© Springer Nature Switzerland AG 2019
Ali Dehghantanha and Kim-Kwang Raymond Choo (eds.)Handbook of Big Data and IoT Securityhttps://doi.org/10.1007/978-3-030-10543-3_8

Big Data Forensics: Hadoop Distributed File Systems as a Case Study

Mohammed Asim1  , Dean Richard McKinnel1  , Ali Dehghantanha2  , Reza M. Parizi3  , Mohammad Hammoudeh4   and Gregory Epiphaniou5  
(1)
Department of Computer Science, University of Salford, Manchester, UK
(2)
Cyber Science Lab, School of Computer Science, University of Guelph, Guelph, ON, Canada
(3)
Department of Software Engineering and Game Development, Kennesaw State University, Marietta, GA, USA
(4)
School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University, Manchester, UK
(5)
Wolverhampton Cyber Research Institute (WCRI), School of Mathematics and Computer Science, University of Wolverhampton, Wolverhampton, UK
 
 
Mohammed Asim
 
Dean Richard McKinnel
 
Ali Dehghantanha (Corresponding author)
 
Reza M. Parizi
 
Mohammad Hammoudeh
 
Gregory Epiphaniou

Abstract

Big Data has fast become one of the most adopted computer paradigms within computer science and is considered an equally challenging paradigm for forensics investigators. The Hadoop Distributed File System (HDFS) is one of the most favourable big data platforms within the market, providing an unparalleled service with regards to parallel processing and data analytics. However, HDFS is not without its risks, having been reportedly targeted by cyber criminals as a means of stealing and exfiltrating confidential data. Using HDFS as a case study, we aim to detect remnants of malicious users’ activities within the HDFS environment. Our examination involves a thorough analysis of different areas of the HDFS environment, including a range of log files and disk images. Our experimental environment was comprised of a total of four virtual machines, all running Ubuntu. This HDFS research provides a thorough understanding of the types of forensically relevant artefacts that are likely to be found during a forensic investigation.

Keywords

HDFSDigital forensicsHadoopBig dataDistributed file systems

1 Introduction

Human dependence on technology has become an inadvertent side-effect of the transition into the digital age, which has subsequently resulted in an exponential increase in data production [1]. This growth of heterogeneous data has led to a relatively new computing paradigm evolving in recent years, this paradigm aptly being named Big Data [2]. Günther et al. [3] explicitly evaluated the current literature associated with the importance of Big Data, highlighting the value of Big Data for organisations as a means of building data-driven business models. Consequently, popularity surrounding Big Data environments has led to its widespread adoption, with many different types of industry utilising the environments for their own means. Companies like the Bank of America, Deutsche Bank and other large financial conglomerates are known users of Hadoop, utilising data analytics to fight against fraudulent activity and present customers with tailored packages based on transactions [4, 5]. The Hadoop ecosystem is developing significant popularity, largely due to Hadoop’s efficiency in parallel data-processing and low fault tolerance [6]. Yu and Guo [7] emphasise the extent of Hadoop adoption when they make reference to the large percentage (79%) of investment in Hadoop by companies attending the Strata Data conference. Demand for Big Data analytics has become so vast that major business intelligence companies like that of IBM and Oracle have developed their own Big Data platforms as a means of allowing companies without extensive infrastructure to take advantage of Big Data analytics [8]. In addition to these elastic Big Data environments, Vorapongkitipun and Nupairoj [9] state that the Hadoop Distributed File System is extremely lightweight and can therefore be hosted using commodity hardware, thus making it more accessible to the average user for storing smaller file types in a distributed manner.

There is a cornucopia of forensic problems associated with the adoption of Big Data environments from both the perspective of companies and individuals alike [10]. Tahir and Iqbal [1] specify that Big Data is one of the top issues facing forensic investigators in coming years due to the complexity and potential size of Big Data environments. As taxonomized by Fu et al. [11], criminals with access to a company’s confidential Hadoop environment can perform data leakage attacks from different platform layers within the Hadoop ecosystem. The authors also extend the problem of permissions to the Hadoop audit logs, specifying that attackers can make changes to these logs as a means of anti-forensics. Moreover, Big Data storage have been targeted by a range of attack vectors from Ransomware [1214] and Trojans [15] to Distributed Denial of Service (DDoS) [16] attacks.

Malicious actors may also utilise Hadoop and the HDFS environment as a means of performing a criminal activity as they have done previously in cloud platforms such as Mega [17] and SugarSync [18]. Almulla et al. [19] illustrate how criminals can use such settings as a means of hiding and distributed illicit content with generalised ease. This consensus is also shared by Tabona and Blyth [20], in which the authors indicate that complex technologies like that of Hadoop have become readily available to criminals, subsequently resulting in crimes with digital trails that are not so easily discovered. Gao and Li [21] emphasise this complexity by identifying the difficulty in forensic extraction.

As well as problems attributed to the criminals themselves, Guarino [22] outlines problems associated with the sheer size of an HDFS environment. The potential area for analysis in an HDFS environment can be substantial, therefore making it extremely difficult to interpret information of forensics significance efficiently. This enormity can be an inherent problem during the data collection phase of forensic analysis, as the gigantean size of the dataset can correlate positively with the potential for errors during collection. Subsequently, errors in the collection can propagate in the latter stages of the investigation, therefore invalidating the integrity of the overall investigation [23].

If the HDFS environment is not already difficult to forensically interpret, the proposal of a secure deletion method by Agrawal et al. [24] may further complicate the forensic practices associated with Hadoop and the distributed file system.

As outlined above, there is a requirement for the analysis of HDFS forensically as a means of determining all the forensically viable information that may be extracted from an environment. The potential for anti-forensic techniques creates a need for more forensic analysis to take place to ensure that forensic examiners are always one step ahead of criminals. In its infancy, Big Data is a paradigm that relatively unexplored regarding forensics. Therefore there is an evident need for in-depth digital forensic analysis of the environment.

In this paper, we seek to answer the followings:
  1. 1.

    Does the Hadoop Distributed File System offer any forensic insight into data deletion or modification, including the change of file metadata?

     
  2. 2.

    Can the memory and file systems of the Nodes within the HDFS cluster be forensically analysed and compared as a means of gaining forensic insight by obtaining forensic artefacts?

     
  3. 3.

    Can forensically significant artefacts be obtained from auxiliary modules associated with HDFS, i.e. web interfaces?

     
  4. 4.

    Does any network activity occur in HDFS between the respective nodes that may attribute valuable forensic information?

     

It is expected that the findings of this research will help the forensic communities further understanding of the forensic significant components of the Hadoop Distributed File System.

This paper is structured as follows; Sect. 2 begins by giving an overview of current related research on Big Data forensics with an emphasis on Hadoop, focusing primarily on HDFS. This overview will allow the forensic investigation performed in this paper to become contextualised, emphasising the importance of research into this field. Following the literature review, Sect. 3 outlines the research methodology that will be employed in this paper, ensuring that the investigation coincides with accepted forensic frameworks. Section 4 encompasses the main experimental setup and collection of the testing data. Section 5 presents the experimental results and analysis, discussing the findings of the investigation in this research. Finally, Sect. 6 concludes the investigation and extrapolates the findings to develop ideas about future work.

2 Related Work

Despite the relatively new adoption of Big Data analytics, several researchers have begun to forensically uncover HDFS and other distributed file system environments as a means of broadening the forensic community’s insight into these systems. Majority of Big Data forensics efforts were based on procedures developed previously for cloud investigation [25] such as those used for analysing end-point devices [26, 27], cloud storage platforms [28, 29], and network traffic data [30, 31]. Forensic guidelines have also been developed as a means of building acceptable frameworks for the use of distributed file system forensics [32].

Martini and Choo [33] use XtreemFS as a big data forensics case study with an emphasis on performing an in-depth forensic experiment. The authors focus on the delineation of certain digital artefacts within the system itself, in conjunction with highlighting the issues regarding the collection of forensic data from XtreemFS.

Thanekar et al. [34] have taken a holistic approach to Hadoop forensics by analysing the conventional log files and identifying the different files generated within Hadoop. The authors made use of accepted forensic tools, e.g. Autopsy to recover the various files from the Hadoop environment.

Leimich et al. [35] utilise a more calculated method of forensics by performing triage of the metadata stored within the RAM of the Master NameNode to provide cluster reconnaissance, which can later be used at targeted data retrieval. This research focusses predominately on establishing an accepted methodology for use by forensic investigators with regards to RAM of the Master NameNode of a multi-cluster Hadoop environment. Despite the promising results gained by the investigators in data retrieval of the DataNodes, the research itself does not provide a means of identifying user actions within the environment itself.

A range of forensic frameworks for significant data analysis has been proposed in contrast to convention methods of forensic analysis and frameworks. Gao et al. [36] suggest a new framework for forensically analysing the Hadoop environment named Haddle. The purpose of this framework is to aid in the reconstruction of the crime scene, identifying the stolen data and more importantly, the user that perpetrated the action within Hadoop. Like the work mentioned above, the authors use Haddle and other tools to create custom logs to identify users performing specific actions, thus identifying the culprit.

Dinesh et al. [37] propose a similar incident-response methodology is utilising logs within the Hadoop environment. The analysis of these logs aids in determining the timeline at which an attack occurred as subsequent comparisons can be made with previous logs that encompass the current state of the systems at that time.

The use of an HDFS specific file carving framework is proposed by Alshammari et al. [38], in which the authors identify the difficulties of file carving in bespoke distributed file systems that do not utilise conventional methods of storing data. This research sets out specific guidelines for the retrieval of HDFS data without corruption, summarising that to retrieve data from HDFS using file carving that the raw data needs to be treated not unlike that found in conventional file systems with only the initial consolidation being an issue.

As mentioned the current state of Big Data forensics is relatively unexplored due to the infancy of the area [39]. However, it is evident that researchers are moving into this area of forensics as a means of understanding all aspects of new systems relevant to Big Data with an emphasis on distributed systems [40]. It is plain to see from the current research that areas of the Hadoop infrastructure are yet to be explored on a comprehensive level, as most researchers seem to stick to what they can see as appose to what they can’t. A general trend in Hadoop and other file system forensics seems to be the analysis of log files. Despite the inarguable richness of forensic data contained with system logs, the majority of papers mentioned above hold scepticism over the heavy reliance on these files in forensics. Where mentioned in the papers above, logs are deemed as a fragile source of forensic information, as these files can be easily tampered with or deleted. Inspired by this fact and for the further advancing the Big Data forensic research, there is a need to explore the Hadoop Distributed File System to retrieve forensic artefacts that may appear in the infrastructure. The emphasis should be on determining such aspects like the perpetrating user of action or time associated with any of actions with the Hadoop environment. Our work proposed in this research took the first step to address this need in the community.

3 Methodology

Digital forensic frameworks play a significant role during a forensics investigation. The previous research presents many reasons why a digital forensics investigation would face challenges and failures. As mentioned by Kohn et al. [41], the main concern is due to lack of upfront planning subsequently avoiding best security practices. To ensure best security practices, an investigator must adhere to security guidelines and standards, and this can be achieved by following a suitable framework.

As outlined by the National Institute of Standards, digital forensics is divided into four stages; collection, examination, analysis, and reporting. Alex and Kishore [42] have also followed a very similar structure to the NIST with their stages including identification, collection, organisation and presentation.

In this paper, we have adopted the cloud investigations framework recommended by Martini and Choo [43]. This framework contains elements that are very similar to research performed by Alex and Kishore [42] and the NIST, with a few distinct differences. The main difference being the presence of an iterative cycle that allows the investigator to go back and forth from stages, thus being able to re-examine and make changes much earlier in the investigation. This is advantageous to the proposed experiment, as it offers flexibility and an opportunity to explore additional technologies that may prove more efficient in obtaining forensically relevant results. Figure 1 below illustrates the four stages of the cloud investigations framework. Each stage has been explained regarding the research carried out within this paper.
  1. 1.

    Evidence source identification and preservation: In this stage, we identified the main components of the environment within HDFS. A forensic copy of the VMEM, VMDK and FSimages (secondary NameNode checkpoint) were generated for each of the respective nodes at each interval of action. Also, network captures were carried out during each action as a means of analysing the network transactions. We verified the integrity of each of the data files (including the dataset) by calculating the hash value associated with them. Windows PowerShell was selected to verify the integrity of each of the files collected using the command Get-FileHash -Algorithm MD5.

     
  2. 2.

    Collection: During this stage, we collected a varied array of different evidence from the files outlined in the previous section. Each piece of granular evidence retrieved from the main evidence files was attributed its hash.

     
  3. 3.

    Examination and Analysis: The purpose of this stage was to examine and analyse the data collected. Due to the composition of the metadata of the NameNode and data-blocks in HDFS string searching was the effective tool used for analysis in both the VMEM captures, network captures and the VMDK captures. The string searching involved finding appropriate information with regards to aspects such as block IDs, dataset file name or text within the sample dataset. A secondary stage of the analysis is to analyse log information within the NameNode and FSimage information supplied by the secondary NameNode, as a means of discovering more information that can validate findings within the environment.

     
  4. 4.

    Reporting and presentation: This stage will contain all the evidence found during the investigation carried out within this paper, including a legal document that can be used in a real-life court environment.

     
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig1_HTML.png
Fig. 1

HDFS cluster showing the FSimage checkpoint communication between the NameNode and Secondary NameNode, as well as the replication of the dataset over the two DataNodes

4 Experimental setup

Our forensic experiment involved a multi-node cluster setup using Hadoop 2.7.4 based on Ubuntu-16.04.3 LTS. As shown in Fig. 1 the cluster consisted of four nodes; a Master NameNode, a secondary NameNode and two DataNodes. Snapshots of each of the machine states were taken following the completion of one of the base functions within the Hadoop distributed file system. The actions being analysed within HDFS were the PUT, GET, APPEND, REMOVE and the transactional processes of a simple MapReduce job. A base snapshot was taken before any data was inserted into HDFS and no actions have been performed. This base snapshot offered a basis for comparison with other snapshots. Figure 2 illustrates the Snapshot hierarchy of all snapshots taken within the analysis.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig2_HTML.png
Fig. 2

The snapshots taken within each stage of the HDFS forensic analysis. All systems will be snapshotted during each interval if action

The cluster itself was housed within a virtual environment using VMware Workstation as a means of saving time in analysis, as to create a physical Hadoop cluster with all the intermediate components involved would be an overly time-consuming exercise. Each VM was also given its minimum running requirements and disk space to save time during the analysis of the VMDKs. As stated by Rathbone [44], HDFS is designed only to take a small range of file types for use in its analysis; these include; plain text storage such as CSV or TSV files, Sequence Files, Avro or Parquet. To reduce complications, the use of a CSV dataset was chosen from U.S. Department of Health that contained information of Chronic Disease Indictors; this would aid in string searching later.

It should be noted that in a real-life scenario a commercial Hadoop cluster would be unable to be powered down to obtain forensic copies of the environment, therefore live data acquisition would need to be utilised to obtain the information obtained within this experiment. To reduce the potential for error in data acquisition, the experiment made use of the VMEM and VMDK files within the virtual machine directories. This allowed real-time acquisition of the virtual machine states while and after the HDFS processes were performed. In the instances of the FSimages, the inbuilt tools of HDFS were utilised to examine these files. These consisted of both the Offline Image Viewer and Offline Edits Viewer, which gave an overview of operations performed, as well as granular information about each operation. FTK Imager was used to examine the VMDK file systems of each of the related snapshots with an emphasis on recovering certain log or configuration files. Finally, Wireshark was used during each action to compile communications occurring between each of the respective nodes. Table 1 outlines the tools utilised within this experiment, with Table 2 describing the specifications of each snapshot.
Table 1

Tools used during analysis

Tool (Incl. Version)

Usage

Access Data FTK Imager 3.4.3.3

To analyse the file system structures and files of VMDK files recovered from the virtual machine snapshots

HxD (Hexeditor) 1.7.7.0

To analyse the VMEM files recovered from the virtual machine snapshots using string searches of the Hex data

Offline Image Viewer (Hadoop Tool)

To analyse the secondary NameNode FSimages for each of the snapshots

Offline Edits Viewer (Hadoop Tool)

To analyse the secondary NameNode edit logs for each of the snapshots

Wireshark 2.4.2

To analyse network capture files

VMware Workstation Pro 12.5.0

To host the virtual machines

Table 2

Snapshot specification

Snapshot

Details

Base snapshot

A base snapshot with Hadoop installed, no data within cluster

PUT snapshot

Copy of the base snapshot with data inserted into the cluster using the PUT command in HDFS

GET snapshot

Copy of the PUT Snapshot with the HDFS data retrieved using the GET command m HDFS

APPEND snapshot

Copy of the PUT Snapshot with the jellyfish.mp4 file appended to the original dataset

REMOVE snapshot

Copy of APPEND Snapshot with the data removed from HDFS

MAPREDUCE snapshot

Copy of PUT Snapshot with MapReduce configured

5 Results and Discussion

The following section is structured using each of the prominent actions performed in HDFS, showing the step-by-step analysis of the environment relevant to these actions.

5.1 Forensic Analysis of HDFS PUT

The HDFS PUT command (hdfs dfs -put) allows files within the local Linux file system to be distributed between the DataNodes of HDFS respective to their size and the replication factor configured within the core-site.xml file of the Hadoop configuration files. To begin the forensic analysis, it was imperative that the location of the NameNode FSimage files and edit logs were located. Fortuitous for the investigation, the main page of the Hadoop web interface (http://localhost:50070/dfshealth.html#tab-overview) presented the exact location of these files along with other pertinent information such as the transactional ID associated with the current Edit in progress (see Fig. 3). The location configured was within the Master NameNode’s (192.168.1.1) file directory at /root/usr/local/hadoop_tmp/hdfs/namenode/.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig3_HTML.png
Fig. 3

Hadoop web interface showing location of NameNode Edit Log Files

Upon browsing to this location within the VMDK image file using FTK Imager, the files edits_00000000000000000xx-00000000000000000xx, fsimage_00000000000000000xx and VERSION were found to exist. These files play a significant role in the forensic analysis of the environment as when analysed correctly they offer an abundance of information relative to the environment. This information includes certain forensically significant information such as timestamps (accessed time etc.), transaction owner as well as the names of files blocks etc. Using the inbuilt Hadoop tools Offline Image Viewer (OIV) and Offline Edits Viewer (OEV) the FSimage can be consolidated into both text and XML files that show actions within Hadoop, as well as the metadata associated with these actions. Whilst on the live Master NameNode, the command “hdfs our -i /usr/local/hadoop_tmp/hdfs/namenode/ fsimage_00000000000000000xx -o /Desktop/oivfsput.txt” was utilised to interpret the current FSimage.

This command consolidates both the FSimage logs to produce information about the current NameNode metadata. As seen in Fig. 4, the information regarding files within HDFS is logged with all their associated data in the FSimage. After the PUT action had been finished the NameNode stores information about the location of this data. This is done inboth the HDFS block and the relative Inode location data regarding the Linux file system itself. Forensically significant aspects of this data are present in both the modification time and the access time, as well as the permissions for this data. Following the acquisition of this data, the edit logs within the NameNode were also analysed using the Offline Edits Viewer, which like the Offline Image Viewer consolidates the edit logs with regards to the transactional processes that have occurred within the time between the last checkpoint of the NameNode metadata and the current checkpoint. The command “hdfs oev -i /usr/local/hadoop_tmp/hdfs/namenode/ edits_00000000000000000xx-00000000000000000xx -o /Desktop/baseedits.txt -p stats” was used to create an overview of the transactional processes that had occurred. Proceeding the PUT command, the transactional process added was the OP_ADD operation (see Fig. 5). As a means of further investigation, the Offline Edits Viewer was used to gain granular information regarding this transaction. Using the “hdfs oev -i edits_00000000000000000xx-00000000000000000xx -o /Desktop/baseedits2.xml -v” created an XML file that contained more verbosity regarding the operations in the checkpoint.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig4_HTML.png
Fig. 4

Output of the Offline Image Viewer text file outlining a range of information regarding the dataset and associated blocks

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig5_HTML.png
Fig. 5

Output of the Offline Edits Viewer showing the transaction summary for the checkpoint

The output file created as a result of this command includes a large amount of granular information surrounding the OP_ADD operation. As seen in Fig. 6, the Inode number for the data within the Linux file system is noted first, thus supplying the investigator a means of retrieving the information based on location in such a scenario where no other information regarding the data is to be found. The name of the file inserted into HDFS is also located in this XML file as well as the replication number, thus showing that the data has repeatedly been striped within HDFS allowing a higher chance of retrieval from the data-blocks themselves. Despite a creation time timestamp not being present, under the consideration that the modification time and accessed time are the same relative to the OP_ADD operation, it is safe to assume that this epoch is the time of creation. A critical piece of information regarding the operation is the owner of that operation. In this case, it is evident that criminal1 acted inserting the sample file into HDFS.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig6_HTML.png
Fig. 6

Granular output of the Offline Edits Viewer showing detailed information of HDFS PUT transaction

In addition to the information pertaining to the creation time, location and names of the file, other forensic information can be gleaned from the Edits Viewer output that coincides with additional forensic information obtained during the investigation. Information in the <CLIENT_NAME>DFSClient_NONMAPREDUCE_1292 467381_1</CLIENT_NAME> section of the edits output is also seen in the network communications between the NameNode and the individual DataNodes that are presented with the data (see Fig. 7). This initial communication occurs over ports 59196 on the NameNode to ports 50010 on the DataNode, which according to Zeyliger [45] correspond with the default Hadoop ports for each of these transactions. This packet is closely followed by the transfer of the file information from the NameNode to the DataNodes over the same ports.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig7_HTML.png
Fig. 7

Network capture occurring between the NameNode and DataNode

Assessing the remaining operations with the edits log showed that block allocations occurred directly after a file being added so that HDFS can place the information on the respective blocks within the DataNodes. As the CSV dataset was relatively small in comparison to the size of datasets that would occur on a commercial cluster, only one block ID was allocated to the DataNodes. As seen in Fig. 8, Block ID 1073741825 with a GENSTAMP of 1001 was generated for the information.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig8_HTML.png
Fig. 8

Offline Edits Viewer output showing the block allocation ID associated with the file being added

Following the discovery of the block identification number, attention was turned to the VMDK files of the DataNodes, to determine if this information could be retrieved from the Linux File Directory that sits below the logical HDFS file directory. Upon analysing the VMDK of the NameNode, it was determined that the Hadoop configuration files were located in the /usr/local/hadoop-2.7.4/etc/hadoop/ (see Fig. 9). Within this folder was the hdfs-site.xml file which contained key information about key configuration items within the Hadoop infrastructure. It was found that the DataNode data directory path was located at /usr/local/hadoop_tmp/hdfs/datanode on the DataNode (see Fig. 10). The DFS replication value was also located in this file, therefore showing that some level of replication had been configured, validating the previous findings of replication occurring.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig9_HTML.png
Fig. 9

File structure of NameNode showing the location of the Hadoop configuration files

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig10_HTML.png
Fig. 10

Contents of the hdfs-site.xml file showing the DataNode data directory

Further analysis of the DataNode image using FTK Imager revealed that browsing to the directory specified in the hdfs-site.xml showed the block files associated with the file as ascertained earlier in the investigation. These block files contained the raw information for the CSV file along with metadata relevant to each of the block files. Addition forensic information was procured through the file directory itself, as the node ID (BP-7456194) itself was found within the file path of the DataNode data directory (see Fig. 11).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig11_HTML.png
Fig. 11

The DataNode file structure with the block files present

During the PUT process, volatile memory was assessed while the put command was undertaken in HDFS. Many processes were running before, during and after the process. However, most of these processes were discredited forensically as they were processes that did not relate to HDFS itself. The only process that showed any relevant activity during the PUT action was that of PID 4644 which was labelled as a java process. Unfortunately, this did not supply any additional information upon further analysis of the PID itself, but the user was listed showing that criminal1 had been performing actions malicious or not (see Fig. 12).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig12_HTML.png
Fig. 12

Output of Linux top command following real time analysis of memory processes

Following the active memory analysis, the raw data within the memory file was analysed with the goal of retrieving any additional information. No specific information regarding the data blocks or data itself was found. However, the recent bash commands performed in the terminal of the NameNode were stored showing that criminal1 had performed a PUT command with the file name of the dataset (Fig. 13).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig13_HTML.png
Fig. 13

Raw volatile memory file analysed in HxD

As a means of validating the information above, the Hadoop log files were analysed to determine whether the changes in the HDFS had been logged. As previously found while analysing the file structure of the NameNode using FTK Imager, the log files were in the /usr/local/Hadoop-2.7.4/logs (see Fig. 14). The sizeable log file of the “Hadoop-criminal1-namenode-masternamenode.log” was of particular interest, as this file seemed to log the majority of actions performed in Hadoop.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig14_HTML.png
Fig. 14

NameNode VDMK file within FTK Imager showing the location of log files

Upon obtaining this log file, string searches were performed using the data retrieved above, i.e. the block ID, the name of the file and additional information that may be logged within the log file that coincided with the timestamps of the insertion of the dataset. The log file was preferably analysed within a Linux environment as the grep command could be used to retrieve the information referencing the details above. Following the use of the command “grep “Chronic” Hadoop-criminal1-namenode-masternamenode.log”, multiple entries were found with regards to the allocation of blocks associated with the dataset and the times that these operations occurred at. Another key piece of information that occurs is the DFSClient code that coincides with previous information about this operation (Fig. 15).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig15_HTML.png
Fig. 15

grep output of the Hadoop-criminal1-namenode-masternamenode.log file following the PUT command

5.2 Forensic Analysis of HDFS GET

Proceeding the use of the PUT command within HDFS, data stored within HDFS can retrieve to the local file system on the NameNode using the GET command (hdfs does -get). As with the PUT command, both the Hadoop tools were used to analyse the current FSimages and edits that had occurred during the GET command on the snapshots. Again, using the Offline Image Viewer command “hdfs our -i /usr/local/hadoop_tmp/hdfs/namenode/fsimage_00000000000000000xx-o /Desktop/getoivfs.xml” on the current FSimage, a preview of the current changes was presented. The image viewer output in this scenario showed an unusual absence of information regarding the retrieval of the information from HDFS. However a notable piece of information was the change in the access time associated with the file which was listed in the image output (see Fig. 16).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig16_HTML.png
Fig. 16

Output of the Offline Image Viewer showing a difference in access time

The edited viewer was used as a means of determining whether more information was present regarding the GET command. The command “hdfs oev -i /usr/local/hadoop_tmp/hdfs/namenode/edits_00000000000000000xx-00000000000000000xx -o /Desktop/getoevfs.txt -p stats” was used on the current edits log to give a summary of operations that had occurred since the last image. Unlike the PUT command, however, the GET command did not list any specific operation codes relating to the operation of GET. The operation codes that were listed after this command were in fact only the OP_START_LOG_SEGMENT and the OP_END_LOG_SEGMENT, which unfortunately occur for every operation within HDFS (see Fig. 17).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig17_HTML.png
Fig. 17

Overview of the Offline Edits Viewer

To investigate further, the verbosity of the command was added to break down the operations further and see if additional information could be gained from the edits log (“hdfs oev - I edits_00000000000000000xx-00000000000000000xx -o /Desktop/getoevfs2.txt -v”). Although the detailed output did present more details regarding the operations themselves, the only notably forensic information on the edits log was the filename of the file that had been accessed by the changed access time and unchanged modification time (see Fig. 18).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig18_HTML.png
Fig. 18

Edits log showing the change in access time for the dataset

As no information of great value could be yielded from the FSimage and Edit logs attention was turned to the log files within the name of Hadoop itself. Upon actively performing the GET command to see if the logs were updated on either the NameNode or the DataNodes, using the grep command with the filename as an argument, it was found that the file was copied with a DFS Client ID associated with the copy (see Fig. 18). In addition to the analysis of the log files, a static approach of file system reconnaissance was used to recover information regarding the downloaded file. Using the Linux command “sudo find. -name “*.csv” -type f -time 0 > modified.txt” from the root directory listed any files that had been modified (creation time is deemed the same as modified time) in the last 24 h. Upon running this command, it was found that the download.csv file (including its directory) that had been created as a result of the Get command appeared in the output of the find command. To verify the creation time of this file the Linux command “stat download.csv” was used as a means to determining the creation time. However, the birth field was absent in the stat output for this file (see Fig. 19) which may be a result of the absence of APIs within the Kernel. The ext4 filesystem within Linux does, in fact, maintain the file system creation time. After finding the Inode ID from the stat command, using the command “sudo debugfs -R ‘stat <1052122>’ /dev/sda1”, the creation time was revealed to be shortly after the epoch located in the edits log (see Fig. 20).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig19_HTML.png
Fig. 19

Terminal output for the statistics of the newly created download.csv

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig20_HTML.png
Fig. 20

Terminal output from the debugfs command showing the creation time of the file relative to the Inode

After analysing the network file proceeding this interval of action, much like the PUT command, a DFSClient_NONMAPREDUCE entry was found, occurring between NameNode (192.168.1.1) and the DataNode (192.168.1.4), followed by an influx of data relating to the dataset itself (see Fig. 21). Despite this DFSClient ID being present, no other instance of it was found in the log files, FSimage or edit logs. Upon inspection of the VMEM file, this was also the case.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig21_HTML.png
Fig. 21

Presence of the DFSClient ID in the network capture followed by data transfer

5.3 Forensic Analysis of HDFS APPEND

As HDFS does not allow the modification of files once they are populated within the filesystem. The use of the APPEND (hdfs dfs -append) command is therefore needed to add additional information to a file already within HDFS, lest the user must delete the entire file and repopulate the environment with the modified version of the file. In this instance, a sample MP4 file was appended to the original dataset to simulate a criminal trying to hide information within another HDFS based file. Again, drawing on the detailed logging capabilities of the FSimage and edit logs, the HDFS OIV and OEV commands were utilised to analyse the logs following the use of the APPEND command.

The Offline Image Editor harboured some interesting information regarding the dataset. Additional blocks were allocated to the filename of the dataset with each block being allocated a subsequent block ID following the block ID of the initial block allocation ID. Naturally, the accessed time was changed (see Fig. 22). The results of the Offline Edits Viewer overview output also gave interesting information proceeding the APPEND. Naturally, an OP_APPEND operation was logged in the edits overview with the multiple block ID allocations and block additions being cited in the overview (see Fig. 23).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig22_HTML.png
Fig. 22

Output of the Offline Image Viewer proceeding an append operation

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig23_HTML.png
Fig. 23

Output of the Offline Image Viewer proceeding an append operation

A large quantity of activity was logged in the detailed edits log following the APPEND operation. The OP_APPEND operation holds a large amount of key information regarding the file being appended, such as the name of the file and whether a new block is needed to accommodate for the new appended information. Assuming this is the case the first block has enough space to store some of the newly added information, this block could be utilised once more to retrieve the newly appended information for analysis. As seen in Fig. 24, following the append function, once HDFS realises that the first block is filled to the specified block size of HDFS new blocks are allocated new block IDs to accommodate for the additional information.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig24_HTML.png
Fig. 24

Block allocations and updates contained within the detailed edits log

The DFSClient ID was once again used as a search string parameter in Wireshark to determine the transfer of data within HDFS to the newly appended blocks. The DFSClient ID matched that of the Client ID in the network traffic transfer and was shortly followed by the transfer of data as the file was appended (see Fig. 25).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig25_HTML.png
Fig. 25

DFSClient ID present followed by data transfer of APPEND operation

The new designated block size change is logged in the hadoop-criminal1-namenode-masternamenode.log log file referencing the block ID and additional information sure as the new length of the file and the nodes that this file is stored on (see Fig. 26).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig26_HTML.png
Fig. 26

grep output of the NameNode log file, using DFSClient ID as a search parameter

An analysis of the block file associated with the appended dataset using HXD Editor was undertaken as a means of determining how the file was appended. As seen in Fig. 27 the file signature of the MP4 format was still recoverable in the appended CSV file. That illustrates that despite being appended as a different file format, HDFS did not convert the file to a respective format type. It simply adds the MP4 file as additional text data to the dataset, therefore making the MP4 file recoverable with file scraping methods.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig27_HTML.png
Fig. 27

MP4 file signature located in the dataset file block proceeding the append

The volatile memory of the NameNode was also assessed during and following the append function to determine if any forensic artefacts were present in memory. During the active assessment, it was found that many different processes were running during the append actions within HDFS. These processes in addition to the ibus-daemon and ibus-ui-gtk3, however, it is not clear if these processes were associated with the actions in HDFS. During the APPEND command, the PID 4644 (java) rose to a high CPU usage until the command was finished. This java execution is to be expected under the consideration that HDFS has programmed in Java. Unfortunately, it does not give any granular details regarding the action itself but only denotes the criminal user at the time (see Fig. 28).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig28_HTML.png
Fig. 28

Output of the “top” command in Linux during the APPEND process

Static analysis of the memory during the time of the append function was undertaken to determine if any forensic artefacts were encapsulated with the raw memory file itself. Following on from the discovery of the DFSClient ID found within the log files (above), the Client ID was also listed within the volatile memory itself. Additional information was found within strings relating to a data streamer that pointed towards the file name of the dataset followed by the blocks that the dataset was hosted upon. Bash commands were also stored within the raw file, incriminating the criminal1 user by outlining the commands that had been facilitated by that user in HDFS (see Fig. 29).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig29_HTML.png
Fig. 29

Data stream information regarding the dataset and the blocks associated with it

5.4 Forensic Analysis of HDFS REMOVE

Deleting information from an environment is often an area of focus for forensic investigations. In HDFS the command hdfs dfs -rm is utilised to remove data files from the filesystem. The OEV and the OIV were once again used on the FSimage and edit logs to determine the operation codes that were associated with the removal of data from HDFS. As seen in Fig. 30, the OIV output contained information regarding a new directory name.Trash with a modification (assumed to be creation time) time that coincided with the deletion of the data within HDFS.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig30_HTML.png
Fig. 30

Offline Image Viewer output showing the creation of the .Trash directory

The outputs within the Edit logs were analysed with the assumption that a delete operation would be found within the overview and detailed outputs. This was not found to be the case, however, as the OP_DELETE counter showed that not had been performed. The only new operational instances were that of the OP_MKDIR and OP_RENAME operations (in Fig. 31).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig31_HTML.png
Fig. 31

Overview of Offline Edits Viewer output showing the absence of a delete operation

This absence of a deletion operation was intriguing. Thus the increase in verbosity for the Offline Edits Viewer was required. Both Figs. 32 and 33 show the output of the increased detail offered by the Offline Edits Viewer, with the operation codes for both the directory creation and the renaming of the dataset to a filename that encompassed the newly created .Trash directory. The OPTIONS parameter in the OP_RENAME operation is marked TO_TRASH, thus meaning that when a file is deleted using the rm parameter in HDFS. It must therefore only be renamed, subsequently moving it to the trash directory temporarily. As HDFS is a logical and stateless file system, the actual transit of data from one directory to the next does not occur, but the renaming of data moves the data accordingly.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig32_HTML.png
Fig. 32

Offline edits viewer output showing the rename of the dataset with the addition of the TO_TRASH option

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig33_HTML.png
Fig. 33

Offline edits viewer showing the creation of the Trash directory

To validate the creation of this new directory and to determine if the files could still be recovered, HDFS was explored proceeding the deletion of the dataset. The deleted dataset was found within the newly created Trash directory and could be retrieved with the GET command in HDFS (see Fig. 34).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig34_HTML.png
Fig. 34

Newly created .Trash directory within HDFS containing the deleted dataset

When analysing the network traffic in Wireshark during the deletion, no traffic of note was recorded. This concluded that only the NameNode was aware of the deletion and that with the DataNodes unaware of the deletion this meant that the data saved on the respective blocks were stored within the same place as before and was only logically moved to trash in the logical file structure of HDFS.

During the removal of data within HDFS the processes in volatile memory were analysed with a means of determining any artefacts within the memory that may demonstrate that a user is performing a specific action in HDFS. Unfortunately, like the previous two memory analyses, no processes of note (disregarding the java process) were found. The static memory analysis in HxD only recovered the bash commands relating to the removal of the data but did show the trash interval time, thus showing that the data was still recoverable (Fig. 35).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig35_HTML.png
Fig. 35

RAW memory file showing the trash policy default time

5.5 Forensic Analysis of a Basic MapReduce Task

One of the fundamental actions performed in Hadoop is that of data reduction and collation using HDFS and MapReduce. The Apache Hadoop MapReduce tutorial Wordcount v1.0 was used to simulate the use of MapReduce within the environment [46]. Following the MapReduce tutorial, the word selected for the wordcount was “Chronic” and was tested against the original dataset inserted into Hadoop. Following the MapReduce operation, the Hadoop infrastructure was assessed to try and unearth any artefacts that had been left as a result of the MapReduce execution within the cluster. It was found that the NameNode log file “yarn-criminal1-resource manager-masternamenode.log” was updated during the time of the MapReduce operation. Figure 36 shows the user criminal1 successfully submitting the job to HDFS to carry out the MapReduce job, found by using grep with the expression criminal1.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig36_HTML.png
Fig. 36

grep output of the yarn-criminal1-resourcemanager-masternamenode.log file showing successful submission by the criminal1 user

The network capture of the MapReduce again showed DFSClient IDs, as shown in Fig. 34 above. The DFSClient ID was used as a grep search parameter on the log files as a means of determining if the MapReduce task was logged in any additional places. It was found that information regarding the MapReduce task was found in the generic NameNode log file “hadoop-criminal1-namenode-masternamenode.log. This log file revealed that each job created by a user in MapReduce had a temporary directory made for itself in HDFS, and could, therefore, be retrieved in the HDFS file structure, the directory is contained within the /tmp/ folder of HDFS. Each of these job directories contained additional information files regarding the jobs (see Figs. 37 and 38).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig37_HTML.png
Fig. 37

Output of the grep command against the masternamenode.log file

../images/451567_1_En_8_Chapter/451567_1_En_8_Fig38_HTML.png
Fig. 38

Temporary directories created for each Hadoop job within HDFS

Using the HDFS GET command the directory contents of /tmp/hadoop-yarn/staging/criminal1/.staging/job_1513885535224_0002 was retrieved for analysis. The directory contained an XML as seen below in Fig. 39 and the JAR file containing the compiled java code associated with the MapReduce job above. The XML file contained configuration information for each of the settings respective of the job, one of the parameters within this file was the MapReduce job username, marked in this scenario as criminal1.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig39_HTML.png
Fig. 39

Job username of the MapReduce job contained in the job.xml file of the job directory

The network capture of the Master NameNode confirmed the DFS Client ID that was encountered during the analysis of the logs (see Fig. 40).
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig40_HTML.png
Fig. 40

Network capture during the MapReduce Task

Auspiciously for forensic examiners, all tasks relating to MapReduce are accessible via a web interface on the NameNode, configured at port 8088. Users can see all tasks including current tasks and completed tasks, as well as the task owner and the times associated with the start and end of each task. Figure 41 illustrates the web interface.
../images/451567_1_En_8_Chapter/451567_1_En_8_Fig41_HTML.png
Fig. 41

Hadoop web interface showing the currently scheduled jobs

6 Conclusion and Future Work

HDFS is a large and complex infrastructure with a web of different configuration files, logs and systems that can be analysed to gain information on the actions occurring within. The purpose of this paper was to forensically explore the HDFS environment and uncover forensic artefacts that could aid in a forensic investigation. During this investigation, each key HDFS command was executed and monitored to understand which artefacts were produced following the execution of each command. The Linux file systems, volatile system memory and network files were all analysed at each interval of action.

The inbuilt HDFS commands for both the Offline Image Viewer and Offline Edits Viewer proved invaluable in forensic data retrieval during the investigation. Each tool provided the examiners with a range of information featured in each FSimage and edited log, which could be extrapolated to other areas of the investigation. The granularity of each of the tool outputs allowed for a comprehensive insight into the activities of the cluster during each action. Subsequently, it is our recommendation that effective forensic analysis of the HDFS environment can occur, checkpointing of the Master NameNode on the secondary NameNode must occur at a high frequency to effectively track the activities of HDFS.

In addition to the FSimages and edit logs provided by checkpointing, the comprehensive nature of the Hadoop log files themselves provided a plethora of information on any actions made with Hadoop. It is due to this and the collation of the previously mentioned FSimages that more information could be found. The generic hadoop-criminal-namenode-masternamenode.log file logged almost every action within the HDFS cluster, showing that this file has great importance to forensic investigators, and interpreting the output of this log file (and others) is vital when forensically analysing the Hadoop environment.

One of the key objectives of this investigation was to assess all aspects of the HDFS including processes within volatile memory. Unfortunately, the findings surrounding this area of the investigation did not reap any forensically significant information as the processes associated with HDFS is run as abstract processes of Java, these processes were associated with the malicious user, but this information alone does not offer any insight into the actions being taken.

Clarification was also sought regarding the communication occurring between the Nodes and whether network traffic produced forensic artefacts that could prove useful in determining activity. Although some information was gained from the network traffic that corresponded with other findings, the traffic offered only a small amount of information that could be used to validate communications between the NameNode and DataNodes. It should also be noted that in some HDFS installations, communication between each of the nodes is often encrypted as a means of security and would, therefore, offer less information.

Assumptions were made with some of the actions that seemed logical, the removal of data being once instance in which one would assume some notation of deletion to occur, however, this produced results that did not coincide with the assumptions made. It is because of assumptions like this that forensically analysing environments like that of HDFS and other distributed file systems are required so that the forensically community can build a repertoire of knowledge with regards to these systems.

As this investigation focused predominately on HDFS, more works needs to be done to understand the entirety of the Hadoop ecosystem with special attention being paid to the types of artefacts that are generated as a result of MapReduce. The exploration of the Eclipse IDE with the MapReduce plugin may also prove valuable in gathering forensic insight into HDFS forensics. Additional research may be needed in HDFS environments that utilise Kerberos for authentication, as more forensic artefacts may be attributed to a Kerberos base cluster. Moreover, forensics investigation of evidence generated in HDFS through connecting it to emerging IoT and IIoT environments is another interesting future research [47, 48]. Analysing malicious programs are specifically written to infect HDFS platforms to exfiltrate data or steal private information within cyber-attack campaigns is another future work of this study [15]. Finally, as machine learning based forensics investigation techniques are proving their value in supporting forensics investigators and malware analyst activities it is important to build AI-based techniques to collect, preserve and analyse potential evidence from distributed file system platforms [4951].

Acknowledgement

We would like to thank the editor and anonymous reviewers for their constructive comments. The views and opinions expressed in this article are those of the authors and not the organisation with whom the authors are or have been associated with or supported by.