© Springer Nature Switzerland AG 2019
Eric Luiijf, Inga Žutautaitė and Bernhard M. Hämmerli (eds.)Critical Information Infrastructures SecurityLecture Notes in Computer Science11260https://doi.org/10.1007/978-3-030-05849-4_12

A Comparison of ICS Datasets for Security Research Based on Attack Paths

Seungoh Choi1  , Jeong-Han Yun1   and Sin-Kyu Kim1  
(1)
The Affiliated Institute of ETRI, Daejeon, Republic of Korea
 
 
Seungoh Choi (Corresponding author)
 
Jeong-Han Yun
 
Sin-Kyu Kim

Abstract

Industrial control systems (ICSs) are widely deployed in various domains of critical infrastructure. In recent years, security threats targeting an ICS are increasing. However, developing or verifying security technology at actual operation sites is quite difficult due to constraints that must be in place for non-disruptive operation and high availability of the control system. In addition, there is also a limit in obtaining datasets for security research. To overcome these limitations, several experimental studies have been conducted to build an ICS testbed for an experimental environment. Based on the testbed, datasets have been captured and released publicly. To properly apply datasets to fulfill the research objectives, the datasets should be analyzed in advance, because each dataset has different characteristics based on domains and security concerns. In this paper, we introduce the results of comparative analysis of various ICS datasets focusing on attack scenarios and discuss considerations of applying datasets to an ICS security research. It is expected that our results will help further researchers deal with datasets for their individual purposes.

Keywords

SecurityDatasetAttack pathIndustrial control system

1 Introduction

Industrial control systems (ICSs) are widely deployed in critical infrastructure such as those power plants, water treatment, and gas. ICSs provide the features of measurement, monitoring, and control for various field devices [2]. In addition, the ICS is extended to the industrial field like a digital twin as part of Industry 4.0. It is connected with heterogeneous devices to monitor a wide range of state information and analyze data. As the operating environment of the ICS becomes complicated, due to the scalability and openness of the connection heterogeneous components have with each other in a network, the attack surfaces are exposed to various security threats.

Since the ICS directly controls a physical system such as the field device, it is essential to prepare a security countermeasure in case a cyber-attack occurs, as it may cause not only destruction of the device but also physical damage due to a secondary explosion. In fact, the US Department of Homeland Security ICS-CERT reported 257 ICS-related vulnerabilities in 2016, and they are expected to continue to grow in the future [17].

To respond to these security threats targeting ICSs, a security technology reflecting the ICS operating environment is needed. Moreover, an ICS research with big data analysis techniques has recently increased. It is based on integrated studies such as machine learning to strengthen diversity and complexity in ICS security. However, it is very difficult to deal with technology for the real world because we cannot accurately predict the effect of new technology or guarantee high availability during consistent operation of the actual ICS. Therefore, an experimental environment similar to the actual environment should be created to ensure that ICS security is elaborate.
../images/477940_1_En_12_Chapter/477940_1_En_12_Fig1_HTML.png
Fig. 1.

Experimental environment to obtain dataset in critical infrastructures

In general, the ICS experimental environment consists of a Level 0 layer representing field devices, a Level 1 layer performing the computation and processing for the ICS control process, and a Level 2 layer handling the control process and operation information with a human-machine interface (HMI). Devices should be located and set up when building the environment. In addition, a system for collecting various data is arranged during the ICS operation. Once the setup is complete, the operation scenario is configured according to the purpose of the test and used for testing and verification. Based on the hierarchical architecture, the environment for providing datasets is actively studied [4, 10, 1416, 23]. The environment for ICS dataset collection should simulate the actual control system operating environment, taking into account scalability. Thus, related works used emulation or simulation methods appropriately to construct the field devices, programmable logic controller (PLC), network, etc.

We analyze the sharable datasets related to ICS for security research. In this paper, we present our result based on attack methods and paths. In addition, we discuss limitations and considerations related to performing security research based on the surveyed datasets. It is expected that applying datasets suitable for further dataset-driven ICS security research to our results will be useful and informative for comparison and considerations.

The rest of this paper is organized as follows. Section 2 addresses the background of ICS experimental environment and scenario of normal operation. Section 3 presents an overview on the datasets in ICS. In Sect. 4, we describe our comparative analysis of each dataset in terms of attack scenarios. In Sect. 5, we discuss the considerations of datasets. We conclude this paper in Sect. 6.

2 Background

Generally, when preparing the experimental environment, the field devices, control systems, and management systems (e.g., engineering workstation (EWS), HMI, and historian) are configured according to each level of the ICS, as shown in Fig. 1. For vertical data collection in the environment, there are communications between Level 0 and Level 1 or between Level 1 and Level 2. In the case of horizontal data collection, state and log information is generated by constituent devices at all levels. Additionally, the data flow can be considered as information moves through each level.
../images/477940_1_En_12_Chapter/477940_1_En_12_Fig2_HTML.png
Fig. 2.

Scenarios of normal operations in critical infrastructures

In a typical control system, the normal operating scenario is as shown Fig. 2. Normal operating scenarios are divided into control, instrumentation, and state/event. First, the control scenario consists of remote control, field control, and stand-alone control. Remote control is used to control the field device using the HMI from a remote site. Remote control can acquire data at both the field and network levels. In case of field control, unlike remote control, the control system does not perform control according to the upper command: it performs the control itself and transmits the result to the management system. The stand-alone control architecture is not used widely at present, but it is supported by a control system. It is composed of the pair of a control system and field device to perform the control process as well as to store or discard the information without sending it to another system. Second, the measurement scenarios are connected with feed-back, feed-forward, and cascade control by mainly transmitting the measurement information (e.g., temperature, pressure, and flow rate) of the sensor to the control and management systems. Lastly, a state or an event scenario refers to a state operated by using the functions of the device and alarm information embedded in the control system.
Table 1.

Public ICS datasets to be analyzed in this paper

Dataset ID

Data domain

Year of release

Data source

Related works

Morris-1

Power System

2014

[13]

[5, 1820]

Morris-2

Gas Pipeline

2013

[13]

[1]

Morris-3

Gas Pipeline, Water

2014

[13]

[14]

Morris-4

Gas Pipeline

2015

[13]

[16]

Morris-5

EMS

2017

[13]

-

Lemay

SCADA

2016

[9]

[10]

SWaT

Water

2016

[6]

[4, 11]

Rodofile

Mining Refinery

2017

[22]

[23]

4SICS

Complex

2015

[8]

-

S4x15CTF

Complex

2015

[21]

-

DEFCON23

Complex

2015

[3]

-

Table 2.

Type of datasets

../images/477940_1_En_12_Chapter/477940_1_En_12_Tab2_HTML.png

3 Public ICS Datasets

In this section, we briefly describe each dataset as shown in Table 1 prior to the comparison of the datasets. Each dataset is collected from their own experimental environment in specific or complex domain. To specify our target of analysis, we limited our study to the ICS-related datasets that can be accessed publicly.

3.1 Data Type

We have identified data type previously described in Fig. 1 as well as the file extension for each dataset as shown in Table 2. The Morris dataset provides five datasets as csv or arff1 files, which include field data, network data, and device log data. In the case of the Lemay, Rodofile, 4SICS, S4x15CTF, and DEFCON23 datasets, the original dataset containing raw network data is provided in the pcap format. The Lemay and Rodofile datasets also include csv files to provide label information. The SWaT (Secure Water Treatment) dataset contains only field data and network data collected during the same time, and it provides all the data in csv and pcap file formats; however, we analyzed only the csv type in this study. Table 3 shows a summary of datasets where the file includes network traffic. As the table shows, the SWaT dataset has the longest duration and the largest packet volume.
Table 3.

Network traffic in the dataset

Dataset ID

Sub-data

Num. of Pkts

Byte of Pkts

Duration

Lemay

Run8

72,186

6,035,064

1 h

Run11

72,498

5,989,226

1 h

Run1 6RTU

134,690

15,017,158

1 h

Run1 12RTU

238,360

16,191,008

1 h 3 m

Run1 3RTU 2 s

305,932

20,330,477

1 h

Polling only 6RTU

58,325

3,441,247

59 m

Moving two files 6RTU

3,319

200,189

3 m

Send a fake command 6RTU

11,166

657,840

1 h 1 m

Characterization 6RTU

12,296

761,587

1 h 5 m

CnC uploading exe 6RTU

1,426

160,547

1 h 1 m

6RTU with operate

1,856

1,129,078

1 h 1 m

Channel 2d 3 s

383,312

22,816,188

1 h 6 m

Channel 3d 3 s

255,668

15,218,187

44 m

Channel 4d 1 s

414,412

24,595,619

1 h 12 m

Channel 4d 2 s

266,387

15,833,346

46 m

Channel 4d 5 s

107,577

6,421,852

19 m

Channel 4d 9 s

60,295

3,619,845

11 m

Channel 4d 12 s

44,977

2,712,015

9 m

Channel 5d 3 s

143,809

8,559,985

25 m

SWaT

Network(pre-processed)

19,761,714

5,498,545,489

11d

Rodofile

Master

1,802,757

173,836,593

9 h

HMI

448,655

61,956,933

9 h

Attacker

1,373,938

114,462,713

9 h

4SICS

GeekLounge

3,773,984

314,562,089

1d 22 h 7 m

S4x15CTF

Advantech

307

35,293

1 m

BACnet FIU

100,934

7,378,656

N/A

BACnet Host

21,285

1,486,618

N/A

iFix Client

5,149

818,114

N/A

iFix Server

86,897

10,607,624

N/A

MicroLogix

65,668

7,959,426

N/A

Modicon

4,193

816,137

3 m

WinXP

26,068

2,975,574

3 m

DEFCON23

ICS Village

1,368,167

92,193,653

1d 5 h 39 m

3.2 ICS-related Protocols

We have verified that datasets contain various ICS protocols as shown in Table 4. The Modbus protocol was included in most datasets (i.e., Modbus/RTU, Modbus/ASCII protocol in Morris-2, and Modbus/TCP protocol in the other datasets). EtherNet/IP (Common Industrial Protocol, CIP) was used in SWaT, 4SICS, and S4x15CTF datasets. Since the 4SICS dataset contains the largest number of ICS protocols, they can be considered as priority for the research of ICS protocol. The DEFCON23 dataset is uniquely characterized by including all types of PROFINET protocols: PROFINET DCP, PROFINET PTCP, and PROFINET IO. Moreover, the S4x15CTF dataset includes BACnet, which is mainly used in a building control system. Therefore, it can be used for research related to direct digital control devices.
Table 4.

ICS-related protocols in datasets

../images/477940_1_En_12_Chapter/477940_1_En_12_Tab3_HTML.png

3.3 Brief Description of Datasets

Morris et al. Datasets. Morris et al. [1416] have released five different datasets related to power generation, gas, and water treatment for their intrusion detection research. Since the Morris datasets provide labels in common, they can be considered as datasets for machine learning in the development of intrusion detection systems. The Morris-1 dataset consists of 37 power system event scenarios that consider the intelligent electronic device (IED) operation count, as well as normal/abnormal events in the power system testbed composed of generators, IEDs, breakers, switches, and routers. The Morris-2, Morris-3, and Morris-4 datasets include communication between the control device and the HMI with the Modbus protocol by connecting the RS-232 or Ethernet interface in the gas pipeline testbed. Each dataset contains network data information that removes some header information such as TCP and MAC of raw packets. In particular, the Morris-3 dataset also provides separate network data information for the water storage tank. The Morris-5 dataset is relatively large and is collected from an actual energy management system for over 30 days, which is the longest time in comparison with other datasets. This dataset contains information on the event ID, priority code, device, and event message. Some of the information is anonymized due to security issues.

Lemay et al. Dataset. Lemay et al. [10] provided the network traffic dataset related to covert channel command and control in the supervisory control and data acquisition (SCADA) field. To construct the test environment, SCADA network was constructed using SCADA Sandbox, a public tool, and two master terminal units were implemented using SCADA BR. The dataset includes Modbus/TCP by connecting three controllers and four field devices per controller. The dataset has diversity as it reflects various scenarios. For example, the dataset is obtained by changing the number of controllers and the polling cycle, ensuring manual operation by the operator, etc. Most datasets provide labels to distinguish between normal and abnormal data.

SWaT Dataset. Datasets released by the SWaT collected sensors, actuators, PLC input/output (I/O) signals and network traffic during seven days of normal operation and four days of the attack scenario. In particular, the SWaT datasets provide the largest amount of data in a large testbed. SWaT defined the device and physical points to be attacked and designed each attack to construct a total of 36 attack scenarios related to field signals and network traffic [4]. Attack scenarios are based on the principles of the physical system to determine the normal operation. When the physical system operates differently, it is considered as an attack [11]. In addition, since the datasets are separated by the network and a physical layer, they can be used in the research for monitoring analog I/O and digital I/O, which are signals in the field layer.

Rodofile et al. Dataset. Rodofile et al. [23] used the Siemens S7-300 and S7-1200 PLCs to obtain the S7Comm Dataset on the mining refinery. The experimental environment consists of a conveyor, wash tank, pipeline reactor field device, master PLC, and slave PLC. To create an attack scenario, an attacker is allowed to access the PLC through the network and to perform a process attack that creates malfunctions in the control process. Rodofile et al. have released datasets on about nine hours of network traffic including S7Comm as well as HMI and PLC logs.

4SICS Dataset. The 4SICS dataset is collected from the ICS Lab’s environment where Siemens S7-1200, Automation Direct DirectLogic 205 PLC, and Industrial Network Equipment including Hirchmann EAGLE 20 Tofino, Allen-Bradley Stratix 6000, and Moxa EDS-508A are deployed. Because heterogeneous ICS devices in the same environment are networked, various ICS-related protocol traffic such as S7Comm, Modus/TCP, EtherNet/IP, and DNP 3.0 are included in the dataset.

S4x15 ICS Village CTF Dataset. Unlike the other datasets, this dataset (hereinafter S4x15CTF) is the network traffic collected during the capture-the-flag (CTF) in the ICS Village, provided by DigitalBond [21]. Therefore, the dataset includes various attacks attempted by many CTF participants that focus on the components of the ICS Village (e.g., Advantech PLC, Modicon PLC, and MicroLogix PLC). Each dataset is grouped according to the components of the ICS Village, but no label is provided.

DEFCON 23 ICS Village Dataset. This dataset (hereinafter DEFCON23) includes network traffic collected by running the ICS Village, provided at DEF CON23. ICS Village is composed of various control systems and communication protocols are used. In particular, the PROFINET PTCP, DCP, and IO protocols are included but labels are not provided.

4 Attack Scenarios in the Public ICS Datasets

The ICS datasets are a collection of information generated based on normal operating scenarios or attack scenarios. As described in Fig. 2, a normal operating scenario consists of necessary actions or situations. Table 5 shows the analysis results, except for the datasets that do not contain either the data label of attack or attack scenario. Due to the lack of space, we combined the individual sub-data of each dataset to describe the normal and attack data. In fact, we identified that seven out of eleven datasets were provided with labels. The attack data were generally smaller than normal data, but in the case of the Morris-1 dataset and some of the attacks of the Lemay dataset, the attack data occupied a higher proportion than the normal data. In addition, some datasets are provided with less than 5% attack data. For conducting attack scenarios, we divided them into modification, fabrication, interruption, and interception [24].
Table 5.

Statistics of normal and attack data in the labeled datasets

Dataset ID

Num. of normal data (%)

Num. of attack data (%)

Morris-1

22,714 (29.98)

55,663 (71.02)

Morris-2

140,382 (97.32)

3,867 (2.68)

Morris-3

233,871 (70.10)

99,627 (29.90)

Morris-4

643,740 (78.13)

180,144 (21.87)

Lemay

16,362 (92.09)

1,405 (7.91)

SWaT

395,298 (87.86)

54,621 (12.14)

Rodofile

1,137,294 (63.09)

665,463 (36.91)

  • Modification. The information is not only intercepted but also modified by an attacker while in transit from the source to the destination (e.g., man-in-the-middle attack).

  • Fabrication. An attacker injects fake data into the system without having the sender do anything (e.g., relaying and masquerading attack).

  • Interruption. A system becomes unavailable due to resource exhaustion or destroyed physically. This attack targets the specific system or communication path (e.g., denial-of-service (DoS) attack).

  • Interception. An attacker gets the information by intercepting information from the communication channel (e.g., wiretapping).

../images/477940_1_En_12_Chapter/477940_1_En_12_Fig3_HTML.png
Fig. 3.

Attack scenarios with attack paths (Color figure online)

Figure 3 shows attack scenarios described that attacks can occur in different attack paths. The solid red line stands for attack path. The dashed red line means that the attack could affect the information afterwards. The attack path in the same communication level means that the attack takes place at a device itself. For example, an attacker may change setting of control system directly. We assigned an attack scenario ID for each dataset through an attack scenario analysis. Table 6 shows the attack scenario IDs to identify the four general attack methods at communication paths between levels, as described in Fig. 3. We expressed each attack and path through symbols. For instance, ‘F3’ means that attack scenarios include fabrication attack from level 1 to level 0

We have limited that the attack scenarios of the labeled datasets in ICS represented by Table 6 while various ICS attack and its real exploitation have been introduced [12]. We identified that each dataset includes modification, fabrication, and interruption, except interception. In particular, the attacker injects data or command through the connection between Level 1 and Level 2 as the attack scenario. The specific attack scenarios targeting ICSs are as follows. First, reconnaissance (e.g., scanning) attacks are performed for preliminary work to collect information such as on control system services. For example, in the case of PLCWorm, scanning was performed to identify available service ports. Second, in the case of a DoS attack trips a field device by sending a trip command, resulting in a DoS, or by exploiting a Modbus communication vulnerability to cause a DoS through resource exhaustion. Third, as a representative example of Stuxnet, the response injection attack (i.e., HMI spoofing attack) injects the response contents so that the operator does not correctly recognize the HMI device information of the field device. Lastly, in the case of a command/data injection attack, the field device caused a trip or fault by manipulating or injecting the command with an abnormal value that is out of the threshold. As shown in Table 7, modification and fabrication attacks through all levels are the major attack scenario in the public ICS datasets. In the datasets, the ICS attack scenarios tend to focus on the malfunction of a field device by sending abnormal data to control systems.
Table 6.

Categories of attack scenarios based on attack paths and methods

Attack path

Attack method

Modification

Fabrication

Interruption

Interception

Level 0$$\rightarrow $$0

M1

F1

R1

C1

Level 0$$\rightarrow $$1

M2

F2

R2

C2

Level 1$$\rightarrow $$0

M3

F3

R3

C3

Level 1$$\rightarrow $$1

M4

F4

R4

C4

Level 1$$\rightarrow $$2

M5

F5

R5

C5

Level 2$$\rightarrow $$1

M6

F6

R6

C6

Table 7.

Classification of attack scenarios in datasets

../images/477940_1_En_12_Chapter/477940_1_En_12_Tab5_HTML.png

5 Consideration for Generating ICS Datasets

  • Timing issues. The synchronization of I/O and internal information acquisition time may not match when OLE for process control (OPC) is used to collect information from various control systems such as PLC and distributed control system. Even if OPC is not used, the S4x15CTF dataset are provided with ‘1970-01-01 00:00:00’ as time information because time synchronization between the devices was not applied. Therefore, it is preferable to use time synchronization of information such as through network time protocol so that information generated at the near time can be identified easily.

  • Criteria of abnormal states. In a dataset, it is important to display the label at the time of abnormal action (e.g., attack) during normal operation of control systems. When machine-learning and detection techniques distinguish between normal and abnormal states, the label marked for each record in the dataset can be used. If a dataset does not correctly provide both normal and abnormal labels reflecting the characteristics of the control devices, both machine-learning and detection can not be performed properly. Even though the label is marked as normal, the actual data may show a different pattern than the normal state. In case of the SWaT dataset, some researches have excluded the data collected during initial operation of the experiment environment from the learning since the sensor was not stabilized at that period [7]. In addition, after finishing attacks, the sensor information may not be stabilized immediately but may gradually return to the normal state.

  • Same attack in different environments. ICSs react differently depending on the time, target, and operational state of the attack, even if the same attack occurs. To test an attack scenario against anomaly detection on machine-learning based techniques, it is essential to test the same attack several times in different states of target system. To provide diverse datasets, it is necessary to consider constructing a system that can reproduce the attack situation generated by the user at a desired time.

6 Conclusion

We analyzed various aspects of datasets obtained publicly. We broke down attack scenarios with the attack methods and paths, then identified attack scenarios of each dataset. As a result, the ICS datasets are biased towards a specific attack paths. This paper presented additional considerations when generating datasets for ICS security research. We expect that our results can be used as an index when using and generating ICS datasets for security research.