1 Introduction
Industrial control systems (ICSs) are widely deployed in critical infrastructure such as those power plants, water treatment, and gas. ICSs provide the features of measurement, monitoring, and control for various field devices [2]. In addition, the ICS is extended to the industrial field like a digital twin as part of Industry 4.0. It is connected with heterogeneous devices to monitor a wide range of state information and analyze data. As the operating environment of the ICS becomes complicated, due to the scalability and openness of the connection heterogeneous components have with each other in a network, the attack surfaces are exposed to various security threats.
Since the ICS directly controls a physical system such as the field device, it is essential to prepare a security countermeasure in case a cyber-attack occurs, as it may cause not only destruction of the device but also physical damage due to a secondary explosion. In fact, the US Department of Homeland Security ICS-CERT reported 257 ICS-related vulnerabilities in 2016, and they are expected to continue to grow in the future [17].
In general, the ICS experimental environment consists of a Level 0 layer representing field devices, a Level 1 layer performing the computation and processing for the ICS control process, and a Level 2 layer handling the control process and operation information with a human-machine interface (HMI). Devices should be located and set up when building the environment. In addition, a system for collecting various data is arranged during the ICS operation. Once the setup is complete, the operation scenario is configured according to the purpose of the test and used for testing and verification. Based on the hierarchical architecture, the environment for providing datasets is actively studied [4, 10, 14–16, 23]. The environment for ICS dataset collection should simulate the actual control system operating environment, taking into account scalability. Thus, related works used emulation or simulation methods appropriately to construct the field devices, programmable logic controller (PLC), network, etc.
We analyze the sharable datasets related to ICS for security research. In this paper, we present our result based on attack methods and paths. In addition, we discuss limitations and considerations related to performing security research based on the surveyed datasets. It is expected that applying datasets suitable for further dataset-driven ICS security research to our results will be useful and informative for comparison and considerations.
The rest of this paper is organized as follows. Section 2 addresses the background of ICS experimental environment and scenario of normal operation. Section 3 presents an overview on the datasets in ICS. In Sect. 4, we describe our comparative analysis of each dataset in terms of attack scenarios. In Sect. 5, we discuss the considerations of datasets. We conclude this paper in Sect. 6.
2 Background
Public ICS datasets to be analyzed in this paper
Dataset ID | Data domain | Year of release | Data source | Related works |
---|---|---|---|---|
Morris-1 | Power System | 2014 | [13] | |
Morris-2 | Gas Pipeline | 2013 | [13] | [1] |
Morris-3 | Gas Pipeline, Water | 2014 | [13] | [14] |
Morris-4 | Gas Pipeline | 2015 | [13] | [16] |
Morris-5 | EMS | 2017 | [13] | - |
Lemay | SCADA | 2016 | [9] | [10] |
SWaT | Water | 2016 | [6] | |
Rodofile | Mining Refinery | 2017 | [22] | [23] |
4SICS | Complex | 2015 | [8] | - |
S4x15CTF | Complex | 2015 | [21] | - |
DEFCON23 | Complex | 2015 | [3] | - |
Type of datasets
3 Public ICS Datasets
In this section, we briefly describe each dataset as shown in Table 1 prior to the comparison of the datasets. Each dataset is collected from their own experimental environment in specific or complex domain. To specify our target of analysis, we limited our study to the ICS-related datasets that can be accessed publicly.
3.1 Data Type
Network traffic in the dataset
Dataset ID | Sub-data | Num. of Pkts | Byte of Pkts | Duration |
---|---|---|---|---|
Lemay | Run8 | 72,186 | 6,035,064 | 1 h |
Run11 | 72,498 | 5,989,226 | 1 h | |
Run1 6RTU | 134,690 | 15,017,158 | 1 h | |
Run1 12RTU | 238,360 | 16,191,008 | 1 h 3 m | |
Run1 3RTU 2 s | 305,932 | 20,330,477 | 1 h | |
Polling only 6RTU | 58,325 | 3,441,247 | 59 m | |
Moving two files 6RTU | 3,319 | 200,189 | 3 m | |
Send a fake command 6RTU | 11,166 | 657,840 | 1 h 1 m | |
Characterization 6RTU | 12,296 | 761,587 | 1 h 5 m | |
CnC uploading exe 6RTU | 1,426 | 160,547 | 1 h 1 m | |
6RTU with operate | 1,856 | 1,129,078 | 1 h 1 m | |
Channel 2d 3 s | 383,312 | 22,816,188 | 1 h 6 m | |
Channel 3d 3 s | 255,668 | 15,218,187 | 44 m | |
Channel 4d 1 s | 414,412 | 24,595,619 | 1 h 12 m | |
Channel 4d 2 s | 266,387 | 15,833,346 | 46 m | |
Channel 4d 5 s | 107,577 | 6,421,852 | 19 m | |
Channel 4d 9 s | 60,295 | 3,619,845 | 11 m | |
Channel 4d 12 s | 44,977 | 2,712,015 | 9 m | |
Channel 5d 3 s | 143,809 | 8,559,985 | 25 m | |
SWaT | Network(pre-processed) | 19,761,714 | 5,498,545,489 | 11d |
Rodofile | Master | 1,802,757 | 173,836,593 | 9 h |
HMI | 448,655 | 61,956,933 | 9 h | |
Attacker | 1,373,938 | 114,462,713 | 9 h | |
4SICS | GeekLounge | 3,773,984 | 314,562,089 | 1d 22 h 7 m |
S4x15CTF | Advantech | 307 | 35,293 | 1 m |
BACnet FIU | 100,934 | 7,378,656 | N/A | |
BACnet Host | 21,285 | 1,486,618 | N/A | |
iFix Client | 5,149 | 818,114 | N/A | |
iFix Server | 86,897 | 10,607,624 | N/A | |
MicroLogix | 65,668 | 7,959,426 | N/A | |
Modicon | 4,193 | 816,137 | 3 m | |
WinXP | 26,068 | 2,975,574 | 3 m | |
DEFCON23 | ICS Village | 1,368,167 | 92,193,653 | 1d 5 h 39 m |
3.2 ICS-related Protocols
ICS-related protocols in datasets
3.3 Brief Description of Datasets
Morris et al. Datasets. Morris et al. [14–16] have released five different datasets related to power generation, gas, and water treatment for their intrusion detection research. Since the Morris datasets provide labels in common, they can be considered as datasets for machine learning in the development of intrusion detection systems. The Morris-1 dataset consists of 37 power system event scenarios that consider the intelligent electronic device (IED) operation count, as well as normal/abnormal events in the power system testbed composed of generators, IEDs, breakers, switches, and routers. The Morris-2, Morris-3, and Morris-4 datasets include communication between the control device and the HMI with the Modbus protocol by connecting the RS-232 or Ethernet interface in the gas pipeline testbed. Each dataset contains network data information that removes some header information such as TCP and MAC of raw packets. In particular, the Morris-3 dataset also provides separate network data information for the water storage tank. The Morris-5 dataset is relatively large and is collected from an actual energy management system for over 30 days, which is the longest time in comparison with other datasets. This dataset contains information on the event ID, priority code, device, and event message. Some of the information is anonymized due to security issues.
Lemay et al. Dataset. Lemay et al. [10] provided the network traffic dataset related to covert channel command and control in the supervisory control and data acquisition (SCADA) field. To construct the test environment, SCADA network was constructed using SCADA Sandbox, a public tool, and two master terminal units were implemented using SCADA BR. The dataset includes Modbus/TCP by connecting three controllers and four field devices per controller. The dataset has diversity as it reflects various scenarios. For example, the dataset is obtained by changing the number of controllers and the polling cycle, ensuring manual operation by the operator, etc. Most datasets provide labels to distinguish between normal and abnormal data.
SWaT Dataset. Datasets released by the SWaT collected sensors, actuators, PLC input/output (I/O) signals and network traffic during seven days of normal operation and four days of the attack scenario. In particular, the SWaT datasets provide the largest amount of data in a large testbed. SWaT defined the device and physical points to be attacked and designed each attack to construct a total of 36 attack scenarios related to field signals and network traffic [4]. Attack scenarios are based on the principles of the physical system to determine the normal operation. When the physical system operates differently, it is considered as an attack [11]. In addition, since the datasets are separated by the network and a physical layer, they can be used in the research for monitoring analog I/O and digital I/O, which are signals in the field layer.
Rodofile et al. Dataset. Rodofile et al. [23] used the Siemens S7-300 and S7-1200 PLCs to obtain the S7Comm Dataset on the mining refinery. The experimental environment consists of a conveyor, wash tank, pipeline reactor field device, master PLC, and slave PLC. To create an attack scenario, an attacker is allowed to access the PLC through the network and to perform a process attack that creates malfunctions in the control process. Rodofile et al. have released datasets on about nine hours of network traffic including S7Comm as well as HMI and PLC logs.
4SICS Dataset. The 4SICS dataset is collected from the ICS Lab’s environment where Siemens S7-1200, Automation Direct DirectLogic 205 PLC, and Industrial Network Equipment including Hirchmann EAGLE 20 Tofino, Allen-Bradley Stratix 6000, and Moxa EDS-508A are deployed. Because heterogeneous ICS devices in the same environment are networked, various ICS-related protocol traffic such as S7Comm, Modus/TCP, EtherNet/IP, and DNP 3.0 are included in the dataset.
S4x15 ICS Village CTF Dataset. Unlike the other datasets, this dataset (hereinafter S4x15CTF) is the network traffic collected during the capture-the-flag (CTF) in the ICS Village, provided by DigitalBond [21]. Therefore, the dataset includes various attacks attempted by many CTF participants that focus on the components of the ICS Village (e.g., Advantech PLC, Modicon PLC, and MicroLogix PLC). Each dataset is grouped according to the components of the ICS Village, but no label is provided.
DEFCON 23 ICS Village Dataset. This dataset (hereinafter DEFCON23) includes network traffic collected by running the ICS Village, provided at DEF CON23. ICS Village is composed of various control systems and communication protocols are used. In particular, the PROFINET PTCP, DCP, and IO protocols are included but labels are not provided.
4 Attack Scenarios in the Public ICS Datasets
Statistics of normal and attack data in the labeled datasets
Dataset ID | Num. of normal data (%) | Num. of attack data (%) |
---|---|---|
Morris-1 | 22,714 (29.98) | 55,663 (71.02) |
Morris-2 | 140,382 (97.32) | 3,867 (2.68) |
Morris-3 | 233,871 (70.10) | 99,627 (29.90) |
Morris-4 | 643,740 (78.13) | 180,144 (21.87) |
Lemay | 16,362 (92.09) | 1,405 (7.91) |
SWaT | 395,298 (87.86) | 54,621 (12.14) |
Rodofile | 1,137,294 (63.09) | 665,463 (36.91) |
Modification. The information is not only intercepted but also modified by an attacker while in transit from the source to the destination (e.g., man-in-the-middle attack).
Fabrication. An attacker injects fake data into the system without having the sender do anything (e.g., relaying and masquerading attack).
Interruption. A system becomes unavailable due to resource exhaustion or destroyed physically. This attack targets the specific system or communication path (e.g., denial-of-service (DoS) attack).
Interception. An attacker gets the information by intercepting information from the communication channel (e.g., wiretapping).
Figure 3 shows attack scenarios described that attacks can occur in different attack paths. The solid red line stands for attack path. The dashed red line means that the attack could affect the information afterwards. The attack path in the same communication level means that the attack takes place at a device itself. For example, an attacker may change setting of control system directly. We assigned an attack scenario ID for each dataset through an attack scenario analysis. Table 6 shows the attack scenario IDs to identify the four general attack methods at communication paths between levels, as described in Fig. 3. We expressed each attack and path through symbols. For instance, ‘F3’ means that attack scenarios include fabrication attack from level 1 to level 0
Categories of attack scenarios based on attack paths and methods
Attack path | Attack method | |||
---|---|---|---|---|
Modification | Fabrication | Interruption | Interception | |
Level 00 | M1 | F1 | R1 | C1 |
Level 01 | M2 | F2 | R2 | C2 |
Level 10 | M3 | F3 | R3 | C3 |
Level 11 | M4 | F4 | R4 | C4 |
Level 12 | M5 | F5 | R5 | C5 |
Level 21 | M6 | F6 | R6 | C6 |
Classification of attack scenarios in datasets
5 Consideration for Generating ICS Datasets
Timing issues. The synchronization of I/O and internal information acquisition time may not match when OLE for process control (OPC) is used to collect information from various control systems such as PLC and distributed control system. Even if OPC is not used, the S4x15CTF dataset are provided with ‘1970-01-01 00:00:00’ as time information because time synchronization between the devices was not applied. Therefore, it is preferable to use time synchronization of information such as through network time protocol so that information generated at the near time can be identified easily.
Criteria of abnormal states. In a dataset, it is important to display the label at the time of abnormal action (e.g., attack) during normal operation of control systems. When machine-learning and detection techniques distinguish between normal and abnormal states, the label marked for each record in the dataset can be used. If a dataset does not correctly provide both normal and abnormal labels reflecting the characteristics of the control devices, both machine-learning and detection can not be performed properly. Even though the label is marked as normal, the actual data may show a different pattern than the normal state. In case of the SWaT dataset, some researches have excluded the data collected during initial operation of the experiment environment from the learning since the sensor was not stabilized at that period [7]. In addition, after finishing attacks, the sensor information may not be stabilized immediately but may gradually return to the normal state.
Same attack in different environments. ICSs react differently depending on the time, target, and operational state of the attack, even if the same attack occurs. To test an attack scenario against anomaly detection on machine-learning based techniques, it is essential to test the same attack several times in different states of target system. To provide diverse datasets, it is necessary to consider constructing a system that can reproduce the attack situation generated by the user at a desired time.
6 Conclusion
We analyzed various aspects of datasets obtained publicly. We broke down attack scenarios with the attack methods and paths, then identified attack scenarios of each dataset. As a result, the ICS datasets are biased towards a specific attack paths. This paper presented additional considerations when generating datasets for ICS security research. We expect that our results can be used as an index when using and generating ICS datasets for security research.