© Springer Nature Switzerland AG 2019
Jiaojiao Jiang, Sheng Wen, Bo Liu, Shui Yu, Yang Xiang and Wanlei ZhouMalicious Attack Propagation and Source IdentificationAdvances in Information Security73https://doi.org/10.1007/978-3-030-02179-5_1

1. Introduction

Jiaojiao Jiang1 , Sheng Wen1, Shui Yu3, Bo Liu2, Yang Xiang3 and Wanlei Zhou4
(1)
Swinburne University of Technology, Hawthorne, Melbourne, VIC, Australia
(2)
La Trobe University, Bundoora, VIC, Australia
(3)
University of Technology Sydney, Ultimo, NSW, Australia
(4)
Digital Research & Innovation Capability, Swinburne University of Technology, Hawthorn, Melbourne, VIC, Australia
 

1.1 Malicious Attacks

With the remarkable advances in computer technologies, our social, financial and professional existences become increasingly digitized, and governments, healthcare and military infrastructures rely more on computer technologies. They, meanwhile, present larger and more lucrative targets for malicious attacks [81]. A malicious attack is an attempt to forcefully abuse or take advantage of a computer system or a network asset. This can be done with the intent of stealing personal information (logins, financial data, even electronic money), or to reduce the functionality of a target computer. According to statistics, worldwide financial losses due to malicious attacks averaged $12.18 billion per year from 1997 to 2006 [50] and increased to $110 billion between July 2011 and the end of July 2012 [167].

Typical malicious attacks include viruses, worms, Trojan horses, spyware, adware, phishing, spamming, rumors, and other types of social engineering. Since the first computer virus surfaced in the early 1980s, malicious attacks has developed into thousands of variants that differ in infection mechanism, propagation mechanism, destructive payload, and other features [51].

According to the types of cyber threats that contributed to breaches, malicious attacks are divided into two main categories [100]: malware and social engineering.
  • Malware, short for “malicious software”, is a broad term that refers to a variety of malicious programs designed to compromise computers and devices in several ways, steal data, bypass access controls, or cause harm to the host computers. Malware comes in a number of forms, including viruses, worms, Trojans, spyware, adware, rootkits and more.

  • Social Engineering is the art of getting users to compromise information systems, where the attackers use human interaction (i.e., social skills) to obtain or compromise information about an organization or its computer systems. Instead of technical attacks on systems, social engineers target humans with access to information, manipulating them into divulging confidential information or even into carrying out their malicious attacks through influence and persuasion. Technical protection measures are usually ineffective against this kind of attack. Social engineering attacks are multifaceted and include physical, social and technical aspects, which are used in different stages of the actual attack. Pretexting, Phishing and Baiting are the most common social engineering tactics.

1.2 Examples of Malicious Attacks

Motivated by extraordinary financial or political rewards, malware owners are exhausting their energy to compromise as many networked computers as they can in order to achieve their malicious goals. A compromised computer is called a bot, and all bots compromised by a malware form a botnet. Bot-nets have become the attack engine of malicious attacks, and they pose critical challenges to cyber defenders. As many security reports indicate—attack numbers are quickly increasing worldwide, reaching new, unbelievable peaks. Here are some of the major cyber malicious attacks in recent years and what we can learn from them.
  • ILOVEYOU is considered one of the most virulent computer virus ever created. The virus managed to wreck havoc on computer systems all over the world, causing damages totaling in at an estimate of $10 billion. Ten percent of the world’s Internet-connected computers were believed to have been infected. It was so bad that governments and large corporations took their mailing system offline to prevent infection. The virus used social engineering to get people to click on the attachment; in this case, a love confession. The attachment was actually a script that poses as a TXT file. Once clicked, it will send itself to everyone in the user’s mailing list and proceed to overwrite files with itself, making the computer unbootable.

  • CryptoLocker [119] is a form of Trojan horse ransomware targeted at computers running Windows. It uses several methods to spread itself, such as email, and once a computer is infected, it will proceed to encrypt certain files on the hard drive and any mounted storage connected to it with RSA public key cryptography. Figure 1.1 illustrates the infection chain of CryptoLocker. While it is easy enough to remove the malware from the computer, the files will still remain encrypted. The only way to unlock the files is to pay a ransom by a deadline. If the deadline is not met, the ransom will increase significantly or the decryption keys deleted. The ransom usually amount to $400 in prepaid cash or bit coin. From data collected from the raid, the number of infections is estimated to be 500,000, with the number of those who paid the ransom to be at 1.3%, amounting to $3 million [176].
    ../images/466717_1_En_1_Chapter/466717_1_En_1_Fig1_HTML.png
    Fig. 1.1

    CryptoLocker infection chain [6]

  • Code Red first surfaced on 2001. The worm targeted computers with Microsoft IIS web server installed, exploiting a buffer overflow problem in the system. It leaves very little trace on the hard disk as it is able to run entirely on memory, with a size of 3569 bytes. Once infected, it will proceed to make a hundred copies of itself but due to a bug in the programming, it will duplicate even more and ends up eating a lot of the systems resources. It will then launch a denial of service attack on several IP address, famous among them the website of the White House. It also allows backdoor access to the server, allowing for remote access to the machine. It was estimate that it caused $2 billion in lost productivity. A total of 1–2 million servers were affected.

  • Conficker is a worm for Windows that made its first appearance in 2008. It infects computers using flaws in the OS to create a botnet. The malware was able to infect more than 9 millions computers all around the world, affecting governments, businesses and individuals. It was one of the largest known worm infections to ever surface causing an estimate damage of $9 billion. The worm works by exploiting a network service vulnerability that was present and unpatched in Windows. Once infected, the worm will then reset account lockout policies, block access to Windows update and anti-virus sites, turn off certain services and lock out user accounts among many. Then, it proceeds to install software that will turn the computer into a botnet slave and scareware to scam money off the user.

  • Mydoom surfacing in 2004, was a worm for Windows that became one of the fastest spreading email worm since ILOVEYOU. The worm spreads itself by appearing as an email transmission error and contains an attachment of itself. Once executed, it will send itself to email addresses that are in a user’s address book and copies itself to any P2P program’s folder to propagate itself through that network. The payload itself is twofold: first it opens up a backdoor to allow remote access and second it launches a denial of service attack on the controversial SCO Group. It was believed that the worm was created to disrupt SCO due to conflict over ownership of some Linux code. It caused an estimate of $38.5 billion in damages and the worm is still active in some form today.

  • Flashback is one of the Mac malware. The Trojan was first discovered in 2011. A user simply needs to have Java enabled. It propagates itself by using compromised websites containing JavaScript code that will download the payload. Once installed, the Mac becomes part of a botnet of other infected Macs. More than 600,000 Macs were infected, including 274 Macs in the Cupertino area, the headquarters of Apple.

1.3 Propagation Mechanism of Malicious Attacks

As we can see from the above examples of malicious attacks, a malicious attack propagation starts from one or few hosts and quickly infects many other hosts. For example, the ILOVEYOU worm attached in Outlook Mails mailed itself to all addresses on the host’s mailing list. The Code Red performed network scanning and propagated to IP addresses connected to the host. In order to fight against malicious attacks, it is important for cyber defenders to understand malicious attack behavior, such as membership recruitment patterns, the size of botnets, distribution of bots, especially the propagation mechanism of malicious attacks.

The diagram shown in Fig. 1.2 illustrates the process of real-world Trojan malware [54, 103, 105, 169]. Such a process consists of three stages:
  • In the first stage, the malware developer creates one or more fake profiles and infiltrates them into the social network. The purpose of these fake profiles is to make friends with as many real OSN users as possible. Infiltration has been shown to be an effective technique for disseminating malicious content in OSNs such as Facebook [22].

  • In the second stage, the malware developer uses social engineering techniques to create eye-catching web links that trick users into clicking on them. The web links, which are posted on the fake users’ walls, lead unsuspecting users to a web page that contains malicious content. A user simply needs to visit or “drive by” that web page, and the malicious code can be downloaded in the background and executed on the user’s computer without his/her knowledge. When security flaws are absent [198], malware creators resort to social engineering techniques to get assistance from users to activate the malicious code.

  • In the third stage, after a user is infected, the malware also posts the eye-catching web link(s) on the user’s wall to “ecruit” his/her friends. If a friend clicks on the link(s) and, as a result, unknowingly executes the malware, the friend’s computer and profile will become infected and the propagation cycle continues with his/her own friends.

../images/466717_1_En_1_Chapter/466717_1_En_1_Fig2_HTML.png
Fig. 1.2

Example of Trojan malware propagation in online social networks [55]

Note that, malicious attacks are similar to biological viruses on their self-replicating and propagation behaviors. Thus the mathematical techniques developed for the study of biological infectious diseases have been adapted to the study of malicious attack propagation. The basic epidemic model, Susceptible-Infected (SI) model, separates populations into two groups of nodes changing over time:
  • A susceptible node is a node that is vulnerable to malicious attack but otherwise “healthy”. We use S(t) to denote the number of susceptible nodes at time t.

  • An infected node is a node that became infected and may potentially infect other nodes. We use I(t) to denote the number of infected nodes at time t.

In the SI model, it was assumed that the population is large and steady with n nodes. If a node got infected, it does not become uninfected. Figure 1.3 presents the propagation progress of the Code Red worm [141]. The dataset on the Code Red worm was collected by Moore et al. during the whole day of July 19th [127]. The SI model well matches the propagation of Code Red worm [206].
../images/466717_1_En_1_Chapter/466717_1_En_1_Fig3_HTML.png
Fig. 1.3

Observed Code Red propagation (from www.​Caida.​org)

Many other models are derivations of this basic SI form. For example, the Susceptible-Infected-Recovered (SIR) models the propagation that node can get recovered or immune from the infectious state and if a node get recovered it will never again become susceptible. The Susceptible-Infected-Susceptible (SIS) models the propagation in which recovered node can become susceptible again.

1.4 Source Identification of Malicious Attack Propagation

From both practical and technical aspects, it is of great significance to identify propagation sources of malicious attacks. Practically, it is important to accurately identify the ‘culprit’ of the malicious attack for forensic purposes. Moreover, seeking the malicious attack as quickly as possible can find the causation of the attack, and therefore, mitigate the damages. Technically, the work in this field aims at identifying the sources of malicious attacks based on limited knowledge of network structures and the states of a small portion of nodes. In academia, traditional identification techniques, such as IP trace back [156] and stepping-stone detection [157], are not sufficient to seek the sources of malicious attacks, as they only determine the true source of packets received by a destination. In the propagation of malicious attacks, the source of packets is almost never the source of the malicious attack propagation but just one of the many propagation participants [191]. Methods are needed to find propagation sources higher up in the application level and logic structures of networks, rather than in the IP level and packets.

Identifying the source of a malicious attack is an extremely desirable but challenging task, because of the large-scale of networks and the limited knowledge of hosts’ infection statuses. In general, only a small fraction of hosts can be observed. Thus, the main difficulty is to develop tractable estimators that can be efficiently implemented, and that perform well on multiple topologies.

In the past few years, researchers have proposed a series of methods to identify the diffusion sources of malicious attacks. The first widely discussed research on this subject has been done by Shah and Zaman [161]. In social networks, Shah and Zaman introduced rumor centrality of a node as the number of distinct ways a rumor can spread in the network starting from that node. They showed that the node with maximum rumor centrality is the Maximum Likelihood Estimator of the rumor source if the underlying graph is a regular tree. They studied also the detection performance for irregular geometric trees, small-word networks and scale-free networks. This method assumes that we know all the connections between nodes and additionally the infection states of all nodes. Researchers [84, 85, 146, 200] later relaxed some of these constraints since their algorithm requires information about state of not every node, but only about some fraction of nodes called observers.

Based on the types of observations on the attacked underlying network, the approaches of identifying malicious attack sources can be divided into three main categories: the complete observation based source detection, the snapshot based source detection, and the detector/sensor based source detection. The first category requires a complete observation of the attacked network after a certain time of the malicious propagation. The second category requires the snapshot (partial observation) of the attacked network at a certain time instance. The third category needs to monitor a small set of nodes but all the time in the attacked network. Regardless of the above division, researchers considered also different epidemic models [24, 202], spreading at weighted or time-varying graphs [11, 23, 84, 163] and multi-source detection problems [58, 68, 86].

1.5 Outline and Book Overview

In many ways, current approaches of analyzing the propagation of malicious attacks and identifying the source of malicious attacks are facing the following critical challenges.
  • Networks portray a multitude of interactions among network assets. Researchers have found that unsolicited malicious attacks spread extremely fast through influential spreaders [42]. Hence, analyzing the influence of network hosts and identifying the most efficient ‘spreaders’ in a network becomes an important step towards identifying the propagation source of malicious attacks and restraining the propagation. Indeed, the propagation of malicious attacks is in an epidemic way (one host can infect multiple hosts simultaneously). However, previous methods in measuring the influence of network hosts focus on the dissemination of information from one host to only one other host. In this book, we explore different influence measures and their application of capturing influential nodes in the epidemics.

  • Restraining the propagation of malicious attacks in a underlying network has long been an important but difficult problem. Currently, there are mainly two types of methods: (1) blocking malicious attacks at the most influential users or community bridges, and (2) spreading the corresponding countermeasures to restrain malicious attacks. Given a fixed budget in the real world, how can we manage different methods working together to optimize the restraint of the propagation of malicious attacks. In this book, we explore the strategy of different methods working together and their equivalence, so as to find out a better strategy to restrain the propagation of malicious attacks.

  • The underlying networks are often of time-varying topology. For example, in human contact networks, the neighborhood of individuals moving over a geographic space evolves over time, and the interaction between the individuals appears/disappears in online social network websites (such as Facebook and Twitter) [151]. Indeed, the spreading of malicious attacks is affected by duration, sequence, and concurrency of contacts among nodes [27, 170]. Then, can we model the way that malicious attack spreads in time-varying networks? Can we estimate the probability of an arbitrary node being infected by a malicious attack? How do we detect the propagation source of a malicious attack in time-varying networks? Can we estimate the infection scale and infection time of the malicious attack?

  • Malicious attacks often emerge from multiple sources. However, current methods mainly focus on the identification of a single attack source in networks. A few approaches are proposed for identifying multiple attack sources but they all suffer from extremely high computational complexity, which is not practical to be adopted in real-world networks. In this book, we will answer the following questions corresponding to multi-source identification. How many sources are there? Where did the diffusion emerge? When did the diffusion start?

  • Another critical challenge in this research area is the scalability issue. Current methods generally require scanning the whole underlying network of malicious attack spreading to locate attack sources. However, real-world networks of malicious attack diffusion are often of a huge scale and extremely complex structure. Thus, it is impractical to scan the whole network to locate the attack sources. We develop efficient approaches to identify attack sources by taking the structural features of networks and the diffusion patterns of malicious attacks into account, and therefore address the scalability issue.

To address the above challenges, this book aims to summarize the new technologies and achieve a breakthrough in source identification of malicious attacks to enable its effective applicability in real world applications. Based on the challenges, we divide the book into three main parts.
  • Part I: Malicious Attack Propagation
    1. 1.

      Primary knowledge of modeling malicious attack propagation.

       
    2. 2.

      Spreading influence analysis of network hosts in the propagation of malicious attacks.

       
    3. 3.

      Restraining the propagation of malicious attacks.

       
  • Part II: Source Identification of Malicious Attack Propagation
    1. 1.

      Source identification under complete observations: a maximum likelihood (ML) source estimator.

       
    2. 2.

      Source identification under snapshots: a sample path based source estimator.

       
    3. 3.

      Source identification under sensor observations: a Gaussian source estimator.

       
  • Part III: Critical Research Issues in Source Identification
    1. 1.

      Identifying propagation source in time-varying networks.

       
    2. 2.

      Identifying multiple propagation sources.

       
    3. 3.

      Identifying propagation source in large-scale networks.

       
The approaches involved in this book include complex network theory, information diffusion theory, probability theory, and applied statistics.