© Springer Nature Switzerland AG 2019
Jiaojiao Jiang, Sheng Wen, Bo Liu, Shui Yu, Yang Xiang and Wanlei ZhouMalicious Attack Propagation and Source IdentificationAdvances in Information Security73https://doi.org/10.1007/978-3-030-02179-5_7

7. Source Identification Under Snapshots: A Sample Path Based Source Estimator

Jiaojiao Jiang1 , Sheng Wen1, Shui Yu3, Bo Liu2, Yang Xiang3 and Wanlei Zhou4
(1)
Swinburne University of Technology, Hawthorne, Melbourne, VIC, Australia
(2)
La Trobe University, Bundoora, VIC, Australia
(3)
University of Technology Sydney, Ultimo, NSW, Australia
(4)
Digital Research & Innovation Capability, Swinburne University of Technology, Hawthorn, Melbourne, VIC, Australia
 

In this chapter, we introduce a propagation source estimator under snapshot observations: a sample path based source estimator (Jordan Center). According to Chap. 5, a snapshot provides partial knowledge of network status at a given time t. Many approaches have been proposed to identify propagation sources under snapshot observations, including Jordan Center, Dynamic Message Passing, effective distance based method, etc. Within these methods, Jordan center is a representative one and many variations and improvements have been made based on this method. Here, we present the details of the Jordan Center estimator. For the techniques involved in other methods under snapshot observations, readers could refer to Chap. 9 for details.

7.1 Introduction

Malicious attack propagation in networks refer to the spread process of attack throughout the networks, and have been widely used to model many real-world phenomena such as the spreading of rumors over online social networks, and the spreading of computer virus over the Internet. The problem of identifying propagation source has attracted many attentions. For example, Shah and Zaman proposed Rumor Center to identify rumor propagation sources [160162] was analyzed in [160, 161]. They formulated the problem as a maximum likelihood estimation (MLE) problem, and developed novel algorithms to detect the source. However, these methods are considered under the complete observation of networks, which is not applicable for many real-world scenarios. Zhu and Ying [200] considered the following problem: given a snapshot of the diffusion process at time t, can we tell which node is the source of the propagation?

Zhu and Ying [200] adopted the Susceptible-Infected-Recovered (SIR) model, a standard model of epidemics [12, 47]. The network is assumed to be an undirected graph and each node in the network has three possible states: susceptible (S), infected (I), and recovered (R). Nodes in state S can be infected and change to state I, and nodes in state I can recover and change to state R. Recovered nodes cannot be infected again. Initially, all nodes are assumed to be in the susceptible state except one infected node. The infected node is the propagation source of the malicious attack. The source then infects its neighbors, and the attack starts to spread in the network. Now given a snapshot of the network, in which some nodes are infected nodes, some are healthy (susceptible and recovered) nodes. The susceptible nodes and recovered nodes assumed to be indistinguishable. Zhu and Ying [200] proposed a low complexity algorithm, called reverse infection algorithm, through finding the sample path based estimator in the underlying network. In the algorithm, each infected node broadcasts its identity in the network, the node who first collect all identities of infected nodes declares itself as the information source. They proved that the estimated source node is the node with the minimum infection eccentricity. Since a node with the minimum eccentricity in a graph is called the Jordan center, they call the nodes with the minimum infection eccentricity the Jordan infection centers.

7.2 The SIR Model for Information Propagation

Consider an undirected graph G(V, E), where V is the set of nodes and E is the set of undirected edges. Each node v ∈ V has three possible states: susceptible (S), infected (I), and recovered (R). Zhu and Ying [200] assumed a time slotted system. Nodes change their states at the beginning of each time slot, and the state of node v in time t is denoted by X v(t). Initially, all nodes are in state S except source node v which is in state I. At the beginning of each time slot, each infected node infects its susceptible neighbors with probability q. Each infected node recovers with probability p. Once a node get recovered, it cannot be infected again. Then, the infection process can be modeled as a discrete time Markov chain X(t), where X(t) = {X v(t), v ∈ V } is the states of all the nodes at time t. The initial state of this Markov chain is X v(0) = S for v ≠ v and 
$$X_{v^*}(0)=I$$
.

Suppose at time t, we observe Y = Y v, v ∈ V ⊆ V such that

$$\displaystyle \begin{aligned} Y_v= \left\{ \begin{array}{ll} 1,~\mbox{if}~ v~ \mbox{is in state}~I;\\ 0,~\mbox{if}~ v~ \mbox{is in state}~ S ~\mbox{or}~ R. \end{array} \right. \end{aligned} $$
(7.1)
The propagation source identification problem is to identify v given the graph G and the observation Y, where t is an unknown parameter. Figure 7.1 gives an example of the attack propagation process. The left figure shows the information propagation over time. The nodes on each dotted line are the nodes which are infected at that time slot, and the arrows indicate where the infection comes from. The right figure is the network we observe, where the red nodes are infected ones and others are susceptible or recovered nodes. The pair of numbers next to each node are the corresponding infection time and recovery time. For example, node 3 was infected at time slot 2 and recovered at time slot 3. -1 indicates that the infection or recovery has yet occurred. Note that these two pieces of information are generally not available.
../images/466717_1_En_7_Chapter/466717_1_En_7_Fig1_HTML.png
Fig. 7.1

An example of information propagation

7.3 Maximum Likelihood Detection

Suppose X[0, t] = X(τ) : 0 < τ ≤ t is a sample path of the infection process from 0 to t, and define the function F(⋅) such that

$$\displaystyle \begin{aligned} F(X_v(t)) = \left\{ \begin{array}{ll} 1,~\mbox{if}~ X_v(t) = I; \\ 0,~\mbox{otherwise}. \end{array} \right. \end{aligned} $$
(7.2)
Then, F(X[t]) = Y if F(X v(t)) = Y v for all v. Identifying the propagation source can be formulated as a maximum likelihood detection problem as follows:

$$\displaystyle \begin{aligned} \hat{v} \in \mbox{arg}~ \displaystyle \max_{v\in V} \displaystyle\sum_{\mathbf{X}[0,t]:\mathbf{F}(\mathbf{X}(t))=\mathbf{Y}} \mbox{Pr}(\mathbf{X}[0, t]|v^* = v), \end{aligned} $$
(7.3)
where Pr(X[0, t]|v  = v) is the probability to obtain sample path X[0, t] given the source node v.

Note that the difficulty of solving the problem in (7.3) is the curse of dimensionality. For each v such that Y v = 0, its infection time and recovered time are required, i.e., O(t 2) possible choices; for each v such that Y v = 1, the infection time needs to be considered, i.e., O(t) possible choices. Therefore, even for a fixed t, the number of possible sample paths is at least at the order of t N, where N is the number of nodes in the network. This curse of dimensionality makes it computational expensive. Zhu and Ying [200] introduced a sample path based approach to overcome the difficulty.

7.4 Sample Path Based Detection

Instead of computing the marginal probability, Zhu and Ying [200] proposed to identify the sample path X [0, t ] that most likely leads to Y, i.e.,

$$\displaystyle \begin{aligned} {\mathbf{X}}^*[0, t^*] = \mbox{arg}~\displaystyle \max_{t,\mathbf{X}[0,t]\in \mathcal{X}(t)}Pr(\mathbf{X}[0,t]), \end{aligned} $$
(7.4)
where 
$$\mathcal {X}=\{\mathbf {X}[0,t]|\mathbf {F}(\mathbf {X}(t))=\mathbf {Y}\}$$
. The source node associated with X [0, t ] is viewed as the information source.
Based on the in graph theory in [77], the eccentricity e(v) of a vertex v is the maximum distance between v and any other vertex in the graph. The Jordan centers of a graph are the nodes which have the minimum eccentricity. For example, in Fig. 7.2, the eccentricity of node v 1 is 4 and the Jordan center is v 2, whose eccentricity is 3. Following a similar terminology, Zhu and Ying [200] define the infection eccentricity 
$$\tilde {e}(v)$$
given Y as the maximum distance between v and any infected nodes in the graph. Then, the Jordan infection centers of a graph are the nodes with the minimum infection eccentricity given Y. In Fig. 7.2, nodes v 3, v 10, v 13 and v 14 are observed to be infected. The infection eccentricities of v 1, v 2, v 3, v 4 are 2, 3, 4, 5, respectively, and the Jordan infection center is v 1.
../images/466717_1_En_7_Chapter/466717_1_En_7_Fig2_HTML.png
Fig. 7.2

An example illustrating the infection eccentricity

Zhu and Ying [200] proved that the propagation source associated with the optimal sample path is a node with the minimum infection eccentricity. The proof of this result is consist of three steps: first, assuming the information source is v r, they analyze 
$$t^*_{v_r}$$
such that

$$\displaystyle \begin{aligned} t^*_{v_r} = \mbox{arg}~ \displaystyle \max_{t,\mathbf{X}[0,t]} Pr(\mathbf{X}[0,t]|v^* = v_r), \end{aligned} $$
(7.5)
i.e., 
$$t^*_{v_r}$$
is the time duration of the optimal sample path in which v r is the information source. They proved that 
$$t^*_{v_r}$$
equals to the infection eccentricity of node v r. In the second step, they consider two neighboring nodes, say nodes v 1 and v 2. They proved that if 
$$\tilde {e}(v_1) &lt; \tilde {e}(v_2)$$
, then the optimal sample path rooted at v 1 occurs with a higher probability than the optimal sample path rooted at v 2. At the third step, they proved that given any two nodes u and v, if v has the minimum infection eccentricity and u has a larger infection eccentricity, then there exists a path from u to v along which the infection eccentricity monotonically decreases, which implies that the source of the optimal sample path must be a Jordan infection center. For example, in Fig. 7.2, node v 4 has a larger infection eccentricity than v 1 and v 4 → v 3 → v 2 → v 1 is the path along which the infection eccentricity monotonically decreases from 5 to 2. In the next subsection, we briefly explain the techniques involved in these three steps.

7.5 The Sample Path Based Estimator

Lemma 7.1

Consider a tree network rooted at v r and with infinitely many levels. Assume the information source is the root, and the observed infection topology is Y which contains at least one infected node. If 
$$\tilde {e}(v_r) \leq t_1 &lt; t_2$$
, then the following inequality holds

$$\displaystyle \begin{aligned} \displaystyle \max_{\mathbf{X}[0,t_1]\in \mathcal{X}(t_1)} Pr(\mathbf{X}[0,t_1]) &gt; \max_{\mathbf{X}[0,t_2]\in \mathcal{X}(t_2)} Pr(\mathbf{X}[0,t_2]), \end{aligned} $$
(7.6)
where 
$$\mathcal {X}(t)=\{\mathbf {X}[0,t]|\mathbf {F}(\mathbf {X}(t))=\mathbf {Y}\}$$
. In addition,

$$\displaystyle \begin{aligned} t^*_{v_r} = \tilde{e}(v_r) = \displaystyle \max_{u\in I} d(v_r, u), \end{aligned} $$
(7.7)

where d(v r, u) is the length of the shortest path between and u and also called the distance between v r and u, and I is the set of infected nodes.

This lemma states that the optimal time is equal to the infection eccentricity. The next lemma states that the optimal sample path rooted a node with a smaller infection eccentricity is more likely to occur.

Lemma 7.2

Consider a tree network with infinitely many levels. Assume the information source is the root, and the observed infection topology is Y which contains at least one infected node. For u, v  V such that (u, v) ∈ E, if 
$$t^*_u &gt; t^*_v$$
, then

$$\displaystyle \begin{aligned} Pr({\mathbf{X}}^*_u([0, t^*_u])) &lt; Pr({\mathbf{X}}^*_v([0, t^*_v])), \end{aligned} $$
(7.8)

where 
$${\mathbf {X}}^*_u([0, t^*_u])$$
is the optimal sample path starting from node u.

Proof

Denote by T v the tree rooted in v and 
$$T^{-v}_u$$
the tree rooted at u but without the branch from v. See 
$$T^{-v_1}_{v_9}$$
and 
$$T^{-v_2}_{v_7}$$
in Fig. 7.2. Furthermore, denote by 
$$\mathcal {C}(v)$$
the set of children of v. The sample path X[0, t] restricted to 
$$T^{-v}_u$$
is defined to be 
$$\mathbf {X}([0,t], T^{-v}_u)$$
.

The first step is to show 
$$t^*_u = t^*_v +1$$
. Note that 
$$T^{-u}_v \cap I\neq \emptyset $$
, otherwise, all infected node are on 
$$T^{-v}_u$$
. As T is a tree, v can only reach nodes in 
$$T^{-v}_u$$
through edge (u, v), 
$$t^*_v = t^*_u + 1$$
, which contradicts 
$$t^*_u &gt; t^*_v$$
.

If 
$$T^{-v}_u\cap I\neq \emptyset $$
, 
$$\forall a\in T^{-v}_u\cap I$$
, then

$$\displaystyle \begin{aligned} d(u,a)=d(v,a)-1 \leq t^*_v -1, \end{aligned} $$
(7.9)
and 
$$\forall b\in T^{-u}_v \cap I$$
,

$$\displaystyle \begin{aligned} d(u,b)=d(v,b)+1\leq t^*_v +1. \end{aligned} $$
(7.10)
Hence,

$$\displaystyle \begin{aligned} t^*_u \leq t^*_v + 1, \end{aligned} $$
(7.11)
which implies that

$$\displaystyle \begin{aligned} t^*_v &lt; t^*_u \leq t^*_v + 1, \end{aligned} $$
(7.12)
Hence, we obtain 
$$t^*_u = t^*_v + 1$$
.
The second step is to prove that 
$$t^I_v = 1$$
on the sample path 
$${\mathbf {X}}^*_u([0, t^*_u])$$
. If 
$$t^I_v &gt; 1$$
on 
$${\mathbf {X}}^*_u([0, t^*_u])$$
, then

$$\displaystyle \begin{aligned} t^*_u - t^I_v = t^*_v + 1 - t^I_v &lt; t^*_v, \end{aligned} $$
(7.13)
According to the definition of 
$$t^*_u$$
and 
$$t^I_v$$
, within 
$$t^*_u-t^I_v$$
time slots, node v can infect all infected nodes on 
$$T^{-u}_v$$
. Since 
$$t^*_u = t^I_v + 1$$
, the infected node farthest from node u must be on 
$$T^{-u}_v$$
, which implies that there exists a node 
$$a\in T^{-u}_v$$
such that 
$$d(u,a) = t^*_u = t^*_v +1$$
and 
$$d(v,a) = t^*_v$$
. So node v cannot reach a within 
$$t^*_u - t^I_v$$
time slots, which contradicts the fact that the infection can spread from node v to a within 
$$t^*_u - t^I_v$$
time slots along the sample path 
$${\mathbf {X}}^*_u[0, t^*_u ]$$
. Therefore, 
$$t^I_v = 1$$
.
Now given sample path 
$${\mathbf {X}}^*_u([0, t^*_u])$$
, the third step is to construct 
$${\mathbf {X}}^*_v([0, t^*_v])$$
which occurs with a higher probability. The sample path 
$${\mathbf {X}}^*_u([0, t^*_u])$$
can be divided into two parts along subtrees 
$$T^{-v}_u$$
and 
$$T^{-u}_v$$
. Since 
$$t^I_v =1$$
, then

$$\displaystyle \begin{aligned} Pr({\mathbf{X}}^*_u([0, t^*_u])) = q\cdot Pr({\mathbf{X}}^*_u([0, t^*_u], T^{-u}_v)|t^I_v=1) \cdot Pr({\mathbf{X}}^*_u([0, t^*_u], T^{-v}_u)). \end{aligned} $$
(7.14)
Suppose in 
$${\mathbf {X}}^*_v([0, t^*_v])$$
, node u was infected at the first time slot, then

$$\displaystyle \begin{aligned} Pr({\mathbf{X}}^*_v([0, t^*_v])) = q\cdot Pr({\mathbf{X}}^*_v([0, t^*_v], T^{-u}_v)|t^I_v=1) \cdot Pr({\mathbf{X}}^*_v([0, t^*_v], T^{-v}_u)|t^I_u=1). \end{aligned} $$
(7.15)
For the subtree 
$$T^{-u}_v$$
, given 
$${\mathbf {X}}^*_u([0, t^*_u], T^{-u}_v)$$
, in which 
$$t^I_v=1$$
, the partial sample path 
$${\mathbf {X}}^*_v([0, t^*_v], T^{-u}_v)$$
can be constructed to be identical to 
$${\mathbf {X}}^*_u([0, t^*_u], T^{-u}_v)$$
except that all events occur one time slot earlier, i.e.,

$$\displaystyle \begin{aligned} {\mathbf{X}}^*_v([0, t^*_v], T^{-u}_v) = {\mathbf{X}}^*_u([0, t^*_u], T^{-u}_v). \end{aligned} $$
(7.16)
Then,

$$\displaystyle \begin{aligned} Pr({\mathbf{X}}^*_u([0, t^*_u], T^{-u}_v)|t^Iv=1) = Pr({\mathbf{X}}^*_v([0, t^*_v], T^{-u}_v)).. \end{aligned} $$
(7.17)
For the subtree 
$$T^{-v}_u$$
, 
$${\mathbf {X}}^*_v([0, t^*_v], T^{-v}_u)$$
can be constructed such that

$$\displaystyle \begin{aligned} {\mathbf{X}}^*_v([0, t^*_v], T^{-v}_u)\in \mbox{arg}~\displaystyle \max_{\tilde{\mathbf{X}}([0,t^*_v], t^{-v}_u)\in\mathcal{X}(t^*_v, T^{-v}_u)}Pr(\tilde{\mathbf{X}}([0,t^*_v], t^{-v}_u)|t^I_u=1). \end{aligned} $$
(7.18)
According to Lemma 7.1, the following inequation is satisfied:

$$\displaystyle \begin{aligned} \begin{array}{lll} \displaystyle \max_{\tilde{\mathbf{X}}([0,t^*_v], t^{-v}_u)\in\mathcal{X}(t^*_v, T^{-v}_u)}Pr(\tilde{\mathbf{X}}([0,t^*_v], t^{-v}_u)|t^I_u=1)\\ =\displaystyle \max_{\tilde{\mathbf{X}}([0,t^*_u-1], t^{-v}_u)\in\mathcal{X}(t^*_u-1, T^{-v}_u)}Pr(\tilde{\mathbf{X}}([0,t^*_u-1], t^{-v}_u)|t^I_u=1)\\ &gt;\displaystyle \max_{\tilde{\mathbf{X}}([0,t^*_u], t^{-v}_u)\in\mathcal{X}(t^*_u, T^{-v}_u)}Pr(\tilde{\mathbf{X}}([0,t^*_u], t^{-v}_u))\\ \end{array} \end{aligned} $$
(7.19)
Therefore, given the optimal sample path rooted at u, a sample path rooted at v can be constructed, which occurs with a higher probability. The lemma holds.

The following lemma gives a useful property of the Jordan infection centers.

Lemma 7.3

On a tree network with at least one infected node, there exist at most two Jordan infection centers. When the network has two Jordan infection centers, the two must be neighbors.

The following theorem states that the sample path based estimator is one of the Jordan infection centers.

Theorem 7.1

Consider a tree network with infinitely many levels. Assume that the observed infection topology Y contains at least one infected node. Then the source node associated with X [0, t ] (the solution to the optimization problem (7.4)) is a Jordan infection center, i.e.,

$$\displaystyle \begin{aligned} \hat{v} = \mathit{\mbox{arg}}~\displaystyle \min_{v\in V}\tilde{e}(v). \end{aligned} $$
(7.20)

Proof

Assume the network has two Jordan infection centers: w and u, and assume 
$$\tilde {e}(w) = \tilde {e}(u) = \lambda $$
. Based on Lemma 7.3, w and u must be adjacent. The following steps show that, for any a ∈ V ∖{w, u}, there exists a path from a to u (or w) along which the infection eccentricity strictly decreases.

First, it is easy to see from Fig. 7.3 that 
$$d(\gamma , w) \leq \lambda -1 ~\forall \gamma \in T^{-u}_w \cap I$$
. Then, there exists a node ξ such that the equality holds. Suppose that d(γ, w) ≤ λ − 2 for any 
$$\gamma \in T^{-u}_w\cap I$$
, which implies

$$\displaystyle \begin{aligned} d(\gamma, u)\leq\lambda-1~\forall~\gamma\in T^{-u}_w\cap I. \end{aligned} $$
(7.21)
Since w and u are both Jordan infection centers, we have 
$$\forall ~\gamma \in T^{-w}_u\cap I$$
,

$$\displaystyle \begin{aligned} \begin{array}{cc} d(\gamma, w)\leq\lambda\\ d(\gamma, u)\leq\lambda -1.\\ \end{array} \end{aligned} $$
(7.22)
In a summary, ∀ γ ∈ I,

$$\displaystyle \begin{aligned} d(\gamma, u)\leq\lambda -1. \end{aligned} $$
(7.23)
This contradicts the fact that 
$$\tilde {e}(w)=\tilde {e}(u)=\lambda $$
. Therefore, there exists 
$$\xi \in T^{-u}_w\cap I$$
such that

$$\displaystyle \begin{aligned} d(\xi, w) = \lambda -1.\end{aligned} $$
(7.24)
../images/466717_1_En_7_Chapter/466717_1_En_7_Fig3_HTML.png
Fig. 7.3

A pictorial description of the positions of nodes a, u, w and ξ

Similarly, 
$$\forall ~\gamma \in T^{-w}_u\cap I$$

$$\displaystyle \begin{aligned} d(\gamma, u) \leq \lambda -1.\end{aligned} $$
(7.25)
and there exists a node such that the equality holds.
If a ∈ V ∖{w, u}, 
$$a\in T^{-w}_u$$
and d(a, u) = β, then for any 
$$\gamma \in T^{-u}_w\cap I$$
, we have

$$\displaystyle \begin{aligned} \begin{array}{lll} d(a, \gamma) &amp; = d(a, u) + d(u, w) + d(w, \gamma)\\ &amp; \leq \beta + 1 + \lambda - 1\\ &amp; = \lambda + \beta, \end{array} \end{aligned} $$
(7.26)
and there exists 
$$\xi \in T^{-u}_w\cap I$$
such that the equality holds. On the other hand, 
$$\forall ~\gamma \in T^{-w}_u\cap I$$

$$\displaystyle \begin{aligned} \begin{array}{ll} d(a, \gamma) &amp; \leq d(a, u) + d(u, \gamma)\\ &amp; \leq \beta + \lambda - 1\\ \end{array} \end{aligned} $$
(7.27)
Therefore,

$$\displaystyle \begin{aligned} \tilde{e}(a) = \lambda + \beta, \end{aligned} $$
(7.28)
so the infection eccentricity decreases along the path from a to u.

Repeatedly applying Lemma 7.2 along the path from node a to u, then the optimal sample path rooted at node u is more likely to occur than the optimal sample path rooted at node a. Therefore, the root node associated with the optimal sample path X [0, t ] must be a Jordan infection center. The theorem holds.

7.6 Reverse Infection Algorithm

Zhu and Ying [200] further proposed a reverse infection algorithm to identify the Jordan infection centers. The key idea of the algorithm is to let every infected node broadcast a message containing its identity (ID) to its neighbors. Each node, after receiving messages from its neighbors, checks whether the ID in the message has been received. If not, the node records the ID (say v), the time at which the message is received (say t v), and then broadcasts the ID to its neighbors. When a node receives the IDs of all infected nodes, it claims itself as the propagation source and the algorithm terminates. If there are multiple nodes receiving all IDs at the same time, the tie is broken by selecting the node with the smallest 
$$\sum t_v$$
. The details of the algorithm is presented in Algorithm 7.1.

Simulations on g-regular trees were conducted. The infection probability q was chosen uniformly from (0, 1) and the recovery probability p was chosen uniformly from (0, q). The infection process propagates t was uniformly chosen from [3, 20]. The experimental results show that the detection rates of both the reverse infection and closeness centrality algorithms increase as the degree increases and is higher than 60% when g > 6.

Algorithm 7.1: Reverse infection algorithm

../images/466717_1_En_7_Chapter/466717_1_En_7_Figa_HTML.gif