Malware Propagation in Large-Scale Networks

Malware Propagation in Large-Scale NetworksAbstract—Malware is pervasive in networks, and poses a critical threat to network security. However, we have very limitedunderstanding of malware behavior in networks to date. In this paper, we investigate how malware propagates in networks from aglobal perspective. We formulate the problem, and establish a rigorous two layer epidemic model for malware propagation fromnetwork to network. Based on the proposed model, our analysis indicates that the distribution of a given malware follows exponentialdistribution, power law distribution with a short exponential tail, and power law distribution at its early, late and final stages, respectively.Extensive experiments have been performed through two real-world global scale malware data sets, and the results confirm ourtheoretical findings.Index Terms—Malware, propagation, modelling, power lawÇ1 INTRODUCTIONMALWARE are malicious software programs deployedby cyber attackers to compromise computer systemsby exploiting their security vulnerabilities. Motivated byextraordinary financial or political rewards, malware ownersare exhausting their energy to compromise as many networkedcomputers as they can in order to achieve theirmalicious goals. A compromised computer is called a bot,and all bots compromised by a malware form a botnet. Botnetshave become the attack engine of cyber attackers, andthey pose critical challenges to cyber defenders. In order tofight against cyber criminals, it is important for defenders tounderstand malware behavior, such as propagation ormembership recruitment patterns, the size of botnets, anddistribution of bots.To date, we do not have a solid understanding about thesize and distribution of malware or botnets. Researchershave employed various methods to measure the size of botnets,such as botnet infiltration [1], DNS redirection [3],external information [2]. These efforts indicate that the sizeof botnets varies from millions to a few thousand. There areno dominant principles to explain these variations. As aresult, researchers desperately desire effective models andexplanations for the chaos. Dagon et al. [3] revealed thattime zone has an obvious impact on the number of availablebots. Mieghem et al. [4] indicated that network topology hasan important impact on malware spreading through theirrigorous mathematical analysis. Recently, the emergence ofmobile malware, such as Cabir [5], Ikee [6], and Brador [7],further increases the difficulty level of our understandingon how they propagate. More details about mobile malwarecan be found at a recent survey paper [8]. To the best of ourknowledge, the best finding about malware distribution inlarge-scale networks comes from Chen and Ji [9]: the distributionis non-uniform. All this indicates that the research inthis field is in its early stage.The epidemic theory plays a leading role in malwarepropagation modelling. The current models for malwarespread fall in two categories: the epidemiology model andthe control theoretic model. The control system theorybased models try to detect and contain the spread of malware[10], [11]. The epidemiology models are more focusedon the number of compromised hosts and their distributions,and they have been explored extensively in the computerscience community [12], [13], [14]. Zou et al. [15] useda susceptible-infected (SI) model to predict the growth ofInternet worms at the early stage. Gao and Liu [16] recentlyemployed a susceptible-infected-recovered (SIR) model todescribe mobile virus propagation. One critical conditionfor the epidemic models is a large vulnerable populationbecause their principle is based on differential equations.More details of epidemic modelling can be find in [17]. Aspointed by Willinger et al. [18], the findings, which weextract from a set of observed data, usually reflect parts ofthe studied objects. It is more reliable to extract theoreticalresults from appropriate models with confirmation fromsufficient real world data set experiments. We practice thisprinciple in this study.In this paper, we study the distribution of malware interms of networks (e.g., autonomous systems (AS), ISPdomains, abstract networks of smartphones who share thesame vulnerabilities) at large scales. In this kind of setting,we have a sufficient volume of data at a large enough scaleto meet the requirements of the SI model. Different from the_ S. Yu is with the School of Information Technology, Deakin University,Burwood, Victoria 3125, Australia. E-mail: syu@deakin.edu.au._ G. Gu is with the Department of Computer Science and Engineering,Texas A&M University, College Station, TX 77843-3112.E-mail: guofei@cse.tamu.edu._ A. Barnawi is with the Faculty of Computing and IT, King AbdulazizUniversity, Jeddah, Saudi Arabia. E-mail: ambarnawi@kau.edu.sa._ S. Guo is with the School of Computer Science and Engineering, The Universityof Aizu, Aizuwakamatsu, Japan. E-mail: sguo@u-aizu.ac.jp._ I. Stojmenovic is with the School of Information Technology, DeakinUniversity, Australia; King Abdulaziz University, Jeddah, Saudi Arabia;and the School of EECS, University of Ottawa, Ottawa, ON K1N 6N5,Canada. E-mail: ivan@site.uottawa.ca.Manuscript received 1 Jan. 2014; revised 14 Apr. 2014; accepted 15 Apr. 2014.Date of publication 28 Apr. 2014; date of current version 1 Dec. 2014.Recommended for acceptance by F. Bonchi.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2320725170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 20151041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.traditional epidemic models, we break our model into twolayers. First of all, for a given time since the breakout of amalware, we calculate how many networks have been compromisedbased on the SI model. Second, for a compromisednetwork, we calculate how many hosts have beencompromised since the time that the network was compromised.With this two layer model in place, we can determinethe total number of compromised hosts and theirdistribution in terms of networks. Through our rigorousanalysis, we find that the distribution of a given malwarefollows an exponential distribution at its early stage, andobeys a power law distribution with a short exponential tailat its late stage, and finally converges to a power law distribution.We examine our theoretical findings through twolarge-scale real-world data sets: the Android based malware[19] and the Conficker [20]. The experimental resultsstrongly support our theoretical claims. To the best of ourknowledge, the proposed two layer epidemic model andthe findings are the first work in the field.Our contributions are summarized as follows._ We propose a two layer malware propagation modelto describe the development of a given malware atthe Internet level. Compared with the existing singlelayer epidemic models, the proposed model representsmalware propagation better in large-scalenetworks._ We find the malware distribution in terms of networksvaries from exponential to power law witha short exponential tail, and to power law distributionat its early, late, and final stage, respectively.These findings are first theoretically provedbased on the proposed model, and then confirmedby the experiments through the two large-scalereal-world data sets.The rest of the paper is structured as follows. Relatedwork is briefly listed in Section 2. We present the preliminariesfor the proposed model in Section 3. The studiedproblem is discussed in Section 4. A two layer malwarepropagation model is established in Section 5, and followedby a rigorous mathematical analysis in Section 6. Experimentsare conducted to confirm our findings in Section 7. InSection 8, we provide a further discussion about the study.Finally, we summarize the paper and present future workin Section 9.2 RELATED WORKThe basic story of malware is as follows. A malware programerwrites a program, called bot or agent, and theninstalls the bots at compromised computers on the Internetusing various network virus-like techniques. All ofhis bots form a botnet, which is controlled by its ownersto commit illegal tasks, such as launching DDoS attacks,sending spam emails, performing phishing activities, andcollecting sensitive information. There is a command andcontrol (C&C) server(s) to communicate with the bots andcollect data from bots. In order to disguise himself fromlegal forces, the botmaster changes the url of his C&C frequently,e.g., weekly. An excellent explanation about thiscan be found in [1].With the significant growing of smartphones, we havewitnessed an increasing number of mobile malware. Malwarewriters have develop many mobile malware in recentyears. Cabir [5] was developed in 2004, and was the firstmalware targeting on the Symbian operating system formobile devices. Moreover, it was also the first malwarepropagating via Bluetooth. Ikee [6] was the first mobile malwareagainst Apple iPhones, while Brador [7] was developedagainst Windows CE operating systems. The attackvictors for mobile malware are diverse, such as SMS, MMS,Bluetooth, WiFi, and Web browsing. Peng et al. [8] presentedthe short history of mobile malware since 2004, andsurveyed their propagation models.A direct method to count the number of bots is to use botnetinfiltration to count the bot IDs or IP addresses. Stone-Gross et al. [1] registered the URL of the Torpig botnetbefore the botmaster, and therefore were able to hijack theC&C server for ten days, and collect about 70G data fromthe bots of the Torpig botnet. They reported that the footprintof the Torpig botnet was 182,800, and the median andaverage size of the Torpig’s live population was 49,272 and48,532, respectively. They found 49,294 new infections duringthe ten days takeover. Their research also indicated thatthe live population fluctuates periodically as users switchbetween being online and offline. This issue was also tackedby Dagon et al. in [3].Another method is to use DNS redirection. Dagon et al.[3] analyzed captured bots by honypot, and then identifiedthe C&C server using source code reverse engineeringtools. They then manipulated the DNS entry which isrelated to a botnet’s IRC server, and redirected the DNSrequests to a local sinkhole. They therefore could countthe number of bots in the botnet. As discussed previously,their method counts the footprint of the botnet, whichwas 350,000 in their report.In this paper, we use two large scale malware data setsfor our experiments. Conficker is a well-known and one ofthe most recently widespread malware. Shin et al. [20] collecteda data set about 25 million Conficker victims from allover the world at different levels. At the same time, malwaretargeting on Android based mobile systems are developingquickly in recent years. Zhou and Jiang [19] collecteda large data set of Android based malware.In [2], Rajab et al. pointed out that it is inaccurate tocount the unique IP addresses of bots because DHCP andNAT techniques are employed extensively on the Internet([1] confirms this by their observation that 78.9 percent ofthe infected machines were behind a NAT, VPN, proxy,or firewall). They therefore proposed to examine the hitsof DNS caches to find the lower bound of the size of agiven botnet.Rajab et al. [21] reported that botnets can be categorizedinto two major genres in terms of membership recruitment:worm-like botnets and variable scanning botnets. The latterweights about 82 percent in the 192 IRC bots that they investigated,and is the more prevalent class seen currently. Suchbotnets usually perform localized and non-uniform scanning,and are difficult to track due to their intermittent andcontinuously changing behavior. The statistics on the lifetimeof bots are also reported as 25 minutes on average with90 percent of them staying for less than 50 minutes.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 171Malware propagation modelling has been extensivelyexplored. Based on epidemiology research, Zou et al. [15]proposed a number of models for malware monitoring atthe early stage. They pointed out that these kinds of modelare appropriate for a system that consists of a large numberof vulnerable hosts; in other words, the model is effective atthe early stage of the outbreak of malware, and the accuracyof the model drops when the malware develops further. Asa variant of the epidemic category, Sellke et al. [12] proposeda stochastic branching process model for characterizingthe propagation of Internet worms, which especiallyfocuses on the number of compromised computers againstthe number of worm scans, and presented a closed formexpression for the relationship. Dagon et al. [3] extendedthe model of [15] by introducing time zone information aðtÞ,and built a model to describe the impact on the number oflive members of botnets with diurnal effect.The impact of side information on the spreading behaviorof network viruses has also been explored. Ganesh et al.[22] thoroughly investigated the effect of network topologyon the spead of epidemics. By combining Graph theory anda SIS (susceptible—infective—susceptible) model, theyfound that if the ratio of cure to infection rates is smallerthan the spectral radius of the graph of the studied network,then the average epidemic lifetime is of order log n, where nis the number of nodes. On the other hand, if the ratio islarger than a generalization of the isoperimetric constant ofthe graph, then the average epidemic lifetime is of order ena,where a is a positive constant. Similarly, Mieghem et al. [4]applied the N-intertwined Markov chain model, an applicationof mean field theory, to analyze the spread of viruses innetworks. They found that tc ¼ 1_maxðAÞ, where tc is the sharpepidemic threshold, and _maxðAÞ is the largest eigenvalue ofthe adjacency matrix A of the studied network. Moreover,there have been many other methodologies to tackle theproblem, such as game theory [23].3 PRELIMINARIESPreliminaries of epidemic modelling and complex networksare presented in this section as this work is mainly based onthe two fields.For the sake of convenience, we summarize the symbolsthat we use in this paper in Table 1.3.1 Deterministic Epidemic ModelsAfter nearly 100 years development, the epidemic models[17] have proved effective and appropriate for a system thatpossesses a large number of vulnerable hosts. In otherwords, they are suitable at a macro level. Zou et al. [15]demonstrated that they were suitable for the studies ofInternet based virus propagation at the early stage.We note that there are many factors that impact the malwarepropagation or botnet membership recruitment, suchas network topology, recruitment frequency, and connectionstatus of vulnerable hosts. All these factors contribute to thespeed of malware propagation. Fortunately, we can includeall these factors into one parameter as infection rate b inepidemic theory. Therefore, in our study, let N be the totalnumber of vulnerable hosts of a large-scale network (e.g., theInternet) for a given malware. There are two statuses for anyone of the N hosts, either infected or susceptible. Let IðtÞ bethe number of infected hosts at time t, then we havedIðtÞdt ¼ bðtÞ½N _ RðtÞ _ IðtÞ _ QðtÞ_IðtÞ _dRðtÞdt; (1)where RðtÞ, and QðtÞ represent the number of removedhosts from the infected population, and the number ofremoved hosts from the susceptible population at time t.The variable bðtÞ is the infection rate at time t.For our study, model (1) is too detailed and not necessaryas we expect to know the propagation and distribution of agiven malware. As a result, we employ the following susceptible-infected model:dIðtÞdt ¼ bIðtÞ½N _ IðtÞ_; (2)where the infection rate b is a constant for a given malwarefor any network.We note that the variable t is continuous in model (2) and(1). In practice, we measure IðtÞ at discrete time points.Therefore, t ¼ 0; 1; 2; . . . . We can interpret each time pointas a new round of malware membership recruitment, suchas vulnerable host scanning. As a result, we can transformmodel (2) into the discrete form as follows:IðtÞ ¼ ð1 þ aDÞIðt _ 1Þ _ bDIðt _ 1Þ2; (3)where t ¼ 0; 1; 2; . . . ; D is the unit of time, Ið0Þ is the initialnumber of infected hosts (we also call them seeds in thispaper), and a ¼ bN, which represents the average numberof vulnerable hosts that can be infected by one infected hostper time unit.In order to simplify our analysis, let D ¼ 1, it could beone second, one minute, one day, or one month, even oneyear, depending on the time scale in a given context. Hence,we have a simpler discrete form given byIðtÞ ¼ ð1 þ aÞIðt _ 1Þ _ bðIðt _ 1ÞÞ2: (4)Based on Equation (4), we define the increase of infectedhosts for each time unit as follows.DIðtÞ , IðtÞ _ Iðt _ 1Þ; t ¼ 1; 2; . . . : (5)To date, many researches are confined to the “earlystage” of an epidemic, such as [15]. Under the early stagecondition, IðtÞ << N, therefore, N _ IðtÞ _ N. As a result,a closed form solution is obtained as follows:IðtÞ ¼ Ið0ÞebNt: (6)TABLE 1Notations of Symbols in This Paper172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015Whenwe take the ln operation on both sides of Equation (6),we haveln IðtÞ ¼ bNt þ ln Ið0Þ: (7)For a given vulnerable network, b, N and Ið0Þ are constants,therefore, the graphical representation of Equation (7)is a straight line.Based on the definition of Equation (5), we obtain theincrease of new members of a malware at the early stage asDIðtÞ ¼ ðebN _ 1ÞIðt _ 1Þ¼ ðebN _ 1ÞIð0ÞebNðt_1Þ: (8)Taking the ln operation on both side of (8), we haveln DIðtÞ ¼ bNðt _ 1Þ þ ln ððebN _ 1ÞIð0ÞÞ: (9)Similar to Equation (7), the graphical representation ofequation (9) is also a straight line. In other words, the numberof recruited members for each round follows an exponentialdistribution at the early stage.We have to note that it is hard for us to know whetheran epidemic is at its early stage or not in practice. Moreover,there is no mathematical definition about the termearly stage.In epidemic models, the infection rate b has a criticalimpact on the membership recruitment progress, and b isusually a small positive number, such as 0.00084 for wormCode Red [12]. For example, for a network with N ¼ 10;000vulnerable hosts, we show the recruited members underdifferent infection rates in Fig. 1. From this diagram, we cansee that the recruitment goes slowly when b ¼ 0:0001, however,all vulnerable hosts have been compromised in lessthan 7 time units when b ¼ 0:0003, and the recruitment progressesin an exponential fashion.This reflects the malware propagation styles in practice.For malware based on “contact”, such as blue tooth contacts,or viruses depending on emails to propagate, theinfection rate is usually small, and it takes a long time tocompromise a large number of vulnerable hosts in a givennetwork. On the other hand, for some malware, which takeactive actions for recruitment, such as vulnerable host scanning,it may take one or a few rounds of scanning to recruitall or a majority of the vulnerable hosts in a given network.We will apply this in the following analysis and performanceevaluation.3.2 Complex NetworksResearch on complex networks have demonstrated that thenumber of hosts of networks follows the power law. Peoplefound that the size distribution usually follows the powerlaw, such as population in cities in a country or personalincome in a nation [24]. In terms of the Internet, researchershave also discovered many power law phenomenon, suchas the size distribution of web files [25]. Recent progressesreported in [26] further demonstrated that the size of networksfollows the power law.The power law has two expression forms: the Pareto distributionand the Zipf distribution. For the same objects ofthe power law, we can use any one of them to represent it.However, the Zipf distributions are tidier than the expressionof the Pareto distributions. In this paper, we will useZipf distributions to represent the power law. The Zipfexpression is as follows:Prfx ¼ ig ¼Cia ; (10)where C is a constant, a is a positive parameter, calledthe Zipf index, Prfx ¼ ig represents the probability of theith ði ¼ 1; 2; . . .P Þ largest object in terms of size, andi Prfx ¼ ig ¼ 1.A more general form of the distribution is called theZipf-Mandelbrot distribution [27], which is defined asfollows:Prfx ¼ ig ¼Cði þ qÞa ; (11)where the additional constant q ðq _ 0Þ is called the plateaufactor, which makes the probability of the highest rankedobjects flat. The Zipf-Mandelbrot distribution becomes theZipf distribution when q ¼ 0.Currently, the metric to say a distribution is a powerlaw is to take the loglog plot of the data, and we usuallysay it is a power law if the result shows a straight line.We have to note that this is not a rigorous method, however,it is widely applied in practice. Power law distributionsenjoy one important property, scale free. We referinterested readers to [28] about the power law and itsproperties.4 PROBLEM DESCRIPTIONIn this section, we describe the malware propagation problemin general.As shown in Fig. 2, we study the malware propagationissue at two levels, the Internet level and the network level.We note that at the network level, a network could bedefined in many different ways, it could be an ISP domain,a country network, the group of a specific mobile devices,and so on. At the Internet level, we treat every network ofthe network level as one element.Fig. 1. The impact from infection rate b on the recruitment progress for agiven vulnerable network with N ¼ 10,000.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 173At the Internet level, we suppose, there are M networks,each network is denoted as Lið1 _ i _ MÞ. For anynetwork Li, we suppose it physically possesses Ni hosts.Moreover, we suppose the possibility of vulnerable hostsof Li is denoted as pið0 _ pi _ 1Þ. In general, it is highlypossible that Ni 6¼ Nj, and pi 6¼ pj for i 6¼ j; 1 _ i; j _ M.Moreover, due to differences in network topology, operatingsystem, security investment and so on, the infectionrates are different from network to network. We denote itas bi for Li. Similarly, it is highly possible that bi 6¼ bj fori 6¼ j; 1 _ i; j _ M.For any given network Li with pi _ Ni vulnerable hostsand infection rate bi. We suppose the malware propagationstarts at time 0. Based on Equation (4), we obtain the numberof infected hosts, IiðtÞ, of Li at time t as follows:IiðtÞ ¼ ð1 þ aiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2¼ ð1 þ bipiNiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2:(12)In this paper, we are interested in a global sense of malwarepropagation. We study the following question.For a given time t since the outbreak of a malware, whatare the characteristics of the number of compromised hostsfor each network in the view of the whole Internet. In otherwords, to find a function F about IiðtÞð1 _ i _ MÞ. Namely,the pattern ofFðI1ðtÞ; I2ðtÞ; . . . ; IMðtÞÞ: (13)For simplicity of presentation, we use SðLi; tÞ to replaceIiðtÞ at the network level, and IðtÞ is dedicated for the Internetlevel. Following Equation (13), for any networkLið1 _ i _ MÞ, we haveSðLi; tÞ ¼ ð1 þ bipiNiÞSðLi; t _ 1Þ _ biðSðLi; t _ 1ÞÞ2: (14)At the Internet level, we suppose there are k1; k2; . . . ; ktnetworks that have been compromised at each round foreach time unit from 1 to t. Any kið1 _ i _ tÞ is decided byEquation (4) as follows:ki ¼ ð1 þ bnMÞIði _ 1Þ _ bnðIði _ 1ÞÞ2; (15)where M is the total number of networks over the Internet,and bn is the infection rate among networks. Moreover,suppose the number of seeds, k0, is known.At this time point t, the landscape of the compromisedhosts in terms of networks is as follows.S_L1k1; t_; S_L2k1; t_; . . . ; S_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1S_L1k2; t _ 1_; S_L2k2; t _ 1_; . . . ; S_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflk2. . .S_L1kt; 1_; S_L2kt; 1_; . . . ; S_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt;(16)where Ljkirepresents the jth network that was compromisedat round i. In other words, there are k1 compromised networks,and each of them have progressed t time units; k2compromised networks, and each of them has progressedt _ 1 time units; and kt compromised networks, and each ofthem have progressed 1 time unit.It is natural to have the total number of compromisedhosts at the Internet level asIðtÞ ¼ S_L1k1; t_þ S_L2k1; t_þ_ _ _þS_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1þ S_L1k2; t _ 1_þ_ _ _þS_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k2þ_ _ _þ S_L1kt; 1_þ S_L2kt; 1_þ_ _ _þS_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt(17)Suppose kiði ¼ 1; 2; . . .Þ follows one distribution with aprobability distribution of pn (n stands for number), andthe size of a compromised network, SðLi; tÞ, followsanother probability distribution of ps (s stands for size).Let pI be the probability distribution of IðtÞðt ¼ 0; 1; . . .Þ.Based on Equation (18), we find pI is exactly the convolutionof pn and ps.pI ¼ pn        ps; (18)where      is the convolution operation.Our goal is to find a pattern of pI of Equation (18).5 MALWARE PROPAGATION MODELLINGAs shown in Fig. 2, we abstract theM networks of the Internetinto M basic elements in our model. As a result, anytwo large networks, Li and Lj (i 6¼ j), are similar to eachother at this level. Therefore, we can model the studiedproblem as a homogeneous system. Namely, all the M networksshare the same vulnerability probability (denoted asp), and the same infection rate (denoted as b). A simpleway to obtain these two parameters is to use the means:p ¼1MXMi¼1pib ¼1MXMi¼1bi:8>>>><>>>>:(19)Fig. 2. The system architecture of the studied malware propagation.174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015For any network Li, let Ni be the total number of vulnerablehosts, then we haveNi ¼ p _ Ni; i ¼ 1; 2; . . .;M; (20)where Ni is the total number of computers of network Li.As discussed in Section 3, we know that Niði ¼ 1; 2; . . . ;MÞ follows the power law. As p is a constant in Equation(20), then Niði ¼ 1; 2; . . .;MÞ follows the power law as well.Without loss of generality, let Li represent the ith networkin terms of total vulnerable hosts (Ni). Based on the Zipf distribution,if we randomly choose a network X, the probabilitythat it is network Lj isPrfX ¼ Ljg ¼ pzðjÞ ¼N P jMi¼1 Ni ¼Cja : (21)Equation (21) shows clearly that a network with a largernumber of vulnerable hosts has a higher probability to becompromised.Following Equation (18), at time t, we have k1 þ k2 þ_ _ _þkt networks that have been compromised. Combiningwith Equation (21), in general, we know the first round ofrecruitment takes the largest k1 networks, and the secondround takes the k2 largest networks among the remainingnetworks, and so on. We therefore can simplify Equation(18) asIðtÞ ¼Xk1j¼1SðNj; tÞpzðjÞþXk2j¼1SðNk1þj; t _ 1Þpzðk1 þ jÞþ . . .þXktj¼1SðNk1þ___þkt_1þj; 1Þ_ pzðk1 þ_ _ _þkt_1 þ jÞ: (22)From Equation (22), we know the total number of compromisedhosts and their distribution in terms of networksfor a given time point t.6 ANALYSIS ON THE PROPOSED MALWAREPROPAGATION MODELIn this section, we try to extract the pattern of IðtÞ in termsof SðLi; t0 Þ, or pI of Equation (18).We make the following definitions before we progress forthe analysis.1) Early stage. An early stage of the breakout of a malwaremeans only a small percentage of vulnerablehosts have been compromised, and the propagationfollows exponential distributions.2) Final stage. The final stage of the propagation of amalware means that all vulnerable hosts of a givennetwork have been compromised.3) Late stage. A late stage means the time intervalbetween the early stage and the final stage.We note that many researches are focused on the earlystage, and we define the early stage to meet the pervasivelyaccepted condition, we coin the other two terms for theconvenience of our following discussion. Moreover, we setvariable Te as the time point that a malware’s progresstransfers from its early stage to late stage. In terms of mathematicalexpressions, we express the early, late and finalstage as 0 _ t < Te, Te _ t < 1, and t¼1, respectively.Due to the complexity of Equation (22), it is difficult toobtain conclusions in a dynamic style. However, we areable to extract some conclusions under some specialconditions.Lemma 1. If distributions pðxÞ and qðxÞ follow exponential distributions,then pðxÞqðxÞ follows an exponential distributionas well.Due to the space limitation, we skip the proof and referinterested readers to [29].At the early stage of a malware breakout, we have advantagesto obtain a clear conclusion.Theorem 1. For large scale networks, such as the Internet, at theearly stage of a malware propagation, the malware distributionin terms of networks follows exponential distributions.Proof. At a time point of the early stage (0 _ t < Te) of amalware breakout, following Equation (6), we obtain thenumber of compromised networks asIðtÞ ¼ Ið0ÞebnMt: (23)It is clear that IðtÞ follows an exponential distribution.For any of the compromised networks, we suppose ithas progressed t0 ð0 < t0 _ t < Te Þ time units, and itssize isSðLi; t0Þ ¼ Iið0ÞebNit0: (24)Based on Equation (24), we find that the size of anycompromised network follows an exponential distribution.As a result, all the sizes of compromised networksfollow exponential distributions at the early stage.Based on Lemma 1, we obtain that the malware distributionin terms of network follows exponential distributionsat its early stage. tuMoreover, we can obtain concrete conclusion of the propagationof malware at the final stage.Theorem 2. For large scale networks, such as the Internet, at thefinal stage (t¼1) of a malware propagation, the malwaredistribution in terms of networks follows the power lawdistribution.Proof. At the final stage, all vulnerable hosts have beencompromised, namely,SðLi;1Þ ¼ Ni; i ¼ 1; 2; . . .;M:Based on our previous discussion, we know Niði ¼1; 2; . . .;MÞ follows the power law. As a result, the theoremholds. tuNow, we move our study to the late stage of malwarepropagation.Theorem 3. For large scale networks, such as the Internet, at thelate stage (Te _ t < 1) of a malware breakout, the malwaredistribution include two parts: a dominant power law bodyand a short exponential tail.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 175Proof. Suppose a malware propagation has progressed fortðt > > TeÞ time units. Let t0 ¼ t _ Te. If we separate allthe compromised IðtÞ hosts by time point t0, we have twogroups of compromised hosts.Following Theorem 2, as t0 >> Te, the compromisedhosts before t0 follows the power law. At the same time,all the compromised networks after t0 are still in theirearly stage. Therefore, these recently compromised networksfollow exponential distributions.Now, we need to prove that the networks compromisedafter time point t0 are at the tail of the distribution.First of all, for a given network Li, for t1 > t2,we haveSðLi; t1Þ _ SðLi; t2Þ: (25)For two networks, Li and Lj, if Ni _ Nj, then Lishould be compromised earlier than Lj. Combining thiswith (25), we know the later compromised networks usuallylie at the tail of the distribution.Due to the fact that t0 >> Te, the length of the exponentialtail is much shorter than the length of the mainbody of the distribution. tu7 PERFORMANCE EVALUATIONIn this section, we examine our theoretical analysis throughtwo well-known large-scale malware: Android malwareand Conficker. Android malware is a recent fast developingand dominant smartphone based malware [19]. Differentfrom Android malware, the Conficker worm is an Internetbased state-of-the-art botnet [20]. Both the data sets havebeen widely used by the community.From the Android malware data set, we have an overviewof the malware development from August 2010 to October2011. There are 1,260 samples in total from 49 differentAndroid malware in the data set. For a given Android malwareprogram, it only focuses on one or a number of specificvulnerabilities. Therefore, all smartphones share these vulnerabilitiesform a specific network for that Android malware.In other words, there are 49 networks in the data set,and it is reasonable that the population of each network ishuge. We sort the malware subclasses according to their size(number of samples in the data set), and present them in aloglog format in Fig. 3, the diagram is roughly a straight line.In other words, we can say that the Android malware distributionin terms of networks follows the power law.We now examine the growth pattern of total number ofcompromised hosts of Android malware against time,namely, the pattern of IðtÞ. We extract the data from thedata set and present it in Table 2. We further transform thedata into a graph as shown in Fig. 4. It shows that the memberrecruitment of Android malware follows an exponentialdistribution nicely during the 15 months time interval. Wehave to note that our experiments also indicate that thisdata does not fit the power law (we do not show them heredue to space limitation).In Fig. 4, we match a straight line to the real data throughthe least squares method. Based on the data, we can estimatethat the number of seeds (Ið0Þ) is 10, and a ¼ 0:2349.Following our previous discussion, we infer that the propagationof Android malware was in its early stage. It is reasonableas the size of each Android vulnerable network ishuge and the infection rate is quite low (the infection is basicallybased on contacts).We also collected a large data set of Conficker from variousaspects. Due to the space limitation, we can only presenta few of them here to examine our theoretical analysis.First of all, we treat AS as networks in the Internet. Ingeneral, ASs are large scale elements of the Internet. A fewkey statistics from the data set are listed in Table 3. WeFig. 3. The probability distribution of Androidmalware in terms of networks.TABLE 2The Number of Different Android Malware against Time (Months) in 2010-2011Fig. 4. The growth of total compromised hosts by Android malwareagainst time from August 2010 to October 2011.176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015present the data in a loglog format in Fig. 5, which indicatesthat the distribution does follow the power law.A unique feature of the power law is the scale free property.In order to examine this feature, we measure the compromisedhosts in terms of domain names at three differentdomain levels: the top level, level 1, and level 2, respectively.Some statistics of this experiment are listed inTable 4.Once again, we present the data in a loglog format inFigs. 6a, 6b and 6c, respectively. The diagrams show thatthe main body of the three scale measures are roughlystraight lines. In other words, they all fall into power lawdistributions. We note that the flat head in Fig. 6 can beexplained through a Zipf-Mandelbrot distribution. Therefore,Theorem 2 holds.In order to examine whether the tails are exponential, wetake the smallest six data from each tail of the three levels. Itis reasonable to say that they are the networks compromisedat the last 6 time units, the details are listed in Table 5 (wenote that t ¼ 1 is the sixth last time point, and t ¼ 6 is thelast time point).When we present the data of Table 5 into a graph asshown in Fig. 7, we find that they fit an exponential distributionvery well, especially for the level 2 and level 3domain name cases. This experiment confirms our claimin Theorem 3.8 FURTHER DISCUSSIONIn this paper, we have explored the problem of malwaredistribution in large-scale networks. There are many directionsthat could be further explored. We list some importantones as follows.1) The dynamics of the late stage. We have found thatthe main body of malware distribution follows thepower law with a short exponential tail at the latestage. It is very attractive to explore the mathematicalmechanism of how the propagation leads to suchkinds of mixed distributions.2) The transition from exponential distribution topower law distribution. It is necessary to investigatewhen and how a malware distribution moves froman exponential distribution to the power law. Inother words, how can we clearly define the transitionpoint between the early stage and the late stage.3) Multiple layer modelling. We hire the fluid model inboth of the two layers in our study as both layers aresufficiently large and meet the conditions for themodelling methods. In order to improve the accuracyof malware propagation, we may extend ourwork to nðn > 2Þ layers. In another scenario, weTABLE 3Statistics for Conficker Distribution in Terms of ASsFig. 5. Power law distribution of Conficker in terms of autonomousnetworks.TABLE 4Statistics for Conficker Distribution in Terms of DomainNames at the Three Top LevelsFig. 6. Power law distribution of Conficker botnet in the top three levels of domain names.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 177may expect to model a malware distribution for middlesize networks, e.g., an ISP network with manysubnetworks. In these cases, the conditions for thefluid model may not hold. Therefore, we need toseek suitable models to address the problem.4) Epidemic model for the proposed two layer model.In this paper, we use the SI model, which is thesimplest for epidemic analysis. More practical models,e.g., SIS or SIR, could be chosen to serve thesame problem.5) Distribution of coexist multiple malware in networks.In reality, multiple malware may coexist atthe same networks. Due to the fact that different malwarefocus on different vulnerabilities, the distributionsof different malware should not be the same. Itis challenging and interesting to establish mathematicalmodels for multiple malware distribution interms of networks.9 SUMMARY AND FUTURE WORKIn this paper, we thoroughly explore the problem of malwaredistribution at large-scale networks. The solution tothis problem is desperately desired by cyber defenders asthe network security community does not yet have solidanswers. Different from previous modelling methods, wepropose a two layer epidemic model: the upper layerfocuses on networks of a large scale networks, for example,domains of the Internet; the lower layer focuses on the hostsof a given network. *This two layer model improves theaccuracy compared with the available single layer epidemicmodels in malware modelling. Moreover, the proposed twolayer model offers us the distribution of malware in termsof the low layer networks.We perform a restricted analysis based on the proposedmodel, and obtain three conclusions: The distribution for agiven malware in terms of networks follows exponentialdistribution, power law distribution with a short exponentialtail, and power law distribution, at its early, late, andfinal stage, respectively. In order to examine our theoreticalfindings, we have conducted extensive experiments basedon two real-world large-scale malware, and the results confirmour theoretical claims.In regards to future work, we will first further investigatethe dynamics of the late stage. More details of the findingsare expected to be further studied, such as the length of theexponential tail of a power law distribution at the late stage.Second, defenders may care more about their own network,e.g., the distribution of a given malware at their ISPdomains, where the conditions for the two layer model maynot hold. We need to seek appropriate models to addressthis problem. Finally, we are interested in studying the distributionof multiple malware on large-scale networks aswe only focus on one malware in this paper. We believe it isnot a simple linear relationship in the multiple malwarecase compared to the single malware one.ACKNOWLEDGMENTSDr Yu’s work is partially supported by the National NaturalScience Foundation of China (grant No. 61379041), Prof.Stojmenovic’s work is partially supported by NSERCCanada Discovery grant (grant No. 41801-2010), and KAUDistinguished Scientists Program.Shui Yu (M’05-SM’12) received the BEng andMEng degrees from the University of ElectronicScience and Technology of China, Chengdu, P.R. China, in 1993 and 1999, respectively, andthe PhD degree from Deakin University, Victoria,Australia, in 2004. He is currently a senior lecturerwith the School of Information Technology,Deakin University, Victoria, Australia. He haspublished nearly 100 peer review papers, includingtop journals and top conferences, such asIEEE TPDS, IEEE TIFS, IEEE TFS, IEEE TMC,and IEEE INFOCOM. His research interests include networking theory,network security, and mathematical modeling. His actively servers hisresearch communities in various roles, which include the editorial boardsof the IEEE Transactions on Parallel and Distributed Systems, IEEECommunications Surveys and Tutorials, and IEEE Access, IEEE INFOCOMTPC members 2012-2015, symposium co-chairs of IEEE ICC2014, IEEE ICNC 2013-2015, and many different roles of internationalconference organizing committees. He is a senior member of the IEEE,and a member of the AAAS.Guofei Gu (S’06-M’08) received the PhD degreein computer science from the College of Computing,Georgia Institute of Technology. He is anassistant professor in the Department of ComputerScience and Engineering, Texas A&M University(TAMU), College Station, TX. Hisresearch interests are in network and systemsecurity, such as malware analysis, detection,defense, intrusion and anomaly detection, andweb and social networking security. He is currentlydirecting the Secure Communication andComputer Systems (SUCCESS) Laboratory at TAMU. He received the2010 National Science Foundation (NSF) Career Award and a corecipientof the 2010 IEEE Symposium on Security and Privacy (Oakland 10)Best Student Paper Award. He is a member of the IEEE.Ahmed Barnawi received the PhD degree fromthe University of Bradford, United Kingdom in2006. He is an associate professor at the Facultyof Computing and IT, King Abdulaziz University,Jeddah, Saudi Arabia, where he works since2007. He was visiting professor at the Universityof Calgary in 2009. His research areas are cellularand mobile communications, mobile ad hocand sensor networks, cognitive radio networksand security. He received three strategicresearch grants and registered two patents in theUS. He is a member of the IEEE.Song Guo (M’02-SM’11) received the PhDdegree in computer science from the Universityof Ottawa, Canada in 2005. He is currently asenior associate professor at the School of ComputerScience and Engineering, the University ofAizu, Japan. His research interests are mainly inthe areas of protocol design and performanceanalysis for reliable, energy-efficient, and costeffective communications in wireless networks.He is an associate editor of the IEEE Transactionson Parallel and Distributed Systems and aneditor of Wireless Communications and Mobile Computing. He is asenior member of the IEEE and the ACM.Ivan Stojmenovic was editor-in-chief of theIEEE Transactions on Parallel and DistributedSystems (2010-3), and is founder of three journals.He is editor of the IEEE Transactions onComputers, IEEE Network, IEEE Transactionson Cloud Computing, and ACM Wireless Networksand steering committee member of theIEEE Transactions on Emergent Topics in Computing.He is on Thomson Reuters list of HighlyCited Researchers from 2013, has top h-index inCanada for mathematics and statistics, and hasmore than 15,000 citations. He received five Best Paper Awards. He is afellow of the IEEE, Canadian Academy of Engineering and AcademiaEuropaea. He has received the Humboldt Research Award.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 179