Home Research Personal  activity   Link

Code Red worm propagation modeling and analysis

(Last Update: March 1st, 2002)

Abstract:

In this report, we first point out several problems in Stuart Staniford's model of code red worm propagation [1]. We show that his model is too simple and cannot catch the real evolvement properties of code red worm. Some other worm propagation paper [7,8,9] used the same modeling method. There are two main processes that they have not considered. One is the slowing down of worm scanning process when more and more ISPs unplug their networks or block web traffic and the event of BGP storm [11] happened during this period ; Another is the fact that users and ISPs who blocked or patched their computers were kept increasing when worms were spreading rapidly. Based on the general deterministic epidemic model, the Kermack-Mckendrick model [6], we derive a general Internet worm model called two-factor worm model. It takes into account of the above two processes that are unique to Internet worm propagation. Though we can not derive an analytical solution for it due to the undetermined dynamics of the above two processes, we can specify them and did some simulations of Code Red worm propagation based on our new model. It showed that our model fitted the real observation data of Code Red worm propagation better than Staniford's model. It can give us more understanding of worm propagation on Internet and better prediction of future worm spreading. It also explains the reason that why some previous worms were not widely spread and as scared as predicted by news or some people.


I. Background


The Windows IIS vulnerability exploit was discovered on June 18, 2001. First version of the Code-Red virus emerged on July 12th, 2001. The truly virulent strain of the worm began to spread on July 17th. This new worm has correct random generator. It generates 100 threads. Each thread randomly picks one IP address and tries to setup connection with the port 80. If the connection were successful, the worm will send copy of itself to the victim web server and continues to find another web server. If the victim were not a web server and connection could not be setup, the thread will randomly generate another IP address to probe. The Code Red worm kept spreading on July 19th until 0:00 GMT (8:00 pm EDT) July 20th, after which the worm stopped spreading [4]. Dr. Stuart Staniford of Silicon Defense published his theoretical analysis of this worm propagation [1]. The differential equation is a typical epidemic equation and he claimed that the equation fitted observed Code Red spreading quite well. After that, many researchers just took this model as granted. During this one-day period, the worm kept spreading continuously, which makes it valuable for us to study worm propagation properties.


II.Questions on Stuart Staniford report [1]



1. Why using worm scan number, not infected host number? In Dr. Staniford's report, the Ken Eichman data set [14] consist of 2 types of data, one is Code Red worm scan number in each hour, another is the number of unique source addresses scanning during each hour. Because each infected host will generate 100 scanning (one scanning per thread) at the same time, so worm-scanning number was much bigger than the number of unique source addresses that are correspondent to the number of infected hosts. Since we want to model worm propagation and check infected host population dynamics, we should mainly deal with infected hosts data set. Since Dr. Staniford's differential equation is the epidemic equation of infected hosts, he should also use the infected hosts number as the data set. But strangely, he used the worm-scanning number data set in his report and mentioned nothing about the infected hosts data set. Another questionable point is that he only used part of the whole data set that Ken Eichman provided. The Code Red worm kept spreading until 0:00 GMT July 20th, but Staniford only used data set up to 21:00 GMT July 19th. During the 3 hours from 21:00 to 00:00 GMT, the worm scan number kept decreasing and Staniford's model can not explain it. We want to know the dynamic of infected host population, thus we need the total infected host number data, which was not provided by Ken's data. We can not derive this number from Ken Eichman's unique sources number per hour since there are a lot overlap sources across each hour and his network address space is only 1/65536 of the whole IP address space. The data set in report [2] seems more useful to us. The data was collected from 2 locations with 2 types of data format to maximize coverage of the expansion of the worm. The total infected host number was directly collected and one /8 network was used to cover 1/256 of the whole IP address space.


2. Have most of vulnerable nodes been infected? The typical epidemic equation for finite population is [6]:

$\displaystyle \frac{dI(t)}{dt} =\beta I(t)[N-I(t)]$ (1)

$ I(t)$ is the infected hosts number; $ N$ is the total population number; And $ \beta $ can be treated as infection rate. It is what Staniford used in his report [1]. Let $ \beta=1.8$ , the curve is Fig. 1:
Figure 1: Typical epidemic equation curve ($ \beta $ = 1.8)
If we change $ I(t)$ to be $ S(t)=N-I(t)$ where $ S(t)$ is the number of susceptible hosts, the equation is the same except a sign. So the curve in Fig. 1 is diagonal symmetric. If we use the parameter in Staniford report, let $ \beta $ equal to 1.8, then it is clear that the infected hosts number will begin to slow down only after 80% of all vulnerable nodes have been infected. If Staniford model is correct for Code Red worm propagation during July 19th, it means that over 80% of vulnerable Internet web servers have been infected in just that one day! How can this be true? We know from the news [5] that Microsoft estimated there were about 6 million vulnerable windows NT/2000 web servers. If we consider the number is exaggerated by Microsoft and there were some web servers not available on that day, then we can conservatively estimate that the vulnerable hosts available on that day should be at least 1 million to 2 million. From the more complete data set [2], there are about 350,000 nodes have been infected on July 19th. So it means that only about 20% to 40% of vulnerable web servers have been infected during that day.


It is clear that Staniford model is too simple and did not predict the worm propagation very well. But from all reports and references, it seems that people took it for granted and nobody has questioned his model.


David Moore from Caida.org gave a detailed analysis of Code Red worm[2]. He presented some valuable data collected when code red worm began rampant spreading during July 19th. The observed infected host number curve is as Fig. 2.

Figure 2: Code Red worm infection data on July 19th (from Caida.org).
Because the code red worm was programmed to stop spreading after 00:00 UTC on July 20th, so the infected host number stopped increasing after that time. Otherwise the curve in Fig. 2 will keep increasing. The problems we are interested are: How can we explain this code red worm propagation curve? What factors affect an Internet worm spreading behavior? Here we want to derive a new model for code red worm propagation. We will consider some practical situations that will affect the worm propagation in Internet. It will fit the data set [2] quite well and also shows the reason why worm propagation is not so scared as some people predicted before.


III. New model on Code Red worm propagation


1. Factors affecting real worm propagation on current Internet. Real worm propagation on Internet is a complicated process. It involves a lot of arbitrary and unpredicted factors. In order to model it, we need to grasp several most important factors that clearly affect worm propagation properties. Here are some major factors that we can think of:

Because hackers can write the worm code arbitrarily, so we can only consider continuous activated worm pattern. It means that the worm on an infected host will keep trying to find and infect other vulnerable ones, which is the case for Code Red worm spreading during July 19th.


2. Classical general epidemic model: Kermack-Mckendrick model In the mathematical modeling of epidemiology book [6], the Kermack-Mckendrick model was presented that took consideration of removing infective hosts. It is assumed that the infective hosts are removed from circulation at a rate that is proportional to the current number of infective hosts. We use the same same as [6], denoting $ R(t)$ as the number of removed hosts at time $ t$, $ \gamma$ as the rate of removal of infective hosts from circulation. So

$\displaystyle \frac{dR(t)}{dt}=\gamma I(t)$ (2)

Define $ \rho\equiv \gamma/\beta$ as the relative removal rate [10]. One interesting result coming out of this model is:

$\displaystyle \frac{dI(t)}{dt}>0$    if and only if  $\displaystyle S(t)>\rho$ (3)

It means that if the number of vulnerable hosts is smaller than some critical number, $ \rho$, there will be no epidemic. However, the Kermack-Mckendrick model is not suitable for Internet worm propagation modeling. We can treat patching, blocking and filtering of Internet worm as the removal process in this model, but the specific removal process in the model can not match the complicated case in Internet worm situation. Another reason is that it obviously does not take into account of the slowing down of worm infection due to large volume of worm traffic on the Internet, which plays a very important role in worm propagation.


3. General Internet worm model We can extend the Kermack-Mckendrick model by considering the following two specific processes that affect Internet worm propagation:

For the first process, we found a paper that possibly supports it [11]. The paper [11] basically showed that during Code Red worm and Nimda worm continuous propagation periods, there were BGP storm happened which caused global Internet routing instable and a lot of events of Autonomous Systems transient failure. Since code red worm randomly chooses one IP to scan, it would take the worm in average more than 2000 of IP scanning before it could find a Windows IIS Web server. When the worm gets the scanning response, a RST packet, which means the computer being scanned is not a web server, it will pick another IP address to scan immediately. But if the scanning IP is not a valid IP or the host is not reachable at that time, the worm will have to wait much longer for timeout before it pick another IP to scan. Getting the RST packet from live host is usually 50 to 500 milliseconds, but waiting for timeout will last 21 seconds [12]. When more and more ISPs unplugged their networks and a lot of Autonomous Systems had transient unreachable period during that period which is indicated by BGP storm [11], more and more worm scanning would have to wait for timeout and thus slow down the overall scanning process. The worm scanning may also cause Internet to slow down. Considering the fact that one infected computer has 100 threads to do continuously scanning in parallel and there were so many infected hosts on July 19th, the worm propagation would have generated huge amount of small size packets. Although these packets volume are small comparing to normal Internet traffic, but the huge number of packets possibly would cause trouble to routers, especially to edge routers. The code red worm scanning data collected independently by Dave Goldsmith [13] and Ken Eichman [14] around July 19th gave us another valuable evidence to verify our conjecture. They collected code red worm port 80 scanning attempts and showed the same pattern as following (Fig. 3 and Fig. 4):

                         Figure 3: Ken Eichman Data                                                                Figure 4: Dave Goldsmith Data

From the worm scanning curves in Fig. 3 and Fig. 4, we can see that between 18:00 to 19:00 UTC, both of these /16 network received maximum worm scanning. After that, the worm scanning activities began to drop down slowly until 00:00 UTC and stopped. Comparing it with the infected host curve in Fig. 2, we can see that they are not consistent: After 19:00 UTC, infected host number kept increasing while worm scanning kept dropping. Since code red worm kept the same continuously scanning and spreading pattern in that day, this phenomenon can only be explained by the following three possible reasons:

For the second process, it is obvious and we knew it from news and around ISPs' counter actions against Code Red worm.


In order to take into account of the slowing down of worm propagation process, the infection rate $ \beta $ in typical epidemic equation (6) must be a variable $ \beta(t)$, not a constant anymore. From the whole Internet point of view, patching, filtering, blocking or disconnecting means that some hosts are removed from the circulation no matter they are infected or vulnerable ones. So the removal process consists of two parts: Removal of infected hosts and removal of vulnerable hosts. Let us use the same $ R(t)$ as the number of removed infected hosts at time $ t$; Use $ Q(t)$ as the number of removed vulnerable hosts at time $ t$. Consider the changes of vulnerable hosts number from time $ t$ to time $ t+\triangle t$:

$\displaystyle S(t+\triangle t) - S(t) = -\beta(t) S(t)I(t)\triangle t -
 \frac{dQ(t)}{dt}\triangle t$ (4)

So we have:

$\displaystyle \frac{dS(t)}{dt}=-\beta(t) S(t)I(t)-\frac{dQ(t)}{dt}$ (5)

At any time, $ S(t)+I(t)+R(t)+Q(t)=N$ holds, so put $ S(t) = I(t)+R(t)+Q(t)$ into equation (5) and we have the differential equation of infected hosts number $ I(t)$ as:

$\displaystyle \frac{dI(t)}{dt}=\beta(t)[N-R(t)-I(t)-Q(t)]I(t)-\frac{dR(t)}{dt}$ (6)


We temporarily call equation (6) as two-factor worm model.


In order to get analytical answer from equation 6, we have to know the dynamic properties of $ \beta(t)$, $ R(t)$ and $ Q(t)$. $ \beta(t)$ is a factor that is determined by Internet traffic engineering and the worm code pattern; $ R(t)$ and $ Q(t)$ involves people's attitude and awareness, patching and filtering difficulties, etc. By simplify their dynamic properties, we may be able to get some analytical results. We will do that in the future. Now let us first do some simulation to verify it to the valuable Code Red worm propagation data collected by others [2].


IV. Simulation of Code Red worm based on two-factor model


In order to verify our two-factor worm model, we do some Code Red worm simulation based on our two-factor worm model.


1. two-factor worm simulation model description In order to simulate Code Red worm propagation as real as possible, we used 490,000 vulnerable hosts in our simulation. We let each worm random pick one host in the hosts space to infect at the end of each infection delay time, which is the time of the worm scanning process. The infection process will be like this: An infected host will randomly pick one host in the hosts space, use a specific delay time (which is determined by the worm scanning process) to infect it, then continue the same infection cycle; There will be no reinfection case here. In order to consider the blocking, patching and filtering factor on worm propagation, we simply immunize the same amount of hosts as the number of the ever infected ones. We randomly pick non-immunized hosts to do immunization no matter they are infected or still healthy. We vary the worm infection delay time $ D(t)$ to capture the worm propagation slowing down effect. In our simulation, we use the following equation to generate the delay time $ D(t)$:

$\displaystyle D(t)=D(0)+ Normal(k_1p(t)^n, k_2p(t)^n)$ (7)

Where $ p(t)$ is the percentage of ever infected hosts to the whole population; $ Normal(\mu,\sigma)$ is the Normal distribution with mean value $ \mu$ and standard deviation $ \sigma$; $ D(0)$ is the base delay time. The normal distribution here is used to capture some randomness in the simulation. The power parameter $ n$ is used to adjust the sensitivity of delay time to the infected hosts number. From our primary simulation, we think that $ n=2$ is appropriate for Code Red worm simulation.


1. Primary simulation results We did simulations on three cases by using the same parameters to their simulation result. The first case is the same as Staniford's model [1] and some other researches [7,8,9]. It dose not consider the two factors in our model; The second case is that we only consider the traffic slowing down effect; The third one is our two-factor model that considers both of these factors. The simulation result is in Fig. 5 1( The infected nodes number is the total number of ever infected hosts, including those infected hosts that have been patched or blocked later on):

Figure 5: Code Red worm simulation based on different models.
If we compare our simulation result with the Caida data set curve Fig. 2, We can see that the by considering Internet traffic and removal processes, we can fit the observed data better than original Code Red worm simulation in [1]. By adjusting the parameters in our simulation, like the parameters in equation (7) and the hosts immunization process, we can adjust the curve to fit real data and thus understand more of real internet characteristics. We tried to use linear formula in equation (7), i.e., let $ n=1$, but the result is not good in our simulation( See Fig. 6). The linear increasing part of these curves happens too early and lasts too long.
Figure 6: Code Red worm propagation by considering Internet traffic with linear equation


2. Statistic in worm simulation Because there are a lot of random events happened in our worm simulations, we want to know what's the statistic properties of worm simulation. Is the simulation sensitive to those random events? We repeat 100 times of the second case simulation as in Fig. 5. By using the maximum and minimum values for each time $ t$, we derive two curves that will contain all these 100 curves inside them. We found that these two curves are very close to each other that we can not draw them out together. The maximum relative difference between these two curves are only $ 0.852\%$! So worm propagation is almost like a deterministic process. The reason why those random events have so little effects on worm propagation is because the objective population is so huge and each worm does infection independently. So from the whole process point of view, these huge number of random events will eventually average out each other.


3. Some discussions In our two-factor model simulations, those two curves will slow down infection speed when only 20% to 40% of total vulnerable hosts have been infected. It explained the fact that on July 19th, there are only about 20% to 40% vulnerable hosts been infected by Code Red worm. Staniford's model can not explain such fact as explained in section II.3. When Code Red worm and Nimda worm were spreading rampantly on the Internet, people made some scaring prediction of how fast these worms will infect almost every computers in the world. the "Warhol worm" presented by Weaver [9] even claim that it can infect almost every computers in the world within 15 minutes. But it turned out that those worms had not infected so many computers on the Internet. Our two-factor model is the first step trying to explain such facts. It should be noticed that the 21 seconds timeout in code red scanning is set too long. It can easily be set as for example 2 seconds, then the code red worm would have spread much faster than the event in July 19th. Patching, blocking and filtering plays an important role in slowing down and eliminating virus and worm propagation. In reality, there are a lot of viruses and worms coming out almost everyday in the past and today. But only a few of them propagated seriously on Internet and eventually all of them died out. Most viruses and worms are not so contagious as Code Red and Nimda, so after patching and filtering speed is higher than the infection speed, they just diminished from the Internet circulation.


In the near future, we will try to simplify some dynamics of those parameters like $ Q(t)$,$ R(t)$ and $ \beta(t)$ in equation (6) to get some analytical results.



Changchun Zou 2002-03-01