Doc - Andre Trudel - Acadia University

advertisement
HOW BIG IS THE WORLD WIDE WEB?
Darrell Rhodenizer & André Trudel
Jodrey School of Computer Science
Acadia University
Wolf Ville, Nova Scotia, Canada, B0P 1X0
andre.trudel@acadiau.ca
ABSTRACT
The World Wide Web is one of the most popular and quickly growing aspects of the Internet. Ways in which computer
scientists attempt to estimate its size vary from making educated guesses, to performing extensive analyses on search
engine databases. We present a new way of measuring the size of the World Wide Web using “Quadrat Counts”, a
technique used by biologists for population sampling.
KEYWORDS
WWW size estimation using biological techniques.
1.
INTRODUCTION
How much of the Internet does the World Wide Web (WWW or simply Web) actually populate? Are web
servers abundant throughout the ‘Net, or is the vastness of cyber space still a relatively empty void?
Unfortunately, due to the size and dynamic nature of the WWW, it is infeasible to conduct an
exhaustive census. Instead, recent studies have attempted to estimate its size using a variety of techniques.
We take an interdisciplinary approach to the problem of estimating the size of the WWW by adopting a
technique utilized by biologists who conduct population sampling using quadrat counts – a new, and
intuitive way of looking at the problem.
There are many ways in which the Web can be measured. Other studies have treated it as the total
number of web pages, or disk space used to store the web pages. We count the number of web servers
available on port 80 across the Net.
2. PREVIOUS STUDIES
Gone are the days when we had only to count the computers hard wired into the ARPANET to know the
exact number of hosts on the Internet. Now it is a difficult task to estimate even a relatively small
population, such as the number of computers running web servers, let alone to guess at the number of
computers connected to the ‘Net in total. There have been numerous studies and surveys conducted which
try to estimate the population of the Web, and we present a few of the approaches taken.
Search engines are a logical tool to use when trying to estimate the size of the web. A search
engine’s index size can be used as a lower bound estimate of the Web’s size. Steve Lawrence and C. Lee
Giles [Giles, Lawrence, 1999a] [Giles, Lawrence, 1999b] [Giles, Lawrence, 1998] used the overlap
between search engines to estimate N, the size of the “Indexable Web”. The Indexable Web is considered
to be those pages on the Web, which are available for search engines to find and catalog.
N
na
n0
nb
Figure 1. Search Engine Coverage [Giles, Lawrence, 1998]
Let a and b be two search engines that search the Web independently. As illustrated in Figure 1, n a is the
number of results returned by Search Engine a, while nb is the number of results returned by Search Engine
b. The value n0 is the area of overlap where the two search engines return the same results. Lawrence and
Giles use n0 / nb as an estimate of the fraction of the Indexable Web, p a, covered by Search Engine a. The
size of this Indexable Web could then be estimated by dividing s a by pa, where sa was the total number of
pages indexed by search engine a.
The fact that this study was limited to the “Indexable Web” poses a problem. Search engine
results are not random and are often skewed since pages can be submitted to them, rather than just being
found by a spider program. Search engines will tend to favour more popular pages as a result of this. In a
scientific study on the size of the Web we would prefer truly random samples, rather than those already
chosen for being popular. Search engines also have the disadvantage of not being able to gain access to
large amounts of pages, such as those which may require form input in order to be viewed.
Lawrence and Giles’ technique is equivalent to Peterson’s method for population estimation.
Peterson’s method involves capturing individuals, marking them, and then releasing them. Individuals are
then recaptured and checked for marks. The formula used for population estimation is identical to the one
used by Lawrence and Giles. In the latter work, the pages indexed by search engine a are the marked
individuals. The documents returned by engine b are the individuals found during the re-capture phase.
Krebs [Krebs, 1999] points out several problems with Peterson’s method. The first is that it tends to
overestimate the actual population; especially with small samples. Also, random sampling is critical. As
described above, this is not the case with search engines.
It is possible that the “Indexable Web” could be similarly mapped using the databases of Domain
Name Servers, where each domain entry could be considered a web server. Unfortunately this approach
also has the problem of not considering those sites that may not want to be found (and so have not
registered a domain name), as well as virtual domains, which do not have real entries in the DNS database.
3. WHAT WE ARE MEASURING
IP Addresses are 32 bit numbers, each of which corresponds to a network interface somewhere on the
Internet. They are broken up into four octets, which are commonly represented in a decimal notation as:
aaa.bbb.ccc.ddd. Each octet contains eight bits of binary information and can thus be represented as a
decimal number between 0 and 255. Because the addresses have 4 octets, this scheme is sometimes referred
to as IPv4. This scheme gives us 4,294,967,296 possible addresses. Not all of these addresses are useable,
however, because there are several ranges of addresses that are reserved for special purposes. Moreover,
the addresses are divided into network classes, and additional addresses may be reserved depending on
what class of network the address belongs to.
The network class is determined by the first octet. Addresses starting with 0, 127, and anything
over 223 are all reserved for special purposes. If the first octet is a number between 1 and 126, then the
address belongs to a Class A network. If the first octet of the address is between 128 and 191, then it
belongs to a Class B network. Finally, if the first octet of the address lies between 192 and 223, then it
belongs to a Class C network.
In a Class A network the first octet (first 8 bits) of the address is used to identify the network,
while the remaining three octets (last 24 bits) are used to identify the host. We therefore have 126 Class A
networks. Each of these networks can have up to 16,777,216 hosts. Likewise, in a Class B network the
first two octets (first 16 bits) of the address are used to identify the network and the remaining two octets
(last 16 bits) identify the host, and in a Class C network, the first three octets (first 24 bits) of the address
identify the network, while the remaining octet (last 8 bits) identifies the host. For example, the IP address
of Acadia’s Dragon server is 131.162.200.56. 131 corresponds to a Class B network, the first two octets
131.162 identify Acadia’s network, while the remaining octets, 200.56, identify the host (i.e. Dragon).
There are 16,384 Class B networks, each with 65,536 hosts. For Class C, there are 2,097,152 networks,
each with 256 hosts.
As mentioned above, certain addresses are reserved. The first octet cannot contain all 0’s or 1’s (0
or 255 in decimal notation). Nor can it be 127 as this is used for loopback. Also, the first octet cannot be a
decimal value higher than 223. Values between 224 and 255 consist of the Class D and E networks, but
these are experimental and are not currently useable. Aside from the restrictions on the first octet, the host
address cannot contain all 0’s, or all 1’s (0 or 255 in decimal notation). The octets corresponding to the host
address vary depending on the class of the network. In addition to the above restrictions, the ranges 10.(0 255).(0 - 255).(0 - 255), 172.(16 - 31).(0 - 255).(0 - 255), and 192.168.(0 - 255).(0 - 255) are reserved for
private networks and are not available for routing over the Internet. We are therefore left with
3,702,423,846 unique available IP addresses.
4. Population Sampling by Quadrats
A straightforward method of determining an organism’s population in a certain area is to count it.
If the population we are attempting to measure is too large to count in total then we may consider counting
the organisms in subsections of the total area and using that to help us estimate a total value. In the real
world this can be done using quadrat counts. A quadrat is a fixed size area that is surveyed. By surveying
multiple quadrats and using a formula to estimate distribution probabilities, we can estimate the total
population size. In our case, the total area is the Internet, a quadrat is a set of IP addresses, and the
population we wish to estimate is web servers.
There are only two basic requirements to be met when using quadrat count sampling techniques.
First, the total area we are considering is known. The total size of the ‘Net was previously calculated to be
3,702,423,846 excluding reserved address values and 4,294,967,296 including reserved address values – so
that quantity is known. Second, it is necessary that the population we are sampling remains relatively
immobile during the sampling period. This is also mostly true. Although we certainly cannot confirm it for
every web server on the Net, it is common for web servers to remain at a fixed IP address in order to make
best use of the DNS service (Domain Name Servers can take a long time to propagate a change through all
of their system, so for a web server to be persistently accessible using a domain name, it needs to have a
static IP address).
We need to decide on the size of a quadrat. There are formulas available to help determine the
optimal size and shape of the quadrats based on obstacles such as distance and cost. Since these obstacles
are not applicable to the Internet, we arbitrarily decided to use quadrats that consist of 256 unique IP
addresses.
Originally, the quadrat scanning software was written to scan subnets of the form a.b.c.(0 - 255),
where a, b, and c were randomly generated numbers between 0 and 255. This seemed to be an intuitive
solution, but quadrats are ideally long and narrow, so that they are able to provide a larger cross section of
the environment, while lacking the homogeneity of a more localized area. IP ranges of the form a.b.c.(0 255), all being in the same subnet seemed to violate that principle since commonly they would be owned by
a single person or group and be subject to the homogeneity that we were trying to avoid. In order to correct
this, the software was altered to scan blocks of IP addresses of the form (0 - 255).a.b.c, where a, b, and c
are still randomly generated numbers between 0 and 255, but the quadrats would span entire networks,
virtually ensuring that no two addresses in a single quadrat would be owned by the same person and also
that no two addresses would belong to the same subnet.
Changing the shape of the quadrats had the added benefit of making the scanning software seem
less like an attack on a given subnet. If a network administrator were to see 256 unsolicited probes on
his/her network, even if they were only on port 80 (the standard HTTP port), he/she could become
concerned. This was something we wished to avoid, and the new quadrat shape, in addition to offering a
better cross section and eliminating the homogeneity concern, ensured that an individual subnet would only
receive one port 80 probe.
Unfortunately, by changing the quadrat shape we introduced a new problem. Since the quadrat
increments across the first octet of the address n.a.b.c, any reserved addresses which are identified solely
by the first octet (as is the case with the entire Class D and Class E networks) will appear within every
quadrat in the same relative position. This is not particularly troubling since the distribution of web servers
can still be considered random outside of these restricted areas. They can never exist, however, in those
locations where the address is rendered invalid solely by the first octet. We allowed this for our purposes,
and although we recorded these addresses as invalid, in our final estimate they were treated as unrestricted
addresses that simply did not have a web server. Alternatively, we could have omitted these addresses
from our results entirely.
IP addresses are uniquely suited to quadrat counts. There is no standard address at which to place a
web server within a subnet, so occurrences of a web server are random. Moreover, there is no correlation
between the subnets. One could infer that the concept of ‘nearness’ exists within a subnet, in as much as all
the addresses will often be owned by a common entity, but there are no similar concepts between the
different subnets. By taking a vertical cross section as our quadrat rather than a horizontal one, we can
eliminate virtually all homogeneity in our sample and we are left with a truly random distribution.
Although there may be some relationship within the addresses of a subnet there is no relationship between
the subnets, so distribution in our quadrat is totally random. By incrementing across the first octet, we are
taking the widest possible leap across the IP addressing system (as there are 255 * 255 = 65025 subnets
between a.b.c.(0 - 255) and (a + 1).b.c.(0 - 255).
When sampling in the real world, there can be cases of biases or organisms can be missed
altogether. Biases occur when an individual is observed to be partially within the quadrat being surveyed,
the biologist would naturally be inclined to include it in the results in order to not ‘waste’ data. This can
lead to greedy estimates that may overestimate total populations. With IP addresses there is no ‘half way’,
and we don’t need to worry about a web server being only partially within the quadrat we are studying. An
address either lies inside the specified range, or it does not and will not be considered. We have crisp
boundaries between quadrats.
The other concern is the possibility of missing an individual within the area being studied. This
case can occur while sampling IP addresses. Using quadrat-scanning software however, it is less a case of
“missing” a web server and more a case of the web server not replying to our attempt at communication in
a timely manner. Nevertheless, with a reliable and fast ‘Net connection and a fair timeout value, this small
chance of error can be minimized. Other individuals we may miss, such as web servers that are not
listening for requests, or those that are being blocked by firewalls, are not considered in this study as they
are not publicly accessible. Since they cannot be detected or utilized by the majority of the Net’s users, we
do not treat them as part of the true World Wide Web.
Since the distribution of web servers across the Net is random, we can use the “Poisson
Distribution Method” to estimate the probability of encountering a given number of web servers within a
quadrat. This method is simple because the only parameter we are concerned with is the mean. Using the
Poisson method we can compute the probability of finding 0, 1, 2, …, n individuals within a quadrat, and
we can consider individuals to be either web servers, non-web servers, or invalid addresses. We use the
Poisson distribution Px = e-μ * (μ x / x!) where:
Px = the probability of observing x organisms in a quadrat.
x = an integer counter (0, 1, 2, 3…)
μ = the mean of the distribution
e = the base of natural logarithms (2.71828…)
The Poisson distribution assumes that all quadrats are equally likely to contain individuals, which
is why a random distribution is necessary. It is through its near perfect randomness and the ease with
which it can be surveyed that the ‘Net distinguishes itself as being ideal for this form of population
sampling.
5. Implementation
In order to survey a great number of quadrats in a short amount of time, we decided that a
computer program would conduct the actual surveying, store the results, and analyze them dynamically.
All programs were written in Perl and most output is in the form of HTML files which allows the programs
to be executed, and the results to be viewed through a web-based front end (shown in figure 2). This front
end displays a list of scanned quadrats on the left, from which one can view the detailed results for a given
quadrat. It also prints out the total number of quadrats it has scanned, the total number of addresses within
those quadrats, and the total number of open, closed, and invalid addresses it has encountered. Inside
another table it prints dynamically calculated values, such as the estimated number of open, closed and
invalid addresses on the entire Net as well as a bar graph displaying the probabilities of encountering 0, 1,
2, 3 or 4 web servers within a quadrat. At the bottom of the page are links for starting new instances of the
quadrat-scanning program and for checking the integrity of the currently stored quadrat counts.
Figure 2. The web-based front end
6. Results
The quadrat-scanning program was allowed to complete 100 runs for a total of 100 quadrats, or
25600 total IP addresses. Although this is a large number, it is still extremely small relative to the entire
‘Net, and is in fact only approximately 0.0006% of the total number of possible IP addresses available on
the Internet.
Of these 25600 addresses, 111 (approximately 0.43359375%) allowed us to form a connection on
port 80, which means there is a webserver running at that address. 21980 (approximately 85.859375%) of
those addresses did not reply to a connection request on port 80, implying that there were either no web
servers at the interfaces identified by those addresses, or no interfaces at those addresses at all.
Additionally, 3509 (approximately 13.70703125%) addresses were identified as being invalid by the rules
governing reserved addresses.
The mean number of web servers found per quadrat was 1.11. By multiplying this by the total
number of possible quadrats (256 * 256 *256 = 16,777,216) we get an estimate of approximately
18,622,710 web servers across the entire Internet. Meanwhile, the mean number of non-web servers found
per quadrat was 219.8. This value yields an estimate of approximately 3687632077 non web-servers across
the entire Internet. Finally, the mean number of invalid addresses found per quadrat was 35.09, yielding an
estimate of 588712509 invalid addresses across the entire Internet.
The number of estimated invalid addresses (588,712,509) was compared to the known value of
592,543,450. The difference was taken to be 3830941, which was then calculated as a percentage of the
total ‘Net – approximately 0.089%.
The results computed also included the Poisson distribution estimate for the total number of web
servers on the ‘Net. By using a linear spline to estimate the values between our actual points (where x
values are the number of web servers we may find in a quadrat, and y values are the probability that the
number will occur), we are able to plot a graph showing the likelihood of discovering a given number of
individuals within a single quadrat. The probability curve for discovering a specified number of web
servers within a quadrat is illustrated in Figure 3.
Figure 3. Probability curve for the number of web servers within a quadrat
This curve peaks around the values 0, 1, and 2, leading us to conclude that we have a high
probability of discovering 0, 1, or 2 web servers in a quadrat. After that, the probabilities drop to zero
quickly, indicating that it is highly unlikely to find more than 3 or 4 web servers within a single quadrat.
Although the numbers were too large to deal with using Perl, we are also able to compute
probability curves for both the likelihood of finding non-web servers and the likelihood of finding invalid
addresses within a quadrat by using Maple. Note that we must be careful in what we infer from these
results, particularly with the invalid addresses, as they are not truly random, and so are not ideal for use
with the Poisson distribution method.
The value of approximately 18.5 million web servers seems reasonable given the rapid growth of
the Internet in recent years. Of course, this does not take into account that many servers do not actually
have any content behind them, or that some servers may in fact be masquerading as multiple servers
through virtual hosts or the use of additional ports.
7. Conclusion
Although only 100 out of a possible 16,777,216 quadrats were scanned, the quadrat counting
method of estimating the size of the World Wide Web appears reasonable. An estimate of 18.5 million
web servers seems neither too high, nor too low and is probably as valid, if not better, an estimate than any
previously devised method could provide. By repeating this study after given time intervals, we would be
able to make further estimates on the growth of the Web relative to time, and by increasing the level of
information gathered by our software, we would be able to speculate on the distribution of server types
across the Web. Testing to determine how well the Poisson distribution fit our data, and to establish
confidence limits (to determine how many quadrats will provide a good estimate) would also increase the
validity of the study.
If future research were to be conducted in this area it may be beneficial to change the nature of the
quadrats yet again. Originally, our problem was that the quadrats we had chosen were of the form a.b.c.(0 255). This allowed a high level of homogeneity within the quadrats, and it was possible to stumble upon a
server farm subnet as a quadrat (giving us an extremely abundant quadrat), while another quadrat could be
a subnet composed entirely of unused addresses (giving us an extremely sparse quadrat). Additionally,
quadrats of this “shape” meant that all of our probes would be received by the same subnet and could be
interpreted as an attack by a zealous System Administrator.
To counteract these problems we changed the quadrats to be of the form (0 - 255).a.b.c. This
eliminated concerns engendered by homogeneity and the problem of sending 256 unsolicited port 80
requests to a single subnet. But it also introduced a new problem. Since we were iterating through the first
octet to form our quadrat, any reserved IP addresses determined by the first octet (such as when the first
octet is 0, or greater than 223) would appear in all of our quadrats. This is a problem because it detracts
from the random nature of our web server distribution (since a web server can never be found within these
ranges).
One possible solution to this situation is to iterate along the second or third octet (quadrat
addresses of the form a.(0 - 255).b.c and a.b.(0-255).c respectively) rather than the first. The second octet is
a better choice than the third, since the third would still yield homogenous quadrats in some Class A
networks. By defining our quadrat in terms of the second octet, we retain our lack of heterogeneity, as well
as prevent ourselves from appearing to launch an attack against a subnet. We also prevent invalid
addresses from appearing in the same locations in all of our quadrats.
Unfortunately, this method still has problems. Instead of having the reserved addresses clumped
together in the same positions in all quadrats, we have quadrats with either an extremely low number of
invalid addresses or which are comprised of entirely invalid addresses. This is because the first octet is now
fixed throughout the quadrat, and if it determines the address to be invalid, then all addresses within that
quadrat will be invalid.
A better solution would be to continue with our current scheme, but simply discount all addresses
where the first octet renders the address invalid. Thus, our quadrats would be 32 values smaller. Addresses
beginning with 0, 10, and 224 - 255 would be eliminated from our calculations. We would then need to
compensate for this in our final estimations by adding the removed values back to the total estimates. We
would still have invalid addresses within our quadrats, but they would not appear with the frequency and
relative stability that they do in our current quadrats.
Another way the results could be improved would be to increase the timeout value when
attempting to establish a connection to port 80. We used a timeout value of 5 seconds, and if a web server
had not replied within that time then we would not consider it in our results. To be more precise and
certain that virtually all web servers are counted, a higher time-out value should be used – perhaps in the
range of 60 seconds.
Finally, it should be pointed out that although being able to estimate the number of web servers on
the Internet is one way of estimating the total size of the Web, it does not take other factors, such as the
content of those servers, into account.
It is also possible that there are a sizeable number of web servers running on ports other than 80,
but these were also not identified by our software. The scanning software would also be rendered
incapable of detecting a web server on any machine behind a firewall that was blocking port 80. As a result
of these facts our estimate is likely to be a lower bound on the actual web server population of the Internet.
In the worst case, the ipasat.pl program will take 5 seconds per address to determine whether or
not that address is open, closed or invalid. If we multiply that by 256 (the number of addresses within a
quadrat), the longest it can take to complete a run through a quadrat is 1275 seconds, or a little more than
21 minutes. Of course, it never actually takes this long in its current form because 32 of those addresses
are certain to be invalid and are identified almost instantly. Therefore, if we were to run this program
continuously to scan the total number of possible quadrats (256 * 256 * 256 = 16777216) it would take
21390950400 seconds, or approximately 678 years. With a higher time out value this would take even
longer. But what if, however, we had an entire server farm running this program and gathering results?
What if this program could query other IP addresses while waiting for another to reply? And what if each
server could run multiple copies of ipasat.pl day and night? It may be that we could actually conduct a
census on the entire Internet in a reasonable amount of time, and not have to do any estimation at all!
The ability to estimate the size of the Web using quadrat counts will become even more important
in the next few years, when the new IPv6 protocol is rolled out to replace the increasingly inadequate IPv4.
Despite the apparent vastness of the Internet, there is a growing shortage of available IP addresses,
primarily due to the large number of non-computers now using IP addresses, such as cell phones and
Internet appliances. IPv6 is much the same as the current IPv4 protocol, but uses 6 octets instead of the
traditional 4. These additional 2 octets give us an extra 281,470,681,743,360 addresses, for a total of
281,474,976,710,656 possible addresses available under IPv6 (including reserved addresses). For
comparison’s sake, using ipasat.pl to scan every possible quadrat in the IPv6 format would take us
approximately 44 million years, rather than the 678 years it would take under the current format. This
means the ability to estimate the size of the Web using smaller sample portions (such as quadrats) will
become more and more important as the Internet becomes too large to feasibly quantify it in any other way.
ACKNOWLEDGEMENT
The second author is supported by an NSERC research grant.
REFERENCES
Adamic, L. and Huberman, B., 2001. “The Web’s Hidden Order” Communications of the ACM,
September 2001/Vol. 44, No. 9.
Berners-Lee, T., 2000. “Weaving the Web” HarperCollins.
Giles, C.L. and Lawrence, S., 1998. “Searching the World Wide Web” Science Magazine, April 3, 1998
Volume 280.
Giles, C.L. and Lawrence, S., 1999a. “Searching the Web: General and Scientific Information Access”
IEEE Communications Magazine, January.
Giles, C.L. and Lawrence, S., 1999b. “Accessibility of Information on the Web” Nature Science Journal,
July 8, 1999 Volume 400.
Hildrum,
J.,
1999.
“Jon
Hildrum’s
IP
Addressing
Page”
Web
Site,
http://www.hildrum.com/IPAddress.htm.
Krebs, C., 1999. “Ecological Methodology 2 nd Edition” Addison-Welsey Educational Publishers, Inc.
Moschovitis, C, et al, 1999. “History of the Internet” ABC-CLIO, Inc.
Download