The World Wide Web (or simply Web) started as a tiny collection of several dozen web sites about 20 years ago. Since then, the number of Web pages grew tremendously and became quite segregated. There is a well-lit part of it, a surface Web, which is indexed by search engines, and there is a so-called deep-web, which is studied only slightly better than the outer deep space.
How many pages are on the surface? According to some of the measurements, there are several dozens of billions pages indexed. Were all of these pages created by humans manually? It is possible, but I doubt it. There are about 100 million books written by humans. Let us assume that a book has 100 pages each of which is published as a separate HTML page. This would give us only 10 billion pages. I think that during the 20 years of the existence of the Web, the number of manually created pages could have hardly surpassed this threshold. Consequently, it is not unreasonable to assume that most of the Web pages were automatically generated, e.g., for spamming purposes (two common generation approaches are: scrapping/mirroring contents from other web sites and generating gibberish text algorithmically).
Ok, but what is the size of the deep web? Six years ago, Google announced it knew about a trillion of Web pages. Assuming that the Web doubles each year, the size of the deep Web should be in the dozens of trillions of pages right now. This is supported by a more recent Google announcement: There are at least 60 trillion pages lurking in the depths of the Web!
What constitutes this massive dataset? There are allegations that the Deep Web is used for all kind of illegal activities. Well, there is definitely some illegal activity going on there, but I seriously doubt that humans could have manually created even a tiny fraction of the Deep Web directly. To make this possible, everybody would have to create about 10 thousand Web pages. This would be a tremendous enterprise even if each Web page were just a short status update on Facebook or Twitter. Anyways, most people write status updates probably once a year and not everybody is connected to the Web either.
Therefore, I conclude that the Deep Web should be mostly trash generated by (supposedly) spamming software. Any other thoughts regarding the origin of so many Web pages?
Comments
It might be the case that a document-view is not appropriate for the deep web. For instance, a form that allows you to sort a listing of records can lead to a different URL and exactly the same content ordered in a different fashion (and say the records are paginated across 3000 pages, allowing the ability to sort by date, time, votes or whatever multiplies the number of pages serving the exact same content). You can (possibly) have a permalink for every blog comment, facebook post etc. Some weird combination of query terms will lead to another set of documents.
Since the search engine has a document-centric view of the web, it is going to count each of these views as different. A much better estimate could be the number of rows in SQL tables on the entire internet (maybe).
Thank you, this is a good observation. For some not so well organized Web-sites, the number of duplicates can be tremendous. There is a paper on how to crawl such forms carefully, but, I admit, I haven't read it yet. So, it may be the case that despite all the precautions a search engine can't properly crawl the forms.
I do monitoring of pages with certain terms ... so this way I just found out your entry :)
I studied this question (i.e., deep web size) during my PhD&first years of postdoc. So here is my input:
- Trillion of web pages known to Google in 2008: first of all, it is not trillion of _web pages_ but trillion of _links_ that Google knew at that time. My educated estimate is that Google actually indexed 50-100 billions of pages at that time. That is, it is indexed some number below 100B pages and the number of non-visited outgoing links from these pages was around, say, around 900B (because # known links = #known visited links + #known non-visited links). Then, based on these rough figures, the web size in terms of web pages is still far from being trivial. One problem is that the same web page can be accessed via several distinct links. Then there are duplicate (or near-duplicate) web pages, each potentially accessible via several distinct URLs. And so far, I'm talking about "meaningful" web pages (which content somehow makes sense), so I assume 'spam' web pages are somehow defined and hopefully excluded. So, overall, according to my educated guess, 1000B links in 2008 actually mean less than 100B indexed web pages and 200-300B non-indexed web pages out of there (out of there estimate can greatly vary depending on techniques used for identifying web spam pages and duplicated pages)
- There might be different definitions for deep web (or, as often used, hidden web). For instance, nowadays many people consider the deep web as 'anonymous' web (that might be browsed using TOR or something like this). It is really misleading especially when they start to use estimates obtained for 'classical' deep web (such as, size of deep web = 500 x size of surface web) since, in fact, 'classical' deep web consists of fully accessible web pages (accessible by regular browsers; not password protected; etc) but just for some reason not indexed by search engines. Anyway, in academic papers most people define the deep Web as all non-password protected web pages that are not in the indexes of search engines. There are several types of such pages (e.g., those that are specified in robots.txt or sitemaps) but the most interesting (and the most massive in terms of numbers) class of such pages are pages behind web search forms.
- If we restrict the deep web to only pages behind search forms (as one can consider a search form as an web interface to a databases, these pages contain data taken from databases) then people did estimates for the number of pages in the deep Web, size of deep Web, number of web databases, etc. Estimating the size (i.e., comparing #of pages in deep web with #of pages in surface web or translating these numbers to actual sizes in gigabytes/terabytes) is extremely hard to do accurately so in my opinion there are just no reliable estimates. For example, well known one done in 2000-2001 (see http://quod.lib.umich.edu/j/jep/3336451.0007.104?view=text;rgn=main) is actually just not good, has many flows, etc. More meaningful estimates (particularly, with much more accurate and reproducible methodology) are about the number of web databases on the Web (content from which constitutes the deep web). The most well-known one is here http://ranger.uta.edu/~cli/pubs/2004/dwsurvey-sigmodrecord-chlpz.pdf --it reports 450 thousands databases on the Web. It can be shown that, because of the methodology used, this number is actually underestimate. I did more accurate (but still with several issues) measures of Russian Web http://link.springer.com/chapter/10.1007%2F978-3-642-23088-2_24, slides are at http://www.slideshare.net/denshe/sampling-national-deepweb-1 So, it was around 18 thousands databases in Russian Web.
- ballpark figures for number of indexed web pages in Google might be perhaps obtained from here http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/: it says that the index size is 10^5TB. I guess there are some typical ratios between size of collection and size of corresponding index. Actually no idea but if I use the ratio 3 (the index is three times bigger than the collection) and assume an average web page size to be 100kb then 10^5TB / 3 / 100kb is ~300B web pages.
Hi Denis,
Thank you for an in-depth answer. Interestingly, I have not realized that we don't have a decent estimates for the # of pages behind forms. It seems that we have a consensus here (as backed up by Shriphani's opinion).
Regarding the size of the index. A well-compressed collection would make only a fraction of the original HTML collection, say 50%. However, it is necessary to replicate the index multiple times. How many times? It is hard to tell, especially, because the index is likely to be multi-tiered. Thus, first-tier pages enjoy more redundancy. Anyway, it is not unreasonable to assume a ratio of 5-10.
@Leo,
This is a great post. I did some work at UT on deep web. Essentially we were trying to figure out the number of cars yahoo cars and other search car retailers have to assess a distribution over their make-year, company etc. And what Sriphani says is quite true. You actually should count rows in DB (if you actually can get/create such a DB)
@Denis,
Thanks for posting. This is very informative.
@Abhi, yes, this makes sense. However, it is probably hard to count rows in a database. This may require some cooperation on the side of DB owners.
Yep. I don't think there is any other way.
Yes, there are no decent techniques for evaluation of # of pages behind search forms. To best of my knowledge, what one can do is to do it manually. The ideas for automation can be perhaps taken from this project http://qprober.cs.columbia.edu/ (done ~12 years ago).
Though, in fact, for evaluating the size of deep web you might be ok with just rough estimates (orders of magnitude). I did it here https://www.dropbox.com/s/tfnj8cijhgpnpe3/191_FinalVersion.pdf for the Russian segment. Basically, as of 2006, my estimates for the number of large deep web sites (sites with search forms leading to more than 10^6 records) was 1500-2000. Interestingly, only one out of ten deep web sites was of 'large' size and eight out of ten were 'small' (allowing to access less than 10^5). Anyway, I've got ~20*10^9 entities in Russian Web accessible via search forms as a very solid upper bound. In fact, based on more careful analysis of large deep web sites, I managed to squeezed this upper-bound limit to 10*10^9 entities. For a number of domains, an entry in a database can roughly correspond to a regular web page (say, records describing a car entity in a 'car' database leads to one regular-size web page with car details). So technically a direct comparison with a size of indexed web makes some sense. And the comparison is as follows: there were ~10^9 indexed (by Yandex) pages in Russian Web vs. at most 10*10^9 entities (each potentially corresponding to a web page). The conclusion is that at least for Russian Web the size of deep web part is comparable to the indexed part. At least, I saw no evidence even for one order of magnitude difference in sizes (contrary to a widely used estimate stating deep web is 500x of indexed web size)
Index vs. collection size: ok, I see. Well, if a ratio is 5 then it is 200B web pages or 100B pages with ratio 10. Well, some number within 100-300B indexed pages sounds quite reasonable to me.
Just returning to one trillion links (known to Google) discussion: after this initial post there were a number of after-posts. e.g, one of which is http://techcrunch.com/2008/07/25/googles-misleading-blog-post-on-the-size-of-the-web/ --it says 40B web pages in 2008. Then, as I just recalled, at that time there were a newly appeared search engine http://en.wikipedia.org/wiki/Cuil that claimed 120B docs in its index. Plus, as you may check in the articles about it, Cuil guys (with a former Google employee; meaning that at least one of them could know something) said that their index might be three times larger than the Google's one, so it is again something like 40B pages.