The World Wide Web (or simply Web) started as a tiny collection of several dozen web sites about 20 years ago. Since then, the number of Web pages grew tremendously and became quite segregated. There is a well-lit part of it, a surface Web, which is indexed by search engines, and there is a so-called deep-web, which is studied only slightly better than the outer deep space.
How many pages are on the surface? According to some of the measurements, there are several dozens of billions pages indexed. Were all of these pages created by humans manually? It is possible, but I doubt it. There are about 100 million books written by humans. Let us assume that a book has 100 pages each of which is published as a separate HTML page. This would give us only 10 billion pages. I think that during the 20 years of the existence of the Web, the number of manually created pages could have hardly surpassed this threshold. Consequently, it is not unreasonable to assume that most of the Web pages were automatically generated, e.g., for spamming purposes (two common generation approaches are: scrapping/mirroring contents from other web sites and generating gibberish text algorithmically).
Ok, but what is the size of the deep web? Six years ago, Google announced it knew about a trillion of Web pages. Assuming that the Web doubles each year, the size of the deep Web should be in the dozens of trillions of pages right now. This is supported by a more recent Google announcement: There are at least 60 trillion pages lurking in the depths of the Web!
What constitutes this massive dataset? There are allegations that the Deep Web is used for all kind of illegal activities. Well, there is definitely some illegal activity going on there, but I seriously doubt that humans could have manually created even a tiny fraction of the Deep Web directly. To make this possible, everybody would have to create about 10 thousand Web pages. This would be a tremendous enterprise even if each Web page were just a short status update on Facebook or Twitter. Anyways, most people write status updates probably once a year and not everybody is connected to the Web either.
Therefore, I conclude that the Deep Web should be mostly trash generated by (supposedly) spamming software. Any other thoughts regarding the origin of so many Web pages?