Category : Google

Google Hits One Trillion Unique URLs

2008-07-25 17:11:23

google trillion

Google has reached another milestone today and they are proud to announce that they have indexed a total of 1 trillion, 1,000,000,000,000 unique web pages.

In 1998, they started off with 26 million indexed web pages, back then that was enough to cement their reputation as a credible search engine under the shadow of Yahoo!. In 2000, they've reahed the unprecedented 1 billion mark.  By that time, the internet bubble was about to pop but still they managed to keep on track.  So in the next eight yars, their spiders have been indexing the next 999,999,999,999 URL's and they're not done yet.

In bytes, the 1 trillion number (1 Terabyte) is easy to imagine, but in webpages that number is overwhelming.   One trillion unique web pages, UNIQUE URL's.  In their official announcement, they tell us how they did it and muse on how large the internet unicerse really is.

"We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day."

So how big is the interent really?  No idea.  Even Google doesn't bother to look at them all saying that many of them are irrelevant, which is actually true. If you consider every link in a calendar entry like "next day" to lead to a unique URL, and index that, it's a pointless entry in the already overwhelming universe of data.

With the growth of the number of URL's everyday, Google's technologies and methods have also grown since 1998.  They reminisce, "Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections."

It's an amazing feat to accomplish what Google has and we're all glad they did it (except maybe for Yahoo and Microsoft.. especilly Microsoft). 

Comments

      Would you like to comment?
      Join tj.com , or sign in if you are already a member
      .