The above titled emphatic post by Danny Sullivan at ClickZ derides latest 20 billion documents indexed claim by Yahoo and the constant attention paid to index size claims by the top search engines. I personally don’t agree that it’s possible that Yahoo is approaching 20 billion pages indexed – as they claim – because they don’t have all of the pages of either my own sites or those of clients indexed and their cache is woefully out of date. Yahoo don’t explain where those documents reside or what value they give the Yahoo index above the other major players – just that they have more documents indexed.
Any time you look at raw server log files, you can see the crawling behavior of all search engines each time their spider visits a site. My log files and those of clients show Google’s spider Googlebot visiting on extremely regular schedules and Yahoo’s Inktomi spider visiting randomly and sporadically.
A new “study on search engine index freshness” shows the Google index to be freshest on an ongoing basis. My own observations of spider crawling behavior on a new site over three months, shows the behavior of each of the major players – as you would expect illustrates similar findings on freshness, since a site must be crawled often and regularly to be indexed fully and provide current results.
Freshness is a much better indicator of the worth of any particular search engine than is size of the index – but clearly size contributes substantially to that value. But Danny Sullivan wants relevancy somehow quantified. Although that gets very complicated and pushes everyone toward that fantasy of “Latent Semantic Indexing” so often discussed, it would indeed be very interesting to put hard numbers to such a squishy goal.
I agree with Sullivan and hope someone picks up that relevancy challenge gauntlet.