Archive for the ‘technorati’ Category

RSS Will Change Search Forever

Google FoundersBloglines issued a proposal today to provide some limited control over whether search engines will index RSS feeds.

This would give users/sites control over what feed content is indexed by search engines, in much the same way as the robots.txt file has been used for since the mid-90’s for web sites. This is something we’ve needed for a while now and it’s great to see bloglines leading the charge. [More]

What’s interesting to me though is this is another sign of the micro formatting of content on a mass scale. From a programmer’s perspective HTML is tough to handle. Extracting content from it (and all its embedded subcomponents) is a nightmare. And if the content wasn’t bad enough, think about the problem a search engine would have trying to determine what’s new on any given page. When you visit a web page you have to go “pull” the content down to see if it’s new, a search engine has to basically go do the same thing. Now times that by a billion or so web sites. That’s a lot of checking for new content, and 99.999% of the time the content isn’t new.

But now enter RSS and the micro formatting (structuring) of content. Suddenly a search engine’s job is significantly easier. It still has to go pull the content, but instead of having to trawl through every page on a site it can simply check a feed — vastly easier job. And when it does get content it can access the search-engine-relevant pure text content much faster.

RSS isn’t dominant enough yet to be the primary source of information for search engines, but that time is naturally coming (regardless of the acceptance of RSS in the user/mainstream, it will become a mandatory function for web site content deployment and management). The benefits to a search engine will be significant: significantly faster uptake of content (freshness) — to the point of near real-time, better results through more refined indexing of complete content (rather than snippet matching), and new found powers of matching streams of content, relationships between feeds and users’ activity. The search engine that takes advantage of this new power could well find themselves with enough use-case-grunt to take on Google for pure value, and they can do so with a fraction of the infrastructure. Will RSS erode Google’s infrastructure dominance by making it irrelevant?

The trick will be matching RSS reliance/result/development with its acceptance in the market. Will that be an RSS based system — such as Technorati — or an existing search engine playing catch up when the light goes on in 2 or 3 years.

And on a related note, I wonder how long before we see the emergence of a notification standard for content (such as today’s feed pings) as a way of getting around the “pull” problem of determining new content. If things were 10 times easier for a search engine using RSS, it’s a 100 times easier if you don’t have to go track it down. The problem search will be solving then will be organizing and presenting information, not finding it.