ACM Queue – Searching Vs. Finding – How do you help computers find the information people really want?

From the ACM’s Enterprise Search feature, William A. Woods, of Sun Microsystems Laboratories discusses different methods of information retrieval. Some of these are computation-intensive algorithms. As the authour puts it:

“It would be possible, in principle, to apply the same kinds of semantic and morphological expansions to the entire Web, using the specific-passage-retrieval technique, but that has not been my primary target. The Web is so vast that it is difficult to predict what would happen without trying it. There would probably be more issues with word sense ambiguity, and a global conceptual taxonomy would be awe inspiring. It would be an interesting challenge. Certainly the cost would be greater than for current Web search engines and might not fit their business models.

The specific-passage-retrieval algorithm lends itself to applications of large scale, because it allows a collection to be subdivided and the search to be distributed, with the results easily collated (because the penalty scores are independent of collection statistics). In theory, this could be used for a kind of federated Web search in which owners of content could provide their own indexing and search and could update their indexes whenever the content changed. This would address a fundamental problem of Web searching: the never-ending task of repeatedly crawling the Web, trying to keep the indexes current.

It is interesting to contemplate a federation knit together by a spanning network of systems (possibly a peer-to-peer network) that distribute queries and collate the results. Some of the members of the federation could be large content providers who index their own content, whereas others could be crawler-based services like current Web search engines. Of course, this would take a heretofore untold amount of cooperation among many players that are currently fierce competitors, making this scenario perhaps nothing more than theoretical for the time being.”

