Understanding the Deep Web

An interesting Salon article describes Yahoo’s new Content Acquisition Program, which offers paid inclusion for deep-searching online databases. These treasure troves of information are often missed by search engines, which travel the links between dynamic pages cautiously.

Yahoo! has the right idea – search engines today aren’t capturing the best has to offer, because these articles are often behind query or login pages. Yahoo’s solution seems to be to extend their search engine to understand URLs of specific sites. However, many people are upset that this new program (which is basically a combination of premium offerings from their other properties) doesn’t clearly mark the “paid inclusion” links in their main index. Some people point out that paid inclusion is a conflict of interest for search engines. (One Yahoo! employee disputes this on his personal blog.)

Ultimately, I think the solution to the problem of searching the deep web will be based in XML. Perhaps what we need is a way of defining the API databases use. A language like WSDL is a good start, but WSDL doesn’t do a good job of capturing the semantics behind a web service call. What we need is a way to map the fields in a database to a common interface – something like what DBI and DB do.

We may also want to consider ways of telling spiders a little more about the sites we run. robots.txt is great, but an expanded language could provide advanced webmasters the ability to define infinite loops better, define different presentations of the same content, specify preferred crawl schedules, and more, allowing smart robots to find even more information at a site, and categorize it intelligently.

(Original link courtesy Slashdot.)