Sunday, January 18, 2009

The Deep Web

The Deep Web is a very interesting place, and what makes it even more interesting is the fact that Google, Yahoo and the other big search engines haven't laid their hands on it yet. So if you thought that there's nothing to innovate in the area of search engines, you may be happy to hear that there are still opportunities for challenging the big Web search engine

Wikipedia defines the Deep Web as "websites that are not registered with any search engine."
The Wikipedia article further enumerates the resources that constitute the Deep Web.

I'm particularly interested in the "dynamic content" part of the Deep Web which is structured WWW data that is hidden behind web form query interfaces. This is content that is stored in databases and served by querying HTML forms. I will refer to these kinds of sites as Deep Web directories, although this is not an exact description. An examples of Deep Web directory content is this site, used for searching for used books.


The Deep Web has the following characteristics and properties:
  • Much larger than the "visible" (a.k.a "surface" Web).
  • May be locked behind user/password
  • Accessible only by query
  • May take a long time to return a result
  • May restrict the number of pages that may be crawled in a period of time
  • The accuracy of the information is rated higher than that found on the visible Web
It is tempting to venture that if the likes of Google and Yahoo! have not been very successful at uncovering the Deep Web, then it is a problem that requires immense resources. Indeed the problem of creating one generic solution to handle all Deep Web directory sites seems unsolvable with today's tools. However, vertical Deep Web crawlers have been implemented (e.g. try this) and the problem seems solvable with some human intervention in the process.

I venture to say that the to first uncover the Deep Web on a large scale will benefit greatly and it is certainly worth while to attempt a solution. I propose a solution which I deem to be novel. It requires a software platform which can be "taught" how to crawl a particular set of vertical Deep Web sources (e.g. all directories listing flight departure and arrival times). The end-user is provided with an interface for "teaching" the crawler how to crawl a particular Deep Web site. The crawler then crawls this site at a later, indeterminate time. I also foresee that the harvesting (extraction of semantic information) of the crawled pages (for more accurate search results) can be an automatic process that is "taught" to the crawler. I envision a massive collaborative community effort, a la Wikipedia, that will work to expose the Deep Web to the general public via a common, open and free Web service.

In order for the community Deep Web search engine to be successful should be:
  • Freely accessible to all
  • Provide a valuable service
  • A culture of community serving the community should be cultivated
  • It should be easy to use!
  • It should strive to crawl new directories as soon as possible in order to give the contributor a quick reward.
  • It should produce crawling progress reports to give the contributor a feeling of participation in the process.
I may provide some research results in a later blog entry.
Until then, here are some interesting articles:

No comments:

Post a Comment