Monday, January 19, 2009

Social networks and Individualism

Customization of services, user devices (e.g. cell phone), and user interfaces seems to be a mandatory feature these days. We have come to expect it everywhere, and even this blog site (blogspot.com) gives me powerful tools to customize the layout and behavior of my blog. It is not a novel idea, but a human need to express individualism that has come to "Web fruitation". A need to be an individual within the group. Customization by the end-user of a web service (e.g. GMail) or of her cell-phone theme are examples of features provided in order to meet this need.

It seems to me that this customization trend gained momentum hand-in-hand with the rise of social networks. As if the more we exposed ourselves to the community, albeit virtual, the more we wanted to express our individuality within this community.

These two (customization and social networks) are manifistations of a more general trend of increased web-presence of common folk. I see two forms of Web presence: "passive" and "active." Passive Web presence is bestowed upon us, most likely by our friendly neighborhood web crawler, search engine. If you search the Web you may find several research papers I wrote as part of my undergraduate assignments. Somehow they made their way to the Web. Active Web presence is created when we actively publish on the Web (like this blog). Sometimes we may participate in this active publication process w/o realizing it. Again, if you search for me on the web, you may find emails I have posted to some IETF working group email discussion list several good years ago.
Returning to the subject of social networks and individualism: if I examine the social networks terrain, it seems to me that the next winning feature-set for social network services will be customization and more advanced methods that will let us express our individualism. Google knows this and you can see this in the latest Gmail Themes (http://googlesystem.blogspot.com/2008/11/gmail-themes.html) feature. The ever so simple feature in Gmail Chat that allows you to set your status to any string of your choosing, packs an amazing individualization power punch. I've loosley tracked the status sentences that my team members have regulary published in GMail in the past few months and it was a great source of information about the team. It was easy to identify the comedian, the cynical, and the artist in the group. Sometimes I knew if someone had a good or bad day, just by looking at their GMail Chat status string. And there were also some interesting patterns in how the group evolved their status string over time. I could definitly identify a loose "nano" social network which communicated using the status-string. Sometimes it was reminiscent of the Twitter format.

There is much more to say about the GMail Chat status message and "nano" social networks (a term that I think I have identified here first), but these subject deserve a blog entry of their own.

Sunday, January 18, 2009

The Deep Web

The Deep Web is a very interesting place, and what makes it even more interesting is the fact that Google, Yahoo and the other big search engines haven't laid their hands on it yet. So if you thought that there's nothing to innovate in the area of search engines, you may be happy to hear that there are still opportunities for challenging the big Web search engine

Wikipedia defines the Deep Web as "websites that are not registered with any search engine."
The Wikipedia article further enumerates the resources that constitute the Deep Web.

I'm particularly interested in the "dynamic content" part of the Deep Web which is structured WWW data that is hidden behind web form query interfaces. This is content that is stored in databases and served by querying HTML forms. I will refer to these kinds of sites as Deep Web directories, although this is not an exact description. An examples of Deep Web directory content is this site, used for searching for used books.


The Deep Web has the following characteristics and properties:
  • Much larger than the "visible" (a.k.a "surface" Web).
  • May be locked behind user/password
  • Accessible only by query
  • May take a long time to return a result
  • May restrict the number of pages that may be crawled in a period of time
  • The accuracy of the information is rated higher than that found on the visible Web
It is tempting to venture that if the likes of Google and Yahoo! have not been very successful at uncovering the Deep Web, then it is a problem that requires immense resources. Indeed the problem of creating one generic solution to handle all Deep Web directory sites seems unsolvable with today's tools. However, vertical Deep Web crawlers have been implemented (e.g. try this) and the problem seems solvable with some human intervention in the process.

I venture to say that the to first uncover the Deep Web on a large scale will benefit greatly and it is certainly worth while to attempt a solution. I propose a solution which I deem to be novel. It requires a software platform which can be "taught" how to crawl a particular set of vertical Deep Web sources (e.g. all directories listing flight departure and arrival times). The end-user is provided with an interface for "teaching" the crawler how to crawl a particular Deep Web site. The crawler then crawls this site at a later, indeterminate time. I also foresee that the harvesting (extraction of semantic information) of the crawled pages (for more accurate search results) can be an automatic process that is "taught" to the crawler. I envision a massive collaborative community effort, a la Wikipedia, that will work to expose the Deep Web to the general public via a common, open and free Web service.

In order for the community Deep Web search engine to be successful should be:
  • Freely accessible to all
  • Provide a valuable service
  • A culture of community serving the community should be cultivated
  • It should be easy to use!
  • It should strive to crawl new directories as soon as possible in order to give the contributor a quick reward.
  • It should produce crawling progress reports to give the contributor a feeling of participation in the process.
I may provide some research results in a later blog entry.
Until then, here are some interesting articles:

Sunday, January 11, 2009

Moving on

Lost my job on Thursday.
Yap - pull the lever, crank up the stats - I switched teams. From head coach to water-boy in less than a "bim bam, thank you Mam". Reached the corner bend, you know the one. The corner around which lies the greatest economic crisis since something or rather. The corner that comes after the bubble.

I figured there can't be a better time to launder some words; rehash some ideas that the next-door blog has already beat to death in a language I can't even read. And in a while or so the Google "machine" will scoop them up, words and ideas, compress, index and cache. And in a Google cluster of 100,000 commodity machines, my commodity words will be stored on a commodity hard disk. Our friendly neighborhood GFS will make sure to replicate this babble-gabble on at least another commodity disk. For safe-keeping, you know.

But hey, 'Movin on' is the title of this feverish gibberish, right? So let's get going. I'll start by making a list. A Prioritized List (ooh!) - what's to do, and what comes first. When you don't know what to do - make a list. Worst case, you'll put it in your pocket.
And I've got the first item all planned out: I'll measure how long it will take Googie to find this post - this should fill my schedule for the next couple of weeks.