Ingmar Weber: Free the Query Logs
With Google seeping into every nook of the Society of the Query conference – the subject, direct or indirect, of most presentations and discussions – you might ask why Google isn’t here to speak for itself. Unfortunately and unsurprisingly, the company makes it very difficult for staff to speak at events (look at how rarely they attend the industry’s largest conferences: SIGIR, WSDM, WWW.)
Lucky for the conference, there was a company rep in the house, Ingmar Weber, a search engine researcher from Yahoo! Weber rounded out yesterday’s discussion with his lecture “It’s Hard to Rank Without Being Evil: where evil means big centralized and keeping track of a huge query log.” Chock full of metaphors linking data to wealth, his talk proposed an alternative search engine of the future that makes query logs a free public resource.
What’s a query log? Let’s say you’re a designer like Weber and want to pioneer this alternate search engine. First you’d consider ranking, or how to organize, prioritize, and filter the web’s data. You could rank a few ways: by document content, such as a word and where it appears on a page, the most basic ingredient of a search; or by hyperlink structure, using a giant webcrawl to discern hits and inlinks – essentially votes – from other websites. Or you could use query logs. Query logs are quality votes; they show that users who search for x always click on y. They also show relations between pages – page y and z are clicked by the same user. A search engine could use this implicit relevance feedback to infer what people like and direct them there.
Over time, a log of individual search actions becomes powerful resource, a goldmine of data. Put it all together, and we could find out flu patterns or fine tune election predictions, or discover what local bar most people like. But there’s a paradox: if you’re using search data to build a search engine from scratch, you’d need to pull that data from some other, pre-existing search engine. And currently there is no access to major search engines’ query logs. Companies hoard their logs like misers sitting on mounds of gold.
There are other such hidden mounds of gold, or ‘information silos,’ as Weber terms them. Mobility data from mobile phones for instance, could tell us where people are at all times. This would be useful to predict traffic jams, for one. Also shopping basket information, held by credit card companies and stores, could tell us what people are buying, where and when. Imagine a real time snapshot of the amount of junk food consumed.
Weber wants to know if we can unlock these silos and chase the misers away, but still respect obvious privacy issues and potential abuses. How can we all contribute to the query log but protect ourselves from intrusions or misuse of our personal data?
Weber offered a few current examples, such as Ippolita’s SCookies, a site that swaps cookies among Google users; you offer up search information but SCookies makes it anonymous. Data sharing without the creepiness factor. What other legal and technical innovations could open up massive querying data for the public good? There’s no answer yet. But who knows what Weber’s cooking up when he’s outside the office.