| Home | All papers | Authors | Tags | Topics |
A short paper in CIKM 2009 looks at frameworks for analyzing search sequences at low levels (fixation, view, click etc) to high levels (task).
I was lightly involved but highly enthusiastic about three papers on online reviews, one from HLT-NAACL on identifying review pages, another from EMNLP 2009 on matching reviews to the underlying objects using language models, and a third from CIKM 2009 on the same problem but using translation models.
We had three WWW papers in 2010. The first was about a browsing process in which the user may open new tables. The second was an algorithm for cover problems using the mapreduce framework. And the third was a characterization study of online behavior.
This paper from WSDM 2010 is a study of some evolutionary dynamics of two-sided markets, for instance those in which buyers and sellers connect through an intermediary such as an auction site.
A paper at SDM 2009 suggests plotting measures of connectedness as a function of fraction of edges removed from a graph, as a simple visual approach to viewing the macroscopic connectivity of a large graph. We call these diagrams "shatterplots."
Raghu Ramakrishnan gave a keynote at PODS, and the associated paper describes a vision for a "web of concepts," exploring some of the technical and social issues that arise from a richer set of identifiable entities.
A Data Engineering Bulletin paper gives a range of characterizations of online search behavior, based on analysis of toolbar data.
We wrote a series of three papers on privacy in query logs. The first paper assesses vulnerabilities in a specific scheme for anonymizing logs: hashing tokens securely into identifiers. The second paper suggests that the removal of personally-identifiable information from logs is problematic, as seemingly innocuous queries can in aggregate reveal personal information. Finally, the third paper covers a family of schemes that seek to provide privacy by grouping users into buckets, and explores certain families of attacks.
This SIGMOD 2008 paper describes the Pig system for large-scale data processing, a more flexible extension of the map-reduce framework.
This paper from WSDM 2008 describes some large-scale experiments on Yahoo! groups, and characterizes people who tend to receive preferential treatment upon joining and group, and then continue on to influence within the group.
Another WSDM 2008 paper gives an approach to studying the macroscopic structure of large bipartite graphs called the KNC-plot, and also describes some algorithms to compute it.
This paper from KDD 2008 gives a detailed study of edge-by-edge evolution of four large social networks, and compared various models of evolution based on likelihood rather than based on aggregate statistics.
A paper from VLDB 2008 describes a new approach to generalization in search: given a universe of, say, restaurants, and a query for a particular cuisine type in a particular location, we give ways to efficiently generalize the location and/or the cuisine type if the initial query returns no results.
A paper in IEEE Computer describes the PeopleWeb, in which a user's social environment travels between sites with the user. This paper also gives some trends in the growth of online content.
I worked some years ago on a project called CLEVER, which was follow-on work to Jon Kleinberg's HITS algorithm, done at IBM Almaden. We finally put together an end-to-end algorithmic description of the CLEVER system.
We have a poster at WWW 2007 on some alternatives to personalized pagerank; a short two-page poster paper gives an overview.
The last few years have shown some very interesting results on the nature of web crawling; see the work of Junghoo Cho and Chris Olston for some nice examples. A new paper shows an analysis of how efficiently new web content can be discovered, with varying levels of insight into the right places to look for them.
We have a paper in the Data Engineering Bulletin that gives a high-level overview of many of the key research directions that Yahoo! Research is pursuing.
We looked at a large social network to understand how friendship relates to geographic separation, and found that two people being friends is a function less of the distance between them than of the number of people between them. We introduced a model of friendship formation based on this intuition, and showed that the model produces social networks with short paths for arbitrary population densities. The results are in a paper in PNAS. The same work also has a more theoretical component, in which we analyze this rank-based friendship and show results for arbitrary metric spaces. These results are in a paper in ESA 2006.
At the same time, we've been looking at various questions about the evolution of social networks. We looked at the contacts networks within Flickr and Yahoo! 360, and found some surprising results about the number of nontrivial disconnected components in the social network over time. The results appear in a short paper in KDD 2006.
A couple of years ago, we did some work on the nature of templates appearing on websites, and showed that roughly half of the content on the web comes from templates. This work appears in an industrial track paper in WWW 2005. Recent work extends the thread of site-level analysis to explore algorithms for segmenting a website into topics based on its hierarchical structure. This new work appears in a paper in KDD 2006.
We've also done some work on clustering objects that evolve over time. There's a short paper in KDD 2006.
In follow-up to various work on measuring the size of collections of objects, we have a new approach appearing in a paper in CIKM 2006.
We've done some recent work on large dense subgraph discovery using a weakened formulation of a dense subgraph. The algorithmic approach is an iterated form of shingling a graph, in which each step concentrates dense subgraphs and removes sparse subgraphs. A paper in VLDB 2005 covers the results.