We wrote a series of three papers on privacy in query logs. The first paper assesses vulnerabilities in a specific scheme for anonymizing logs: hashing tokens securely into identifiers. The second paper suggests that the removal of personally-identifiable information from logs is problematic, as seemingly innocuous queries can in aggregate reveal personal information. Finally, the third paper covers a family of schemes that seek to provide privacy by grouping users into buckets, and explores certain families of attacks.
This SIGMOD 2008 paper describes the Pig system for large-scale data processing, a more flexible extension of the map-reduce framework.
This paper from WSDM 2008 describes some large-scale experiments on Yahoo! groups, and characterizes people who tend to receive preferential treatment upon joining and group, and then continue on to influence within the group.
Another WSDM 2008 paper gives an approach to studying the macroscopic structure of large bipartite graphs called the KNC-plot, and also describes some algorithms to compute it.
This paper from KDD 2008 gives a detailed study of edge-by-edge evolution of four large social networks, and compared various models of evolution based on likelihood rather than based on aggregate statistics.
A paper from VLDB 2008 describes a new approach to generalization in search: given a universe of, say, restaurants, and a query for a particular cuisine type in a particular location, we give ways to efficiently generalize the location and/or the cuisine type if the initial query returns no results.
A paper in IEEE Computer describes the PeopleWeb, in which a user's social environment travels between sites with the user. This paper also gives some trends in the growth of online content.
I worked some years ago on a project called CLEVER, which was follow-on work to Jon Kleinberg's HITS algorithm, done at IBM Almaden. We finally put together an end-to-end algorithmic description of the CLEVER system.
We have a poster at WWW 2007 on some alternatives to personalized pagerank; a short two-page poster paper gives an overview.
The last few years have shown some very interesting results on the nature of web crawling; see the work of Junghoo Cho and Chris Olston for some nice examples. A new paper shows an analysis of how efficiently new web content can be discovered, with varying levels of insight into the right places to look for them.
We have a paper in the Data Engineering Bulletin that gives a high-level overview of many of the key research directions that Yahoo! Research is pursuing.
We looked at a large social network to understand how friendship relates to geographic separation, and found that two people being friends is a function less of the distance between them than of the number of people between them. We introduced a model of friendship formation based on this intuition, and showed that the model produces social networks with short paths for arbitrary population densities. The results are in a paper in PNAS. The same work also has a more theoretical component, in which we analyze this rank-based friendship and show results for arbitrary metric spaces. These results are in a paper in ESA 2006.
At the same time, we've been looking at various questions about the evolution of social networks. We looked at the contacts networks within Flickr and Yahoo! 360, and found some surprising results about the number of nontrivial disconnected components in the social network over time. The results appear in a short paper in KDD 2006.
A couple of years ago, we did some work on the nature of templates appearing on websites, and showed that roughly half of the content on the web comes from templates. This work appears in an industrial track paper in WWW 2005. Recent work extends the thread of site-level analysis to explore algorithms for segmenting a website into topics based on its hierarchical structure. This new work appears in a paper in KDD 2006.
We've also done some work on clustering objects that evolve over time. There's a short paper in KDD 2006.
In follow-up to various work on measuring the size of collections of objects, we have a new approach appearing in a paper in CIKM 2006.
We've done some recent work on large dense subgraph discovery using a weakened formulation of a dense subgraph. The algorithmic approach is an iterated form of shingling a graph, in which each step concentrates dense subgraphs and removes sparse subgraphs. A paper in VLDB 2005 covers the results.
This article from Scientific American gives an overview of search and related techniques that make use of link analysis over hyperlinked corpora.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Hypersearching the web. Scientific American, June 1999. ( html )
This is a brief summary of some results on web structure, covering
HITS, the bow tie model, and the fractal structure of the web.
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins:
The Web and Social Networks. IEEE Computer 35(11): 32-36 (2002) (pdf)
This is a slightly more detailed view of applications that make use of the web's link structure; it's getting a little dated.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the web's link structure. IEEE Computer, August 1999. ( pdf )
This paper in CACM
2004 gives an analysis of the bloggers on the LiveJournal
blog-hosting site: their interests, locations, and demographics, from
the perspective of the social network.
R. Kumar, P. Raghavan, S. Rajagopalan, and
A. Tomkins. Social networks: From the web to knowledge
management. Book chapter in Web Intelligence, editors: Ning
Zhong, Jiming Liu, Yiyu Yao, by Springer-Verlag, pages 367--379,
January 2003. (pdf)
These slides
are from part of an AMS tutorial. They cover some basic introductory
remaks about power laws and related heavy-tailed distributions, and
discuss a set of generative models for these distributions. Slides
from an AMS tutorial on power laws and generative models.
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Recommendation systems: a probabilistic analysis. Journal of Computer and System Sciences (JCSS), 63(1):42--61, August, 2001. appeared in Proc. 39th Symposium on Foundations of Computer Science, 1998. ( pdf )
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. On targeting markov segments. In Proceedings of the ACM Symposium on Theory of Computing, 1999. ( pdf )
A. Broder, M. Fontura, V. Josifovski, R. Kumar, R. Motwani,
S. Nabar, R. Panigrahy, A. Tomkins and Y. Xu. Estimating Corpus Size
via Queries. In Conference on Information and Knowledge Management (CIKM), 2006. (pdf)
R. Fagin, R. Guha, R. Kumar, J. Novak, D. Sivakumar, and
A. Tomkins. Multi-Structural Databases. In Proceedings of the
24th ACM Symposium on Principles of Database Systems, 2005. ( pdf )
R. Fagin, P. Kolaitis, R. Kumar, J. Novak, D. Sivakumar, and
A. Tomkins. Efficient Implementation of Larce-Scale Multi-Structural
Databases. In IEEE International Conference
on Very Large Databases (VLDB), 2005. ( pdf )
Soumen Chakrabarti, Byron E. Dom, David Gibson, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13:409--435, 1999. ( pdf )
S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In SIGIR 98 Workshop on Hypertext Analysis, 1998. ( postscript )
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large scale knowledge bases from the web. In IEEE International conference on Very Large Databases (VLDB), Edinburgh, Scotland, September 1999. ( pdf )
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. Computer Networks, 31:1481--1493, 1999. Conference version at Eighth Internation World Wide Web Conference, 1999. ( html )
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web (winner best paper award). In Proceedings of the Ninth International World Wide Web Conference, 2000. ( html )
A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. Pagerank computation and the structure of the web: Experiments and algorithms, 2002. ( pdf )
Stephen Dill, Ravi Kumar, Kevin S. Mccurley, Sridhar Rajagopalan, D. Sivakumar, and Andrew Tomkins. Self-similarity in the web. ACM Transactions on Internet Technology (TOIT), 2(3):205--223, 2002. Appeared in IEEE International conference on Very Large Databases (VLDB) 2001, Rome, Italy. ( pdf )
J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, models and methods. In Proceedings of the International Conference on Combinatorics and Computing, number 1627 in LNCS. Springer-Verlag, July 1999. ( pdf )
Z. BarYossef, A. Broder, R. Kumar and A. Tomkins. Sic Transit Gloria Telae: Towards an Understanding of theWeb's Decay. In Proceedings of the Thirteenth International World Wide Web Conference, New York, New York, 2004. ( html , pdf )
S.R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The web as a graph. In Proceedings of the 19th ACM Symposium on Principles of Database Systems, pages 1--10, 2000. ( pdf )
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In Proc. 41st Symposium on Foundations of Computer Science, 2000. ( pdf )
R. Fagin, A. Karlin, J. Kleinberg, P. Raghavan, S. Rajagopalan, R. Rubinfeld, M. Sudan, and A. Tomkins. Random walks with ``back buttons''. In Proceedings of the ACM Symposium on Theory of Computing, 2000. ( pdf )
R. Kumar, J. Novak and A. Tomkins. Structure and Evolution of
Online Social Networks. In Proceedings of the Twelfth ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD), poster
track, 2006. (pdf)
J. Novak, P. Raghavan and A. Tomkins. Anti-Aliasing on the Web. In
Proceedings of the Thirteenth International World Wide Web
Conference, New York, New York, 2004. ( pdf )
R. Guha, R. Kumar, P. Raghavan and A. Tomkins. Propagation of Trust and Distrust. In Proceedings of the Thirteenth International World Wide Web Conference, New York, New York, 2004. ( html , pdf )
C. Faloutsos, K. McCurley and A. Tomkins. Fast Discovery of Connection Subgraphs. In Tenth ACM SIGKDD Conference, Seattle, WA, 2004. ( pdf )
R. Kumar, D. Liben-Nowell, J. Novak, P. Raghavan, and A. Tomkins.
Geographic routing in social networks. In Proceedings of the
National Academy of Science 102(33):11623-11628 (2005). (pdf)
R. Kumar, D. Liben-Nowell and A. Tomkins. Navigating Low-Dimensional and Hierarchical Population Networks. In European Symposium on Algorithms (ESA), 2006. (pdf)
D. Gruhl, R. Guha, D. Liben-Nowell and
A. Tomkins. Information Diffusion Through Blogspace. In
Proceedings of the Thirteenth International World Wide Web
Conference, New York, New York, 2004. ( pdf )
R. Kumar, J. Novak, P. Raghavan, and
A. Tomkins. On the bursty evolution of blogspace. In
Proceedings of the Twelth International World Wide Web
Conference, Budapest, Hungary, 2003. ( html , pdf )
D. Gruhl and R. Guha and R. Kumar and J. Novak and A. Tomkins. The
Predictive Power of Online Chatter. In Proceedings of the
Eleventh ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD), Chicago, IL, 2005. (pdf)
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew
Tomkins. The Web and Social Networks. In IEEE Computer 35(11):32-36
(2002). (pdf)
D. Chakrabarti, R. Kumar and A. Tomkins. Evolutionary Clustering.
In Proceedings of the Twelfth ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD), poster track, 2006. (pdf)
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins and J. Zien. How to build a WebFountain: An
architecture for very large-scale text analytics. In IBM Systems
Journal, vol 43, number 1, 2004. (pdf)
K. McCurley and A. Tomkins. Mining and knowledge discovery from the
Web. In 7th International Symposium on Parallel Architectures,
Algorithms and Networks, Hong Kong, 2004. (pdf)
A. Dasgupta, R. Kumar, P. Raghavan and A. Tomkins. Variable Latent
Semantic Indexing. In Proceedings of the Eleventh ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD), Chicago,
IL, 2005. (pdf)
D. Gibson, K. Punera, and A. Tomkins. The Volume and Evolution of Web
Page Templates. In Proceedings of the Fourteenth International
World Wide Web Conference (WWW), Chiba, Japan, 2005. (pdf)
R. Kumar, K. Punera and A. Tomkins. Hierarchical Topic
Segmentation of Websites. In Proceedings of the Twelth ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD), 2006. (a href="/andrew/papers/website-segmentation/website-segmentation.pdf">pdf S. Dill, N. Eiron, D. Gibson, D. Gruhl,
R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan,
A. Tomkins, J. Tomlin, and J. Zien. Bootstrapping the
semantic web via automated semantic annotation (winner best paper
award). In Proceedings of the Twelth International World Wide Web
Conference, Budapest, Hungary, 2003. (pdf)
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. On semi-automated web taxonomy construction. In Fourth International Workshop on the Web and Databases (WebDB'2001), Santa Barbara, CA, May 24--25, 2001. ( pdf )
Tutorials
This is a tutorial that Jon Kleinberg and I gave at PODS99 on linear algebra techniques in information retrieval -- it's a broad introduction to the area that assumes little background beyond basic linear algebra.
J. Kleinberg and A. Tomkins. Applications of linear algebra in information retrieval and hypertext analysis. In Proceedings of the 18th ACM Symposium on Principles of Database Systems, 1999. ( pdf )
User targeting
Searching and querying large-scale data
Web graph analysis
D. Gibson, R. Kumar and A. Tomkins. Discovering Large Dense Subgraphs
in Massive Graphs. In IEEE International Conference on Very Large
Databases (VLDB), Trondheim, Norway, September 2005. (pdf)
Communities, blogs and social networks
WebFountain and content analysis
Website Analysis
Web taxonomies