Home All papers Authors Tags Topics

Back to homepage

Web graph analysis

This page describes threads of research performed by myself and my collaborators. The goal is to put our work into a meaningful sequence, instead of just a list of papers; the goal is not to give an overview of the field. At some later time, this page may turn into a survey, but at this point the many wonderful contributions of others are not represented here.

A survey paper from PODS 2000 covers some early work in analysis of the web graph.

Measurement studies

In the very early days of graph-based web research, a number of groups began to observe power law behavior on the web. These included teams from Notre Dame, Xerox PARC, and IBM. We published an early paper appearing in WWW8 focused on community extraction, which also included an early observation of the power law around indegrees of web pages. (see the journal version of this paper for details). Subsequently, a series of analyses of power laws on the web appeared, covering graph properties as well as growth, network properties, eigenvalues, and many others.

In 1999, researchers from Altavista, Compaq and IBM collaborated to perform a large-scale measurement study of the world wide web graph. The resulting paper uncovered a "bow tie" structure in the web, and won the best paper award in WWW9. See a nice summary in Nature on the topic.

There's a long and interesting line of work on measuring the size of corpora like the web, into which we have only limited visibility. We have a recent contribution along these lines, following on to this interesting body of work.

A follow-on paper in VLDB 2001 extended the bow tie structure to communities within the web, showing that any "thematically-unified community" contained similar structure to the web at large, and showing how the bow ties of many smaller communities connected together to form a navigational backbone of the web.

A number of researchers in the physics and theoretical computer science community and elsewhere have advanced models to explain some of these observations. We did some early work in the area, developing a model based on copying, which is strongly related to a model called "preferential attachment" introduced by Barabasi and Albert. The model and other general aspects of the area are discussed in this paper from COCOON99. A more theoretical treatment is given in this paper from FOCS 2000 . There have since been a number of fantastic results on these and related models that in many cases extend the results in our paper.

Dense subgraphs

A number of algorithms have been proposed for finding "communities" in the web graph using approaches based on dense subgraphs. The first such paper covered a methodology known as trawling for enumerating all communities based on a certain type of signature. This methodology was first described in WWW8 in this paper, and then extended to a general framework creating knowledge bases based on particular types of graphical signatures in a follow-on paper in VLDB 99.

A similar technique was employed to find communities in blogs, using a signature based on a greedily-constructed dense subgraph rather than a bipartite clique. Results are given in this paper.

There are fewer algorithms known for finding large dense subgraphs in graphs. One such algorithm based on iterated shingling of subgraphs is given here.

All of those approaches are offline in the sense that a large graph is processed over a period of days or weeks, and a large number of communities are extracted. There are also online approaches that try to approximate a particular community based on a user query. We have worked on this problem in social networks using a technique called connection subgraphs.

Similarly, early work on the CLEVER project covered finding dense bipartite graphs based on a user query; some results are given in this paper. A more complete summary of the algorithms is here.

Back to Andrew Tomkins homepage.