Mathematical properties of webgraphs

Many people are interested in the structure of World Wide Web and Internet. These are huge networks with many websites, webpages, and servers, routers, respectively, and even more links between webpages and physical connection between servers and routers. These structures can be investigated by means of graph theory.

The network of the Internet is not similar, and therefore not identical to the network of World Wide Web. The aim of Erdős Webgraph Project is to collect data of World Wide Web, and build a useful database.

If you want to learn more about the structure of world wide web, and large graphs in a strict mathematical sense, navigate here

Definition

Webgraph can be defined in several ways. We have our own definition which is the following.

What are the vertices and edges of our graph?

In the present project we construct webgraphs for scientific research purposes. In our construction, the vertices of the graph correspond to domain-names, and we connect by a directed edge (arrow) a domain A to domain B if and only if there exists a document, characterized by an URL, in domain A that hyperlinks to an URL under the domain B. For example, in the picture below, the thick arrows are the edges in our graph, and these edges were added since the thin arrows link from one document to the other, under the distinct domains.

Webgraph

The difference between URLs and domains

If we consider a large website, e.g., www.cnn.com, then we can find lots of images, videos and links on the page. If we place the mouse over that items, then most www browsers show the internet address, the URL of the item somewhere on the bottom of the browser. Every day hundreds of new URLs are added to large sites (pictures, videos, blog links, etc.), while the name of the site remain unchanged even for decades.

In our example in CNN, there are subdomains (e.g. http://money.cnn.com) and subdirectories (e.g., http://www.cnn.com/OPINION/). We count each subdomain as a distinct domain, but all subdirectory will be considered the part of the domain where it resides.

Degree distribution

The degree distribution of the webgraph strongly differs from the degree distribution of the classical Erdős–Rényi random graph model. The degree distribution of the webgraph obides the power law: the fraction of nodes with degree d is roughly proportional to d-c, where c=2.09 for in-degree and c=2.72 for out-degree. This means that the webgraph has lots of large-degree-nodes. Power law distributions are visualized as straight lines in double-logarithm scale graphics, you can witness this fact at our Download page, too.

 

The size of World Wide Web

By some estimates published on Wikipedia on April 2011, there are 1012 different URLs and about 110 million different websites on the WWW. A website can be identified roughly with a domain-name (we wrote "roughly" since the intricacies are not covered here).

Important papers about this topic