What is the PageRank Algorithm

PageRank - Everything about the Google algorithm

The PageRank algorithm

Basics

The PageRank algorithm is the best-known method for evaluating / weighting websites. The algorithm was developed and published by Sergey Brin and Lawrence Page at Stanford University in the late 1990s [p. Brin, L. Page, "The PageRank Citation Ranking: Bringing Order to the Web"; 'The Anatomy of Large-Scale Hypertextual Web Search Engine']. The innovation consisted in generating an evaluation factor based solely on the link structure. The content of a page plays just as little role in PageRank as the URL. In addition, no distinction is made between internal and external links. The PageRank value reflects the "importance of a website". PageRank is an "off-page factor", i.e. it is independent of the content of the page. The ranking (the placement of the websites) is determined from the combination of on-page and off-page factors. The former include, for example, the title, description, headings and normal texts; The latter includes the referring link text in addition to the PageRank. The PageRank algorithm is therefore only part of a ranking algorithm.

A modified PageRank algorithm was used at Google. In the early days, Google's ranking algorithm was superior to other algorithms because it was more difficult to manipulate and thus provided better search results. However, the publication of the PageRank procedure and the display of the PageRank in the Google toolbar led to increased manipulation (e.g. through the purchase and sale of PageRank). In addition, other search engines developed comparable algorithms, i.e. processes that use the structure of the link and the referring anchor text as an important means of evaluation. This led to the fact that the quality of the results lists of the best search engines has meanwhile adjusted.

Search engine optimization

Optimizing the PageRank value alone is relatively easy. This is because the (original) procedure is completely known. In general, the entire website should not have any outgoing links or any links that transfer PageRank. The structure of the link depends on whether the PageRank should only be optimized for one page or for a larger number of pages. In the first case, the website would have a strong hierarchical structure, with all internal pages linking to the start page. All external pages would also link to the (to be optimized) start page. The number of your own pages would be chosen as large as possible. In the second case, a relatively flat structure would be chosen. In this case, incoming links would be distributed to the various pages.

An optimization of the PageRank does not have to automatically lead to an improvement of the ranking (i.e. the position in the search results). PageRank is only one criterion within the ranking algorithm. Outbound links that reduce the PageRank slightly can often easily be compensated for by other positive effects. In addition, outbound links can lead to the corresponding pages linking back to your own and the PageRank profit being greater than the loss.

Finally, Google also made changes to the original PageRank algorithm, which also changed the optimal linking.

Maths

Linear system of equations

The calculation of the PageRank is the solution of a linear system of equations of the form

M * PR = (1 - d)

where 0 i, is the PageRank of the i-th page. The matrix M is composed as follows

M = 1 - d T

where T describes the transition matrix. The components of T can be derived from the number of outgoing links:

Tij= 1 / Cj (if page j links to page i) Tij= 0 (otherwise)

C.j is the number of links on page j. The solution to the system of linear equations is

PR = M-1 * (1 - d)

The calculation of the inverse matrix M-1 can in principle be carried out analytically. For 0 PR {k + 1} = (1 –d) + d T * PR {k}

PR {k} denotes the value of the PageRank vector in the kth iteration. For example, PR {0} = 1 can be taken as the start value. The convergence of the method is guaranteed since the greatest eigenvalue of d T is less than 1. The damping factor d determines the proportion of the pagerank that is transferred further. For d = 1 everything is passed on, for smaller d there is an attenuation. For values ​​of d close to 1, however, convergence problems arise. The number of iterations until the result is stable increases. For a real calculation of the PageRank approx. 100 iterations would be necessary. Of course, the exact number also depends on the link structure and the desired accuracy. The number of iterations can be reduced if, for example, the final result of the last iterations is taken as the starting value (provided the link structure has not fundamentally changed in the meantime).

If one writes the equation in components and leaves out the designation for the number of the iteration, the result is

PRi= (1 - d) + d (PRX1 / CX1 + ... + PRXn / CXn )

The last equation is often referred to as the PageRank algorithm. Strictly speaking, however, is only an iteration scheme for the numerical solution of the above matrix inversion. The Jacobi iteration is simple, but not the fastest method. For example, the following iteration scheme (minimal residue) converges

PR {k + 1} = PR {k} + R {k} (∑ Ri M.ij R.j) / (∑ Min R.n M.ij R.j)

With

R {k + 1} = R {k} - M * R {k} (∑ Ri M.ij R.j) / (∑ Min R.n M.ij R.j)

generally faster. In addition, there are a number of further iteration schemes for matrix inversion (see e.g. A. S. Householder. "The Theory of Matrices in Numerical Analysis" New York, 1964) such as Gauss-Seidel, over relaxation methods, conjugate gradient methods, preconditioning, multigrid and blocking techniques or Chebyshev. However, some of these methods are limited to Hermitian matrices.

In some cases, the system of equations is used as an alternative

M * PR = (1 - d) / N

considered. Obviously, the same solution vector only results with a different normalization (1 / N).

The case d = 1 (i.e. no damping) is a special case. In this case, the eigenvector of the matrix T with the eigenvalue one is sought. The problem is degenerate eigenvalues. These always occur if the link structure is such that not every page can be reached from every other. This is the case when there are dead ends (i.e. pages with no outbound links) or closed structures exist. The final result of the iteration in this case depends on the initial values ​​and is a combination of the various eigenvectors with eigenvalue 1. For d < 1="" stellen="" tote="" ende="" kein="" problem="" dar="" und="" brauchen="" nicht="" gesondert="" behandelt="" zu="">

Random surfer model

A clear interpretation of the PageRank (for d <1) is the random surfer model: a surfer starts on a random page and then moves through the network by following a link on the page with the probability d at random or with the probability (1-d) restarts on any side. If there are no dead ends, the probability of landing on page i corresponds exactly to the PageRank of the system of equations

M * PR = (1 - d) / N If there are such dead ends in the system, then the probability vector is proportional to the PageRank vector.

Sample calculations

Simple examples

In the following, the calculation of the PageRank will be illustrated using a few examples. The normalization / formula M * PR = (1-d) is used. Most calculations are done analytically. Numerical values ​​result from the choice of specific parameters, e.g. B. a damping factor of d = 0.85.

example 1
The simplest case is unlinked websites. The following applies here: PRA. = PRB. = PRC. = (1 - d) All pages have the same PageRank. The value corresponds to the possible minimum value. The result applies accordingly to any number of unlinked pages.
Example 2
Each page links every other page. The result: PRA. = PRB. = PRC. = 1 Here, too, all pages have the same PageRank - as with any symmetrical link. The result also applies to any number of pages.
Example 3
The first non-trivial case: Page B links to A and C, which in turn link back. The result: PRB. = (1 + 2 d) / (1 + d) PRA. = PRC. = (1 + d / 2) / (1 + d) Obviously the PageRank of page B is higher than that of A and C. The sum of the PageRank values ​​is equal to the number of pages: PRtotal = 3 It can be shown that for every link structure in which every page has at least one outgoing link (i.e. no dead ends exist), the sum of all PageRank values ​​is equal to N (the number of pages).

More complex examples

Example 4
Your own website consists of a single page, has no outbound links, but some inbound links: PRHome = (1 - d) + X X denotes the sum of all contributions from incoming links.
Example 5
Your own website consists of N + 1 pages. The start page links to N subpages (i = 1,…, i = N, N> 0), which each link back to the start page. There are some inbound links to the home page but no outbound links to external sites: PRHome = (N d + 1) / (1 + d) + X / (1 - d2 ) PRi = (N + d) / N (1 + d) + d X / N (1 - d2 ) PR applies to the sum of all contributions on one's own sidetotal = N + 1 + X / (1 - d) The part N + 1 is itself 'produced'; the portion X / (1 - d) is based on the transferred PageRank. The factor 1 / (1 - d) results from ∑ di = 1 / (1 - d) i.e. the PageRank X is transferred in the zeroth order, X in the first d X etc.
As mentioned earlier, the PageRank produced is equal to the sum of the pages, provided there are no dead ends.

Dead ends are a waste of PageRank. This becomes clear when one compares the result for N = 1 with example 4. The PageRank has increased by far more than 1 (due to the additional page). The link structure is ideal if you want to give a page the highest possible value. If no self-links (links to one's own page) are taken into account in the PageRank calculation, there is no structure that gives an individual page more PageRank.
Example 6
Your own website consists of N + 1 pages. The start page links to N subpages, all of which are linked to one another. There are some inbound links to the home page but no outbound links to external sites:
PRHome = 1 + X (N (1 - d) + d) / (N + d) (1 - d) PRi = 1 + d X / (N + d) (1 - d) PR applies to the sum of all contributions on one's own sidetotal = N + 1 + X / (1 - d) You can show that the result is always the same, provided there are no outgoing links and no dead ends on your own website. The internal link structure only changes the distribution.
Compared to example 5, the PageRank of the start page is lower, the PageRank of the subpages is higher.
Example 7
As in example 6, but all sub-pages have an outgoing link to external pages. The linked external pages have no connection (direct or indirect) to the pages with the incoming links, i.e. the linking of the external pages does not lead to an increase in the contributions of incoming links (X).
To simplify matters, d = 0.8 and N = 20 were assumed: PRHome ≈ 0.845 + 1.124 X PRi ≈ 0.846 + 0.163 X PRtotal ≈ 17.78 + 4.38 X If you compare the values ​​with example 6 (d = 0.8, N = 20), you can see that all values ​​have decreased:
PRHome = 1 + 15/13 X ≈ 1 + 1,154 X PRi = 1 + 5/26 X ≈ 1 + 0.192 X PRtotal = 21 + 5X
Example 8
As in example 6, but the start page has an outbound link to an external page that has no connection (direct or indirect) to the pages with the inbound links.
To simplify matters, d = 0.8 and N = 20 were assumed: PRHome ≈ 0.993 + 1.145 X PRi ≈ 0.991 + 0.182 X PRtotal ≈ 20.81 + 4.78 X If you compare the values ​​with example 6 (d = 0.8, N = 20), you can see that all values ​​have decreased, while the PageRank is higher than in example 7.

Extensions & modifications

Personalized PageRank

The personalized PageRank was first mentioned by Page, Brin, Motwani and Winograd. The idea is simple: Instead of distributing the sources of the PageRank evenly, they are modified individually (according to interests). The PageRank is calculated specifically

PR = M-1 * V where V is a personalized vector. With an even distribution of the sources Vi = the normal PageRank vector is constant. The choice of the source vector V depends on personal interests. For example, one could use the personal bookmarks as sources for the personalized PageRank. In this way, for example, a surfer with a strong interest in commercial pages would get the same as favorite search results, while with a user with an interest in non-commercial information pages, these would preferably be displayed as search results.

The problem with personalized PageRank is that it either has to be calculated individually for each person or a complete basic set of vectors has to be determined first. However, this basic rate would correspond to the number of all websites. Both calculation methods are currently practically impossible due to the calculation time.

Topic-dependent PageRank

With the topic-dependent PageRank, transition matrices T [i] are formed for individual topics. Out

PR [i] = M [i]-1 * (1 - d) a topic-dependent PageRank PR [i] is then determined. The number of topics is relatively small (e.g. the 16 top-level topics of the ODP). By individually weighting (w [i]) the individual topic-dependent values, an individual (personalized) pagerank can then be obtained PR = Σ w [i] PR [i] Σ w [i] = 1 This results in a simple form of the personalized PageRank .

Modular PageRank

The modular PageRank is another form of the personalized PageRank. A basic set of important, central websites is used as sources for the PageRank:

PR [i] = M-1 * V [i] By weighting the PR [i] depending on the personal interests, a personalized PageRank results. In the borderline case i → (number of all websites) the general form of the personalized PageRank results.

BlockRank

The BlockRank is not a new approach for the ordering of the results, but only a method for faster calculation of the PageRank. Here, the RageRank is first calculated on small blocks (e.g. domains) before the overall calculation is carried out. This offers two advantages: On the one hand, the calculation is faster, and on the other hand, the calculation can be parallelized.

The use of block techniques in matrix inversion methods is not new. They were used long before the PageRank process was described. In addition to the block technique, there are other more efficient calculation methods. In particular, in the field of lattice theory there are many methods to invert large matrices with many entries being zero.

miscellaneous

PageRank production & Google's current algorithm

If you have studied the PageRank algorithm and want to improve your own site, you will quickly come up with the idea of ​​generating millions of pages with the sole purpose of producing PageRank. In theory, this is possible with a suitable link structure - but not in practice! Google fundamentally modified its algorithm years ago. One of these changes prevents PageRank from 'producing' itself. On the contrary, it even requires PageRank (i.e. corresponding inbound links) to completely spider a larger website. In addition, Google has made some other small adjustments and e.g. adjusted the value of the damping factor from originally d = 0.85.

Another practical aspect is the question of which links are counted by Google.If page A links twice to page B and once to page C, there are several ways in which the PageRank could be distributed. Google's current implementation does not take multiple links into account, i.e. in the example above, 50% of the transferred PageRank is distributed between page B and page C. There are also various implementation options for linking the page itself (i.e. a self-link). Google self-links are currently being considered. In contrast, links marked with the rel = "nofollow" attribute are not evaluated.

Google toolbar and directory values

There are currently only two ways to get information about the PageRank: the Google toolbar and the Google directory. There are also providers who display the PageRank directly in the browser, but they also use the Google toolbar values.

The Google toolbar works on a logarithmic scale from zero to ten. Due to the logarithmic scale, an increase by one unit in the toolbar corresponds to an increase in the real PageRank by a factor of b - the logarithmic base. The scale is presumably adjusted so that the page with the currently highest PageRank has a toolbar value of exactly 11.

The Google directory worked on a logarithmic scale from zero to seven until early 2008. The other classification made it possible to get additional, more precise information about the value of the PageRank. On the one hand, websites that have the same toolbar value could have a different value in the directory, and on the other hand, the order according to PageRank in the directory resulted in further information on the exact PageRank. Since the directory is a clone of the ODP, the page must be listed there in order to receive the additional information. Problems arose, however, because the toolbar and directory values ​​are not updated at the same time. In general, the directory values ​​were more recent. In the meantime, however, only the toolbar value from 0 - 10 is displayed in the Google directory.

For some time now, toolbar values ​​have also been manipulated. For this purpose, pages are redirected to existing pages with a high PageRank. In this case, Google equates both sides and adopts the PageRank value of the landing page. If the forwarding is replaced by new content after some time, the displayed (incorrect) PageRank remains for the time being. Alternatively, cloaking is also used and a redirect is only shown to the search engines while visitors see other content. Incorrect PageRank values ​​can be identified by viewing cached and archived versions, reviewing the references, checking inheritance and using the 'info' command.

PageRank calculator

PageRank calculators calculate the PageRank of the individual pages numerically for selectable link structures. PageRank calculators can be useful to increase the basic understanding. Of course, only a small number of pages can be viewed with this. In addition, external inbound and outbound links are often not included. Since they neither take into account Google's current changes nor are more complex structures possible, they serve more didactic purposes and not as a concrete decision-making aid in the development of link structures.

PageRank prediction

There are some sites on the Internet that promise to predict future PageRank. Since, of course, no page has the link structure of the Internet or the part indexed by Google, the 'forecast' must be based on other data. Either only the number of incoming links for a website is taken, which is very imprecise, or the incoming links and the associated toolbar value of the PageRank are used for the forecast. Even in the second case, however, there are a large number of unknowns:

  • If the inbound links are taken by Google, a problem arises because Google does not display all the referring pages. If the incoming links are taken from other search engines, it is not clear whether Google takes them into account at all.
  • It cannot be checked whether the PageRank is inherited.
  • The toolbar value of the PageRank is only available as a whole number. A displayed value of PR6 can mean 6.0 or 6.99. A mean value must therefore be expected.
  • The values ​​displayed in the toolbar are incorrect for dynamic URLs.
  • The currently displayed toolbar PageRank values ​​of the reference pages are used. However, these values ​​change after the next calculation.
  • The basis of Google's logarithmic toolbar scale must be included in the calculation. Here, completely wrong values ​​between 5 and 8 are almost always used.
  • The number of outbound links must be estimated using an average.
  • The toolbar scale is not fixed so that the toolbar value can change even if the real PageRank remains constant.

All of these errors add up. A meaningful forecast is therefore impossible. The hit rate should not be higher than with a simple estimate of the form 'Maximum PageRank of a referring page -1' or 'The PageRank of the page does not change'.

Buy & Sell

If you want to improve your own PageRank, buying is an option. However, there are a number of points to consider here:

  • Is the PageRank on the seller side real?
  • Will the PageRank be transferred to other sites? Is it blocked by the page itself (e.g. with nofollow or a script) or is Google preventing redirection?
  • How many links (external plus internal) are there on the page? Is this number fixed or can it change?
  • What happens if the seller's PageRank changes?

If all these questions have been answered, the purchase of PageRank can be quite useful. However, this must be decided on a case-by-case basis on the basis of specific projects. However, there is no guarantee that the PageRank will actually be transferred. In principle, Google can intervene manually in the algorithm at any time and prevent the transfer for the purpose of sales, as happened in the example of SearchKing.com.

The PageRank algorithm as PDF (150 kB)


questions and answers

  • No PageRank information available

    Why does the Google toolbar show 'No PageRank information available' on some pages?
  • Are speaking URLs useful?

    Hello, in the context of SEO I keep hearing about speaking URLs, e.g. those from Bild.de. Is that a marketing gag or do such URLs have a positive influence on the PageRank or the search results?
  • Web statistics and search engine optimization

    Dear SEO world, as part of an interproject, the requirement for so-called link or element tracking is made. Example: An element consists of a linkable element 'Heading', 'Image' and 'Link'. All 3 targets are the page www.domain.de/zielzeite. In order to track the use of the above-mentioned elements, a separate parameter is required to differentiate between them. This means, ...
  • Question about the pagerank of a website

    Hello, our website www.xxx.de is a website created with Joomla. Quite extensive but with dynamic URLs. I've tried to optimize it as best I can. The site is listed in some web catalogs. But strangely enough, the pagerank (0) does not rise. Although we are sometimes found with good placements. A sitemap is sent to Google regularly. The ...
  • Pagerank loss after update

    After the current pagerank update, many pages were punished with a pagerank loss - heise.de/newsticker/meldung/98042. Is there already an explanation for this?
  • Pagerank changed

    After the current pagerank update one often reads that the algorithm has been tightened or internal links have been devalued. Can this be?
  • Pagerank and traffic

    Could it be that there is a connection between the page rank and the traffic of a page? Does a page with less traffic get a lower pagerank?
  • fake PageRank

    How can I tell whether the displayed pagerank is genuine or whether it is a fake pagerank?
  • PageRank abolished

    Is Pagerank still relevant at the moment? Will the pagerank procedure be abolished in the near future?
  • Cause pagerank loss

    All the pages on my homepage had PR4 for a long time. Now many are only PR3. What can be the reason?
  • Pagerank for dynamic pages

    The ad for my dynamic ratings is a little weird. I have just created a new page and a pagerank of 3 is displayed. Is the value real?
  • Sandbox through Pagerank

    I read a theory that the sandbox is caused by fluctuations in the calculation of the ragrank. When calculating PR (A) = (1-d) + d (PR (T1) / C (T1) + ... + PR (Tn) / C (Tn)), the values ​​change during the first iterations, in particular for new domains, very strong. This is supposed to be responsible for the effect. Can that be true?
  • Pagerank despite penalty

    My site seems to have received a penalty from Google. All pages are from the index. Nevertheless, the homepage shows a PR5 - can that be?
  • Pagerank update

    Why has my ranking not improved even though my page rose from PR2 to PR4 the last time it was updated?
  • Pagerank inheritance

    Is it true that the pagerank value -1 is inherited by subdirectories?