Link theory for search engines
The article directories
- Link theory for search engines
- preface
- I. Robin Li’s hyperchain analysis patent
- HITS algorithm
- TrustRank algorithm
- Fourth, Google PR
-
- 1. Concept and calculation of PR
- 2. Two figurative models of PR
- 3. Toolbar PR
- 4. A few misconceptions about PR
- 5. The meaning of the PR
- 5. Hilltop algorithm
preface
Before Google, traditional search engines relied on keywords in page content to match users’ query terms for ranking. The downside of this ranking, as is now apparent, is that it can be easily manipulated. Black hat SEO stacking keywords on the page, or adding hot keywords unrelated to the topic, can improve the ranking, so that the quality of search engine ranking results significantly decreased. Today’s search engines use link analysis to reduce spam and improve the user experience. This section will briefly explore the application of links in search engine rankings.
Factoring links into rankings not only helps reduce spam and improve relevancy of results, but also enables files that cannot be ranked by traditional keyword matches to be processed. For example, images and videos cannot be keyword matched, but there may be external links. Through the link information, the search engine can understand the content of images and videos and rank them.
Page ranking of different text is also possible. Search “SEO” on Baidu or Google.cn, for example, and you’ll see SEO sites in English and other words. Even a search for “SEARCH engine optimization” will bring up a non-Chinese page because some links may use “search engine optimization” as anchor text to point to an English page.
Link factors are now more important than page content. However, to understand the link relationship is abstract, the influence of the factors on the page on the ranking can be seen, easy to intuitively understand. A simple example, searches for a particular keyword, SEO personnel first few pages as long as the observations, can see: what is the effect of keywords in the title tag, appear in the front and what is the effect, have technical resources can also statistics on a large scale, calculate the keywords in the title tag in the relationship between the different positions and rankings. While this relationship is not necessarily causal, it is at least a statistical link that gives SEOs a rough idea of how to optimize.
The impact of links on rankings is hard to visualize and measure because no one has access to search engines’ link databases. The best we can do is qualitative observation and analysis.
Here are some links patents that shed some light on how links are used and where they stand in search engine rankings.
I. Robin Li’s hyperchain analysis patent
Baidu founder Robin Li was one of the top search engine engineers in the United States before returning to found Baidu. It is said that when Robin Li was looking for venture capital, investors asked three other technical experts in the search engine industry a question: who should you ask to know about search engine technology? Two of the three who were asked answered: Ask Robin Li about search engines. From this investors concluded that Robin Li is the most understanding of search engine one of the people.
This is how links work in real life: to determine which pages are the most authoritative, look not just at what the pages say, but at what other pages say.
Li filed a patent application called “Hyperlinked document Retrieval System and Method” in 1997, which was very forward-looking research work, long before the founders of Google invented PR. In the patent, Li proposed a link-based ranking method different from traditional information retrieval systems.
In addition to the index page, the system also establishes a link word library to record some relevant information of the link anchor text, such as what keywords are contained in the anchor text, the page index of the link, the total number of links containing specific anchor text, and which pages the links containing specific keywords point to. The thesaurus contains not only keyword archetypes but also other keywords derived from the same stem.
From this link data, especially anchor text, link – based file dependencies are calculated. When users search, the resulting link-based correlation is combined with the traditional correlation based on keyword matching to get more accurate rankings.
Today, this sort of link-based relevancy calculation is the norm in search engines, as every SEO person knows. But 17 or 18 years ago, it was a very innovative concept. Of course, the current search engine algorithm to link consideration, has not only anchor text, in fact, much more complex.
The patent was owned by li’s company at the time and invented by Li himself. The interested reader can look up at the following address the U.S. Patent Office issued “hyperlinked documents retrieval system and the method” patent details: patft. The uspto. Gov/netacgi/NPH…
HITS algorithm
HITS is the abbreviation of English hyperlink-induced Topic Search, translated as “Hyperlink Induced Topic Search”. HITS algorithm by Jon Mr. Kleinberg is put forward in 1997, and applied for a patent: patft. The uspto. Gov/netacgi/NPH… According to HITS algorithm, after users input query words, the algorithm calculates two kinds of values for the matching page returned, one is Hub Scores, the other is Authority Scores, the two values are interdependent and affect each other. The so-called hub value refers to the sum of all the authoritative values on the page that the exported links point to. The authoritative value is the sum of the hub values of all the pages where the import links are located.
HITS abstracts two important types of pages: hub and authority. The hub page itself may not have many import links, but there are many export links to the authority page. Authoritative pages may not export many links by themselves, but there are many import links from hub pages.
Typical hub pages are web directories such as Yahoo directories, Open Directories, or 123. This kind of high quality website directory function is to point to other authoritative websites, so called hub. The authority page has many import links, including many links from the hub page. Authority pages are usually pages that provide truly relevant content.
HITS algorithm is for specific query words, so it is called topic search.
The biggest disadvantage of THE HITS algorithm is that it performs calculations in the query phase, rather than in the fetching or preprocessing phase. So HITS algorithm is at the expense of query ranking response time. Because of this, the original HITS algorithm is not commonly used in search engines. But the idea of HITS is likely to be incorporated into the indexing phase of search engines, where links are used to find pages with hub or authoritative features.
Being an authoritative page is a priority, but the only way to do that is to get quality links. When your site can’t be the authority page, make it the hub page. So the export link is also one of the current search engine ranking factors. Never linking to other sites is not a good SEO method.
TrustRank algorithm
TrustRank is a ranking algorithm based on link relationship that has attracted more attention in recent years. TrustRank can be translated as “trust index”.
TrustRank was originally developed in 2004 as a joint research project between Stanford University and Yahoo to detect spam sites and was patented in 2006. The inventor of the TrustRank algorithm also published a special PDF document explaining the application of the TrustRank algorithm. Interested readers can download the PDF file at www.vldb.org/conf/2004/R…
TrustRank was not invented by Google, but since Google has the largest market share and TrustRank is a very important factor in Google rankings, some people mistakenly believe that TrustRank was invented by Google. To make matters more confusing, Google had filed for Trademark TrustRank, but the TrustRank reference referred to Google’s method of detecting sites with malicious code, not the trust index in its ranking algorithm.
The TrustRank algorithm is based on the basic assumption that good sites rarely link to bad ones. The reverse is not true, that is, bad sites rarely link to good sites this statement is not true. On the contrary, many spam sites link to high-authority, high-trust sites in an attempt to improve their trust index.
Based on this assumption, if you can pick sites that can be trusted 100 percent, those sites will have the highest TrustRank rating, and those sites with the highest TrustRank will link to sites with a slightly lower trust score, but also a high trust score. Similarly, trust continues to decline when the second tier of trusted websites links to the third tier. For a variety of reasons, good sites will inevitably link to some junk sites, but the closer you click to the first level site, the higher the trust index will be, and the further you click away from the first level site, the trust index will decline. The TrustRank algorithm then calculates a trust score for all sites, and the further away you are from tier 1, the more likely you are to become a spam site.
TrustRank value is calculated by selecting a batch of seed sites, then manually viewing the sites and setting an initial TrustRank value. There are two ways to pick a seed site. One is to pick the site with the most exported links, because The TrustRank algorithm calculates how the index decays with exported links. The websites with more exported links can be understood as having higher reverse PR value in a sense.
Another way to pick a seed site is to pick a site with a high PR, because the higher the PR, the more likely it is to appear on the search results page. These are the sites that TrustRank looks at most and needs to adjust its ranking. Pages with low PR scores rank low without TrustRank, making TrustRank count meaningless.
According to the calculation, the TrustRank value of all websites can be calculated more accurately by selecting about 200 websites as seeds.
There are two ways to calculate the reduction of TrustRank with links. One is decaying with the number of links, that is, if the TrustRank index of the first page is 100, the TrustRank index of the second page is 90, and the TrustRank index of the third page is 80. The second calculation is to assign TrustRank by the number of exported links, that is, if a page has a TrustRank value of 100 and there are five exported links on the page, each link will pass 20% of the TrustRank value. The two calculation methods of decay and allocation are usually used together, and the overall effect is that the TrustRank value decreases gradually as the link level increases.
Once you have the TrustRank values for your site and page, you can influence your ranking in two ways. One is to select the relevant pages from the traditional ranking algorithm and rerank them according to TrustRank values. The other is to set a minimum TrustRank threshold, above which pages are considered good enough to be ranked, and below which pages are considered spam. Filter it from the search results.
Although the TustRank algorithm was originally developed as a spam detection method, the TrustRank concept is more widely used in current search engine ranking algorithms, often affecting the overall ranking of most sites. TrustRank algorithm was originally aimed at the page level, now in search engine algorithm, TrustRank value is usually expressed in the domain name level, the higher the trust index of the whole domain name, the stronger the overall ranking ability.
Fourth, Google PR
PR is short for PageRank. Google PR theory is the best known of all link-based search engine theories. SEO people may not be aware of the other linking theories covered in this section, but they can’t be unaware of PR. PR was invented by Larry Page, one of the founders of Google, to indicate the importance of a page. In the simplest terms, the more backlinks a page has, the more important it is, and therefore the higher the PR value.
Google PR is somewhat similar to the concept of cross-referencing in scientific and technical literature. The literatures cited by other literatures are likely to be relatively important literatures.
1. Concept and calculation of PR
We can understand the Internet as a directed graph composed of nodes and links. A page is a node after node, and the directed links between pages convey the importance of the page. The PR value of a link is determined by the PR value of the page where the link is located. The higher the PR value of the page that sends the link, the higher the PR value can be transmitted. The PR value passed also depends on the number of export links on the page. For a page with a given PR value, let’s say 100 copies of PR are delivered to subordinate pages, 10 export links on the page, each delivering 10 copies of PR, and 20 export links on the page, each delivering only 5 copies of PR. So – the PR value of a page depends on the total number of imported links, the PR value of the linked source page, and the number of exported links on the linked source page. PR(A)= (1-d) + D (PR(T1)/C(T1)+… + PR(tn)/C(tn))
A stands for page A. PR(A) represents the PR value of page A. D is the damping index. D is generally considered to be 0.85. T1… Tn represents the page T1 that links to page A to TN. C is the number of exported links on the page. C(T1) is the number of export links on page T1. It can be seen from the concept and calculation formula that the PR value can only be obtained after several iterations. The page ARD PR value depends on the PR value of the pages t1 to TN that link to A, which in turn depends on the PR value of other pages, most likely including page A. During calculation, an initial value is set for all pages. After a certain number of iterations, the PR value of each page tends to be stable and converges to a specific value. It is proved that no matter how the initial value is selected, the final PR value calculated by iteration will not be affected.
Just a brief description of the damping coefficients. Consider a loop like the one shown in the figure that must exist on a real network.
The external page Y injects PR into the loop, and the page in the loop passes PR iteratively, reaching infinity if there is no damping coefficient. The PR calculation can be stabilized at a value only by introducing a damping coefficient so that PR naturally decays during transmission.
2. Two figurative models of PR
There are two famous metaphors for PR. One metaphor is voting. Links are like democratic voting. Page A is linked to page B, which means page A votes for page B, making page B more important. Meanwhile, the PR value of page A determines the voting power of page A. The higher the PR value of page A, the more important the voting power of page A. In this sense, the traditional keyword matching algorithm is to see what the page says about the page content, while link based PR is to see what others say about a page.
The second is the random surfing metaphor. Suppose a visitor starts on one page and randomly clicks on links to the next. Sometimes the user gets bored, stops clicking on the link, randomly jumps to another url, and starts clicking down again. The PR value is the probability that a page will be visited during this random surfing visit. The more links a page imports, the higher the probability that it will be visited, and therefore the higher the PR value.
The damping coefficient is also related to the random surf model. (1-d)=0.15 is actually the probability that the user gets bored, stops clicking, and randomly jumps to a new URL.
3. Toolbar PR
The actual Google PR for ranking calculation is not known, all we can see is the Google toolbar PR. To be clear, the toolbar PR value is not an accurate reflection of the real PR value. The real PR value is an accurate number greater than 0.15 with no upper limit. The PR value displayed on the toolbar has been normalized to the 11 digits 0~10, which is an integer, that is to say, the minimum PR value is approximately 0, and the maximum PR value is approximately 10. In fact, each toolbar PR value represents a wide range, and the real page PR value represented by toolbar PR5 may vary by many times.
The real PR value is constantly computed and updated, and the toolbar PR value is just a simplified snapshot output of the real PR value at a point in time. Over the past decade or so, Google has updated toolbar PR at least once a month or once a year. In October 2014, Google employee John Mueller said in a video q&A that Google might not update toolbar PR in the future. The last toolbar PR update was on December 6, 2013, and that was accidentally (or out of control) delivered by Google engineers doing something else, not planned, so it’s a safe bet that Google won’t be updating the toolbar PR in the future. The update dates of toolbar PR values in recent years are shown in the table.
December 6, 2013 |
---|
February 4th, 2013 |
November 7, 2012 |
August 2, 2012 |
May 3, 2012 |
February 6, 2012 |
November 8, 2011 |
Toolbar PR has a logarithmic but not linear relationship with the number of backlinks. That is to say, if 100 external links are needed from PR1 to PR2, about 1000 external links are needed from PR2 to PR3, and more external links are needed from PR5 to PR6. So a site with a higher PR takes much more time and effort to move up one level than a site with a lower PR.
4. A few misconceptions about PR
The full English name of PR is PageRank. The name comes from inventor Page, which coincidentally also means “Page” in English. So the name PageRank should be correctly translated as page-level, not page-level. But by convention, and with a clever pun, everyone calls PR the page level.
The PR value is only relevant to links. Often webmaster inquiry, his website has done a long time, the content is all original, how R or zero? In fact PR and stationmaster are serious, do station how long, whether the content is original have no direct relationship. There is PR with backlinks, and there is no PR without backlinks. A high-quality original website will naturally attract more external links, so it will indirectly improve the PR value, but this is not inevitable. , there is no corresponding relationship between toolbar PR value update and page rank change in time. In the process of updating the toolbar PR value, there are often webmasters that the PR value has increased, no wonder the website ranking has also improved. It’s safe to say it’s just a coincidence of timing. As mentioned earlier, the real PR used for ranking calculations is continuously computed and updated, and is always factored into the ranking algorithm. The toolbar PR we see is updated only every few months, the last update was in December 2013. Even when toolbar PR is updated, when we see PR changes, the real PR has been updated and counted in the rankings for months. Therefore, it is meaningless to study the relationship between PR value and ranking fetishization through toolbar PR changes.
5. The meaning of the PR
Google engineers have said many times that Google PR is now an overhyped concept, but PR is only one of the 200 + factors in Google’s ranking algorithm, and its importance has declined so much that SEO people don’t need to be obsessed with improving PR. This is probably why Google doesn’t update the toolbar PR anymore.
Of course, PR is also an important factor in Google’s ranking algorithm. In addition to directly influencing rankings, PR is important for several reasons. (1) Website collection depth and total page number. Search engine spider crawling time and database space are limited. Google wants to prioritize the most important pages as much as possible, so a site with a higher PR score will get more pages included and spiders will crawl deeper inside pages. For large and medium-sized websites, home PAGE PR value is one of the important factors to drive the site included.
(2) Frequency of access and update. The higher the PR, the more frequently the search engine spiders visit the site, and the faster it is indexed when new pages appear on the site or when content is updated on an old page. Because new pages on a site are often linked to existing pages, frequent visits mean new pages are found faster.
(3) Repeated content determination. When Google finds the exact same content on different sites, it will select one as original and others as reprinted or copied. When a user searches for relevant queries, the version judged to be original comes first. The PR score is also an important factor in determining which version is original. This is why the weight of those high, HIGH PR value of the large website, reprint small website content is often regarded as the original reason.
(4) Selection of initial subset of ranking. As mentioned in the previous introduction to the ranking process, it is impossible for a search engine to perform correlation calculation on all the files that match keywords, because there may be millions or tens of millions of files returned. The search engine needs to select an initial subset of the files and then perform correlation calculation. The selection of the initial subset is obviously independent of keyword relevance, but can only start from the importance of the page, PR value is an important index independent of keywords.
The current PR algorithm is certainly an improvement and change from what was described in the Original Larry Page patent. One observation is that the PR algorithm should already exclude some links that Google considers suspicious or invalid, such as paid links, spam links in blogs and forums. So sometimes we’ll see a page with PR6 or even PR7 import links, but after a couple of toolbar PR updates, it’s still in PR3 or even PR2. A link to PR6 or 7 should take the linked page to PR5 or PR4. So it’s likely that Google has excluded some links it deems suspicious from its PR calculations.
For example, should links in different places on the same page pass the same number of PR values? Should text, sidebar navigation, and footer links be treated the same? Following the original PR design, yes, because link location was not considered. But obviously, the importance of links in different locations is different, and the probability of being clicked by real users is also different, so should the PR value delivered be the same? Has correction been introduced in the current Google PR algorithm?
The PR patent is invented by Larry Page, owned by Stanford University and used in perpetuity by Google. While PR is Google’s proprietary algorithm, every other major search engine has a similar algorithm, just not called PR. So the role and significance of PR mentioned here also applies to other search engines.
5. Hilltop algorithm
Hilltop algorithm was researched by Krishna Baharat around 2000. In 2001, he applied for a patent and licensed the patent to Google. Later, Krishna Baharat himself joined Google.
Hilltop algorithm can be simply understood as the PR value related to the topic. Traditional PR values are not associated with a particular keyword or topic, but only link relationships. That leaves the possibility of some kind of loophole. For example, a university page with high PR for environmental content has a link to a children’s products website, which may only appear because the university page is maintained by a professor whose wife works for a company that sells children’s products. This kind of link with high PR value, unrelated to the topic, may make some sites rank well, but in fact, not high authority, relevance.
The Hiltop algorithm tries to correct for this possible omission. The Hilltop algorithm also calculates link relationships, but focuses more on the weight of links from topic-related pages. The topic-related pages are called expert files in Hilltop algorithm. Obviously, there are different expert files for different topics or search terms.
According to Hilltop algorithm, after a user searches for a query term, Google first finds a series of relevant pages and ranks them according to the normal ranking algorithm, and then calculates how many links from expert files are related to the topic of these pages. The more links from expert files, the higher the ranking score of the page. According to the original idea of Hilltop algorithm, a page must have at least two links from expert files to return a certain Hilltop value, otherwise the returned Hilltop value will be zero.
The score calculated from the expert file link is called LocalRank. Ranking program according to LocalRank value, the original traditional ranking algorithm calculation of the ranking to do readjusting, give the final ranking. This is the final filtering and tuning step in the search engine ranking phase discussed earlier.
The Hilltop algorithm described the selection of expert documents differently when it first wrote papers and applied for patents. In the original study, Krishna Baharat defined expert files as pages that contained specific topic content and had more export links to third party websites, similar to hub pages in HITS algorithm. The page that the expert file links to should not be associated with the expert file itself, such association refers to the sub-domain name from the same master domain name, the page from the same or similar P address, etc. The most common expert documents often come from the websites of schools, governments and industry groups.
In the original Hilltop algorithm, expert files were pre-selected. The search engine can calculate a set of expert files in advance according to the most common search terms. When a user searches, the ranking algorithm selects a subset of expert files related to the search term from the set of expert files calculated in advance, and then calculates the LocalRank value from the links in this subset.
But in a patent filed in 2001, Krishna Bztiarat described a different way of selecting expert files. Expert files are not pre-selected. After a user searches for a particular query, the search engine uses a traditional algorithm to pick a series of initial relevant pages. The Hilltsp algorithm calculates again which pages have links from other pages in the collection in this page and assigns a higher LocalRank value. Because the set of pages obtained by the traditional algorithm has already had the correlation, and these pages provide links to a particular page, the weight of these links should be naturally high. This method of selecting expert documents is carried out in real time.
The Hilltop algorithm is generally believed to have had a major impact on the Florida update in late 2003, but whether Hilltop has actually been incorporated into Google’s ranking algorithm is not known for sure. Google has never admitted or denied that it uses a patent in its ranking algorithm, but observations of the results and the hiring of Krishna Baharat suggest that Hilltop’s ideas are being taken seriously.
Hilltop algorithm prompts SEO, the construction of external links should pay more attention to the theme related, and its ranking is good websites and pages. The easiest way to do this is to search for a keyword. The top page is the best source of links, or even a link from a competitor’s site. Of course, getting such links is the most difficult. The top of the list here, including the top hundreds, not just the top 20 or 30 that ordinary users will see, the top hundreds are already considered expert documents.
To be continued…