Slag turn: search engine algorithm basis

This is the 8th day of my participation in Gwen Challenge

This article briefly introduced some of the basics of The Google search engine. In the next series of articles, I will also start to update some of the basics of EslasticSearch and some of the practical use of the Rest client with Java. Stay tuned!

Original address: moz.com/blog/search…

The engine does not try to return pages that best match the input query. A good search engine tries to answer those underlying questions. If you realize this, you can understand why Google (and other search engines) use a very sophisticated algorithm to determine what results they should return. Factors in the algorithm include “hard factors” such as the number of backlinks on a page, and perhaps some social recommendations via likes and +1. These are usually external influences, as well as the page itself. To do this, the way the page is built and the various page elements play a role in the algorithm. But only by analyzing all of the on-site and off-site factors will Google be able to determine which pages might be behind the query. Therefore, Google must analyze the text on the page.

In this article, I’ll elaborate on some of the issues surrounding search engines and their alternative solutions. (Sadly) We didn’t go into Google’s algorithm at the end of this article, but we can take a closer look at some of the advice we often give as SEO players. There are some formulas, but don’t panic. This article is not just about those formulas. The article contains an Excel file. Oh, and best of all: I’ll introduce them with some Dutch food.

Look: Croquets are long and bitterballen is round. 【Croquets quets have a bitterballen. 】

True or false

Search engines have grown rapidly in recent years, but at first they could only handle Boolean operands. Simply put, it’s whether a word is included in a document. Some things are true or false, one or zero. You can also use operators such as AND(AND), OR(OR), AND NOT(NOT) to search for documents that contain multiple words OR exclude some words. This sounds fairly simple, but it does have some problems. Suppose we have two documents that contain the following text:

Document 1: “And our restaurant in New York serves croquets And bitterballen.”

Document 2: “In the Netherlands you retrieve croquets and frikandellen from the wall.”

Oh, I almost forgot to show you Frikandellen; -)

If we’re building a search engine, the first step is to tokenize the text. We want to be able to quickly locate which documents contain conditional words. It would be easier if we put all the tokens in the database. A “token” can be any single word in a text. How many tokens are there in document 1?

As you begin to answer this question yourself, you may want to consider the definition of “term.” In fact, in this example “New York” should be considered a term. How to determine that these two words are actually one is beyond the scope of this article, so for now I’m afraid we’ll treat each separate word as a separate marker. So, there are 10 tags in document 1 and 11 tags in document 2. To avoid duplicate information in our database, we store “types” instead of tags.

Type is a unique tag in the text. The example in Document 1 contains the “and” tag twice. In this example, I’m going to ignore the fact that “and” is capitalized once and not capitalized once. As with terms, there are techniques to determine whether certain words really need capitalization. In this case, we assume that we can store without uppercase, And that “And” And “are of the same type.

By storing all types and searchable documents in the database, we can search through the database with the help of Boolean. A search for “croquets” returns results for documents 1 and 2. A search for “croquets AND bitterballen” returns only the results of document 1. The problem with this approach is that you may get too much or too little. In addition, it lacks the ability to organize results. If we want to improve our approach, we need to identify what we can use, and then the word exists/does not exist in the documentation. If you were Google, what page factors would you use to organize your results?

Zone Indexes algorithm

A relatively easy way to do this is to use the Zone Indexes algorithm. A page can be divided into different areas. Think of a title, a description, an author, and a body. By increasing the weight of each area in a document, we can calculate for each document and get a simple score. This was one of the original ways search engines used web pages to determine the topic of a page. The scoring algorithm of regional index is as follows:

Suppose we add the following weights to each region:

Regions (zones)	The Weight (Weight)
title	0.4
description	0.1
content	0.5

We perform the following search query: “croquets AND bitterballen”

We then get the document with the following areas:

Regions (zones)	Content (Content)	Boolean value (Boolean)	Score (Score)
title	New York Cafe	0	0
description	Café with delicious croquets and bitterballen	1	0.1
content	Our restaurant in New York serves croquets and bitterballen	1	0.5
–	–	Total	0.6

Because at some point, everyone starts abusing the weights assigned to, say, description. It is important for Google to divide the body into different regions and assign different weights to each region of the body.

This is a bit difficult because web pages contain various documents with different structures. However, parsing XML documents by such a machine is very simple. Parsing an HTML document is much more difficult for a machine. The structure and tags are more limited, which makes analysis more difficult. Of course there will be HTML5 and Google will support microframeworks in the near future, but it still has its limitations. For example, if you know that Google is assigning more weight to the content in the tag, and giving

To determine the content of a web page, Google will have to slice up a Web page. In this way, Google can determine which parts of the page are important and which are not. One approach you can use is the text/code ratio. A section of a page that contains more text than HTML code is likely to be the main content of the page. A page block with a lot of hyperlinks or HTML code and little content could be a menu section. This is why choosing the right rich text editor is so important. Some of these editors use a lot of unnecessary HTML code.

The use of text/code ratios is just one of the ways that search engines can divide pages into chunks. Bill Slawski also talked about identification blocks earlier this year.

The advantage of the regional indexing algorithm is that you can calculate a simple score for each document; the disadvantage, of course, is that many documents may get the same score.

The Term frequency word frequency

When I ask you to think about the factors you use to determine document associations on a page, you probably think about the frequency of search terms. Adding weight to documents that use search terms more frequently is a logical step.

Some SEOs swear by stories that use a certain percentage of keywords in their text. We both know that’s not true, but let me show you why. I will try to explain based on the following examples. There will be some formulas, but as I said, the synopsis of the story is important.

The numbers in the table below are the number of times the word appears in the document (also known as word frequency or TF). Which document has a higher score for the query “croquets and bitterballen”?

document	croquets	and	cafe	bitterballen	Amsterdam
Document 1	8	10	3	2	0
Document 2	1	20	3	9	2
Document N	.	.	.	.	.
The query	1	1	0	1	0

The scores for both documents are calculated as follows: Score (” croquets and bitterballen “, Doc1) = 8 + 10 + 2 = 20 SCORE (” Croquets and bitterballen “, Doc2) = 1 + 20 + 9 = 30

In this case, document 2 appears to be a better match for the query. In this example, the word “and” gets the most weight, but is that fair? This is a stop word, and we only want to give it a small value. We can solve this by using the inverse document frequency (TF-IDF), that is, the opposite of the document frequency (DF). Document frequency is the number of documents indicating the occurrence of a word. Inverse document frequency, then, is the opposite. As the number of documents in which the word appears grows, the IDF shrinks.

You can calculate IDF by dividing the total number of documents in the corpus by the number of documents containing the word, and then taking the logarithm of the quotient.

Assume that the IDF of the search terms is as follows: IDF (Croquets) = 5 IDF (and) = 0.01 IDF (bitterballen) = 2

You get the following scores: Score (” croquets and bitterballen “, Doc1) = 85 + 100.01 + 22 = 44.1 score(” Croquets and bitterballen “, Doc1) Doc2) = 15 + 200.01 + 92 = 23.2

Document 1 now gets a higher score. But we’re not taking length into account right now. One document can contain much more than another, not more relevant. It’s fairly easy to score a long document this way.

Vector model

We solve this problem by looking at the cosine similarity of the document. An exact explanation of the theory behind this approach is beyond the scope of this article, but you can think of it as a harmonic mean between query keywords in a document. I made an Excel file so you can do your own research. The document itself has a note. You need to know the following indicators:

Query terms – Each separated term in a Query condition.
Document frequency – does Google know how many documents contain that word?
Term Frequency – The frequency of each individual query Term in the document (bookmarking the focus keyword widget made by Sander Tamaela is very helpful in this section).

Here’s an example where I actually use this model. The site has a page that aims to rank “buying Bikes,” or “fiets Kopen” in Dutch. The problem is that the error page (the home page) is getting the ranking for the query.

For formula, we mentioned the inverse document frequency (IDF) earlier. For this we need the total number of documents in the Google index. So let’s say N is 10.4 billion.

The following table explains:

Tf = term frequency
Df = document frequency
Idf = Inverse Document frequency
Wt,q = weight for term in query
Wt,d = weight for term in document
Product = Wt,q * Wt,d
Score = Sum of the products

Currently ranked home page: www.fietsentoko.nl/

term	Query	–	–	–	Document	–	–	Product
–	tf	df	idf	Wt,q	tf	Wf	Wt,d
Fiets	1	25.500.000	3.610493159	3.610493159	21	441	0.70711	2.55302
Kopen	1	118.000.000	2.945151332	2.9452	21	441	0.70711	2.08258
–	–	–	–	–	–	–	Score:	4.6356

I want to go to rank page: www.fietsentoko.nl/fietsen/

term	Query	–	–	–	Document	–	–	Product
–	tf	df	idf	Wt,q	tf	Wf	Wt,d
Fiets	1	25.500.000	3.610493159	3.610493159	22	484	0.61782	2.23063
Kopen	1	118.000.000	2.945151332	2.945151332	28	784	0.78631	2.31584
–	–	–	–	–	–	–	Score:	4.54647

A few days later, Google crawled to the page, and my modified document began ranking the word. We can conclude that it doesn’t necessarily matter how many times you use the word, it’s important to find the right balance of the words you want to rank.

Speed up the process

It takes a lot of processing power to perform this calculation for each document in order to find the documents for the query criteria. You can solve this problem by adding some static values to determine which documents you want to score. PageRank, for example, is a good static value. When you first calculate the score for those that match the query criteria and have high PageRank, you will notice some changes and some documents will be in the top 10 of the results no matter what.

Another possibility is the use of winning lists. For each word, only the first N documents containing that word are fetched. If you have multiple queries, you can cross these lists to find documents that contain all of the query terms, and you may get a high score. You can only search through all documents if there are too few documents that contain all the words. So you don’t just find the best vector score to rank, you can also get the correct static score.

Relevance feedback

Correlation feedback assigns more or less values to the words in the query, based on the relevance of the document. By using the relevance feedback process, a search engine can change its search without notifying the user.

The first step here is to determine whether a document is relevant. While there are search engines that can specify whether a result or document is relevant, Google did not have this capability for a long time. Their first attempt was to add a favorite star to the search results. Now they’re using the Google+ button to try it out. If enough people start clicking on a button for a certain result, Google will start thinking about associating the document with the query.

Another way is to look at the current page that ranks well. These pages will also be considered relevant. The danger with this approach is that the subject matter is elusive. If you’re looking for bitterballen and Croquettes, the highest-ranked page is Amsterdam’s snack bar, and the risk is that you’ll assign a value to Amsterdam and only Amsterdam’s snack bar will end up in the result.

Another approach Google has taken is to simply use data mining. They can also see the click-through rate of different pages. For average, pages with higher click-through rates and lower bounce rates can also be considered relevant. Pages with very high bounce rates will be irrelevant.

An example of how we can use this data to weight a query item is Rochio’s feedback formula. This boils down to adjusting the values of the words in the query and possibly adding additional query words. The formula is as follows:

The table below is a visual representation of this formula. Suppose we put the following value in: Query terms: +1 (alpha) Relevant terms: +1 (beta) Irrelevant terms: -0.5 (gamma)

We have the following query: “Croquets and bitterballen”

The correlations of the following documents are as follows: Doc1: related Doc2: related Doc3: irrelevant

Terms	Q	Doc1	Doc2	Doc3	Weight new query
croquets	1	1	1	0	1 + 1-0 = 2
and	1	1	0	1	1 + 0.5-0.5 = 1
bitterballen	1	0	0	0	1 plus 0 minus 0 is 1
cafe	0	0	1	0	0 + 0.5-0 = 0.5
Amsterdam	0	0	0	1	0 + 0 -0.5 = -0.5 = 0

Croquets (2) and(1) bitterballen(1) cafe(0.5)

The value of each word is the weight value obtained in the query. We can use these weights in vector calculations. Although the word Amsterdam gave a score of -0.5, adjust the negative value back to 0. This way, we don’t exclude words from our search results. Although cafe did not appear in the original query, it was added to the query and given a weight value in the new query.

If Google uses this kind of relevancy feedback, you can look at pages that are ranked for specific queries. By using the same vocabulary, you can ensure that you get the most out of this relevancy feedback.

Tips:

In short, we have considered one of the options for assigning weights to a document based on web content. While the vector method is fairly accurate, it is certainly not the only way to calculate correlations. There will be a lot of tweaks to the model, but it is still only part of the complete algorithm of search engines like Google. We also looked at correlation feedback. Hopefully, I’ve given you some insight into the ways in which search engines can use external factors. It’s time for you to explore this and study that Excel file

Slag turn: search engine algorithm basis

True or false

Zone Indexes algorithm

The Term frequency word frequency

Vector model

Speed up the process

Relevance feedback

Tips:

Related Posts

Performance testing using JMH in Spring Boot

Java calls C++ DLL library methods

Spring WebFlux opened its doors to a special guest