MaxCompute+OpenSearch build a distributed search engine

Abstract: Recently, there are often customers consulting how to build high-performance mass data search engine at low cost, such as the realization of public number search, video search and so on. Since the customer’s data is on Ali Cloud, we hope to find a cloud solution. The author began to investigate some products on the cloud, and many people recommended OpenSearch to me, so I spent some time to study it carefully, and found that it worked well after using it. It has the functions of word segmentation and cloud database synchronization. I also found some problems in the research process, and I would like to share with you.

background

Recently, there are often customers consulting how to build high-performance mass data search engine at low cost, such as the realization of public number search, video search and so on. Since the customer’s data is on Ali Cloud, we hope to find a cloud solution. The author began to investigate some products on the cloud, and many people recommended OpenSearch to me, so I spent some time to study it carefully, and found that it worked well after using it. It has the functions of word segmentation and cloud database synchronization. I also found some problems in the research process, and I would like to share with you.

Next, we started to use Ali Cloud MaxCompute (formerly known as ODPS) and OpenSearch to build a search engine Demo for video and news retrieval. I had about 10GB of data, and it only took 15 minutes to build the service and about 1 hour to build the index for data synchronization. Because of the choice of flexible billing, the experiment cost about tens of yuan.

In addition, OpenSearch has a rich SDK and API, which can be easily integrated into online business.

Experimental architecture diagram

Based on OpenSearch, the search engine architecture is a typical distributed online real-time interactive query architecture with no single point of failure, high scalability, high availability, free operation and maintenance, and low cost. Indexing and searching large amounts of information can be done in near real time, quickly searching billions of files and petabytes of data in real time.

Distributed database architecture is a fast, fully managed TB/PB level data warehouse solution on top of MaxCompute. MaxCompute provides users with a comprehensive data import solution and a variety of classical distributed computing models, which can quickly solve the problem of computing massive data for users, effectively reduce enterprise costs, and ensure data security.

Experimental preparation

1. Register aliyun user, authenticate with real name and bind alipay;

2. Open data plus service;

3. Open MaxCompute and OpenSearch postpaid services.

The experiment task

Select * from MaxCompute; select * from MaxCompute;

2. Create application with OpenSearch, configure data/index structure and word segmentation;

3. Import data in full and build indexes;

4. Search effect test.

Step 1: Purchase and open OpenSearch, MaxCompute, big data development suite services

1.1 Enabling the Opensearch Service

Visit www.aliyun.com/product/ope… , click Open now and choose post-payment (pay by volume).

1.2 Enabling MaxCompute& Big Data Development suite service

1.2.1 opened MaxCompute

Ali cloud real-name authentication account visit www.aliyun.com/product/odp… , enable MaxCompute and choose pay-as-you-go.

1.2.2 Creating a MaxCompute Project

Enter MaxCompute Management Console, open the MaxCompute success page, or navigate to product -> Big Data ->MaxCompute click Management console.

Create a project

After entering the console page, navigate to “Big Data Development Suite -> Project List” and click “Create Project”, as shown in the picture:

In the dialog box that is displayed, select the post-I /O payment mode and enter the project name:

Create MaxCompute table

Enter the data development page of big data development suite, enter Ali Cloud Plus Platform > Big Data Development Suite > Management console as a developer, and click in the operation bar of the corresponding project under the project list to enter the workspace.

Note: If you use the digital plus platform for the first time, you need to register the digital plus AK.

Step 2: Import the data set into MaxCompute through the big data development suite

After entering the big data development suite workspace, we import a test data first.

Data description: the author here refers to an MaxCompute public data sets (beta), address: https://yq.aliyun.com/articles/89763, currently MaxCompute open data categories include: stock price data, real estate information, film and television and its box office data. All data is stored in the public_data project in the MaxCompute product.

Next, let’s cite a movie box office data.

It is very simple to use, the prerequisite is to open MaxCompute& big data development suite;

In the big data development suite, create a script, name it Opensearch_demo, and execute the following statement in the window.

add user ALIYUN$everyone;

After execution, all members of the user project space can read the public data set.

Verify:

Select * from public_data.dwd_product_movie_basic_info WHERE movie_name like ‘% ‘LIMIT 10;

Create a primary key in MaxCompute by using the UUID function in OpenSearch.

Execute the following statement in the window:

create table alian.demo_opensearch_case2 as select uuid() as id,* from public_data.dwd_product_movie_basic_info ;

After successful execution, verify the data;

select count(1) from alian.demo_opensearch_case2;

You can see that the data set has been created;

Step 3: Create an open search application

3.1 Enter the OpenSearch console and click “Create Application”

3.2 Selecting the product version, the author has opened the standard version. If you need multiple table association search, please open the advanced version, if it is a single table query, the standard version can.

3.3 Enter the application name MaxCompute_OpenSearch_Demo, select East China 1 (Hangzhou), because MaxCompute currently only contains east China, otherwise the data link fails, and click Next.

3.4 Select Create Application Architecture by Data Source. The initial application structure can be quickly created from the source table structure, saving the manual construction work and reducing the error probability.

3.5 Select ODPS, the table just created.

Select the ODPS project and table DEMO_openSearch_Case2 that you just created

SQL > alter table ODPS create primary key for STRING LITERAL;

3.6 Configuring Indexes, Word segmentation, and Search Display Contents

Select the fields movie_name, director, Scriptwriter, area, Actors, Type, movile_date, and movie_language as indexes, and set the default Chinese word partition.

Add a display field to set the content of the search results.

3.7 Creation Complete

Step 4: Synchronize data and create indexes

4.1 Activating An Application

Select quota and QPS, the data set we use is about 8G, so 10G quota is opened, QPS uses the default.

Note: MaxCompute (original ODPS) data is compressed, we used the SIZE of 2GB, but the actual SIZE is 8GB, I bought 3GB OpenSearch quota, the result of import failed.

4.2 Starting to Build Indexes

The main thing here is to wait. I’ve been waiting for an hour.

You can view the index build progress

Step 5: Search test

Open application Management -> Search test, enter any movie, such as the recently released Wrestling Daddy, and then automatically match the corresponding video information to complete the experiment.

MaxCompute provides a great data set with a lot of data and a lot of freshness.

Conclusion: By now, we have completed the whole experiment, OpenSearch+MaxCompute is still very convenient, very suitable for the data scale of more than 100GB and do not want to high operation and maintenance costs and IT costs of enterprises;

The original link

To read more articles, please scan the following QR code:

MaxCompute+OpenSearch build a distributed search engine

Related Posts

Java Overview and preparation

Understand the read and write lock ReentrantReadWriteLock

Did you know that Redis can implement delay queuing?