Youzan search platform is a PaaS product for various internal search applications and some NoSQL storage applications, which helps applications support retrieval and multidimensional filtering functions reasonably and efficiently. Youzan search platform currently supports more than 100 search businesses of various sizes and serves nearly 100 billion data.
In addition to providing advanced retrieval and big data interaction capabilities for traditional search applications, Youzan search platform also needs to provide support for other massive data filtering, such as commodity management, order retrieval, fan filtering and so on. From an engineering perspective, how to expand the platform to support diverse retrieval needs is a huge challenge.
I am the first employee of the Youlike search team, and I am fortunate to be responsible for designing and developing most of the functions and features of the youlike search platform so far. Our search team is mainly responsible for the performance, scalability and reliability of the platform, and reducing the operation and maintenance cost of the platform and the development cost of the business as much as possible.
#Elasticsearch
Elasticsearch is a highly available distributed search engine. On the one hand, the technology is relatively mature and stable, on the other hand, the community is active, so we chose Elasticsearch as our basic engine in the process of building the search system.
# architecture 1.0
Back in 2015, the likes search system in the production environment was a cluster consisting of several high provisioning virtual machines, which mainly ran the goods and fans index. Data was synchronized from DB to Elasticsearch via Canal.
In this way, the volume is small, can rapid low cost index creation synchronization for different business applications, suitable for the period of rapid business growth, but relative each synchronization program is monomer used, not only with the business address coupling, libraries need to adapt to business changes quickly, such as moving library, depots table, etc., and multiple canal to subscribe to the same repository at the same time, It can also degrade database performance. The Elasticsearch cluster is not physically isolated. I had a promotion where the Elasticsearch process ran out of heap memory due to the huge amount of fan data, and all indexes in the cluster did not work properly.
# architecture 2.0
In the process of solving the above problems, we also naturally precipitated a favorable search 2.0 version of the architecture, the general architecture is as follows:
Firstly, the data bus synchronizes the data change message to MQ, and the synchronization application synchronizes the business library data by consuming MQ message. The decoupling between the data bus and the business library is realized by using the data bus, and the introduction of the data bus can also avoid the waste of multiple Canal listeners consuming the same table binlog.
# Advanced Search
With the development of our business, we gradually emerged some relatively centralized traffic entry points, such as distribution and selection, etc. At this time, the common bool query could not meet our requirements for sorting the fineness of search results. Such as the complex function_score professional strong high-level query and optimization to the business development and responsible work is obviously not desirable option, here we consider is through an advanced query middleware intercept business query requests and reassembled after parsing out the necessary condition for the advanced query execution to the engine, general architecture is as follows:
Also do some here to optimize cache is joined in the search results, regular text search queries match each require real-time calculation, it is not required in the actual application scenario, users in a certain time period (15 or 30 minutes, for example) by the same request access to the same search results is perfectly acceptable, To do a result cache in the middleware can avoid repeated query repeated execution of the waste, while improving the response speed of the middleware, students who are interested in advanced search can read another article “Praise search engine practice (engineering)”, not detailed here.
# Big data integration
Search application is inseparable from big data. In addition to mining more value of user behavior through log analysis, off-line calculation of ranking comprehensive score is also an indispensable part of optimizing search application experience. In the 2.0 stage, we built the interactive channel between Hive and Elasticsearch through the open source ES-Hadoop component. The general architecture is as follows:
You can use Flume to collect search logs and store them in HDFS for subsequent analysis. You can also export them for search prompts after hive analysis. Of course, big data provides more than this for search services.
Problem #
This architecture supported the search system for more than a year, but it also exposed a lot of problems. The first one is the increasingly high maintenance cost, except for the Elasticsearch cluster maintenance and index itself configuration, field changes, although it has been decoupled from the business library through the data bus. But the business code coupled to the synchronous program still poses a significant maintenance burden for the team. Although message queues reduce our coupling with business programmatically, they also bring message ordering problems that are difficult for those unfamiliar with the state of business data to deal with. I summarized these questions in an article I wrote earlier. In addition, the traffic flowing through the Elasticsearch cluster is a semi-black box for us, which is sensed but unpredictable, and thus leads to the failure of the online cluster to be overloaded with CPU and unserviceable due to heavy internal traffic errors.
# Current architecture 3.0
To address the problems of the 2.0 era, we have made some specific adjustments in the 3.0 architecture. The main points are listed below:
- Through the open interface to receive user calls, completely decoupled from the business code;
- Add proxy for external services, preprocess user requests and perform necessary operations such as flow control and caching;
- To provide a management platform to simplify index change and cluster management, the evolution of the search system for praise gradually platform, has begun to take shape a search platform architecture:
#Proxy
As the import and export of external services, proxy provides a standardized interface for invoking Elasticsearch of different versions through ESLoader, and has embedded function modules such as request verification, caching, and template query. Request verification is used to preprocess user write and query requests. If field inconsistency, type error, query syntax error, or suspected slow query operations are found, the request is rejected in fast Fail mode or the request is executed at a low flow control level, preventing invalid or inefficient operations from affecting the entire Elasticsearch cluster. Caching and ESLoader mainly integrate the general functions of advanced search, so that advanced search can focus on query analysis and rewrite sorting functions of the search itself, and become more cohesive. We have made a little optimization in the cache. Since the cache of query results with source document content is usually large, in order to avoid traffic peak frequent access to coDIS cluster network congestion, we have implemented a simple local cache on proxy, which is automatically degraded during traffic peak.
When the query structure (DSL) is relatively fixed and lengthy, such as product category filtering, order filtering, etc., the search template can be implemented. On the one hand, the burden of business orchestration DSL can be simplified. On the other hand, you can also edit the query template template. Use default values, optional conditions and other means to perform online query performance tuning on the server side.
# Management platform
In order to reduce the maintenance cost of index addition and deletion, field modification, and configuration synchronization, we implemented the original version of the search management platform based on Django, which mainly provided a set of index change approval flow and synchronization of index configuration to different clusters, and implemented index metadata management in a visual way. Reduce our time cost in platform maintenance. Due to the unfriendly presentation of the open source Head plugin, and exposing some of the rougher features:
In earlier versions of Elasticsearch, fieldData is used to load the fields that need to be sorted. If the field is large enough, it will cause full GC or OOM. To avoid repeating this problem, we also provide a custom visual query component to support the user’s need to browse the data.
#ESWriter
Es-hadoop can only adjust read/write traffic by controlling the number of Mapreduce. In fact, es-Hadoop adjusts its behavior based on whether Elasticsearch rejects requests, which is very unfriendly to online clusters. In order to solve this problem, we developed an ESWriter plug-in based on the existing DataX, which can realize the second-level control of the number of records or the volume of traffic.
Challenge #
With the rapid growth of services, the operation and maintenance cost of Elasticsearch cluster itself has become unbearable. Although there are physically isolated clusters, it is inevitable that multiple business indexes share the same physical cluster. Poor support for production standards that differ from service to service, and the deployment of too many indexes in the same cluster is also a hidden danger to the stable operation of the production environment. In addition, it is difficult to flexibly scale cluster services. To expand the capacity of a node, you need to apply for machines, initialize the environment, and install software. For physical machines, it takes a long time to purchase machines.
# Future Architecture 4.0
The current architecture through an open interface accepts user data synchronization requirements, although realized decoupling, with business reduces our team’s own development costs, but the relative user development costs are high, the data from the database to the index need experience to get the data from the data bus, synchronous interface application processing data, called search platform open write data three steps, The two steps of obtaining data from the data bus and writing to the search platform will be repeatedly developed in the synchronization program of multiple businesses, resulting in resource waste. We are also planning to integrate with the Data Transporter (DTS) developed by the PaaS team to automate Data synchronization between Elasticsearch and multiple Data sources through configuration.
In order to solve the problem of shared cluster dealing with applications of different production standards, we hope to further upgrade the platform-based search service to the cloud-based service application mechanism, and independently deploy core applications into physical clusters that are isolated from each other according to the classification of service levels. Non-core applications apply for Elasticsearch cloud services based on K8s using different application templates. Application templates define service configurations in different application scenarios to address the difference in production standards for different applications. Cloud services can be scaled based on the application running status in a timely manner.
# summary
This article provides an architectural overview of the evolution of the Search system and the problems to be solved. We will update the technical details in the next article in this series.