The author

Lv Yalin, head of the infrastructure-architecture R&D team of Workband, joined Workband in 2019. During his time in Workband, he led the evolution of cloud native architecture, promoted the implementation of containerization transformation, service governance, GO micro-service framework, and the implementation of DevOps.

Haoran Zhang joined Jobgang in 2019 as senior architect of Jobgang infrastructure. During his work in Jobgang, he promoted the evolution of jobgang cloud native architecture, was responsible for multi-cloud K8S cluster construction, K8S component development, Linux kernel optimization and tuning, and related work of bottom service containerization.

background

The large-scale retrieval system has always been the cornerstone of the platform business of various companies. It often operates in the form of a super-large cluster of thousands of bare metal servers, with huge amount of data, extremely demanding requirements for performance, throughput and stability, and low fault tolerance. In addition to the operation level, data iteration and service governance in super-large clusters and massive data scenarios are also often a huge challenge: Incremental and full amount of data distribution efficiency, short and long term hot data tracking and so on all is a problem needs to be further studied This article introduces work for interior design and implementation based on fluid separation calculation storage architecture, can significantly reduce the complexity of the large-scale retrieval system class service, make large-scale retrieval system can be as smooth as a normal online business management.

Problems faced by large-scale retrieval systems

Homework help many learning materials rely on a large scale in intelligent analysis and search data retrieval systems, our cluster scale in more than one thousand units, according to the total amount above hundreds of TB level, the whole system is composed of several shard, each shard by the server to load the same data sets, run level we require performance achieve P99 1. Xms, Throughput peak 100 GB level, stability requirements of 99.999% or more.

Past environment in order to improve the efficiency and stability of data read, more considering localization data storage and retrieval system of daily produce index entries and we need to TB levels of data update, the data output by offline database services, need to update to the corresponding subdivision respectively, this mode has brought many challenges, Key issues focus on data iteration and extensibility:

  1. Discrete data set: In actual operation, each node of each fragment needs to copy all data of the fragment, which brings the problem of synchronous data delivery. In practice, to synchronize data to a single server node, you need to use hierarchical delivery. One level (ten) is delivered first, then two levels (hundreds) is delivered, and then three levels (thousands) is delivered. The distribution period is long and layer verification is required to ensure data accuracy.

  2. Weak elastic expansion of service resources: In the original system architecture, computing and storage are tightly coupled, data storage and computing resources are tightly bound, and the flexible expansion of resources is not high. Capacity expansion is usually performed in hours, and the capacity expansion is not able to cope with sudden peak traffic.

  3. Insufficient single-fragment data scalability: The upper limit of single-fragment data is limited by the upper limit of single-node storage in the sharded cluster. If the storage ceiling is reached, it is often necessary to split the data set, which is not driven by business requirements.

The problems of data iteration and scalability lead to cost pressures and weakness in automated processes.

Through the analysis of retrieval system operation and data update process, the key problem we are facing at present is brought by the coupling of computing and storage, so we consider how to decouple computing and storage. Only the introduction of calculation store separation architecture can fundamentally solve the problem of the complexity of computing store separation of each node is the most important is to split between the shard full amount data way, the data is stored within the shard in logic on remote machines But calculating store separation and brings other problems, such as stability, Although these problems exist, they are solvable and easy to solve. Based on this, we confirm that the separation of computing and storage must be a good solution in this scenario, which can fundamentally solve the problem of system complexity.

Computational storage separation architecture addresses complexity issues

To address the above concerns about computing and storage separation, the new computing and storage separation architecture must achieve the following goals:

  1. After all, the original file reading is replaced by various components. The data loading method can be replaced, but the stability of the data reading still needs to maintain the same level as the original.

  2. In the scenario of simultaneous data update of each fragment 1000 node, the read speed should be maximized and the network pressure should be controlled to a certain extent.

  3. Data can be read through POSIX interfaces. POSIX is the most adaptable mode for various service scenarios. POSIX shields the impact of downstream changes on upstream without intrusion into service scenarios.

  4. The controllability of the data iteration process. For online business, the data iteration should be considered the CD process equivalent to the service iteration, so the controllability of the data iteration is extremely important because it is part of the CD process.

  5. For the scalability of data sets, the new architecture needs to be a set of replicable and easily extensible patterns, so as to have a good ability to cope with the scaling of data sets and cluster scale.

In order to achieve this goal, the Fluid open source project was chosen as the key link for the new architecture.

Component is introduced

Fluid is an open source Kubernetes native distributed data set choreography and acceleration engine that mainly serves data-intensive applications in cloud native scenarios, such as big data applications, AI applications, etc. Through the data layer abstraction provided by Kubernetes service, data can be moved, copied, evicted, transformed and managed flexibly and efficiently between storage sources such as HDFS, OSS, Ceph and Kubernetes upper-layer cloud native application computing as fluid. The specific data operation is transparent to users, so users do not have to worry about the efficiency of accessing remote data, the convenience of managing data sources, and how to help Kuberntes make operation and maintenance scheduling decisions.

Users simply need to access the abstract data directly in the most natural Kubernetes native data volume mode, leaving the rest of the tasks and low-level details to Fluid. Fluid project currently focuses on two key scenarios: data set choreography and application choreography.

Cube choreography can cache data from a specified dataset to Kubernetes nodes of a specified feature, while application choreography specifies that the application is scheduled to nodes on which the specified dataset can or has been stored. The two can also be combined to form collaborative scheduling scenarios, that is, node resource scheduling based on data sets and application requirements.

Why we chose to use fluid

  1. The retrieval service has been containerized and is naturally fluid fit.

  2. Fluid, as a data arrangement system, enables the upper layer to directly use it without knowing the specific data distribution. Meanwhile, based on the data sensing scheduling ability, it can realize the nearest scheduling of services and accelerate the data access performance.

  3. Fluid implements a PVC interface that enables business Pods to be sensorless mounted into pods, making pods as sensorless as using local disks.

  4. Fluid provides distributed hierarchical caching of metadata and data, as well as efficient file retrieval.

  5. Fluid+ Alluxio has a variety of built-in cache modes (back source mode, full cache mode), different cache strategies (optimization for small file scenarios, etc.) and storage modes (disk, memory), which are well adaptable to different scenarios and can meet various business scenarios without too much modification.

The ground practice

  1. Separation of cache nodes and compute nodes: Although fuse and worker can be deployed together to achieve better local data performance, in the online scenario, we finally choose the solution of separating cache and compute node, because it is worthwhile to obtain better elasticity by extending the startup time. And we don’t want the business node stability issues to get entangled with the cache node stability issues. Fluid supports the schedulability of dataset, in other words, the schedulability of cache nodes. We schedule cache nodes of dataset by specifying the nodeAffinity of dataset, so as to ensure that cache nodes can provide cache services efficiently and flexibly.

  2. High requirements in online scenarios: In online service scenarios, the system has high requirements on data access speed, integrity, and consistency. Therefore, partial data updates and unexpected source requests cannot occur. So the choice of data caching and update strategy can be critical.

    • Suitable data caching strategy: Based on the above requirements, we chose to use Fluid’s full caching mode. In full cache mode, all requests are only cached instead of being returned to the source, thus avoiding unexpected and time-consuming requests. At the same time, the dataload process is controlled by the data update process, which is more secure and standardized.

    • Update process combined with permission flow: Data update of online business is also a kind of CD, which also needs to be controlled by update process. Dataload mode combined with permission flow makes online data release more secure and standardized.

    • Atomicity of data update: because the model is composed of many files, only after all files are cached, can a complete model be used; Therefore, under the premise of full cache and no back source, the atomicity of dataload process should be guaranteed. During the data loading process, the data of the new version cannot be accessed, and the data of the new version can be read only after the data loading is completed.

The above schemes and strategies cooperate with our automatic database building and data version management functions, greatly improving the security and stability of the whole system, and making the whole process flow more intelligent and automatic.

conclusion

Based on Fluid computing storage separation architecture, we successfully achieved:

  1. Data distribution at the minute level and at the hundred terabyte level.

  2. The atomicity of data versioning and data updating makes data distribution and updating a manageable, more intelligent, automated process.

  3. Retrieval services can easily scale horizontally through TKE HPA as normal stateless services. Faster scaling brings greater stability and availability.

Looking forward to

The pattern of computing and storage separation allows services previously considered very special to be stateless and incorporated into Devops as normal services. Fluid based data orchestration and acceleration systems are an opening in the practice of computing and storage separation. In addition to retrieval systems, We are also exploring Fluid based OCR system model training and distribution patterns.

In terms of future work, we plan to continue to optimize the scheduling strategy and execution mode of upper-level jobs based on Fluid, and further expand model training and distribution to improve overall training speed and resource utilization. On the other hand, we also plan to help the community evolve its observability and high availability to help more developers.

About us

More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~

Benefits:

① Public account background reply [Manual], you can get “Tencent Cloud native Roadmap manual” & “Tencent Cloud native Best Practices” ~

② Public number background reply [series], can get “15 series of 100+ ultra practical cloud original dry goods collection”, including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③ Public account background reply [white paper], you can get “Tencent Cloud container Security White Paper” & “Source of Cost reduction – Cloud native Cost Management White Paper V1.0”