The online service is mainly used to respond to user requests, while the offline service converts data from various sources into the online service. “Search offline” data processing is a typical mass data batch/real-time computing combination scenario.
The full text contains 4142 words and is expected to take 8 minutes to read.
I. “Offline” and “Online” behind multimodal Retrieval
Baidu search is mainly composed of “search online” and “search offline”. The “online” service is mainly used to respond to user requests, while the “offline” service converts and processes data from various sources into the “online” service. “Search offline” data processing is a typical mass data batch/real-time computing combination scenario.
Since 2015, Baidu App has launched multi-modal retrieval capability, which directly reflects intelligent search in front of users. Multimodal retrieval is an addition of visual retrieval and speech retrieval capabilities on top of traditional text retrieval.
Among them, “visual retrieval” and “text retrieval image” these two types of business offline and online technology, there are a lot of common. Taking visual retrieval as an example, product forms include: guessing words, more size pictures, picture sources, vertical category pictures (short videos, commodities, etc.), similar recommendation, etc. The core technologies behind it include classification (GPU online model prediction) and Ann retrieval.
In terms of Ann retrieval, the retrieval methods mainly adopted at present include cluster-based GNO-IMI, graph-based HNSW, and locally sensitive hash method. The main consideration of selection is the applicability of technical solution cost and features. For example, GNO-IMI is an open source solution in Baidu with relatively small memory consumption. The cost is acceptable when applied to tens of billions of Ann retrieval; The locally sensitive hash method, applied to local features such as SIFT, can enhance the recall effect in the mobile phone camera recognition scene.
Behind these online technologies, there are more than one hundred kinds of features to rely on. Offline to include the whole network of pictures, and to calculate the features of the picture, the computing power cost is very large; In addition, pictures are attached to web pages on the Internet, and the relationship of “picture – picture link – page link” needs to be maintained (offline data processing and online applications are inseparable from data relationship, for example, in order to trace the source, the source page URL of the picture needs to be provided).
In this case, the Search Architecture Department and the Content Technology Architecture Department jointly designed and developed the “Image Processing and inclusion Center” according to their own business and technical characteristics, in order to achieve the following goals:
-
The unified data acquisition and processing capability can integrate the data acquisition, processing and storage logic of picture services to improve human efficiency and reduce storage and computing costs.
-
Tens of billions to billions of level of picture application, can realize the rapid research, data collection, the whole network data update ability.
-
Build a real-time image screening and customized delivery data path to improve the timeliness of image resource introduction.
The project is known internally as project Imazon. Imazon comes from Image + Amazon, where Amazon represents the throughput capacity, DAG processing capacity and Image capacity of the middle platform.
At present, the image processing includes the middle station, which can process billions of levels of picture data in a single day in complex business scenarios, collect hundreds of GPS in real time in a second level, and collect 10,000 levels of GPS in the whole network. At present, the platform supports the image processing and collection requirements of multiple business lines, which greatly improves the efficiency of business execution.
Two, image processing included in the center of the structure and key technologies
Continuous optimization of search results is inseparable from data and computing power, mainly taking collection, storage and computing as the core. It is hoped that the general capabilities provided by the middle station include: filtering data from the aging and the whole network image collection channels, providing the streaming processing mechanism of high throughput, the ability to describe the relationship between images and web pages, the storage of original images and small images, online processing mechanism and so on.
2.1 What problems can be solved in image processing and inclusion?
Image processing includes the main process of the middle platform, through six stages: web page spider (access to web content), picture content extraction, picture spider (climb pictures), feature calculation (more than 100 features), content relationship storage, database construction. As shown in the figure below:
2.2 Image processing includes the technical indicators of the middle station
The definition of technical indicators of the middle platform is described from three aspects: architecture indicators, effects and R&D efficiency.
Architecture indicators include throughput, scalability, and stability:
-
Throughput, that is, within the cost limit, improve throughput, specific indicators are: single data size: 100 K bytes (picture + feature); Real-time collection of 100 QPS; The whole network includes ten thousand level QPS
-
Scalability means that the cloud is deployed in native mode and computing resources are flexibly scheduled. Computing is performed faster when resources are available and slower when no resources are available.
-
Stability, that is, no data loss, automatic retry, automatic playback; Minute level processing success rate of timeliness data; Day – level data processing success rate for the whole network
Performance indicators focus on data relationships:
- Real image-page link relationships (e.g., page/picture exit, relationship update)
R&d efficiency indicators include business versatility and language flexibility:
-
Business versatility: support services that rely on pictures of the whole network to obtain data; Characteristics of the iteration
-
Language flexibility: C++&go&php
2.3 Architecture design of image processing and collection center
Image processing is an unbounded data streaming processing process, so the overall architecture design is based on streaming real-time processing system, and also supports batch processing input. At the same time, in order to solve the problems of large throughput demand and the efficiency of business research and development, the design adopts the ideas of elastic computation and event-driven, and the decoupling deployment of business logic and DAG framework. The details are shown in the following figure and will be explained in detail later.
2.4 Infrastructure of the center for image processing and collection
Baidu Infrastructure:
-
Storage: TABLE, BDRP (Redis), UNDB, bos
-
Message queue: BigPIPE
-
Service framework: BaiDURPC, GDP(GO), ODP(PHP)
Relying on & building business infrastructure
-
Pipeline scheduling: Odyssey, supporting each DAG in the architecture panorama
-
Flow control system: at the core entrance layer, it provides the ability to balance the flow and adjust the flow
-
Qianren: A CPU/GPU operator that hosts, schedules, and routes hundreds to thousands of instances
-
Content relationship engine: Characterizes graph – web relationship based on event-driven computation and flexibly schedules in association with Blades
-
Offline microservice component: Tigris, the specific business logic of DAG node is executed in remote RPC
Third, optimize practice
The following is a brief introduction of some optimization practices for the middle platform in the scenario of high throughput and high computing power.
3.1 Practice of HVDC processing architecture
The cost (computing power and storage) is limited. In the face of large throughput demand, targeted optimization is made in the following directions:
-
Message queues are expensive
-
Flow burrs and peaks and troughs cause insufficient utilization of resources
-
The accumulation of data caused by insufficient computing power
3.1.1 Message queue cost optimization
In offline streaming data processing, it is a more conventional scheme to transfer data in DAG/ Pipeline through message queue. This scheme can ensure that the service does not lose data at least once by means of the persistence of message queue. Business features:
-
The transmission in Pipeline/DAG is the picture and its characteristics, hundred K bytes, the cost of message queue is relatively high
-
Downstream operators do not necessarily need all data. Transparent transmission of all fields through the Message Queue is cost-effective
The specific optimization ideas are as follows:
-
The message queue in DAG sends a reference (trigger MSG), and the output of operators in DAG is stored in the bypass cache
-
The bypass cache is optimized with high throughput and low cost. It takes advantage of the data life cycle in DAG to optimize proactively delete and dirty write
The specific protocol design is as follows:
-
Trigger MSG (bytes), which is transmitted point-to-point between Message Queue and Miner
-
TigrisTuple (100K~ bytes) implements inter-Miner sharing through Redis
-
ProcessorTuple (M to bytes) On-demand read and write is implemented using the off-line cache
3.1.2 Flow equalization and peak lag calculation
The whole system must be deployed at peak capacity due to the peaks and troughs or burrs of the inlet flow, but the resource utilization is insufficient during the low peak period. The diagram below:
The specific optimization ideas are as follows:
Through the backpressure/flow control mechanism, the total throughput of the system is maximized under the premise of constant resources
-
The flow control system smooths the flow, reduces the gap between the mean and the peak, and makes the “capacity utilization rate” of each module in the whole system stable at a high level
-
DAG/ PIPELINE has the capacity of back pressure, when the local module capacity is insufficient, back pressure to the flow control module, flow control module adaptive adjustment, peak data lag to the trough calculation
-
In order to solve the unacceptable data lag in business, data priority is distinguished to ensure the priority distribution of high-priority data (the throughput design of the whole system should cover the throughput of high-priority data at least).
△ Figure 3 Flow control with 3 priorities
3.1.3 Solve the data accumulation caused by the temporary insufficient computing power in the large throughput scenario
In the scenario of data collection on the whole network, feature computing has GPU resource bottleneck, and these features consume huge GPU cards. This problem can be solved by “staggered peak”, “off-line mixing, and temporary resource use”, but new problems are introduced: The offline pipeline cannot buffer this much data and you do not want the backpressure to affect the upstream DAG processing throughput
Specific optimization ideas:
-
Analyze the bottleneck and split DAG; Using storage DB as a “natural flow control” system, event-driven (elastic scheduling computing features, features in place to trigger scheduling downstream DAG).
3.2 Content relationship engine
The picture content relationship of the Internet can be depicted by a tripartite map. The following definitions are used to describe the concept:
-
F: fromurl, for web page, f with multiple O’s. F Features of latitude: title, page type, etc
-
O: objURL, which stands for image link. An O can only point to one image. O Features of latitude: dead chain
-
C: Picture content sign. C Latitude features: picture content, OCR, clarity, people, etc
-
Fo: The edge of the page with the image link. Edge features: image context, Alt
-
Oc: Image link with image edge. Edge features: image crawl time
The content relationship engine needs to be able to depict the following behaviors:
In order to describe the complete relationship description of all elements in the Internet, this is a graph database with a scale of 100 billion nodes and p-level storage, and the system indicators need to be achieved are as follows:
-
Write performance:
-
QPS, single node attributes (100~K bytes)
-
Edge: 100,000-level QPS
-
Read performance (full filtering, feature iteration) :
-
Point and edge attribute information to be exported (Scan THROUGHPUT: G bytes/s)
In order to solve the problem of read and write performance, a COF three-part graph content relation engine is designed based on table. The core design ideas are as follows:
-
Table C uses the hash prefix to partition data to ensure sequential scan and read integrity relationships (which OS C comes from, and which F O comes from). Table C is stored at the P level
-
The O table uses the SSD mechanism to query the C corresponding to O
-
F Table adopts SSD media to improve random read performance. Saves the reverse mapping and supports searching for O and C through F
In order to reduce the I/O bottleneck caused by random write and reduce the complexity of system transaction, “version-based verification method, read time verification, asynchronous exit” is adopted to ensure the correctness of the relationship.
3.3 Other Practices
In order to improve the iterative efficiency of business R&D and improve the maintenance of the system itself, the system has solved some problems, but it is just on the way to improve the “happiness of R&D”. We focus on r&d efficiency and maintenance costs.
For example, in terms of service access efficiency:
Data source reuse
-
Problem: Data from 10 businesses in 10 formats, proto embedded too much to understand
-
Try: from heterogeneous schema=> standard schema. OP input/output management
DAG**** Output reuse
-
Problem: Cannot affect upstream DAG processing throughput and speed
-
Try: DAG RPC series, solve the cascade blocking; DAG native connection, data lifecycle issues, copy on write&erase
Resource storage overcommitment:
-
Problem: I used it to create a thumbnail, but now the thumbnail won’t open! What, the original was taken down, too?
-
Try: multi-tenant mechanism, reference counting exits, unified CDN access, and online unified intelligent cutting and compression
In terms of multilanguage support:
-
Problem:
-
Want to use C++/Python/PHP/go, compatible framework complex! Slow down. Who’s the problem?
-
I just implemented one business logic and didn’t want to worry too much about the details of the DAG
-
Try:
-
DAG framework language unification, through remote RPC isolation business implementation
-
Rpc Echo(trigger msg[in], tigris tuple[in], processor input list[in], processor output list[out])
In terms of maintenance costs:
-
Problem:
-
Why is this data not included?
-
Message 99+ (Warning or fatal) : What should I do if the message queue is blocked again
-
Try:
-
Distributed APP log trace
-
Monitoring & alarm, classification + HOWTO
-
Business core indicators: collection scale /s, collection time by quantile, feature coverage, business distribution scale /s, data loss rate, and timeout collection proportion
-
System core indicators: DAG Submission PV, DAG capacity/utilization, OP Status (OK, FAIL, RETRY,…) , OP capacity/utilization, and OP time/timeout rate
-
Key indicators: dependent service throughput, delay, and failure; OP internal detail monitoring;
This author | imazon
We are looking for positions in computer vision processing, index retrieval, offline streaming processing, etc. Welcome to follow the public account of the same name, Baidu Geek, and submit your resume. Look forward to joining us!
Supporting 700 million users search Baidu image processing included in Taiwan
Welcome to follow the public account of the same name baidu Geek said, more dry goods, benefits, internal push waiting for you ~