Author’s brief introduction
Xiong Ping, Leader of public R&D team of Ctrip International Division, is mainly responsible for r&d of basic components related to internationalization and market-related projects. Open source community enthusiasts, like to read excellent open source project source code, has a deep interest in new technology.
Recently, I have been reconstructing the original SEO project, and now I have entered the later stage of reconstruction. I want to share the details of the whole reconstruction with you, hoping to be helpful to you.
First, what is SEO project
SEO (Search EngineOptimization), the use of Search engine (currently mainly refers to Google) rules to improve the natural ranking of the site in the Search engine and the brand influence of the site.
Users search for the corresponding keywords on the search engine, click the search results to directly jump to SEO’s Landing Pages, and then divert the traffic to the website that needs to be promoted through the Landing Pages, so as to convert the traffic into orders.
SEO projects mainly design Landing Pages according to different promotion dimensions and provide corresponding data for these Landing Pages. At present, the project mainly covers ctrip hotel and air ticket production lines, and more production lines may be connected in the future.
Why refactor
The original SEO project mainly has the following problems:
Code coupling: front-end code and server-side code are all coupled in the same project, interdependent in the development process, page information modular, configurable, support AB testing and a series of functions are difficult to achieve.
Data storage: SEO project data and other systems before the storage in the same DB, and part of the data table is shared, inevitably lead to some table fields from the PERSPECTIVE of SEO project is useless but can not be removed.
Data update: it takes about 2-3 days for all data to be updated once. Manual intervention is required in the whole process. If any problem occurs during the update process, it needs to be fully updated again and there is dirty data. The other is incomplete data. For example, if there is no hotel or airport in a city, the city data is meaningless, because there is no data of any hotel or airport in the city when the city dimension is promoted.
Unmet needs: At the bottom of the SEO page, relevant link information needs to be calculated according to certain rules. It takes about 4 hours to calculate a production line of a site in a specific language. There are 16 sites, each site has 15 languages and 3 production lines. It takes 16*15*3*4=2880 hours (120 days) to calculate the link information of all languages and production lines under all sites. Obviously, the current implementation scheme cannot meet the business needs.
Why do these problems exist? The main reasons are as follows:
Requirement iteration is too fast: IBU has been in a state of rapid development, many requirements need to be completed in a very short time, and it is difficult to grasp the change of future requirements. In the process of making requirements, it is inevitable to choose some short-term, tentative and quick solutions.
Small number of developers: Prior to refactoring, only one or two people did all the development work for the entire SEO project. When the demand came in, the developers would be overwhelmed.
Complex data: At present, SEO almost needs all the data related to air tickets and hotels, and the collection process of these data is extremely scattered, complex and complicated. When collecting a certain data, it may need to collect data from multiple data sources to complete all the fields in the data, and the large amount of data leads to a long update time. The high correlation between data makes it more difficult to ensure data integrity during data update.
Three, technology selection
The technical selection of the project is very important for the whole project, and the main expression here is the technical selection of the service side.
Development language: removed the previous part of the code is implemented in Java, part of the code is implemented in PHP, the development language is unified as Java. The main reason is that the company is focused on the Java language, and all members of the team are most proficient in Java.
Data storage: MySQL database is mainly used for data storage. The previous scheme of ES storage for some data and MySQL storage for other data is removed. SEO project in the hotel data volume is the largest, only tens of millions of levels, for MySQL is completely no problem.
RPC framework: The company provides two ways to expose services externally, one through the Baiji contract and the other through CDubbo. The main reason for not choosing the latter was that it was just launched at that time, and it might be slightly worse than the former in terms of stability. Meanwhile, the whole team had a deeper understanding of the former and used it more, which reduced the learning cost.
Iv. Design scheme
The overall architecture of an SEO project is shown below:
1
And deliberate
It is mainly used to collect data and convert it into formatted data. Data can be collected in either incremental or full form from MQ, DB, or API (more data sources will be added later). The core idea is to pull data from different data sources in a concurrent way and convert the data into formatted data, and then call Faba’s Write interface to Write the data to DB.
Due to the large amount of full data, it is most complicated to pull full data in the whole process.
Update the data in the full amount for the moment most of the situation is to adopt the way of calling the API, you need to consider the called API QPS, response time, update a time interval, and the size of the API return a message (in some cases need to be considered paging), API timeout, Gateway timeout, network bandwidth, such as data dependencies between, In order to determine the number of threads, call frequency, call cycle, call time (usually called at off-peak), number of machines at deployment, number of CPU cores and memory size of virtual machines, etc., different parameters need to be optimized for calling different apis.
Incremental update is relatively simple, mainly using MQ docking mode, need to take into account the message sent first arrived, the message sent later arrived first situation, duplicate messages, message loss, MQ queue size, etc. There is also the possibility that the data provider might have dirty data or could not support the traffic of the Vampire during the entire pull process, so there is also the need to support pause, restore, and forced update.
Whether pulled incrementally or in full, data is eventually converted to formatted data and written to the DB. The speed of processing in this transformation is critical because Vampire as a whole is really a producer and consumer model. Producers are the various data sources that are accessed. The consumer converts the data and then invokes the write interface provided by Faba to quickly complete the data conversion.
Ideally, the production speed of the producer should be equal to the consumption speed of the consumer. When the production speed is greater than the consumption speed, the data that has not been consumed will be stored in the memory, which will easily cause OOM. Therefore, in actual use, the consumption speed is generally greater than the production speed.
For Vampire, the speed of the producer is the sum of the traffic connected to various data sources, which increases with the increase of data sources, but the consumption power of consumers is fixed. Therefore, to improve the throughput of data collection and transformation, the speed of consumers is essentially increased. This is to speed up Faba’s Write interface (more on how Faba handles data later).
At present, four 8-core, 8GB VMS are deployed in the production environment. The processing capacity of the Vampire can reach 10K+ per second and it takes about 30 minutes to process 1000W data.
2
, Faba
This subproject provides data Read and Write operations for the entire SEO project. The Write interface is mainly called by Vampire to supplement the data. Vampire writes the data collected and converted into DB by calling Faba’s Write interface. The Read interface is mainly used to provide access to data externally and is called by Service.
The Write interface is primarily implemented asynchronously. Vampire calls temporarily store data in a message queue before consuming the data. Firstly, the QPS and response time of Write interface are improved. Secondly, some same operations can be combined into batch operations to minimize the consumption of DB connections. Finally, the data written in a batch can be deleted as much as possible to reduce unnecessary Write operations.
There are three factors to consider when designing a Write interface:
First, support idempotent. Since the data to be written comes from a message queue, which has a retry mechanism, idempotent data needs to be supported during writing.
Actually the message queue also cannot ensure that data is ordered to arrive, the data is ordered to have an impact on incremental pull data only, for the whole quantity to pull data has no effect, because the whole quantity to pull the data, if and only if each of the data will only be pulled again, so for each data update operation were independent of each other without considering the order.
In the case of incremental pull data, if A city data changes the name of the city from A to B and then from B to C at the same time, the two updated operations are pushed to Vampire in an orderly manner, and then Faba’s Write interface is called after Vampire converts formatted data. From the message queue consumer when the two data may receive first city name from modified B to C data, after the data received from A change to B, then takes place in two data modification time as A timestamp, update the data in the DB update only when the current time stamp is greater than the data in the DB update time, all the rest of the filter, That is, the data of city name changed from B to C will be updated to DB, and the data changed from A to B will be filtered out.
Second, consumption rate. It’s easy to see throughout the bottleneck in the process of writing is the DB write operations, the company the DB connection pool size is 100, that is to say by multithreading consumption data in a message queue, the thread pool size is not more than 100, determine the spending power of consumers and producers of production capacity need only by simple calculation can be identified, In theory, you just have to equal the total amount of data produced per producer per hour to the number of consumer threads 100* the average number of data per batch, which is an ideal situation.
In reality, there are probably three more factors to consider:
1) The size of the message queue, i.e. the ability to hoard data, is related to machine memory;
2) Acceptable data delay time, that is, the time of a data from entering the message queue to writing DB;
3) I/O processing capability, writing data to DB will produce a large number of I/O operations, especially in the batch write operation, because this factor was not taken into account, resulting in a large number of timeout alarms in the normal read and write operations of other DB on the same physical machine as SEO’s DB.
Third, data priority, deliberate will to pull the data from different data sources and different data source will provide one data in several fields, different sources of data quality also will be different, that is different data sources for the same data in several fields have different priority, high priority data of high quality, This priority is defined when data sources are accessed. Therefore, data updates need to be determined based on the priority of data. Currently, there is only one data source for the same field of the same data, so you can ignore this problem for now.
Write Interface performance
The Read interface currently reads data from DB, and its performance depends on two factors:
First, the design of database table structure and reduce data redundancy, as far as possible in the design of the original each data table vertical split into many smaller tables, according to business needs to establish good index, let each the SQL query is a index, for complex SQL queries, split into several simple SQL, and let each simple SQL index hit, These simple SQL and reuse as much as possible, if the result of an SQL query out the larger need paging, then will pass to parse SQL execution, determine the reasonable page size, for more complex queries and paging query data is by performing several simple SQL will return results completed by means of application assembly.
Second, the design of the interface, the external Read interface is also as simple as possible in the design, which includes simple input parameters and simple return values.
For example, an interface has three parameters: A, B, and C. Assuming that B can be derived indirectly from A and C, the parameter B is unnecessary and should be removed.
The returned value simply means that the number of returned packets is not too large. Generally, the returned packets are smaller than 4KB, and the data fields in the returned packets are useful.
There are two important factors to consider in the process of interface splitting:
1) Whether all interfaces can obtain all useful data in DB through several combined calls;
2) To complete a particular function, it is necessary to call multiple simple interfaces as few times as possible, and try to call the interfaces with fast response as much as possible, and call the interfaces with slower response as little as possible.
When the data volume is 1KW+, the Read interface directly accesses the database without caching. QPS can reach 1400+ for simple queries and 400+ for complex multi-condition paging queries
Caches in Faba are classified as local caches and distributed caches.
The local cache mainly stores some data with small volume, high access frequency and low data inconsistency requirements. Distributed cache is mainly realized by Redis as a carrier to store some data with relatively large volume, small value and high access frequency.
At the same time, the data with small data volume should be fully cached and updated regularly. For the data with large data volume, LRU elimination strategy is adopted to update the cache, so as to improve the cache hit ratio when the cache space is fixed. According to the current demand, QPS achieved only through direct connection with DB can be met, so the priority of development cache is low. Currently, the development process is still under way, and data on interface performance cannot be provided for the time being.
3
And the Service
According to the analysis of business requirements, each SEO page of the production line is composed of several sets of pages, each set of pages are promoted from different angles, each page is composed of several modules, a Module corresponds to an interface.
Take airline tickets: The FLIGHT ticket SEO page will contain two sets of departure and airport pages. The departure page is composed of three modules A, B and C, and the airport page is composed of three modules B, C and D. At this time, only four interfaces need to be developed to implement the corresponding functions of the four modules A, B, C and D respectively. It can improve the reusability of the interface. At the same time, a Module in a page can also be configured to display different data in different languages, currencies, cities and other dimensions.
4
, Page
This project is primarily the responsibility of the front end team and will not be described in detail here.
5
And Portal
It is mainly composed of four modules. The Config module can be configured according to different languages, currencies and other conditions to control the return results and sorting methods of each interface in Service under different parameters. Log module is mainly used to record the progress, update duration and Log of Vampire data update.
AB Test module is mainly used in coordination with Config module to compare different configurations, so as to help business personnel make a better decision. The Statistic module is mainly used to collect performance data such as the hit ratio of the cache in Faba.
Four,
SEO project is the core of data, how to collect data, update data, the quality of the data in each update gradually precipitation down is the key to the whole project; The design of interface and data table is as simple as possible to improve the performance of the whole project.
This article is only a general description of the overall SEO project reconstruction scheme, the specific implementation details of the design scheme did not do too much description, at the same time some non-core functions are still under development, interested students can leave a message, but also welcome everyone clap brick.
【 Recommended reading 】
Ctrip ticket how to use big data to improve the test effect
Ctrip international BU Hotel team big front end road
10 million level data feedback within 20 seconds, Ctrip hotel intelligent monitoring platform how to achieve?
How can Ctrip accurately predict the future traffic during a period of time?
IOS custom video compression by hand