Author’s brief introduction
Peng Xu is the Leader of ctrip flight ticket big data platform, responsible for the platform construction, operation and maintenance. Deep knowledge of open source big data products such as Spark, Presto and Elasticsearch. Author of Apache Spark Source Code Anatomy. This article is from Xu Peng in the DAMS 2017 China Data Asset Management Summit, the first DBAplus community (ID: DBAplus).
Nowadays, there are many open source projects in big data. Therefore, the difficulty of building a platform lies in how to choose an appropriate technology to build the architecture of the whole platform. Secondly, as there are business data, how to use the platform to analyze the data so that users can have a good interactive experience. The third level is that science and engineering like modeling, and in the whole process, we will form a kind of non-data modeling, but mainly how we mix people at different levels, and then make such a big data team.
I. Selection of data platform technology
1. Overall framework
This framework should be a generic one, or rather a common one. The previous is from data sources to message queues to data cleaning, data presentation and other things we can easily think of, but under such a big hat, what is different is the specific selection of what components to fill this void, in different scenarios, everyone’s choice is not the same.
For the message queue layer, we choose Kafka, which is widely used at present, because it has high throughput. It adopts the combination of Push and Pull, and the consumer side actively pulls data. For ETL, there is a general desire to adopt a custom way, generally popular is to use Camus provided by LinkedIn to do data synchronization from Kafka to HDFS. This should be a popular architecture.
The data in HDFS is basically prepared for batch processing, so what analysis engine we choose for batch analysis may be a controversial focus. That is to say, there may be Hive, Spark, Presto, Impala, etc. There are other things. The choice or practice among these engines depends on the specific usage scenario.
Here’s why YOU chose Presto over others. For those of you who have experience with Presto, you will realize that Presto is a CLI user interface, and there is no good Web UI. For the average user, CLI is difficult to use, whether it is perceived or practical, so you need a good Web UI to increase ease of use.
Currently, the Presto webui you can find on GitHub is AirPal provided by Airbnb, but according to our experience, it is not very friendly, especially in UTC time Settings. Meanwhile, its community maintenance has been stuck two years ago. We have adapted this part. I then used Presto’s StatementClient to make the Web UI. Easyui of jquery is used in the front end, like the batch processing line just mentioned, which is used in the batch processing. The bottom line means that some data may be intended for immediate storage, immediate search, or brief analysis.
Elasticsearch is a search engine that has a very active community. Elasticsearch is a search engine that has a very active community. However, the difficulty with Elasticsearch is how to maintain it well, and I will talk about its possible maintenance pain points later.
Elasticsearch has a very powerful search capability and a very fast response time, but the user interface has its own search syntax based on Lucene. Lucene’s syntax is geeky and concise, but most people don’t want to learn it. Because for analysts to learn, it means that the previous decades of kung fu, in vain.
Elastisearch-sql allows you to query Elasticsearch by point or range using SQL statements. – SQL support will be added into Elasticsearch’s roadmap, ES roadmap should be released in 17, no later than 18. Or turn customers away.
Webui. is the part of human-computer interaction, we will conduct ad-hoc query, but many programs in the whole department want to call query, that is, the interface of the application. Using SOA architecture, we developed and implemented BigQuery API, which can be used to fetch data or analyze data through this way of Restful interface. So we’re going to automatically determine whether we’re going to the ES side or we’re going to Presto.
In the use of many companies, data analysis needs reports, that is, a good Dashboard.
2, ETLPipeLine — Gobblin
This is some of the relative details of ETL. Just a quick look at this diagram. During ETL time, for example, why not directly use Spark or stream, the most common problem is the problem of small files, and then need to clean and merge small files, which is very troublesome. If Zeus is used for scheduling, and a certain number of partitions are set, there will be a Map Task corresponding to write a Block as full as possible, mainly 64M or 128M. In addition to considering its size when storing, the choice of storage format should also be considered.
From the point of our current selection, recommend the use of ORC file format, so use this file format because it is already embedded with a certain level of the index, while the index is not very fine particle size, but in some ways is the ability to rapidly improve retrieval, skip the block of data is not in conformity with the conditions, avoid unnecessary data transmission. One of the more promising or highly promoted formats is Huawei’s CarbonData, which contains index granularity and more detailed index information than ORC. They have also released a 1.× version, which is a relatively mature version.
Analysis engine -Presto
This is the inner workings of Presto. Why not use Hive and Spark? Hive is a Russian weapon, characterized by stupidity and absolute stability. To what extent is it stable? Stable to is that it is the slowest one, there is a joke is that my grades have been very stable, because the old exam last first, no one can compare, so it has been very stable, and the positive number first is not very stable, Hive is this characteristic, absolutely can come out the result, but will make you feel that there is no hope in life.
Spark’s name is definitely loud, but there are some problems with using Spark. Resource sharing is a problem. If you only use Spark, you must have a Concurrent Query problem. You need something like Livy or something to solve your resource sharing problem. And Spark is very ambitious, trying to eat almost everything, it’s hard to eat everything, because you have to cover a lot of ground.
Presto focuses only on data analysis, only on THE SQL query level, and only does one thing. This fully embodies the Unix philosophy of only one job, and different jobs are linked by Pipeline. And Presto is pipeline-based. As soon as a result comes out in one block, for example, we typically add a condition after it, and then limit 10 or limit 1, you’ll find that the result comes out very quickly. Using Spark, you’ll find that the search for Where condition goes through multiple stages. To run the next Stage, you must complete all the previous stages. The result of Limit 1 will be filtered later.
As you can see from the data presented at the back of Presto, this is an improvement. Based on ORC file storage, it should be 5 times or 10 times, 10 to 20 times better. In simple terms, there is a Client, and then the Client submits THE SQL statement. There is a Planner and Scheduler in front of the Client, which divides the corresponding SQL into different stages, and each Stage has multiple tasks. These real tasks run on different Workers and use these Workers to read data from data sources.
In other words, Presto is focused on the data analysis side. The specific data is stored outside, and there is definitely a need to figure out what is worth pulling and what can be pushed directly to the data source side without having to pull a lot of stupid things up.
Analysis engine comparison – Presto vs. MapReduce
You can see I mentioned a Stage based approach, a Pipeline based approach, and the Pipeline approach is that there’s no pauses in the process, it’s all intersecting, it doesn’t wait for one Stage to complete before it goes to the next Stage, Spark typically waits until one Stage is finished, and the data is spit out to Disk. The next Stage pulls the data, and then the next Stage moves on. Pipeline means that I finish one Task and spit out data to the next Task until the Aggregator node.
In this process, you will also see that one of the biggest characteristics of Presto is that all the calculations are in the memory. You will think that the human brain and the memory of the machine are limited, and they will collapse. When they collapse, they will collapse.
MapReduce will restart. If it works, it’s ok. This feature also suggests that Presto is ideal for interactive queries. If it’s bulk, and you’re doing regular reports at night, it’s irresponsible to leave the whole thing to Presto, because there’s a lot of time, and it’s better to leave it to Hive.
4. Near real-time search — Elasticsearch
Let’s talk about the ES side of things, the near real time search engine, which is all wrapped around Lucene and has very good JSON support. Elasticsearch supports horizontal and horizontal scaling, is highly available, easy to manage, has an active community, and has a dedicated commercial company behind it. Its competitor is Solr and Solr’s Cloud. The installation of SolrCloud is complicated, and an independent third-party is introduced, which greatly depends on ZooKeeper cluster, making the management of SolrCloud complicated.
The development of SolrCloud is also very active, now it is 6.×, and then it will be 7.×, and SolrCloud introduced SQL support in 6.×, ES and SolrCloud are the same brothers, through the competition between the same brothers, we can see the development trend — SQL will definitely support.
If you do a search, this graph is very common. It must have a corresponding Primary partition on one node, a Replicas on the other node, and more than one Replica. If these don’t exist, There’s not much to say about this graph. The problem is that there are several replicas in each machine and several different partitions. If maintenance work is carried out, the above problems are worth analyzing and studying.
ES tuning and operations
The following is about the tuning and operation of ES, starting from two levels.
The first layer is the OS, and when you’re talking about Linux, you tune it to take into account the number of file handles, and then its Memory, and then its I/O scheduling, and then the I/O scheduling line and if you’re interested in the kernel, you’ll find that basically using CFQ, Because most of the production will use Redhat or CentOS to deploy, not to play Archlinux or Gentoo, it is impossible to do so.
And then there’s the Virtual memory DirtyRatio, which is something that has a huge impact on response time, or sometimes you’ll see I/O operations, and the CPU is always high because there’s a file cache, and if there’s enough cache it keeps writing to disk, So our solution was to lower the vm.dirty_ratio from the default 20% to 10%, which was quite high. This means that once the contents of the cache exceed 10% of the system memory, do nothing else. Concentrate on spitting out the contents of the cache. Vm.dirty_background_ratio means that if this threshold is reached, the file cache contents are written to disk. OS level tuning is similar to database system tuning.
Another level of tuning is ES itself, which first means that I’m on a Cluster and the number of shards is evenly distributed.
I’ve put up a screenshot here, and you can see that the number of shards on all the nodes is pretty even. There are corresponding parameter adjustment can achieve this effect. The second is that there is a Replica process. For example, if a new machine is added or a Replica machine is reduced, corresponding maintenance needs to be done. The cluster of machines will be dynamically expanded and reduced. If both Shard transfers are performed at this time, the write and query of the whole cluster will be greatly affected. Therefore, a certain Balance should be carried out to ensure a certain Balance between the two. These are all at the cluster level, but the following is index level optimization.
Indexing-level optimization is the number of shards I want to store, whether the Index is stored in ten shards or five shards, the frequency of refresh, refresh means how long after the data is written to be searched. The longer the Refresh time is, the greater the data throughput is, but the longer the time that can be searched is. There’s also the Merge process, because of sharding, in order to reduce the use of file handles, you need to Merge. Some people say that because ES supports Schemaless, fixed Schema is not needed. However, in the actual use of the process found that if there is no certain limit, everyone thinks it is free, there will be a rapid expansion of fields, thousands of fields under a certain index, so that the index write speed down.
The figure below is the Dashboard we wrote ourselves. Speaking of ES, many of you may also use it. If you find that Marvel is a better plug-in after upgrading to 5. So it also offers a so-called basic version. Free stuff, as you all know, is cheap and bad, which means it has functionality compared to the 1.× version, and a lot of information is missing.
Our words are self-reliant, because all of your content can be read through Rest apis, but it just needs to be visualized on the front end. So this picture is what we do, you can easily see written amount of the current node, query, the index of the current node, the number of shards and current state of the cluster, if once the state becomes red, you can email notification, further points down on the page that understand each node and index of detailed information.
To summarize, tuning generally involves four dimensions: a CPU dimension, a Memory dimension, a Disk I/O dimension, and a network dimension. For example, from the original 100 M network card upgrade to 1000 M network card, from 1000 M to 10000 M, the response speed of the query will be greatly improved.
This is a unified SQL query interface mentioned earlier, you can see that it is very simple, very silly and naive, I just input a SQL above, the following quickly out of a result. But because of using this approach, because it is behind the adopted the engine of Presto, within the department, we have a lot of colleagues in the use of the data query, the current daily usage should be in close to the appearance of 8 k, because recently to upgrade the network card, to upgrade to the M card, make speed more quickly. The extra time to drink coffee and smoke is so much better than waiting around worrying.
5. Data Visualization — Zeppelin
When you do data visualization, you can learn from competitors or competitors to see what others are doing. If you look at Hue, it means that you enter a query statement and then the result will be displayed. What we’re currently thinking about is how to visualize Data, currently attempting to use Zeppelin, which is able to port to Presto via a JDBC interface, query the Data, and visualize the report graphically by simply dragging and dropping it.
In addition, if you want to connect Zeppelin to Spark, if it is only a Spark cluster, directly connected to the Spark cluster, resource utilization is very low, but if you have a Livy Server in front of you, using Livy for resource scheduling, resource sharing is better. At present, there are two competing products in this aspect, one is Livy, and the other is Oyala, which provides SparkJob ServerS, which actually do the same job. Zeppelin is an integration of Livy Server.
Data Microservices – Rest query interface
For microservices, we provide a BigQuery API, which has the benefit of a unified query entry and unified permission management. Because not everyone should see all the data in the query, it is easy to cause problems. There may be some real data, unlike ordinary log data, especially the air ticket or our hotel, its data has a lot of sensitive information, which needs to do the corresponding permission management.
After this entrance unified, do authority management is more convenient, if the problem as long as check the corresponding log is OK. And the use of a unified query language, are familiar with the USE of this SQL statement, not to say that a new thing to learn a new set of knowledge, so that the original knowledge is not easy to be inherited, this is everyone should try to avoid.
7. Task Scheduler – Job Scheduler
-
Zeus-https://github.com/ctripcorp/dataworks-zeus
In fact, when doing a set of big data platform, the task scheduling is indispensable. For task scheduling, we use Zeus system, which is open-source by Ctrip. The Ops team of our company is responsible for developing and maintaining the platform. But as you can imagine, the delivery of tasks through this platform includes ETL and scheduled tasks, where you can put data from Kafka into HDFS or synchronize data from SQL Server and MySQLDB to HDFS. There is currently market volume and other competing products in this section.
Ii. Data team capacity building
This part is about building our team. At present, I divide it into five different perspectives. The first one is engine development, which is relatively difficult and requires high technical requirements on the background.
The second is interactive interface design, the whole thing to do, if just do the engine, or the engine to do, but no actual people use, the boss will certainly call off, must be a ring set a ring, the formation of an effective transmission, not a single point, only talk about the engine has no meaning, to talk about the vehicle. So there are engines, and the engines are pretty demanding, and there’s going to be an interface design, which is how I’m going to use these engines.
Get on, all of these things can be turned up, the whole can turn up later, we have a ops, actually you can gradually found a trend, is regardless of the large data, the cloud platform, the operational requirements are relatively high, a good operations should not only master a basic OS level, You have to have a good idea or research about the background. Whether it is from background service development to operation and maintenance or from operation and maintenance to background server development, both need to cross learn.
Therefore, a platform planning is relatively the work scope of an architect or a higher level person with a higher vision. Such a role shoulders the two concepts of architecture and product manager, because such things are mainly for internal use and difficult to be isolated.
The language part is a matter of opinion, but I’m just making a list of the things that we’re using, the things that we’re using, and there’s all of that.
That’s basically what we do. I think all of our parts are in this one, and this one looks kind of bland, but it takes a lot of time to get it right.
Recommended reading:
Design and implementation of Ctrip open source configuration center Apollo
Ctrip ticket wireless testing technology and efficiency improvement
Ctrip’s Fourth Generation Architecture Exploration: Operation and Maintenance Infrastructure Upgrade (PART 1)
Ctrip’s Fourth-generation Architecture Exploration: Operation and Maintenance Infrastructure Upgrade (PART 2)
Ctrip new risk control data platform construction