Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~
This article was first published in cloud + community, shall not be reproduced without permission.
Hello, just to give you a quick introduction, my name is Rao Jun and I am one of the co-founders of Silicon Valley start-up Confluent. All three of our founders came from the very beginning of this company in fluent’s Development of Kafka. Our company was founded in 2014 with the goal of being a company that helps all kinds of businesses do data flow based on Kafka.
Before we get started, I’d like to do a quick survey of who in this room has used Kafka. About 80 percent of people use it. Ok, thank you. I want to share with you today, I want to share with you the development of our project Kafka, how it was created, and then some of his experiences. Kafka goes back to 2010. In this area, I joined linkedin in 2010, which many of you are probably familiar with, as a social platform for talent and opportunity. In 2010, linkedin began to take shape, which was a phase of rapid growth. When I joined linkedin in 2010, I had an employee number of 600. When I left linkedin in 2014, I had an employee number of 6,000. In just four years, the organization developed rapidly. In led the rapid development of the process, so it can have so high speed development have inseparable relations with the data, led the British like many Internet companies, data is the core of it, brought the British have their own user, through its services and user communication, users of their own data directly or indirectly provide led Britain, led the British by doing all kinds of scientific research or analysis, Can extract a lot of new knowledge and cognitive, these information will be feedback on our products, this product will be more effective, can attract more customers to our platform, so if the data is done good, can form a very good virtuous cycle, the user can get more data, you can do a better analysis, can produce a better product, It can attract more users.
Diversity of data sources
From the data perspective, to bring the data is very diversified, the most common data, may we all know that this is a kind of transaction data, these data are generally exists in the database, from led the British perspective, this kind of this kind of transaction data is very simple, you provide job resume, or some of the resume, you go to school including the relationship between you and members of the inside connection, It’s all transactional data, but there’s also a lot of non-transactional data, a lot of behavioral data about users, for example, as a user, what connections you click on, what search terms you type in, which is actually very valuable information. In terms of our internal operations, there are a lot of operational service metrics, some application logs, and then finally some information on a lot of our smartphones, which is also very valuable. So in terms of value, the value of these non-transactional data is no less than the value of these transactional data. But in terms of traffic, the traffic of these non-transactional data may be 100 times, or even 10 million times the data source of such transactional data. Here’s a quick example of how linkedin is using this data concept to offer the service.
It is called People You May know, PYMK for short. What this organization does is to provide some recommendations to linkedin users. It wants to recommend some other linkedin users, but they are not connected to you yet. It uses 30 or 40 different kinds of information internally, and adds it up to give you one final recommendation. Some simple examples, such as the two of us have been to the same school, or work in a company, it is a very strong message, maybe we just need to together, but there are the indirect information, such as a and b two people, they are not directly this common some obvious relationship, but if in a very short time, The fact that a lot of people saw both resumes suggests that they may have some of this hidden information that makes them worth linking together. So in the early days of linkedin, when people used the service, a lot of the recommendations were amazing. At first glance, you may think how could he recommend such a person to me, but if you think about it, you will find that there are many strong reasons for it, and indeed there are some truths. Similarly, there are many services in it that allow him to use all kinds of real-time data. But back in 2010, we had a big problem with linkedin, which was that the integration of data was a very incomplete process. This figure is probably has a brief introduction of the state, so I see it is on a variety of data sources, led the British started is a veteran Internet companies, all data are exist database, along with the development of the concept, I have a system is to collect all the data on the user’s behavior, a lot of data is in the local file, Some other information is stored in the running log, running some identifying monitoring data.
And downstream we see this is all kinds of consumer side, linkedin started with this data warehouse, and as time goes on, we have more and more real-time microservices that are similar to these batch processes, capturing more or less the same information from these different data sources. As we just talked about the recommendation engine, it is one of the services, we have a lot of this, there will be some social graphics, he can analysis between the two nodes, such as the two leading members of the British, is how connected between them, which link is the strongest, there are also some real-time search, so the quantity increase gradually, And many of their uses are more real-time, with delays of a few seconds or even less from the time the data is generated to the system it updates.
Point-to-point data integration
So what we did was, if we wanted to get this data from the data source to the consumer side, what we did was we talked about point-to-point data integration, and we knew that some of the data, what we wanted to do was get that data into the data warehouse, and what we did was we wrote scripts or we wrote programs. After a few days, we realized that there were a lot of systems that needed to read data, and we would do something like that again, and we were writing scripts, so we were doing something like that for a long time, but after I had written five or six streams like that, it was a very inefficient thing to do. What are the main problems? The first problem we want to solve is a cross product problem, with the data clerk and data consumer side of the cross product problem. Therefore, every time a data source is added, it is necessary to connect this data source with all the consumer end. If a consumer end is also added, the consumer end needs to be directly connected with all the data sources. The second problem was that when we were doing this point-to-point flow data flow, we had to do a lot of the same work with each data flow, and we didn’t have enough time to make it 100 percent perfect for each data source, so we felt that the architecture was not very ideal.
An ideal architecture
So what, if anything, should be improved? We thought that if we had an architecture, let’s say we had a centralized logging system in the middle that cached all the data sources, if we could do that, we would simplify the framework a lot. So if you’re a data source, you don’t need to know all the consumers, the only thing you have to do is send your data to the central logging system. Also if you are a consumer side, you don’t need to know all of the data source, you do just like this in the center of the logging system you want to subscribe to news, so we just cross-product problem simplified into a real problem, the key is in architecture, what kind of system can do this in the middle of the log, So that’s what we were talking about, and we didn’t start from scratch with a new system. It seemed like a very common enterprise-level problem, and there should be a similar solution in this enterprise. If you look at it, think about it, a central logging system is similar to a traditional messaging system from an interface point of view. Our messaging systems tend to separate the production side from the consumption side, and then it’s a very real-time system, so we thought why not try some of the existing messaging systems, and there were some open source messaging systems at the time, and some enterprise-level messaging systems, but we found that it didn’t work very well. There are many reasons for this, but one of the most important is that these traditional messaging systems are not designed for this use by design, especially since the biggest problem is throughput.
Kafka first edition: High throughput publish subscribe messaging system
A lot of this early messaging, his designers designed it for this database data, this consumer transactional data, but you can’t imagine a lot of non-transactional data, like user behavior logs, and some of this monitoring data, going through this traditional messaging. So in this case, we feel that we have no way to solve the problem, but we don’t have a ready-made result, so we say we do a thing by ourselves. Around 2010, we made the first version of Kafka that we started working on. The first version of our positioning was also very simple, we wanted to make it a high throughput messaging system, high storage is our most important goal.
Distributed architecture
Now, let’s talk a little bit about how we implement high throughput. The first thing we did with high throughput was that in the first version of coffee, we made it a distributed framework. Many people familiar with Kafka know that there are three layers in Kafka, the middle layer plus the service layer below the production side, and then below the consumer side. Server-side words usually have one or more nodes, the basic concept called message rubbings, the source can be partition, each partition can be placed on a hard drive of a node, so if you want to increase the throughput, the simplest method is to increase the machines in the cluster, can have more resources, no matter from the Angle of the bandwidth or storage, You can have more resources to take in a lot of data, and we do a multi-threaded design on both the production side and the consumer side. In any case, you can have tens of thousands of these production and consumer threads writing or reading data from a cluster of karst machines. So the design is to say that in our first class there was a lot of this old messaging system.
Simple and practical log storage
Second we do is to use the storage structure of a log, this is also very simple, but it is a very effective storage structure, so it is some of its structure is each source partition, there will be a corresponding to such a structure of log, and log structured and all will be hard to hang together through the hard disk to store. This structure is each small square corresponds to a message, a message with a code, the code is increased continuously, if you are a production of the words, you do is you have the most behind you to write messages written to the log, you will get a new larger log message code, again sent to give consumption is in order to send the words, If you write it in the same order, it will send it to the consumer in the same order. The advantage of this is that your overhead from the consumer side is very small, because you don’t need to remember all the messages from the consumer side, just the code name of the last message it consumed. And then with that in mind, it can continue to consume from there, because we know that all messages are sent in order, so all messages before this message should have been consumed.
Two optimization
There are several benefits to this design. The first benefit is that the mode of access is very optimized, because this is linear writing not only from the point of view of writing but also from the point of view of reading, which is also linear reading from a certain position. So from this point of view, it is beneficial to the operating system and file system to optimize its performance. The second point, we can support the system set up to spend more at the same time, at any time you can have one or more consumers, consumers’, he can say from this place another consumers can consume again from a different place, but no matter how much you are a consumer, this data is only once, so from the point of view, Its performance has nothing to do with how many times you consume it. Another point that is not obvious is that since our logs are stored on hard disk, we can accept both real-time consumers and some non-real-time batch consumers. But because all the data on the hard drive, we can have a very large cache, so no matter you are in real time or not in real time, from the consumer side service methods are a set, he doesn’t need to do different optimization, only is we rely on the operating system to determine what data can be provided to the consumer from the memory, which need to read from the hard disk. But the design of this frame is the same. Finally we made this kind of high throughput, we made two small optimization, the two optimization are related, the first optimization is batch, all three levels on the server, so we just said to these messages is to exist on a hard disk in the log, but it is written to the hard disk has certain overhead, So we are not every news immediately wrote this hard disk, but usually for a period of time, such as when we have accumulated some enough news to get him a number of disk, so while you still have the same cost, but you this overhead is Shared by a lot of news, also on the production side, too, if you want to send a message, Instead of sending the message as a remote request to the server right away, we wait, hoping for more messages, and package them all up to the server. Is related to the batch data compression, we compression is also conducted on a batch of data compression, and from end to end compression, if you open the function of the compression, reproduce the first we will wait for a batch of a batch of data is completed, we will see this as a batch of data with a compression, strike a batch of data compression, You tend to get a better compression ratio than this compression per message as well. Different messages tend to have some duplication, and then the compressed data will be sent from the production side to the server, and the server will log the compressed data and send it to the consumer in the compressed format, until the consumer consumes a message and we unpack it. So if you enable compression, not only do we save on the network overhead, but we also save on the hosting overhead, so both of these are very efficient ways to implement this high throughput. So the first version of Kafka was about six months in the making, but it took us a little bit more time to put it into the linkedin data line, because linkedin has a lot of microservices inside of it, and we did it around the end of 2011, which was sort of the base number at the time.
Production side we have hundreds of thousands of messages are made at the time, and then there are millions of message is consumer, the data was still very considerable, and led the British at that time, there are hundreds of micro service, tens of thousands of tiny threads, more important is that we are doing the things after implementation inside the field a democratization of data. Before Kafka, if you were an engineer at linkedin, or a product manager, or a data analyst, and you wanted to do some new design or some new application, one of the most difficult problems was that you didn’t know what interface to use to read it and whether the data was complete. With Kafka, we’ve made this part of the problem much simpler, freeing engineers to innovate. So with that success, and the feeling that Kafka was a very useful system, we went on to do some more development, and the second part of the development was to do some of this high availability support.
Kafka version 2: High availability
In the first version, each message was stored on only one node, and if that node went down, the data would not be available. If the machine is permanently damaged, your data will be lost. So when we did the second version, we added some of this high availability, and the way we did it was by adding this multi-copy mechanism. If there are multiple nodes in the group, then we can redundantly store a message on multiple copies. The same small color is multiple copies. In the same situation, if you take a machine offline, another indication that it can continue to provide the same data service if it has the same copy. So with the second version, we were able to expand the range of data that it could include, not only when it was not transactional, but when it was transactional, it could also be collected through our system.
We also did a thing in 2000, the year was kafka, this project is donated to the apache foundation when we do this thing is also think we do system is very useful for areas within at least, then we will see if also have some useful to other companies, other Internet companies may also feel useful, But I didn’t realize that with open source, it’s going to be very widely used, so it tends to be the layout of the web, not just for companies that have this kind of Internet but for industry as a whole. As long as your company has some such real-time data, you can use it if you need to collect it. A big reason is that some of the various traditional enterprises, it is also going through this process of software digitization. There are some traditional industries, where we were strong in the past maybe in the traditional manufacturing, or some retail outlets, but now we have to be strong in software or data. Kafka provides a very effective channel for many enterprises to integrate data in real time. As the next step we worked on Kafka for several years and knew it would be of more and more magnitude, so we wanted to do something dedicated to Kafka, and as this is a full-time job, we left linkedin to establish Confluent in 2014. The purpose of this company is to provide convenience for all kinds of enterprises, which can be used more widely. At present, our company has more than 200 people.
The development of Kafka
So here’s an overview of how far we’ve come in 14 years. After that, we did two things for Kafka. The first one was about enterprise functionality, which was about data integration. The second block deals with data flow processing. So we’ll talk a little bit about both. I’m going to skip this, but we’ve done a big piece in the corporate world, which has to do with the data integration thing that we talked about at the beginning. Many of the company, if your company is longer, you’ll find that your data is scattered in many of the system, just after we speak is a kafka is very convenient, you can put the extracted from these data, but different company, you don’t want to read something to each company to do their own a set of things, So we designed in two pieces, the first part is that it has a platform, it does take a lot of common things out, to make a module, for example you need to do some of the data distribution, you need to do some parallel processing, you need to do some this kind of failure detection, can you do some data balance after testing, So these common things are done in this module, and there is an open interface in this module, which can be used to design and implement a variety of different data source connections. On the sending side of the data, we can do something similar if we want to search for some copies, so that’s the first thing we did.
The second piece does an aspect of data flow. If you have a system like Kafka that collects a lot of data in real time, the initial use is as a data transfer platform. But we think it’s time, kafka may not only a transmission platform, but also can do this kind of share cooperation platform, with real-time data, do you often do some things such as you need to do some of the data stream, such as a data from one format to another format, you may also want to do some data, Say you have a data flow, there are some data information, but only the user code, there is no data of the user’s specific information, but you can have a database, there are a lot more detailed user information, if you can put the two information, and together, the data flow is richer, can make you do some more more effective treatment. The other thing you might want to do is do some real-time data aggregation, and in the application, we want to simplify that a little bit.
Kafka’s future
In the future, I think Kafka system will not only be a real-time data collection and transmission platform, but more likely over time, it will be a platform for processing, exchanging and sharing more data streams, so we will do more things in this direction. In the future, as many applications become more widespread, we think many applications will become more and more of these real-time applications. So on that basis, we might have a very strong ecosystem on Kafka.
Finally, I want to share a little story with you. This story is about one of our users in North America. The bank is a relatively traditional old bank with a history of several decades. The problem that has existed for a long time is that its data are very scattered. So if you’re a customer of this bank, you might have a bank account, you might have a loan, you might have an insurance policy, you might have a credit card, all the previous information about this customer, because it’s all different business sectors, it’s all completely separate. If you’re a salesperson at a bank, your problem is that you don’t have all the information about the customer. The company made it a project related to Kafka, a project is to put all of the customer’s different data source information are collected in real-time, and then push the information to provide them with tens of thousands of sales staff, so the sales staff when doing the sales, there will be some more effective real-time information can do some more to recommend to the customer, So the project was very successful.
To share more information, click the link below:
Ask.qcloudimg.com/draft/11844…
Question and answer
How to use Apache Kafka vs Apache Storm?
reading
Chen Xinyu: Application of CKafka in face recognition PASS
Yang Yuan: Tencent Cloud Kafka automated operation practice
Design and implementation of Zhihu Kafka platform based on Kubernetes
Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1114675?fromSource=waitui