This is peng Wenhua’s 179 original article \

Hello, I’m Peng Wenhua. Recently, I write less technical articles. If I don’t write again, I will be scolded by my friends who are big data architects.

I once talked with a friend about how to build a big data system, starting from data collection, monitoring and collection with Flume, and then transmission and distribution with Kafka.

My friend was a Java guy, and he said, “I’m not really good with Kafka. How about something else?” I was stunned. I had heard of several MQ names, but I didn’t know much about them, except that Kafka was used in big data environments because of its high throughput.

But I’m not going to dig too much into whether other MQ works or not, it should work anyway. I said, “Yeah, I think so.”

When I came back, I studied it carefully, and I’m going to share my findings with you. At the end of the article there is the strongest Kafka information pack, can be downloaded directly.

MQ is what?

MQ, Message Queue, Message Queue. What does that mean? It’s a queue that stores various messages. More generally, it’s a scenario design tool that puts messages in a queue waiting for someone to read them. Like ~~ this: \

The black dots on the left are bits and pieces of information, lined up there. This is the message queue. Messages in the message queue are ultimately destined to be consumed (eaten) by the previous application (mouth).

The reason for having a message queue is that the speed at which the information is produced is different, fast and slow.

Like when I was a kid, mom would peel seeds over there, and we’d sit here and eat. Our mother is a role in producing products, and we are a role in consuming products.

Sometimes mother is quick and sometimes slow to peel seeds. Sometimes we wait there and sometimes we go out to play. There is a disconnect between the two roles. If you want the seeds on this side to be eaten as soon as they are peeled off, you have to nail two characters on this side, which is a lot of work!

Mother how smart ah, she can take advantage of our absence, peel some melon seeds, put the kernel son in the bowl. When we get home, we can eat one by one.

But the downside of this is that if you peel it earlier, it can go bad, or it doesn’t have enough crunch. To do that? It’s better to eat them in order and eat them first, so as to prevent problems to the greatest extent.

Let’s move the whole scene to the information system, and this is a message queue. The role of producing information in front is the producer, and the role of using information in the back is the consumer. In the middle, in order, is a piece of information. All of these roles make up the three elements of a message queue: producer, consumer, and message (service).

What can message queues do?

The purpose of the message queue is explained above. It’s decoupling producers and consumers.

Without MQ, producers and consumers had to poll one to one or periodically. This is either a waste of resources or not very timely.

With MQ, there are production tasks, and it’s not too late for producers to come back to work. Consumers don’t have to wait, or come back later, but wait in line for the trigger. One message, one consumption.

This satisfies both resource minimization and efficiency maximization. You see, the brains of these big bulls are brilliant. MQ is a mechanism that allows resources and efficiency, both inherently contradictory things, to be met simultaneously.

And MQ has another function: it is not afraid of backlogs. Mother peeled a pile of melon seeds, children eat slowly, it doesn’t matter, mother don’t have to wait for children to eat and then peeled, can be put on the clean table is good. You feel like it’s so easy and deserved. But without MQ it’s like carrying something from hand to hand. You can’t pass it off until you put it down.

So MQ has another capability, which is extremely high information transmission capacity, the term throughput.

Why Kafka?

There are a lot of MQ. Apache’s Active MQ, Rabbit’s Rabbit MQ, Ali’s open source Rocket MQ, big data overlord Kafka, NSQ, Zero MQ, Beanstalkd, and many more. \

Domestic factories in addition to Ali, there are also gods since the research MQ, are also good. Meituan made a MOS, cloud message queue based on RabbitMQ, and suddenly it’s big, right? Didi also made DDMQ based on Rocket and it works fine.

However, no matter what others do or do, Kafka is unbeatable in big data. It’s used everywhere, for data access, for data distribution, and finally for real-time data storage.

I guess Kafka is a bit confused. I’m an MQ and I ended up as a database.

Which brings us to Kafka’s core feature, one of Kafka’s greatest tricks: super-fast data transfer. As mentioned earlier, one metric that measures the efficiency of message queues is throughput TPS Transaction per Second, the amount of Transaction data transferred per Second.

How much TPS can Kafka achieve? One second, millions! But this one doesn’t show how good it is without comparison.

What is the TPS of Ali’s Rocket MQ? Hundreds of level. What about Apache’s Acitive MQ? All level.

What’s that? Crush it, did you? The magnitude of crushing !!!! Invincible posture!

But that’s not all. Kafka is very friendly to big data scenarios through clever design that maximizes its reliability. Such as:

It has duplicate concept, can open multiple copies, to ensure data security;

Design Log and Index files with the idea of skip table to ensure super efficiency;

Sparse index is used to improve the ultra-high efficiency of accessing physical storage when the consumer side reads Offset.

Zero copy technology reduces data replication between different links, thus improving the super speed of writing and reading.

Bottom line: Kafka is built for performance. On this characteristic, is the cow not cow?

But, to be honest, Kafka also has a less friendly aspect, which is that in order to achieve performance, it has to sacrifice a few highly reliable features.

If you choose high concurrency, high performance, and high throughput, kafka may run the risk of data loss and repeated consumption. This doesn’t matter in an OLAP analysis scenario, since the rule of large numbers means that missing a few pieces of data doesn’t matter at all, especially in percentages.

However, for any scenario related to the amount of money, high reliability is the first priority. You can’t say that when you place an order on a certain treasure, 100 million orders are successful, but only one order is inexplicably lost. This is intolerable. Alas, success is also what, defeat is also what!

conclusion

MQ: Message Queue. A decoupling tool for the upstream and downstream of data/information production and consumption.

There are a lot of MQS out there, Rocket MQ, Acitve MQ, Rabbit MQ, Kafka, and more.

Kafka dominates the big data ecosystem due to its superior performance over all MQ. Its distinguishing feature is fast!

Its core design ideas are: sequential read and write, skip table, sparse index, zero copy.

In order to achieve ultra-high performance, Kafka has to sacrifice some reliability. Therefore, it is only suitable for tolerable (minimal probability of data repetition, loss, etc.) scenarios.

For highly reliable scenarios such as transactions, tools such as Rocket MQ are the only option.

Kafka: Big Data Architect Kafka: Big Data Architect Kafka: Big data Architect You can also add me to wechat: Shirenpengwh, we chat privately. \

This resource is downloaded from the Internet for research use only, not commercial use. If there is anything wrong, please contact this number to delete.

Enjoy better with the following articles

Massive data super fast query secret – jump table idea \

Three Semantics of Real-time Task Processing in SparkStreaming \

1 billion users, 7 days in a row user tag how to play? \

How is id-mapping the core technology of One ID implemented? \

The architect walks you through the MapReduce process

Design idea appreciation – distributed ID generation algorithm – Snowflake algorithm

I need your retweets to satisfy my vanity a little