As a crawler, Kafka to you is a message queue, you just need to know how to write data to it, and you need to know how to read data from it. And how to read the data.
Keep in mind that Kafka is easy to use, but difficult to set up, maintain, and tune Kafka clusters. A Kafka cluster needs someone to maintain it, so don’t assume you can do it easily.
In this article, and the next few on Kafka, we are targeting crawler engineers or readers who simply need to use Kafka. The deeper details and core principles of Kafka are beyond our scope. For the sake of explanation, some of the terms Kafka are used in the article with less precise but useful analogies. If you need to explain these terms in an interview, read Kafka’s official documentation.
One of the topics we’re going to talk about today is how Kafka allows continuous consumption, breakpoint continuation, and parallel consumption for multiple processes in a single application. For a number of procedures and do not affect each other, independent.
A Kafka can have several different queues, which we call Topic. Suppose one of these queues looks like this:
Information goes in from the right and comes out from the left. If this were the list of Redis, it would pop up a message and the queue would look like this:
Message 1 on the far left is missing. So even if the program closes immediately after consuming message 1 and then opens it again, the program will continue to consume message 2 and will not consume message 1 twice.
But what if I have two programs? Program 1 reads each piece of data and saves it to the database. Program 2 reads each piece of data and checks for keywords. In this case, information 1 should be consumed by program 1, and it should be consumed by program 2. But the above scheme is clearly not feasible. When program 1 consumes information 1, program 2 can no longer get it.
So, in Kafka, the information stays in the queue, but for each program, there is a separate marker to keep track of which data item is currently consumed, as shown in the figure below.
When program 1 reads the next piece of data in Kafka, Kafka moves the current position marker one bit to the right, returning the new value. The tag move and return operations are taken together as one atomic operation, and there is no problem with repeated reads.
Program 1 and program 2 use different tags, so it doesn’t matter which value each tag points to.
When adding a program 3, you only need to add another tag. The new tag is also not affected by the previous two tags.
This allows multiple applications to read Kafka without affecting each other.
Now, if you think program 1 is consuming too slowly, run program 1 three times at the same time. Because the marking and shifting are atomic operations, even if you look like your program is reading Kafka at the same time, internally Kafka queues them so that their returns are not duplicated or missed.
If you look at the Kafka tutorial online, you’ll see that they mention something called Offset, which is essentially the flag that points to the current data in each program.
You’ll also see a keyword called Group, which actually corresponds to programs 1, 2, and 3 in this article.
If multiple applications consume different groups for the same queue, the data they read will not interfere with each other.
Multiple processes consuming data on the same queue, in the same Group, look like lPOP operations on Redis.
Finally, if you read about Kafka on the Internet, you will surely come across a word called Paritition or Chinese shard. And you’ll find that you can’t understand this stuff.
Never mind, forget it. All you need to know is how many partitions a Topic has and how many processes you can start to read the same Group. If a Topic has 3 partitions, you can only open a maximum of 3 processes reading the same Group at the same time. Topic if there are 5 partitions, then you can open a maximum of 5 processes reading the same Group.
In the next article, let’s read and write Kafka in Python. It only takes a few lines of code.