What is a Message Queue?
In high concurrency scenarios, incoming requests tend to block because there is no time to process them synchronously. For example, a large number of new and update requests arrive in the database at the same time, which can lead to rows or tables being locked, resulting in too many connections exceptions or time-out exceptions due to a large number of requests. Therefore, in high concurrency application scenarios, there is a need for a buffer mechanism, and message queues can play a good role in this role, by asynchronously processing requests to reduce the peak load and reduce the load of the system.
Message queuing can be simply understood as putting the data to be transmitted in a queue with a first-in, first-out (FIFO) feature. It is mainly used for communication between different processes or threads to process a series of input requests. Message queue adopts asynchronous communication mechanism. The sender and receiver of a message do not have to exchange data with the message queue at the same time, and the message is kept in the queue until it is read by the receiver.
Message queues are used for:
-
Apply decoupling (synchronous invocation is a strong dependency and asynchronous invocation is a weak dependency);
-
Peak cutting and valley filling;
-
Reduced response time;
-
Improve throughput (i.e. Kafka throughput is 30-40 times that of MySQL, and Kafka is much more scalable than MySQL);
-
Kafka, Java Middleware interview questions + study notes
What is Kafka?
Kafka is a distributed messaging system developed by LinkedIn that can be deployed independently on a single server or in clusters across multiple servers. It provides publish and subscribe functionality. Users can send data to and read data from a Kafka cluster. Kafka is written in Scala and is widely used for its horizontal scalability and high throughput.
More and more open source distributed processing systems such as Storm, Spark, and Flink support integration with Kafka. Kafka is now used in our real-time data processing platform. It is now being used by many different types of companies as many types of data pipes and messaging systems.
Why Kafka?
The main reason for comparing Kafka with Apache ActiveMQ and RabbitMQ is that Kafka has better performance. The main reasons for Kafka’s good performance include:
1. Order IOFirst of all,
Kafka uses Sequential IO and tries to avoid Random Disk Access. The former is an order of magnitude faster than the latter. For example, on a disk array consisting of six 7200 RPM SATA disks, the sequential write speed can reach 300MB/S, while the random write speed is only 50KB/S.
With such a big difference, it’s no wonder Kafka is flying fast. Kafka uses commit logs that are written to partitions appending, which means that individual partitions are written sequentially without deletion or update, thus avoiding random writes. In addition, data is read sequentially from partitions, avoiding random reads.
The question is, even sequential IO is not as fast as memory, so why doesn’t Kafka use memory to store data? The first reason: memory is fast, but it costs a lot more than hard drives. Kafka, as a member of the big data ecosystem, was built to store huge amounts of data, and using memory to store large amounts of data is obviously not practical.
In addition, Kafka’s high availability is achieved by creating multiple copies. A message can be copied three or five times, which adds to the storage overhead and makes memory storage impossible. In addition, Kafka runs on the JVM, and if there are too many objects in the memory heap, there is bound to be a significant delay in garbage collection, which can affect the overall performance of the system.
2. Memory-mapped files
Memory-mapped files map the contents of files on disk to memory, where we write data and the operating system later flusher the data to disk. So, writing data is almost as fast as writing memory, which is another reason Kafka flies.
3. Zero copy
Kafka uses zero-copy technology, in which data is copied directly from the kernel-space read buffer to the kernel-space socket buffer and then written to the NIC buffer, avoiding shuttles between the kernel-space and user-space.
4. Application layer optimization
In addition to leveraging underlying technologies, Kafka also provides some means at the application level to improve performance. The most obvious is the use of batches. When writing data to Kafka, batch writes can be enabled to avoid the latency and bandwidth overhead associated with frequently transferring individual messages over the network. Given a network bandwidth of 10MB/S, it is obviously much faster to transfer a 10MB message at once than a 1KB message 10 million times.
Kafka basic concept and architecture diagram
1. The Producer:
Message producers are clients that send messages to kafka Broker. As a Producer of messages, Producer needs to send messages to a specified destination (a partition of a topic) after producing messages. Producer can choose which partition to publish messages to based on the specified partition selection algorithm or random method. For example, a partition index can be obtained by taking the hash value of the primary Key of a message record, and then modulo the number of partitions with that value. Int partition = math.abs (key.hashcode ()) % numPartitions;
2. The Consumer:
Message consumers, clients that pull messages from Kafka Broker.
3. What’s your Topic
You can think of it as a queue. A topic usually contains a type of message, and each topic has one or more subscribers, the consumers of the message.
4. Consumer Group (CG)
Consumer groups, which kafka uses to broadcast a topic message (to all consumers) and unicast (to any consumer). A topic can have more than one CG. Topic messages are copied (not really copied, but conceptually) to all CG’s, but each partion sends messages to only one consumer in that CG. If broadcasting is required, each consumer should have a separate CG.
To implement unicast, all consumers need to be in the same CG. CG also allows consumers to group freely without having to send messages to different topics multiple times.
5. The Broker:
A kafka server is a broker. A cluster consists of multiple brokers. A single broker can hold multiple topics.
6. Partition:
Partitioning. For scalability, a very large topic can be spread across multiple brokers (servers), and a topic can be divided into multiple partitions, each of which is an ordered queue. Each message in a partition is assigned an ordered ID (offset).
Kafka only guarantees that messages are sent to consumers in the order of one partition, not the order of a topic as a whole (across partitions). Multiple copies of each partition are distributed among different brokers. Messages are ordered within a partition, but not as a whole. As shown in the figure below, a topic is divided into four partitions:
7. Offset:
Kafka files are named offset. Kafka files are named offset. Kafka files are named offset. For example, if you want to find the location 2049, just find the file 2048. Kafka. Of course, the first offset is 00000000000.kafka.
Each partition is an Append log file at the storage layer. Messages published to this partition are appended to the end of the log file and are written to disks in sequence (sequential writing to disks is more efficient than random writing to memory). The position of each message in the log file is called offset, which is a long number that uniquely identifies a message. The only metadata saved for each consumer is the offset value, which is completely controlled by the consumer so that the consumer can consume the records in any order.
Some other things to watch out for:
One or more topics can be created on a single broker. The same topic can be distributed across multiple brokers in the same cluster.
(2) Consumer group, message, partition first, the same consumer group can not have more than one consumer to consume messages, there can only be one consumer to consume. Second, there are no repeated consumption messages within the same consumer group. Third, a consumer of the same consumer group is not a data as a unit, is a partition as a unit, which is equivalent to the consumer and partition to establish a certain socket connection, data transmission, so once the relationship is established, the content of the partition can only be consumed by the consumer. Why is Kafka a distributed model? First, the same Kafka cluster has a topic, and the same topic has different partitions, different partitions can be distributed on different borkers, that is, on different machines, so partition is distributed, data is also distributed, Kafka is distributed mode. Now that you know the above components, it should be easy to understand Kafka’s architecture diagram. Kafka is a distributed system that uses Zookeeper to manage and coordinate Broker nodes in Kafka clusters. When a new agent node is added to the Kafka cluster, or when one of the agent nodes fails, the Zookeeper service notifies the producer and consumer applications to read and write to the other normal agent nodes.
Use scenarios for Kafka
The common business scenarios for Kafka are as follows:
** Log collection: ** A company can use Kafka to collect logs for a variety of services and open them up to consumers as a unified interface service, such as Hadoop, Hbase, Elasticsearch, etc.
Message systems: decouple producers and consumers, cache messages, and so on.
More Java technology notes can be shared public account: Kirin bug access!
** User activity tracking: **Kafka is often used to record the activities of Web users or app users, such as browsing, searching, clicking, etc. These activities are published by various servers to Kafka topics, and subscribers subscribe to these topics for real-time monitoring and analysis. Or load it into Hadoop or data warehouse for offline analysis and mining.
Operational metrics: Kafka is also frequently used to record operational monitoring data, including collecting data from various distributed applications and producing centralized feedback on various operations, such as alarms and reports. Streaming: Kafka is a Streaming platform, so it can be used with other big data suites such as Spark Streaming, Storm, Flink, etc.
** Event source: ** Event source is an application design style in which a state change produces a timestamp record, which is then saved as a time series. This approach can be used to build very stable and reliable back-end applications in the face of very large state change requirements.