Overview Big data platforms generate a large number of logs every day, and special logging systems are required to process these logs.
In general, these systems need to have the following characteristics:
Build a bridge between application system and analysis system, and decouple the association between them; Support near-real-time online analysis systems and offline analysis systems like Hadoop; It has high scalability. That is, when the amount of data increases, you can expand horizontally by adding nodes. Therefore, it is recommended to divide the log collection and analysis system into the following modules:
Data acquisition module: responsible for real-time data collection from each node, it is recommended to use Flume-NG to achieve. Data access module: Because the speed of data collection and data processing are not necessarily synchronous, so add a message middleware as a buffer, it is recommended to choose Kafka to achieve. Streaming computing module: Real-time analysis of collected data, Storm is recommended for implementation. Data output module: persisting the results after analysis, HDFS, MySQL, etc. Log collection selection
Big data platforms generate a large number of logs every day, and processing these logs requires a specific logging system. At present, the commonly used open source log system has Flume and Kafka two kinds, are very excellent log system, and each has its own characteristics. Let’s take a look at each of them.
Flume component features
Flume is a distributed, reliable, and highly available log collection system that collects, aggregates, and transmits massive logs. You can customize various data sender in the log system to collect data. Flume also provides the ability to easily process data and write to various data recipients (customizable).
Flume design objectives
Reliability The core of Flume is to collect data from the data source and send it to the destination. To ensure a certain success, data is cached before being sent to the destination. After data actually arrives at the destination, the cached data is deleted. Flume uses a transactional approach to ensure the reliability of the whole process of transmitting events.
Scalability Flume has only one role Agent, including Source, Sink, and Channel components. The Sink of one Agent can be output to the Source of another Agent. This configuration enables multiple levels of flow configuration.
Function extensibility Flume provides rich Source, Sink, and Channel implementations. Users can also add custom component implementations as needed and use them in configuration.
The Flume of architecture
The basic architecture of Flume is Agent. It is a complete data collection tool, which contains three core components, namely Source, Channel and Sink. The basic unit of data is Event, through Source, Channel and Sink, from external data sources to external destinations.
In addition to the single-agent architecture, multiple agents can also be combined to form a multi-layer data flow architecture:
Sequential connection of multiple Agents: Multiple agents are sequentially connected to collect initial data sources and store them in the final storage system. Generally, the number of agents connected in this sequence should be controlled because the path of data flows is longer. If Failover is not considered, the Agent collection service on the whole Flow will be affected.
The data of multiple agents is converged on the same Agent: This situation applies to many scenarios and applies to data flow summary in a distributed system with scattered data sources.
Multiplexing mode generally has two implementations, one is used for replication, the other is used for shunt. In the replication mode, multiple copies of the front-end data source can be copied and transferred to multiple channels. Each Channel receives the same data. The triage method, the Selector can use the value of the Header to determine which Channel to pass the data to.
Implement the Load Balance function: Events in a Channel can be balanced to multiple Sink components, and each Sink component is connected to an independent Agent, thus achieving Load balancing.
Kafka component features
Kafka is actually a publish and subscribe system. Producers publish to a Topic, and consumers subscribe to a Topic. When a new message comes in about a Topic, the Broker sends it to all consumers that subscribe to it.
Kafka’s design goals
Kafka manages messages in topics. Each Topic contains multiple partitions. Each Partition corresponds to a logical log consisting of multiple segments. Multiple messages are stored in each Segment. The message ID is determined by its logical location, that is, the message ID can be directly located to the location where the message is stored, avoiding additional mapping from ID to location.
High throughput for publish and subscribe Kafka can produce about 250,000 messages per second (50 MB) and process 550,000 messages per second (110 MB).
Distributed systems are easy to scale out. All producers, brokers, and consumers are distributed. The machine can be expanded without stopping.
Kafka’s architecture
Kafka is a distributed, partitioned, and replicable messaging system that maintains message queues.
Kafka’s overall architecture is very simple. It is an explicitly distributed architecture, with multiple producers, brokers, and consumers. The Producer and consumer implement Kafka’s registered interface. Data is sent from the Producer to the Broker, which acts as an intermediary buffer and distributor. The Broker distributes consumers registered with the system. The Broker acts like a cache, a cache between active data and an offline processing system. The communication between client and server is based on the simple, high-performance, and programming language-independent TCP protocol.
Flume vs. Kafka
Flume and Kafka are both excellent log systems, which can realize data acquisition, data transmission, load balancing, fault tolerance and a series of requirements, but there are some differences between the two.
Flume and Kafka have their own characteristics:
Flume is suitable for configuration solutions without programming. Because it provides rich source, channel, and sink implementations, the introduction of various data sources can be realized only by configuration changes. Kafka is suitable for data pipeline throughput, availability requirements are high, basic need programming to achieve the production and consumption of data.
Summary of log collection and selection
It is recommended to use Flume as the producer of data, so that the introduction of data sources can be realized without programming, and Kafka Sink as the consumer of data, so that higher throughput and reliability can be achieved. If data reliability is high, Kafka Channel can be used as Flume Channel.
The Flume docking Kafka
Flume, as a message producer, publishes the message data (log data, service request data, etc.) to Kafka through Kafka Sink.
Docking configuration
Docking sample
Suppose the existing Flume reads /data1/logs/ Component_role-log data in real time and imports it into Kafka’s MyTopic topic.
The environment default is: Zookeeper: ZDH100:2181 zDH101:2181 zDH102:2181 Kafka Broker: zDH100:9092 zDH102:9093 Modify Flume configurations as follows:
gent1.sources = logsrc agent1.channels = memcnl agent1.sinks = kafkasink
#source section agent1.sources.logsrc.type = exec agent1.sources.logsrc.command = tail -F /data1/logs/component_role.log agent1.sources.logsrc.shell = /bin/sh -c agent1.sources.logsrc.batchSize = 50 agent1.sources.logsrc.channels = memcnl
Each sink’s type must be defined
agent1.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink agent1.sinks.kafkasink.brokerList=zdh100:9092, zdh101:9092,zdh102:9092 agent1.sinks.kafkasink.topic=mytopic agent1.sinks.kafkasink.requiredAcks = 1 agent1.sinks.kafkasink.batchSize = 20 agent1.sinks.kafkasink.channel = memcnl
Each channel’s type is defined.
Agent1. Channels. Memcnl. Type = memory agent1) channels. Memcnl. Capacity = 1000 to launch the Flume nodes:
/home/mr/flume/bin/flume-ng agent -c /home/mr/flume/conf -f /home/mr/flume/conf/flume-conf.properties -n agent1 Port =10100 To dynamically append log data, run the following command to add data to /data1/logs/ Component_role.
Log /data1/logs/component_role.log Verify that Kafka received the correct data and should be able to render the append data:
Kafka-console-consumer. sh –zookeeper zdh1001:2181 –topic mytopic –from-beginning
Welcome Java engineers who have worked for one to five years to join Java Programmer development: 854393687 group provides free Java architecture learning materials (which have high availability, high concurrency, high performance and distributed, Jvm performance tuning, Spring source code, MyBatis, Netty, Redis, Kafka, Mysql, Zookeeper, Tomcat, Docker, Dubbo, multiple knowledge architecture data, such as the Nginx) reasonable use their every minute and second time to learn to improve yourself, don’t use “no time” to hide his ideas on the lazy! While young, hard to fight, to the future of their own account!