Preface: Lao Liu can not guarantee how good, but it is definitely very conscientious about self-learning big data development on the road of some experiences and feelings, I promise to tell some different from other technical blog details.
01 Self-study Flume details
Liu now wants to write something with his own characteristics, talking about some things he has encountered in learning big data by himself, and making sure to mention some knowledge points that others ignore in technology blogs.
Many people who teach themselves programming will have a problem, especially those who are about to find a job in the second year of graduate school. Because they are about to find a job, there is not much time for self-study, so on the way of self-study, they often ignore a lot of small but very important knowledge points, and many partners are directly back some information of institutions.
They did not calm down to study each knowledge point, also did not consider the right and wrong knowledge points written by these institutions, completely copy the knowledge points on the data, did not form their own understanding, this is very dangerous!
From today, Lao Liu will tell you about the details that are easy to be ignored on the road of self-study big data development, so that you can form your own understanding of the knowledge point.
1, What is flume?
Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system. Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system.
The man said I have a car and a house, but did not say what car, what house, bike is also a car, rent a house is also a house ah! So when you say you own a car or a house, make sure you have solid evidence.
So, when the interview, directly said flume is a high availability, high reliability, distributed massive log collection, aggregation and transmission system is very unconvincing, very typical copy data, no own understanding.
It is how to achieve high availability, high reliability, distributed also need to talk about, so just feel reliable! This is what old Liu said and others are not the same place, really share conscience!
In Lao Liu’s opinion, it can be said that in a complete offline big data processing system, apart from HDFS, MapReduce and Hive, which constitute the core of the analysis system, it also needs indispensable auxiliary systems such as data collection, result data export and task scheduling, and these auxiliary tools have convenient open source frameworks in the Hadoop ecosystem.
Among them, the flume is a log collection, aggregation and transmission system of the open source framework, its high availability, high reliability, distributed these characteristics, are generally by deploying multiple servers, and then on each server to deploy the flume agent pattern formation, and the flume by transaction mechanism to ensure the integrity and accuracy of the data transmission, Flume transactions are covered later.
Flume concept is so much, the purpose of this is to say that we do not want to copy the content of the organization data, think about it, to have their own understanding.
2. Flume architecture
See this architecture diagram, Lao Liu directly talk about how flume works?
The external data source sends events events to Flume in a specific format. When the Source receives events, it stores them in one or more channels, which hold events until they are consumed by Sink. The main function of sink is to read events from channel, store them in external storage system or forward them to the next source, and remove events from channe after success.
Then talk about each component agent, source, channe and sink.
agent
It is a JVM process that sends data from source to destination in the form of events.
source
It is an acquisition component that is used to get data.
channel
It is a transport channel component used to cache data and pass data from source to sink.
sink
It is a sink component that sends data to the final storage system or to the next agent.
Flume transaction
Flume transactions are very, very important. As mentioned before, flume transactions are used to achieve the integrity and accuracy of the transmitted data.
Take a look at this first:
Flume has two transactions, namely put transaction and take transaction.
The put transaction takes two steps:
Doput, which first writes this data to the temporary buffer putList;
Docommit, which checks for empty places in the channel and passes in data if there are empty places; If there is no empty space in the channel, the data is rolled back to the putList.
The take transaction also has two steps:
Dotake, which reads data into a temporary buffer called Takelist and uploads the data to HDFS;
Docommit, which determines whether the data has been uploaded successfully, and if so, clears the temporary buffer takelist; If it doesn’t work, if HDFS crashes or something, it rolls back data to a channel.
By describing the steps of two transactions, you can see why Flume guarantees the integrity and accuracy of the data transferred.
Lao Liu summed up that when the data is transferred to the next node, if the receiving node is abnormal, such as network abnormality, this batch of data will be rolled back, so that the data will be retransmitted.
In the same node, the source writes data to the channel. If the data is abnormal in a batch, the data will not be written to the channel. Part of the received data will be discarded directly, and the data will be retransmitted by the next node.
Through these two transactions, Flume improves the integrity and accuracy of data transmission.
Flume combat
This part is flume the most important, as a log collection framework, Flume application is more important than its concept, be sure to know how to use Flume! Old Liu at the beginning of the root did not see this part, the light looked at the knowledge point, now just discover the importance of actual combat!
But flume actual combat cases are numerous, do we have to remember every case?
Of course not. We can configure the Flume case according to the configuration file on the official website, as shown below:
Look at the blue box in the lower left corner to find the relevant configuration file. Here old Liu has a saying, if you want to learn a new framework, our learning material is the official website, through the official website to learn, not only can improve technology, but also improve English, and not happy!
Now start to talk about the case, the first is to collect files to HDFS, the requirement is to monitor a file if there is new content to collect data to HDFS.
According to the official website, you need to create a folder in the Flume installation directory for flume configuration file development.
The source configuration should be exec, as described by the requirements; To ensure that data is not lost, the configuration of a channel should be file. The configuration of Sink should be HDFS.
Although this has met the needs, but we do data development, there will be a lot of small files, we must do related optimization. For example, how to solve the problem of small files and large files? How to solve the file directory?
So we also need to select some parameters to control parameters and directories.
# Name the components on this agent a1. Sources = r1 a1. Sinks = k1 a1. Channels = c1 # # configuration source specify the type of the source for the exec, A1.sources.r1.type = exec # There is new data have been collected mand = a1.sources.r1.com tail -f/opt/bigdata flumeData/tail. The log # specify the source data into the channel A1. Sources.r1. channels = c1 In the event of a fire or flood a1.channels.c1.type = file # Set the checkpoint directory -- this directory records the location of events in the data directory A1. Channels. C1. CheckpointDir = / KKB/data/flume_checkpoint # data storage directory a1 channels. C1. DataDirs = / KKB/data/sink flume_data # configuration Channel = c1 # Specify sink type as HDFS a1.chores.k1. type = HDFS # specify data to be collected to HDFS directory a1.chores.k1.hdfs. path = HDFS: / / node01: Y - 9000 / tailFile / % % % % m % d/H m # assigned to generate the filename prefix a1 sinks. K1. HDFS. FilePrefix = events - # "give up" on whether to enable time -- > control directory A1.chores.k1.hdfs. Round = true # The value of "ditching" in time # such as 12:10 -- 12:19 => 12:10 # such as 12:20 -- 12:29 => 12:20 A1. Sinks. K1. HDFS. RoundValue = "give up" on the 10 # unit a1, sinks, k1. HDFS. RoundUnit = minute # # 60 s control file number or 50 bytes or article 10 data, who first met, Began to scroll to generate new files a1. Sinks. K1. HDFS. RollInterval = 60 a1. Sinks. K1. HDFS. RollSize = 50 a1. Sinks. K1. HDFS. RollCount = 10 # each batch write the amount of data A1. Sinks. K1. HDFS. BatchSize = 100 # can be used after local timestamp - open % Y - % m a1 - % d to parse time sinks. K1. HDFS. UseLocalTimeStamp = true # generated file type, Default is Sequencefile, usable DataStream, compared to plain text a1. Sinks. K1. HDFS. FileType = DataStreamCopy the code
The second is to collect directories to HDFS. If new files are constantly generated in a directory, data of the files in the directory needs to be continuously transferred to HDFS.
To collect directories, the source configuration generally adopts spooldir; The channel configuration can be set to file or other Settings, usually memory. The configuration of Sink is still set to HDFS.
# Name the components on this agent a. sources = r1 a. sinks = k1 A. channels = c1 Not to repeat in monitoring item lost files with the same a1. Sources. R1. Type = spooldir a1. Sources. R1. Spooldir = / opt/bigdata flumeData/files whether # add files the absolute path to the header A1.sources.r1. fileHeader = true a1.sources.r1.channels = c1 # A1. Channels. C1. Capacity = 1000 a1. Channels. The c1. TransactionCapacity = 100 # configuration sink a1. Sinks. K1. Type = HDFS a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.path = hdfs://node01:9000/spooldir/%Y-%m-%d/%H%M a1.sinks.k1.hdfs.filePrefix = events- a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 60 a1.sinks.k1.hdfs.rollSize = 50 a1.sinks.k1.hdfs.rollCount = 10 A1. Sinks. K1. HDFS. BatchSize = 100 a1. Sinks. K1. HDFS. UseLocalTimeStamp = true # generated file type, default is Sequencefile, available DataStream, It is plain text a1. Sinks. K1. HDFS. FileType = DataStreamCopy the code
Finally, there is a series of two agents. The first agent is responsible for monitoring the newly added files in a directory for data collection and sending them to the second agent over the network. The second agent is responsible for receiving the data sent by the first agent and saving the data to HDFS.
Although it is a series of two agents, the difficulty of the series of two agents is very ordinary as long as we look at the configuration file on the official website and the two cases before.
First, the agent configuration file looks like this:
# Name the components on this agent a. sources = r1 a. sinks = k1 A. channels = c1 Not to repeat in monitoring item lost files with the same a1. Sources. R1. Type = spooldir a1. Sources. R1. Spooldir = / opt/bigdata/flumeData/files A1.sources.r1. fileHeader = true a1.sources.r1.channels = c1 # A1. Channels. C1. Capacity = 1000 a1. Channels. The c1. TransactionCapacity = 100 # configuration sink a1. Sinks. K1. Channel = c1 #AvroSink is used to transmit data through the network. The event can be sent to RPC server (such as AvroSource) a1.chores.k1.type = Avro #node02 should be changed to its own hostname a1.sinks.k1.hostname = node02 a1.sinks.k1.port = 4141Copy the code
The configuration file for agent2 looks like this:
# Name the components on this agent a. sources = R1 a. sinks = k1 A. channels = c1 # configuration source Type = avro a1.sources.r1.channels = c1 # IP address of AvroSource a1.sources.r1.bind Port = 4141 # Configure channel a1.channels.c1.type = memory a1.channels.c1.capacity = Lloyds 1000 a1 channels. C1. TransactionCapacity = 100 # configuration sink a1. Sinks. K1. Channel = c1 a1. Sinks. K1. Type = HDFS a1.sinks.k1.hdfs.path = hdfs://node01:9000/avro-hdfs/%Y-%m-%d/%H-%M a1.sinks.k1.hdfs.filePrefix = events- a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 60 a1.sinks.k1.hdfs.rollSize = 50 a1.sinks.k1.hdfs.rollCount = 10 A1. Sinks. K1. HDFS. BatchSize = 100 a1. Sinks. K1. HDFS. UseLocalTimeStamp = true # generated file type, default is Sequencefile, available DataStream, It is plain text a1. Sinks. K1. HDFS. FileType = DataStreamCopy the code
The final run starts flume on node02 and flume on node01.
02 Flume details summary
Lao Liu talked about flume’s four easy to be ignored details, just want to remind self-taught partners to pay attention to details, absolutely can not completely copy the content said on the data, each knowledge point must have their own understanding.
Finally, if you feel that there is something wrong or wrong, you can contact the official account: hardworking Old Liu to communicate. I hope to be of help to students who are interested in big data development, and I hope to get their guidance.
If you think it’s good, give liu a thumbs up!