preface
In The Age of Big Data, written by Victor Mayer-Schonberg and Kenneth Cukier, big Data refers to the use of all data to analyze it, without the short-cut of random analysis (sampling). The 5V characteristics of big data (proposed by IBM) : Volume, Velocity, Variety, Value and Veracity.
MapReduce, Kafka, and Flink are three popular streaming computing frameworks
Since we already know the importance of MapReduce, Kafka and Flink for learning big data, there is no corresponding learning resources to learn it. Don’t be afraid of xiaobian here has been sorted out for you, a total of 20G of resources, I hope you can like it!
It’s divided into three parts, so let’s start with one of Google’s three treasures, MapReduce, which is very important in Hadoop.
MapReduce
1: What does MapReduce do
Since I can’t find Google’s schematic, I’d like to borrow a diagram of the Hadoop project to illustrate where MapReduce stands, as shown below.
Hadoop is actually the open source implementation of Google Sambo. Hadoop MapReduce corresponds to Google MapReduce, HBase corresponds to BigTable, and HDFS corresponds to GFS. HDFS (or GFS) provides efficient unstructured storage services for upper layers. HBase (or BigTable) is a distributed database that provides structured data services. Hadoop MapReduce (or Google MapReduce) is a parallel computing programming model used for job scheduling.
GFS and BigTable have provided us with high performance and high concurrency services, but parallel programming is not a job for all programmers, and if our application itself cannot be concurrent, then GFS and BigTable are meaningless. The great thing about MapReduce is that it allows even programmers unfamiliar with parallel programming to take full advantage of the power of distributed systems.
In a nutshell, MapReduce is a framework for breaking up a large job into smaller jobs (large and small jobs should be essentially the same, but of different sizes). All the user has to do is decide how many jobs to break up into and define the job itself.
2: Mapreduce7.7GB teaching video
MapReduce, one of Google’s three great tools for Hadoop, is an important tool for hadoop. In the MapReduce tutorial, you can see how important MapReduce is to Hadoop. Need to get MapReduce learning video partners, you can forward attention xiaobian private letter xiaobian to get access to it ~~
kafka
1: What does kafka do
Kafka is a distributed publish/subscribe based messaging system developed by LinkedIn, written in Scala and widely used for its horizontal scalability and high throughput.
What is Kafka? For example, producer consumer, producer produces an egg, consumer consumes an egg, producer produces an egg, consumer consumes an egg, suppose the consumer chokes while consuming an egg (the system crashes), producer still produces an egg, then the newly produced egg is lost. Another example is that the producer is very strong (in the case of high transaction volume), the producer produces 100 eggs a second, the consumer can only eat 50 eggs a second, and then after a while, the consumer becomes overwhelmed (the message becomes clogged and eventually the system times out) and refuses to eat any more.” The eggs are lost again, and at this point we put a basket between them, and the eggs that are produced are put in the basket, and the consumer goes to the basket to get the eggs, so that the eggs are not lost, they are all in the basket, and this basket is Kafka. An egg is actually a “data stream”, and all interactions between systems are transmitted through a “data stream” (i.e., TCP, HTTP, etc.), also called a message, also called a “message”. When the queue is full, the basket is full, and you can’t put more eggs in it.
Like twitter, someone posts, someone consumes, and this is a Kafka scenario.
Kafka versus other major distributed messaging systems
2: Kafka learning route and 2.9G learning video
Learning path
2.9G learning video
Read kafka’s profile, you must have understood, in a horizontal scaling and high throughput by taobao, jingdong, weibo etc widely used, a consortium small make up for all of the 2.9 G kafka teaching video, because too much content, small make up don’t do too much is introduced here, need to get kafka study route and the video, You can forward to pay attention to xiaobian private letter xiaobian “learn” to get the way ~~
Flink
1: What does Flink do
Many of you may not have heard the word Flink until 2015, but Flink began as a research project at the Technical University of Berlin in 2008 and was accepted by the Apache Incubator in 2014. It quickly became one of the top projects of the ASF (Apache Software Foundation). The latest version of Flink is currently up to 0.10.0, and while many people are impressed by The rapid growth of Spark, perhaps we should also give Flink a thumbs up.
Flink is a distributed processing engine for streaming and batch data. It is primarily implemented by Java code. At present, it is mainly based on the contributions of the open source community. For Flink, the main scenario it deals with is streaming data, and batch data is just an extreme case of streaming data. In other words, Flink handles all tasks as a stream, which is its best attribute.
Flink can support fast iteration locally, as well as some circular iteration tasks. And Flink can customize memory management. At this point, if you want to compare Flink and Spark, Flink doesn’t leave the memory entirely to the application layer. This is why Spark is more likely to appear in OOM than Flink. Flink is more similar to Storm in terms of the framework itself and application scenarios. Flink’s architecture and many of its concepts will be easier to understand if you’ve read Storm or Flume before. Let’s take a look at Flink’s architecture first.
We can learn some of Flink’s most basic concepts, Client, JobManager and TaskManager. The Client submits tasks to the JobManager, which sends tasks to the TaskManager for execution. The TaskManager then reports the task status in a heartbeat. At this point, some of you might already feel like you’re back in the Hadoop generation. Indeed, JobManager looks a lot like JobTracker, and TaskManager looks a lot like TaskTracker. However, one of the most important differences between TaskManagers is that they are streams. Second, in the Hadoop generation, there is only Shuffle between Map and Reduce, whereas in The case of Flink, it can be many levels, and there is data transfer within TaskManager and between TaskManager, unlike Hadoop, Fixed Map to Reduce.
2: Alibaba chooses Flink as its first choice
Early last year, alibaba shocked big Data circles when it bought Data Artisans, the Berlin-based startup behind Flink, for 90 million euros.
In the Hadoop ecosystem, Flink is a newer engine than Spark. No doubt you know Spark, the new data-processing engine that has replaced MapReduce. But what you may not know is that Spark has been completely replaced by Flink internally.
No matter it is full data, incremental data, or real-time processing, a set of solutions can support all of them, which is the background and original intention of Ali’s choice of Flink.
At present, there are many open source big data computing engines, such as Storm, Samza, Flink, Kafka Stream, etc., batch computing, such as Spark, Hive, Pig, Flink, etc. For computing engines that support both streaming and batch processing, there are only two options: Apache Spark and Apache Flink.
From the comprehensive consideration of technology, ecology and other aspects, first of all, Spark’s technical concept is to simulate flow calculation based on batch. Flink, on the other hand, uses stream-based computing to simulate batch computing.
From a technical perspective, batch simulation of flow has some technical limitations, and this limitation may be difficult to overcome. Flink simulates batches based on streams and is technically more scalable. In the long run, Alibaba has decided to use Flink as a unified, universal big data engine for the future.
3: Flink learning route and 5.97G learning video
Learning Route:
5.97g Learning video:
Flink, for those of you who know something about big data, knows how popular Flink is right now. It simulates batch computing based on streaming and treats all tasks as streams.
Xiaobian for everyone to sort out Flink+Kafka+MapReduce teaching video a total of 17 G, because the content is too much, and are dry goods, xiaobian here will not do too much introduction, need to get learning route and video partners, you can forward + pay attention to xiaobian and then private letter xiaobian “learning” to get access to it ~~