The history of big data
On September 30, 2018, Martin Lau, president of Tencent, sent a letter to all employees, officially launching the third major organizational restructuring in the history of The Chinese Internet giant Tencent. Google, Amazon, Alibaba, Baidu, Xiaomi and other Internet giants have been adjusting their organizational structures in recent years to adapt to the inevitable ARRIVAL of THE ABC era. ABC refers to industrial trends and technological changes represented by A (ARTIFICIAL intelligence), B (Big Data), and C (Cloud computing). The industry generally believes that this will be the PC era, mobile Internet era after another industrial change, marking a new era has come. Like water and electricity in our daily life, cloud computing (C) will serve as the underlying infrastructure of the entire Internet, providing storage and access places and channels for enterprise data assets. Have infrastructure but only for the business data is truly valuable assets, here said the data including the enterprise internal management information, commodity information in the Internet, chat software in interpersonal communication information, location information, etc., these data will be far more than the number of enterprises of the bearing capacity of the existing IT architecture and infrastructure, As a result, the real-time requirements of enterprise applications will greatly exceed the existing computing capabilities. How to make good use of these valuable data assets, so that they can serve national governance, enterprise decision-making and personal life, is the core of big data processing, is also the inner soul of cloud computing and the inevitable upgrading direction.
Along with this trend, the term big data has been increasingly mentioned at technology conferences in recent years. It is used to describe and define the massive amounts of data produced in the information age, and to name the technological developments and innovations associated with it. McKinsey, a world-renowned consulting company, first proposed the arrival of the era of big data. In fact, big data has existed for a long time in physics, biology, environmental ecology and other fields, as well as in military, financial, communication and other industries, but IT has attracted people’s attention due to the development of the Internet and IT industry in recent years. According to the China Academy of Information and Communication Technology (CAICT) combined with the survey of big data-related enterprises, the scale of China’s big data industry was 470 billion yuan in 2017, up 30% year on year, and is expected to reach 1 trillion yuan by 2020. (Source: China Academy of Information and Communication Technology, White Paper on Big Data (2018))
What exactly is Big Data? According to the definition given by research institute Gartner, big data is the massive, high-growth and diversified information assets that require new processing modes to have stronger decision-making power, insight and process optimization ability. The strategic significance of big data technology lies not in the mastery of huge amounts of data, but in the professional processing of meaningful data. In other words, if we compare big data to an industry, the key to the profitability of this industry is to improve the “processing capacity” of data and realize the “value-added” of data through “processing”. (source: sogou wikipedia) and the definition given by the McKinsey global institute is: a large scale to the acquisition, storage, management, analysis, well beyond the traditional scope of data collection, database software tools with mass of data scale, fast data transfer, a variety of data types, and value of low density four big characteristics. Sogou encyclopedia in many definition of big data entry more my heart: big data is not in a certain time range with conventional software tools to capture, manage, and process of data collection, the need to deal with the new model can have better decision-making, insight found mass force and the process optimization ability, high rate of growth and diversification of information assets.
Is known as “the first person of big data business applications” viktor meyer schon berg, don’t think big data refers to random analysis method (such as sampling) such a shortcut, but with the method of analyze all the data processing, the core of the large data is forecast, it will create an unprecedented for human life quantifiable dimension. The biggest shift in the era of big data, he says, is to abandon the desire for causality and focus instead on correlation. This means knowing the “what”, not the “why”. This has overturned thousands of years of human thinking conventions, and posed a new challenge to human cognition and the way of communication with the world. These are summarized by IBM as 5V characteristics: Volume, Velocity, Variety, Value, Veracity. The strategic significance of big data technology lies not in the mastery of huge amounts of data, but in the professional processing of meaningful data. In other words, if we compare big data to an industry, the key to the profitability of this industry is to improve the “processing capacity” of data and realize the “value-added” of data through “processing”.
From the perspective of technological development history, big data processing can be divided into three stages: predecessor, generation and application. From 90 s last century until the beginning of the century, can be said to be the big data processing of the predecessor, the mainstream of data storage and processing was then still in the database, with the database technology and the mature of the theory of data mining, data warehouse and data mining technology began to gradually developed, business intelligence tools began to be applied, Such as data warehouse, expert system, knowledge management system and so on.
With the emergence of various new services on the Internet, a large number of unstructured data emerge, which makes the traditional database technology more and more difficult to deal with. For example, the popularity of Facebook leads to a large amount of unstructured data generated by social applications, and it is well known that The search engine business of Google naturally faces the ever-expanding storage and processing of massive data, all of which drive the development of big data technology into the fast lane. The industry generally takes three papers published by Google between 2003 and 2006 as the starting point for the emergence of big data processing technology, namely, GFS, MapReduce and Bigtable. GFS (2003) is an extensible distributed file system for accessing large amounts of distributed data. It runs on inexpensive common hardware and provides fault tolerance. MapReduce (2004) is a parallel programming mode for processing massive data, which is used for parallel operation of large-scale data sets. It can make full use of a large number of CPUS provided by all low-cost servers in THE GFS cluster. It can be regarded as a supplement of GFS in terms of architecture, and together with GFS, it constitutes the core of massive data processing. GFS is suitable for storing a small number of very large files, but it is not suitable for storing thousands of small files. In order to deal with a large number of formatted and semi-formatted data, BigTable (2006) was born to manage non-relational data distributed data storage system, its design goal is to process PB level data quickly and reliably. And can be deployed to thousands of machines. These three papers can be regarded as the origin of big data processing technology.
In the development of big data processing technology, Hadoop has to be mentioned. Founded in 2005 by Apache Software Foundation President Doug Cutting while at Yahoo, this project is an open source distributed computing platform for big data analysis, which allows applications to scale securely to handle thousands of nodes and petabytes of data. By building an open source platform for MapReduce, Hadoop has inadvertently created a thriving ecosystem whose influence extends far beyond its original Hadoop footprint. In the Hadoop community, engineers can refine and extend these ideas from earlier GFS and MapReduce papers, and many useful tools have been built upon them, such as Pig, Hive, HBase, Crunch, and more. This openness is key to the diversity of thought that exists across the industry, and Hadoop’s open ecosystem has directly contributed to the development of streaming computing systems. With the rapid growth of the Internet industry, the production, use, processing and analysis of data has also increased at an incredible pace, with verticals such as social media, the Internet of Things, advertising and gaming all beginning to process ever-larger data sets. These industries require near-real-time data processing and analysis from a business perspective, so traditional frameworks such as Hadoop for batch processing of big data are not well suited for these situations. Since 2007, several open source projects have been launched with new ideas for handling the endless stream of data records from more than one source, most notably Apache’s many projects, which are all at various stages of development.
Nowadays, with the wide application of intelligent mobile devices, Internet of Things and other technologies, the characteristics of fragmented, distributed and streaming media of data are more obvious. Big data technology begins to combine with mobile and cloud technologies, and begins to develop towards complex event processing, graphic database and in-memory computing. The concept of big data is more and more accepted by vertical industry and the public, through catalytic new business model makes big data with the boundary of the traditional industry becomes more and more fuzzy, everyone began to pay more attention to the innovation of the business and technology itself, big data industry also turned to the theme of the application of the transformative impact of the industry, came to the stage of the real application.
The direction of big data
Big data technology is a new technology and architecture, which is committed to the collection, processing and analysis of all kinds of super-large scale data at a lower cost and faster, so as to extract valuable information for enterprises. As the technology boomed, it made it easier, cheaper and faster for us to process huge amounts of data, becoming a good assistant in harnessing data and even changing business models in many industries. With the help of ARTIFICIAL intelligence, cloud computing and the Internet of Things, even complex big data can be processed by ordinary data practitioners with corresponding data analysis tools. Big data analytics has out of popular trend of IT, became part of the company’s business must have today, IT will soon replace gold become one of the most valuable human assets, said in a brief history of the future: “who owns the data, who owns the data interpretation, who could take in the future competition advantage”. To keep readers up to date on big data, here are some of the hottest big data trends to drive the industry into the future. The following are ten data development trends worth knowing about big data translated and compiled by Alibaba Cloud Computing community
The rapidly growing Internet of Things network
Thanks to Internet of Things (IoT) technology, it is becoming increasingly common for smartphones to be used to control household appliances. The Internet of Things boom is also attracting companies to invest in research and development of the technology, as smart devices such as Xiaomi and Alibaba automate specific tasks in the home.
More organizations will seize the opportunity to provide better iot solutions, which will inevitably lead to more ways to collect large amounts of data, as well as ways to manage and analyze it. The trend is to push for new devices that collect, analyze and process data, such as wristbands, smart speakers and glasses.
Pervasive artificial intelligence technology
Ai is now more often used to help large and small companies improve their business processes. Ai can now perform tasks faster and more accurately than humans, reducing human-introduced errors and improving the overall process, which allows people to better focus on more critical tasks and further improve service quality.
The rapid growth and high salaries of AI have attracted many developers to the field, and fortunately, there are mature AI development toolkits available for anyone to build algorithms based on real tasks to meet the growing demand. Individual organizations may gain a major advantage if they can find the most effective way to integrate them into business processes.
The rise of predictive analytics
Big data analysis has always been one of the key strategies for enterprises to gain competitive advantage and achieve their goals. Researchers use the necessary analytical tools to process big data and determine the causes of certain events. Now, predictive analytics through big data can help better predict what’s likely to happen in the future.
There is no doubt that this strategy is very effective in helping analyze the information collected to predict consumer behavior, which allows companies to understand the next steps of customers before doing relevant development in order to determine what actions they must take. Data analysis can also provide more data context to help understand the real reason behind it.
Migrate dark data to the cloud
Information that has not yet been converted into digital format is called dark data, and it is a huge database that is currently untapped. It is expected that these simulated databases will be digitized and moved to the cloud for use in enterprise-friendly predictive analytics.
The chief data officer will play a bigger role
Big data is now becoming an increasingly important part of executing business strategies, and chief data officers are playing a more important role in their organizations. Chief data officers are expected to steer their companies in the right direction and take a more aggressive approach, a trend that has opened the door for data marketers looking to advance their careers.
Quantum computing
Currently, analyzing and interpreting large amounts of data with our current technology can take a lot of time. If we can process billions of data simultaneously in just a few minutes, we can greatly reduce the processing time and give companies the opportunity to make timely decisions to achieve better results.
This daunting task can only be achieved by quantum computing. Although quantum computer research is in its infancy, some companies are already conducting experiments using quantum computers to aid practical and theoretical research in different industries. Soon, big tech companies like Google, IBM and Microsoft will all begin testing quantum computers to integrate them into business processes.
Open Source solutions
There are many public data solutions available today, such as open source software, that have made considerable strides in speeding up data processing while also providing real-time access and responsiveness to data. For this reason, they are expected to grow rapidly and be in high demand in the future. Although open source software is cheap and can be used to reduce operating costs, there are some disadvantages to using open source software. Here are some disadvantages that you need to know about.
Edge of computing
Due to the trend of the Internet of Things, many companies are turning to research connected devices to collect more data about customers or processes, creating a need for technological innovation. The new technology aims to reduce the lag time between data collection and the cloud, its analysis and the need for action.
Edge computing can address this problem by providing better performance because less data flows in and out of the network and the cost of cloud computing is lower. Companies can also benefit from lower storage and infrastructure costs if they choose to remove unnecessary data previously collected from the Internet of Things. In addition, edge computing can speed up data analysis, giving companies enough time to make the right response.
Smarter chatbots
Due to the rapid development of ARTIFICIAL intelligence, many companies are now deploying chatbots to handle application scenarios such as customer queries to provide a more personalized mode of interaction while eliminating the need for humans.
Big data has a lot to do with providing a more enjoyable customer experience, as robots crunch large amounts of data to provide relevant answers based on the keywords a customer enters in a query. They can also gather and analyze information about the customer from the conversation during the interaction, which in turn helps marketers develop simplified strategies for better conversion.
conclusion
All these technological leaps across different industries are based on the solid foundation laid by the development of big data. Advances in technology will continue to help us create a better society through smarter processes. To benefit from these trends, we must fully understand how the technology is being used, as well as how to achieve specific business goals. These are just the beginning, and big data will continue to serve as a catalyst for the changes we are experiencing in our business and technology. What we can do is think about how we can effectively adapt to these changes and use this technology to enable our business to flourish.
There are also seven development trends of global big data industry in 2018 published by People’s Daily online and last year
Introduction to big Data Processing framework
From a technical point of view, it is generally believed that the three classic articles published by Google between 2003 and 2006 really opened the door to big data processing technology: GFS, BigTable, and MapReduce, also known as Google’s distributed computing troika, gave rise to Hadoop, the first big data processing framework that gained a lot of attention in the open source community. This technology stack based on HDFS, HBase and MapReduce continues to influence today.
At the beginning, most big data processing is conducted offline, and its core processing logic is MapReduce. MapReduce divides all operations into two categories: Map and Reduce. Map is used to divide data into multiple pieces and process them separately. Reduce merges the processed results to obtain the final result. MapReduce is a good concept but performs poorly in practice. The Spark framework gradually replaced MapReduce and became the mainstream of big data processing through its powerful and high-performance batch processing technology.
With the development of The Times, many businesses are no longer satisfied with offline batch processing, and more and more scenarios need real-time processing. Spark Streaming simulates quasi-real-time effects with microbatch processing to address this need, but it’s not as effective as it should be. The introduction of Twitter’s Storm stream computing system in 2011 ushered in a solution for lower latency stream processing. The streaming model was then developed by Flink, which began to challenge Spark’s position by providing a more generic streaming framework for batch processing rather than a simple streaming engine.
In the discussion of big data processing often hear two words: processing framework and processing engine, according to my personal understanding of engine and frame There is no difference, it is to point to the data in the system, but most of the time the actual handles data manipulation component called engine, and assume a similar role in a series of components Is called the framework. For example, Hadoop can be seen as a processing framework that uses MapReduce as the default processing engine, while Another processing framework, Spark, can be incorporated into Hadoop to replace MapReduce. This component-to-component interoperation is one of the reasons why big data systems are so flexible, so engines and frameworks can often be interchangeable or used together when talking about big data.
If the big data processing framework is classified according to the state of processing data, some systems process data in batch mode, some systems process data in stream mode, and some systems support both methods. So we can divide big data processing frameworks into three types: batch, stream, and hybrid.
The batch
The so-called batch processing is to decompose a data processing task into smaller tasks, distribute these tasks in each instance of the cluster for calculation, and then recalculate and combine the calculation results of each instance into the final result. Batch systems typically manipulate large amounts of static data and wait until the data has been processed to get the results back. This pattern is suitable for work that requires access to the full set of records, such as when calculating totals and averages, the data set must be treated as a whole rather than as a collection of multiple records. Because of its excellent performance in handling large amounts of persistent data, batch processing is commonly used to process historical data and is used as the underlying computing framework for many online analytical processing systems. Data sets used in batch processing typically have the following characteristics:
- Bounded: A batch data set represents a finite collection of data
- Persistence: Data is usually always stored in some type of persistent storage location
- Bulk: Batch operations are often the only way to process extremely large data sets
Batch frameworks are typified by Hadoop, the first big data processing framework to gain significant attention in the open source community and for a long time almost synonymous with big data technology. Hadoop is a distributed computing infrastructure. So far, Hadoop has formed a broad ecosystem and implemented a large number of algorithms and components, with two cores: HDFS and MapReduce. HDFS (Hadoop Distributed File System) is a distributed file system that can be built on inexpensive clusters. MapReduce is a distributed task processing architecture, and it is these two components that form the cornerstone of Hadoop. The core mechanism of Hadoop is the effective utilization and management of data storage, memory and programs through HDFS and MapReduce. Hadoop combines a number of common and inexpensive servers into distributed computing-storage clusters to provide storage and processing capabilities of big data.
Hadoop is actually an umbrella term for a large project that contains many subprojects:
- HDFS: Hadoop distributed file system. It is an open source implementation of GFS for storing files on all storage nodes in a Hadoop cluster. Reliable file storage can be provided on a common PC cluster, and multiple copies of data blocks can be backed up to solve the problem of server downtime or disk damage.
- MapReduce: A Java-based parallel distributed computing framework that is the native batch engine of Hadoop and the open source implementation of Google’s MapReduce paper.
- HBase: An open source distributed NoSQL database that references Google’s BigTable modeling.
- Hive: data warehouse tool.
- Pig: Big data analysis platform.
- Mahout: A collection of machine learning Java libraries for a variety of tasks, such as classification, evaluative clustering, and pattern mining. It provides some classic machine learning algorithms inside.
- Zookeeper: An open source distributed coordination service created by Yahoo and an open source implementation of Google Chubby. In Hadoop, it is mainly used to control data in the cluster, such as NameNode management in Hadoop cluster, Master Election and status synchronization between servers in Hbase.
- Sqoop: Used to transfer data between Hadoop and traditional databases.
- Ambari: Hadoop management tool, which can quickly monitor, deploy, and manage clusters.
Hadoop and its MapReduce processing engine provide a proven batch processing model that is reliable, efficient, and scalable to handle large amounts of data. This inexpensive and efficient processing technique can be applied flexibly in many cases by allowing users to build fully functional Hadoop clusters from very low-cost components. Compatibility and integration with other frameworks and engines make Hadoop the underlying foundation for a variety of workload processing platforms using different technologies. However, because this processing mode relies on persistent storage, computing tasks need to perform multiple reads and writes on the nodes of the cluster, so the speed is slightly inferior, but the throughput is also unmatched by other frameworks, which is also the characteristics of batch processing mode.
Stream processing
Stream processing is a real-time calculation of incoming data at any time. Instead of performing operations on the entire data set, it performs operations on each item of data transmitted through the system. The data set in stream processing is borderless, which has several important effects:
- The complete data set only represents the total amount of data that has been entered into the system so far.
- Working datasets may be more relevant, representing only a single item of data at a given time.
- Processing is event-based and has no “end” unless it explicitly stops. The results are available immediately and will continue to be updated as new data arrives.
In primary school, we all did such math problems: a pool has an inlet pipe and an outlet pipe, only open the inlet pipe for x hours to fill water, only open the outlet pipe for y hours to drain water, then open the inlet pipe and outlet pipe at the same time, how long will the pool fill with water? The flow treatment system acts as this sink, processing the water (data) that flows in, and then releasing the processed water (data) out of the outlet pipe. So the data is like a flow of water that never stops and is processed in the pool. The system that processes the never-ending stream of incoming data is called a stream processing system. Stream processing systems do not operate on existing data sets, but on data imported from external systems. Flow processing systems can be divided into two types:
- Item by item processing: processing data one at a time, is a true sense of stream processing.
- Microbatch processing: This process treats data over a short period of time as a microbatch and processes the data within this microbatch.
Because it takes a lot of time to process massive data, batch processing is not suitable for scenarios requiring high delay in processing results. The real-time performance of both item-by-item processing and micro-batch processing is much better than that of batch processing. Therefore, stream processing is suitable for tasks requiring close to real-time processing, such as log analysis, device monitoring, real-time traffic changes of websites, etc. Because it is common for these domains to respond to changes in data in a timely manner, flow processing is appropriate for data that must respond to changes or spikes and focus on trends over time.
Well-known frameworks in the field of stream processing systems include Twitter’s open source Storm, LinkedIn’s open source Samza, Alibaba’s JStrom, Spark Streaming and so on, all of which have the advantages of low latency, scalability and fault tolerance. Allows you to run data stream code by assigning tasks to a series of fault-tolerant computers for parallel execution. They also provide simple apis to simplify the underlying implementation. The terminology used in these frameworks may be different, but the concepts are similar. Storm is used as an example here.
In Storm there is a graph structure for real-time computation called topology. The topology will be submitted to the cluster, and the master node in the cluster will distribute the code, and the work will be assigned to the worker node for execution. There are two roles in a topology, spout and Bolt, where SPout sends messages and is responsible for sending data streams as tuples. The Bolt is responsible for transforming the data stream, performing computations, filtering, and other operations in the Bolt, and randomly sending data to other Bolts. The tuple emitted by spout is an immutable array that corresponds to a fixed key-value pair.
If Hadoop is a bucket that can only be carried to the well one by one, Storm is a faucet that can be turned on for a continuous flow of water. Storm supports many languages, such as Java, Ruby, Python, etc. Storm makes it easy to write and scale complex real-time calculations in a cluster. Storm guarantees that every message will be processed quickly and millions of messages per second can be processed in a small cluster.
Storm is a stream processing framework that focuses on very low latency and can process very large amounts of data, providing results with lower latency than other solutions. Storm is suitable for pure stream processing types with high latency requirements, ensures that every message is processed, and can be used with multiple programming languages.
Mixed processing
Among the schools of big data processing technology, in addition to the pure batch and stream processing modes, there are some processing frameworks that can handle both batch and stream processing, which are called hybrid processing frameworks. While focusing on one way of processing may be well suited for a particular scenario, a hybrid framework provides a common solution for data processing. These frameworks can simplify different processing requirements by processing both types of data with the same or related components and apis. The best known hybrid frameworks are Spark and Flink.
Spark is a batch processing framework that includes stream processing capabilities. It has its own real-time stream processing tool and can be integrated with Hadoop instead of MapReduce. Spark can also be used to deploy clusters independently, but only with distributed storage systems such as HDFS. The power of Spark lies in its computing speed. Similar to the Storm, Spark is memory-based and can perform operations on a hard disk when the memory is fully loaded. Spark’s original design was inspired by MapReduce, but unlike MapReduce, Spark dramatically improves its ability to process data through in-memory computing models and execution optimizations. In addition to Spark Core, which was originally developed for batch processing, and Spark Streaming for stream processing, it provides other programming models to support graph computation (GraphX), interactive queries (Spark SQL), and machine learning (MLlib).
Flink is a stream processing framework that can handle batch tasks. Initially, Fink focused on streaming data, which is the opposite of what Spark was designed for. Whereas Spark splits streams into smaller batches for processing, Flink treats batch tasks as bounded streams. Flink treats batch data as a data stream with finite boundaries, thus processing batch tasks as a subset of stream processing. In addition to processing and batch processing capabilities, Flink also provides SQL-like queries (Table API), graph computation (Gelly) and machine learning libraries (Flink ML). Flink is also compatible with native Storm and Hadoop programs and can run on yarN-managed clusters. While Spark also provides batch and stream processing capabilities, Spark’s microbatch architecture for stream processing leads to slightly longer response times. The Flink stream processing first approach achieves low latency, high throughput and true item-by-item processing. While the two frameworks are often compared, both are moving towards greater compatibility, driven by market demands.
Offline and Real time
If big data technology is divided into offline computing and real-time computing according to the timeliness of data processing.
Offline computing is performed on the premise that all input data are known before the calculation begins, the input data will not change, and the results will be obtained immediately after a problem is solved. Generally speaking, offline computing has a large amount of data and a long storage time. Complex batch calculations on large amounts of data; The data is completely in place before the calculation and will not change; Can conveniently query the results of batch calculation and other features.
Real-time computing is the study in computer science of computer hardware and software systems subject to real-time constraints, which are the maximum time limit between the occurrence of an event and the system response. Real-time programs must be able to respond within strict time limits. Real-time response times are usually required in milliseconds and sometimes in microseconds. In contrast, a non-real-time system is one in which the response time cannot be guaranteed to match the real-time constraints under any conditions. It is possible that non-real-time systems can match the real-time constraints in most cases, or even faster, but there is no guarantee that they will match the constraints in all conditions. The opposite of offline computing is real-time computing, where data is processed as soon as it is generated in real time, which tends to treat data as a stream. Real-time computing is generally carried out for massive data, and generally requires the second level. Real-time computing is mainly divided into two parts: real-time data entry and real-time data calculation.
Real-time computing represented by Storm and offline computing represented by MapReduce. Real-time computing has high requirements on data processing time, and there is no unified standard for delay threshold in big data. The default delay threshold is seconds (Strictly speaking, real-time systems must ensure response within a certain time boundary, usually milliseconds or subseconds, but both Spark Streaming and Storm can only ensure low computation time. Should therefore be a near real-time system). Offline computing is less sensitive to data processing time and usually requires only N+1 time to see the results.
Due to different application scenarios, the two computing engines accept data in different ways: the data sources for real-time computing are usually streaming, that is, one data is processed one data, so it is also called streaming computing. The data source of offline computing is usually static, complete data of one computing cycle, and the calculation time is set after all the data of one computing cycle is received (usually the data of the previous day is calculated in the early morning), so it is also called batch computing. Sometimes these two different ways of receiving data and their resulting different uptime (real-time tasks need to run all the time, offline tasks are timed) are used by some engineers to distinguish between real-time and offline computing engines.
conclusion
Although the differences between stream processing and batch processing (especially microbatch) seem to be a matter of small temporal differences, they actually have a fundamental impact on both the architecture of data processing systems and the applications that use them. Stream processing systems are designed to respond to data as it arrives. This requires them to implement an event-driven architecture, where the internal workflow of the system is designed to continuously monitor new data and schedule processing as soon as it is received. On the other hand, the internal workflow in a batch system only periodically checks for new data and only processes that data when the next batch window occurs.
The difference between stream and batch processing is also important for applications. Applications built for batch processing are deferred by definition for processing data. In a multi-step data pipeline, these delays accumulate. In addition, the delay between the arrival of new data and the processing of that data will depend on the time until the next batch processing window — ranging in some cases from no time at all to all time between the batch window, which arrives after the batch has started. As a result, batch applications (and their users) cannot rely on consistent response times and need to adjust accordingly to accommodate this inconsistency and greater latency.
Batch processing is typically appropriate for use cases where up-to-date data is not important and where slow response times are tolerated. Flow processing is required for use cases that require real-time interaction and real-time response.
Big data systems can use a variety of processing techniques. For batch-only workloads, Hadoop, which is cheaper to implement than other solutions, may be a good choice if it is not time-sensitive. For stream-only workloads, Storm supports a wider range of languages and very low-latency processing, but the default configuration can produce repetitive results with no guarantee of order. Samza tightly integrates with YARN and Kafka to provide greater flexibility, easier use by multiple teams, and easier replication and state management. For mixed workloads, Spark provides streaming in both high-speed batch and microbatch modes. The technology is more fully supported, with a variety of integration libraries and tools to achieve flexible integration. Flink offers true streaming and batch capabilities, deep optimization to run tasks written for other platforms, and low latency processing, but it’s too early for practical applications.
The most suitable solution depends primarily on the state of the data to be processed, the need for processing time, and the desired results. The question of whether to use a full-featured solution or one that focuses primarily on a project requires careful balancing. As it matures and becomes widely accepted, similar questions need to be considered when evaluating any emerging innovative solutions.