We’ve been honored to see Hadoop go from zero to king in a decade. Touched by the ever-changing technology, I hope this article will provide an in-depth understanding of Hadoop’s yesterday, today and tomorrow, as well as a vision for the next decade.
This paper is divided into four parts: technology, industry, application and prospect
Technical articles
When the project was launched in 2006, the word “Hadoop” stood for just two components — HDFS and MapReduce. Ten years now, the word stands for “Core” (the Core Hadoop project) and the growing ecosystem associated with it. This is very similar to Linux in that it consists of a core and an ecosystem.
Now, with the release of stable version 2.7.2 in January, Hadoop has grown from the traditional Hadoop triad of HDFS, MapReduce, and HBase to a vast ecosystem of more than 60 related components, more than 25 of which are included in major distributions. These include data storage, execution engines, programming, and data access frameworks.
Hadoop evolved from a three-tier architecture of 1.0 to a four-tier architecture after 2.0 made resource management a generic framework independent of MapReduce:
Bottom layer: Storage layer, file system HDFS
Middle tier – Resource and data management, YARN, Sentry, etc
Upper layer — MapReduce, Impala, Spark and other computing engines
Top level – Advanced packaging and tools based on MapReduce, Spark and other computing engines, such as Hive, Pig, Mahout, etc
Storage layer
HDFS has become the de facto standard for storage of big data [1] disks, and is used for online storage of massive log files. After years of development, the HDFS architecture and functions have been consolidated. Important features such as HA, heterogeneous storage, and local data short-circuit access have been implemented. Apart from Erasure Code, there are no exciting features in the roadmap.
As HDFS becomes more and more stable, the community becomes less and less active, and the usage scenarios of HDFS become mature and fixed, more and more file formats will be encapsulated in the upper layer. The file formats of column storage, such as Parquent, well solve the existing BI data analysis scenarios. In the future, there will be new storage formats to adapt to more application scenarios, such as group storage to serve machine learning applications. HDFS will continue to expand support for emerging storage media and server architectures in the future.
In 2015, HBase released version 1.0, which also represented the stability of HBase. The latest HBase features include clearer interface definition, multi-region copy to support high availability reads, family-granularity Flush, and RPC read and write queue separation. In the future, HBase will not add big new features, but more stability and performance evolution, especially large memory support and memory GC efficiency.
Kudu is a new distributed storage architecture announced by Cloudera in October 2015. Kudu is completely independent of HDFS. Its implementation references Spanner’s paper published by Google in 2012. Given Spanner’s success within Google, Kudu is touted as an important part of the next generation of analytics platforms for fast data query and analysis, bridging the gap between HDFS and HBase. Its emergence will further bring the Hadoop market closer to the traditional data warehouse market.
The Apache Arrow project provides specifications for the processing and interaction of column memory stores. Developers from the Apache Hadoop community are currently working on establishing it as a de facto standard for big data system projects.
Arrow is supported by several big data giants like Cloudera and Databricks, and many committers are core developers on other big data projects such as HBase, Spark, Kudu, etc. Given that Tachyon et al don’t seem to have found a lot of practical ground yet, Arrow’s high-profile debut could be the new memory analysis file interface standard of the future.
Control layer
Control is divided into data control and resource control.
With the increase of Hadoop cluster scale and the expansion of external services, how to share and utilize resources effectively and reliably is a problem that needs to be solved by the management and control layer. YARN, derived from MapReduce1.0, has become a universal resource management platform for Hadoop 2.0. Because of Hadoop’s location, the industry is very optimistic about its future in resource management.
Other traditional resource management frameworks such as Mesos and the emerging Docker will have an impact on the future development of YARN. Enterprises need to improve YARN performance, deeply integrate YARN with container technology, better adapt to short-task scheduling, more complete multi-tenant support, and fine-grained resource management and control. There is a lot of work that YARN needs to do in the future to take Hadoop further.
On the other hand, the security and privacy of big data are getting more and more attention. Hadoop relies on and only on Kerberos for security, but each component will do its own authentication and authorization policies. The open source community never really seems to care about security, and without components like Ranger from Hortonworks or Sentry from Cloudera, big data platforms are basically not secure and reliable.
Cloudera’s new RecordService component gives Sentry a head start in the security race. RecordService not only provides consistent security granularity across all components, but also provides an underlying record-based abstraction (a bit like Spring, replacing the original Kite SDK) that decouple the upper application from the lower storage while providing a reusable data model across components.
Computing engine layer
One of the biggest differences between the Hadoop ecosystem and other ecosystems is the concept of “multiple applications on a single platform.” There is only one engine at the bottom of the database, only dealing with relational applications, so it is a “single platform single application”; The NoSQL market has hundreds of NoSQL software, each for different application scenarios and completely independent, so it is a “multi-platform multi-application” model. Hadoop shares a HDFS storage at the bottom layer, and many upper-layer components respectively serve various application scenarios. For example:
Deterministic data analysis: mainly simple data statistics tasks, such as OLAP, focus on fast response, implementation components such as Impala, etc.
Exploratory data analysis: mainly refers to information relevance discovery tasks, such as Search, focusing on unstructured full information collection and realizing components such as Search.
Predictive data analysis: mainly machine learning tasks, such as logistic regression, etc., focusing on the advancement and computing power of computing models, and realizing components such as Spark and MapReduce.
Data processing and transformation: Mainly ETL tasks, such as data pipelines, etc., focusing on I/O throughput and reliability, and implementing components such as MapReduce
One of the most eye-catching is Spark. With IBM announcing the development of 1 million Spark developers, Cloudera announcing support for Spark as the default universal task execution engine for Hadoop in its One Platform initiative, and Hortonworks fully supporting Spark, we believe Spark will be at the heart of big data analytics in the future.
While Spark is fast, it still needs to be improved in terms of scalability, stability, and manageability in production environments. Meanwhile, Spark has limited capabilities in the field of stream processing. To achieve sub-second or large-capacity data acquisition or processing, other stream processing products are required. Cloudera’s announcement that it aims to make Spark’s streaming data technology available for 80% of uses takes this flaw into account. We do see real-time analysis (rather than simple data filtering or distribution) scenarios where Kafka+Spark Streaming has been replaced by many previous implementations using Streaming engines such as S4 or Storm.
Spark’s popularity will gradually bring MapReduce and Tez to museums.
The service layer
The service layer wraps the programming API details of the underlying engine, providing more abstract access models to business people such as Pig, Hive, and so on.
One of the hottest is the SQL market for OLAP. Right now, 70% of Spark’s traffic comes from SparkSQL! SQL on Hadoop Hive, Facebook’s Pheonix, Presto, SparkSQL, Cloudera’s Impala, MapR’s Drill, IBM’s BigSQL, Pivital’s HAWQ?
This is perhaps the most fragmented of all. Technically, almost every component has a specific application scenario, and ecologically, each vendor has its own preference, so the SQL engine on Hadoop is no longer just a technical game (and hence, for the neutrality of this article, not a comment here). We can expect that all SQL tools will be integrated in the future, and some products are already falling behind the competition, so we look forward to the market’s choice.
Tools abound, the most important being visualization, task management and data management.
There are many open source tools, such as HUE, Zeppelin, and others that support Hadoop-based query programming and real-time graphical presentation. Users can write some SQL or Spark code and some tags that describe the code, specify a visual template, execute it and save it for others to reuse. This pattern is also called “Agile BI”. The commercial products in this field are more competitive, such as Tableau and Qlik.
Oozie, the ancestor of the scheduling tool, can execute several MapReduce tasks in a series. Later, other tools, such as Nifi and Kettle, provide more powerful scheduling implementation. Therefore, it is worth a try.
There is no doubt that Hadoop’s data governance is relatively simple compared to the traditional database ecosystem. Atlas, Hortonworks’ new data governance tool, is making progress, though it’s not fully mature yet. Cloudera Navigator, the core of Cloudera’s business version, brings together lifecycle management, data traceability, security, auditing, SQL migration tools and more. When Cloudera acquired Explain. IO, it integrated its products into Navigator Optimizator, which helps users migrate traditional SQL applications to Hadoop and provides optimization suggestions, saving people months of work.
Algorithms and machine learning
Realizing automatic intelligent data value mining based on machine learning is the most attractive vision of Big data and Hadoop, and also the ultimate expectation of many enterprises on big data platform. As more data becomes available, the value of future big data platforms will depend more on the extent of their computational ai.
Now machine learning is slowly stepping out of the ivory tower, from a scientific topic studied by a small number of academics to a data analysis tool used by many enterprises, and more and more into our daily life.
In addition to Mahout, MLlib, Oryx, etc., the open source project of machine learning has witnessed a number of notable events this year, including the arrival of several big stars:
In January 2015, Facebook launched its open source cutting edge deep learning tool Torch.
In April 2015, Amazon launched its Machine Learning platform, Amazon Machine Learning, a fully hosted service that enables developers to easily develop and deploy predictive models using historical data.
In November 2015, Google opened source its machine learning platform TensorFlow.
In the same month, IBM open-source SystemML and became an official Apache incubator project.
At the same time, Microsoft Research Asia open source the distributed machine learning tool DMTK via Github. DMTK consists of a framework for distributed machine learning and a set of distributed machine learning algorithms that can be applied to big data.
In December 2015, Facebook opened source Big Sur, a server for neural network research, with high-performance Graphics processing units (GPUs) designed for deep learning.
Industry article
There are now tens of thousands of enterprises using Hadoop and making money from It. Almost every large enterprise already uses or plans to experiment with Hadoop technology to some extent. Companies in the Hadoop industry can be divided into four categories in terms of their positioning and use of Hadoop:
Tier 1: These companies have embraced Hadoop as a big data strategic weapon.
Tier 2: These companies productize Hadoop.
Tier 3: These companies create products that add value to the Hadoop ecosystem as a whole.
Tier 4: These companies consume Hadoop and provide Hadoop-based services to companies smaller than Categories 1 and 2.
Today, Hadoop is technically proven, accepted and even mature. None is more representative of Hadoop’s trajectory than the commercial distribution of Hadoop. Since Cloudera became the first Hadoop company to commercialize in 2008 and released the first Hadoop distribution in 2009, many large companies have joined the Hadoop productization bandwagon.
The word “distribution” is a symbol of open source culture. It seems that any company can have a “distribution” by packing open source code and adding more or less ingredients, but behind it lies the value filtering, compatibility and integration assurance, and supporting services of a huge number of ecosystem components.
Prior to 2012, releases were mainly about Hadoop patching, with several privatized versions of Hadoop, reflecting the quality defects of Hadoop products. The high activity of HDFS, HBase and other communities during the same period proves this fact.
Later companies are more tools, integration, management, not “better Hadoop” but how to better use “existing” Hadoop.
After 2014, with the rise of Spark and other OLAP products, offline scenarios such as Hadoop Shanzhan have been well solved, hoping to adapt to new hardware and expand into new markets by expanding the ecosystem.
Cloudera proposes a Hybrid Open Source architecture: Cloudera’s Distribution Including Apache Hadoop (CDH) is the core component of Cloudera’s Distribution including Apache Hadoop (CDH). It is open source and free and synchronously with the Apache community. Data governance and system management components are closed source and require commercial licenses, enabling customers to use Hadoop technologies more easily, such as deploying security policies. Cloudera also provides the operational capabilities necessary to run Hadoop in an enterprise production environment in the business component section that are not covered by the open source community, such as rolling upgrades without downtime, asynchronous disaster recovery, and so on.
Hortonworks adopts a 100% open source strategy and its product name is Hortonworks Data Platform (HDP). All software products are open source and free to use by users. Hortonworks provides commercial technical support services. Compared to CDH, management software uses open source Ambari, data governance uses Atlas, security components use Ranger instead of Sentry, and SQL continues to cling to Hive’s lap.
MapR adopts the model of traditional software vendors, using a privatized implementation. Users can’t use the software until they buy a license. Its OLAP product is the main push of Drill, and does not exclude Impala.
At present, mainstream public clouds such as AWS and Azure have provided Hadoop-based PaaS cloud computing services in addition to the original IaaS services for VIRTUAL machines. This market will outgrow proprietary Hadoop deployments in the future.
Application of article
The Hadoop platform unleashes unprecedented computing power while dramatically reducing the cost of computing. The development of productivity of underlying core infrastructure inevitably brings about the rapid establishment of big data application layer.
Applications on Hadoop can be roughly divided into the following two categories:
IT optimized
Move implemented applications and businesses to the Hadoop platform for more data, better performance, or lower cost. It benefits the enterprise by improving the output ratio and reducing production and maintenance costs.
Hadoop has proven to be a very suitable solution in several of these applications over the years, including:
Online query of historical log data: Traditional solutions store data in expensive relational databases, which is costly and inefficient, and cannot meet the high concurrent access volume of online services. The hBase-based storage and query engine is suitable for queries in fixed scenarios (non-Ad Hoc), such as flight queries and individual transaction records queries. It has become a standard solution for online query applications. In the enterprise technical guidance, China Mobile clearly specifies that HBase technology is used to implement the bill clearing query service of all branches.
ETL mission: Many vendors have provided excellent ETL products and solutions, and have been widely used in the market. However, in the scenario of big data, traditional ETL encounters serious challenges in performance and QoS assurance. Most ETL tasks are computational-heavy IO types, while traditional IT hardware solutions, such as database-hosting minicomputers, are designed for computational-like tasks, with IO of tens of gigabytes at most, even with the latest networking technologies.
Hadoop with distributed architecture provides a perfect solution. It not only uses Share-nothing scale-out architecture to provide infinite IO with linear expansion, but also ensures the efficiency of ETL tasks. The framework also provides features such as load balancing and automatic FailOver to ensure the reliability and availability of tasks.
Data warehouse offload: Traditional data warehouse has a lot of offline batch data processing services, such as daily reports, monthly reports, etc., which occupy a lot of hardware resources. These are often tasks that Hadoop does well
One oft-asked question is whether Hadoop can replace data warehouses, or whether enterprises can use free Hadoop to avoid purchasing expensive data warehouse products. Mike Stonebroker, a leading voice in the database world, said in a technical exchange that the overlap between data warehouse and Hadoop scenarios is so high that the two markets will merge in the future.
We believe that Hadoop will replace current products in the data warehouse market sooner or later, but Hadoop will not be what it is today. For now, Hadoop is just a complement to data warehouse offerings, building mash-up architectures with data warehouses to provide services for upper-level application syndication.
Business optimization
Algorithms and applications that have not yet been realized in Hadoop are implemented to hatch new products and businesses from the original production line and create new values. Through new business to bring new markets and customers to the enterprise, thereby increasing the enterprise revenue.
Hadoop provides powerful computation ability, professional big data applications has been in almost any vertical well, from the banking sector (fraud, credit reporting, etc.), health care, particularly in genomics and drug research, to retail, services, personalized service, intelligent service, such as UBer automatically send function, etc.).
Within the enterprise, a variety of tools have emerged to help enterprise users navigate core functions. For example, big data can help sales and marketing figure out which customers are most likely to buy by updating data in real time with large amounts of internal and external data. Customer service apps can help personalize service; HR applications can help figure out how to attract and retain the best employees.
Why is Hadoop so successful? This question may seem like an afterthought, but as we marvel today at how dominant Hadoop has become in just 10 years, it does make us wonder why it all happened. Based on comparisons with other programs during the same period, we believe that a combination of factors contributed to this miracle:
Technical architecture: Hadoop advocates the concept of localized computing. In fact, its scalability, reliability, and flexible multi-tiered architecture are inherent factors for its success over other products. No other such complex system can quickly meet the changing needs of users.
Hardware development: Scale Up architecture represented by Moore’s Law has encountered technical bottlenecks, and the increasing computing requirements force software technology to turn to the distributed direction to find solutions. At the same time, advances in PC server technology have made it possible to use inexpensive clustering of nodes like Hadoop with attractive cost performance advantages.
Engineering verification: When Google published GFS and MapReduce papers, there had been considerable internal deployment and practical application, while Hadoop had been verified in Yahoo and other Internet companies before it was introduced to the industry, which greatly increased the confidence of the industry and was quickly accepted and popular. The large number of deployment instances has contributed to the maturity of Hadoop.
Community driven: The Hadoop ecosystem remains open and open, and the friendly Apache license has virtually eliminated barriers to entry for vendors and users, building the largest, most diverse and active developer community in history that continues to push technology forward and put Hadoop ahead of many previous and contemporary projects.
Focus on the bottom: Hadoop’s roots are in building a distributed computing framework that makes life easier for application developers. The focus of the industry’s ongoing push has been to consolidate the ground floor and continue to bear fruit in areas such as resource management and security, clearing the way for enterprises to deploy production environments.
Next generation analysis platform
The Apache Hadoop community has grown at a frenetic pace over the past decade and is now the de facto standard for big data platforms. But there’s still more work to do! The future value of big data applications lies in prediction, and the core of prediction is analysis. What will the next generation of analytics platforms look like? It must face, and must solve, the following problems:
More data, faster data.
Updated hardware features and architecture.
More advanced analysis.
Therefore, in the coming years, we will continue to see the next generation of enterprise big data platforms in the post-Hadoop era:
The advent of memory computing. With the growth of advanced analytics and real-time applications, higher demands on processing power have been put forward, and the focus of data processing has shifted from IO back to CPU. Spark, which focuses on memory computing, will replace MapReduce, which focuses on I/O throughput, as the default universal engine for distributed big data processing. As a general-purpose engine that supports both batch processing and quasi-real-time streaming, Spark can meet more than 80% of application scenarios.
However, Spark’s core is batch processing, which is good at iterative computation, but not suitable for all application scenarios. It is complemented by other tools designed for specific application scenarios, including:
A) the OLAP. OLAP, especially aggregation-based online statistical analysis applications, stores, organizes, and processes data quite differently from offline batch applications.
B) Knowledge discovery. Unlike traditional applications that solve known problems, the value of big data lies in discovering and solving unknown problems. Therefore, it is necessary to maximize the intelligence of analysts and turn data retrieval into data exploration.
Unified data access management. Nowadays, due to the different formats and locations of data storage, users need to use different interfaces, models and even languages to access data. At the same time, different data storage granularity brings many challenges in security control and management governance. The future trend is to separate the low-level deployment operation and maintenance details from the upper-level business development. Therefore, the platform needs the following functions:
A) safety. Big data platform can realize the same caliber of data management security policies as traditional data management systems, including integrated user rights management across components and tools, fine-grained access control, encryption and decryption, and audit.
B) Unified data model. Through the abstract definition of data description, not only can the unified management of data model, reuse data parsing code, but also for the upper processing to shield the details of the bottom storage, so as to achieve development/processing and operation/maintenance/deployment decouple.
Simplify real-time applications. Now users are not only concerned about how to collect data in real time, but also concerned about the realization of data visibility and analysis results online as soon as possible. Both the old Delta architecture and the current Lambda architecture wanted to have a solution for fast data. Cloudera’s newly unveiled Kudu, while not yet in production, is the best possible solution to this problem: a single platform that simplifies the “access to” implementation of fast data, a new solution for the future of journal-like data analysis.
Look forward to the next decade
Ten years from now, Hadoop will be just another word for ecology and standards. The lower storage layer will not only contain existing storage architectures such as HDFS, HBase and Kudu, but also the upper processing components will be as many as apps in the App Store. Any third party can develop its own components based on the data access and computing communication protocols of Hadoop. Users can select corresponding components for automatic deployment based on their own data usage characteristics and computing requirements in the market.
Of course, there are some obvious trends that must affect Hadoop’s progress:
Fifty percent of big data tasks are already running in the cloud, and in three years that number could rise to 80 percent. Hadoop’s evolution into the public cloud requires more secure localization support.
The Hadoop community will not stand by as advances in fast hardware force it to rethink its roots.
The development of the Internet of Things will lead to massive, distributed and decentralized data sources. Hadoop will adapt to this evolution.
What will happen in the next ten years? Here are my guesses:
SQL and NoSQL markets will merge, NewSQL and Hadoop technologies will learn from each other and eventually unify, Hadoop and data warehouse markets will merge, but product fragmentation will continue.
Hadoop integrates with other resource management technologies and cloud platforms, integrates Docker and Unikernal technologies to unify resource scheduling and management, and provides complete multi-tenant and QoS capabilities. Enterprise data analysis centers are merged into a single architecture.
Scenarios of enterprise big data products. Later companies that provide products and technology directly mature and move to services. More and more new companies are offering professional, scenario-based solutions, such as personal online credit suites and services.
The big data platform scenario is “split”. Different from Hadoop and certain framework when it comes to big data at present, the future data platform will have subdivided ladder solutions and products according to different levels of data (from dozens of TB to ZB) and different application scenarios (various exclusive application clusters), and even customized integrated products.
Afterword.
Hadoop has become the “new normal” for enterprise data platforms. We’ve been honored to see Hadoop go from zero to king in a decade. As we are moved by the ever-changing technology, we hope that this article can provide some of our own interpretation of Hadoop yesterday, today and tomorrow, as a gift to celebrate Hadoop’s 10th birthday.
The author’s level is limited, and time is pressing, superficial and rough place, but also ask readers to forgive and advice. Some of the content of the article is quoted from the network, and some of the sources could not be found, please forgive the original author.
The future of big data is beautiful, Hadoop will be a necessary skill of enterprise software in the future, I hope we can witness together.