** “Everyone’s time is limited, and choosing a technology worth investing in becomes more important with that limited time.” **

It has been 12 years since I started working in 2008. I have been dealing with data all the way, and I have done the development of many big data underlying framework kernel (Hadoop, Pig, Hive, Tez, Spark). I also worked for many years on the upper level data computing framework (Livy, Zeppelin) and data application development, including data processing, data analysis and machine learning. Currently Apache Member and PMC for multiple Apache projects. In 2018, he joined alibaba real-time computing team to focus on the research and development of Flink.

Today I’d like to talk about evaluating whether a technology is worth learning based on my own professional experience. I have been working on big data computing engines throughout my career, starting with Hadoop, then Pig, Hive, Tez, and then Spark, the next generation of computing engines, and most recently Flink. Personally, I was lucky. I was working on popular technologies at each stage, and I chose technology types based on my own interest and intuition. Now looking back, I think there are three broad dimensions to evaluate whether a technique is worth learning.

1. Technological depth 2. Ecological breadth 3

01 Technical Depth

Depth refers to whether the technology has a solid foundation, whether the moat is wide and deep enough, and whether it can be easily replaced by other technologies. In layman’s terms, this is whether the technology solves important problems that other technologies cannot solve. There are two main points:

1. No one could solve the problem; it was the technology that solved it first. Solving this problem can bring significant value.

Take Hadoop, which I learned at the beginning of my career. Hadoop was a revolutionary technology when it first came out, because no other company in the industry had a complete solution for massive data, with the exception of Google, which said it had an in-house GFS and MapReduce system. With the development of Internet technology, the amount of data is increasing day by day, and the ability to deal with massive data is imminent. Hadoop was created to address this urgent need.

With the development of technology, the advantages of Hadoop’s ability to process large amounts of data have gradually been used by people, while the disadvantages of Hadoop have been constantly criticized (poor performance, complex writing of MapReduce, etc.). At this time, Spark came into being and solved the stubborn problem of Hadoop MapReduce computing engine. Spark’s computing performance far exceeds That of Hadoop and its elegant and simple API catered to the needs of users at that time, making it popular among big data engineers.

Now I am engaged in the research and development of Flink in Alibaba, mainly because I see the industrial demand for real-time and Flink’s dominant position in the field of real-time computing. The biggest challenge faced by big data before was the large scale of data (hence the name “big data”). After years of efforts and practice in the industry, the problem of large scale has been basically solved. In the coming years, the bigger challenge will be speed, or real time. The real time of big data is not the real time of simple data transmission or data processing, but the real time from end to end. If the speed of any step is slow, the real time of the whole big data system will be affected.

In Flink’s opinion, Everything is stream. Flink’s stream-centered architecture is unique in the industry, resulting in superior performance, high scalability, end-to-end Exactly Once and other features, which makes Flink a worthy king in the field of Stream computing.

There are three mainstream streaming computing engines: Flink, Storm and SparkStreaming.

Note: Spark Streaming can only select search terms, theoretically this comparison is not rigorous. But as a trend, we are more concerned with the curve, the actual impact should be small.

As you can see from the Google Trends curve above, Flink is in a period of rapid growth, Storm’s popularity is decreasing year by year, and Spark Streaming has almost reached a plateau. This proves that Flink has deep roots in the field of flow computing, and no one can surpass Flink’s dominance in the field of flow computing at present.

02 Ecological breadth

Only technical depth is not enough for a technology, because a technology can only focus on one thing. If it wants to solve complex problems in real life, it must be integrated with other technologies, which requires that the technology has a wide enough ecological breadth. Ecological breadth can be measured at two latitudes:

1. Upstream and downstream ecology. Upstream and downstream ecology refers to upstream and downstream of data from the perspective of data flow. 2. Vertical ecology. Vertical domain ecology refers to the integration of a specific domain or application scenario.

When Hadoop first came out, there were only two basic components: HDFS and MapReduce, which solved the problems of mass storage and distributed computing respectively. However, with the development, more and more complex problems need to be solved, and HDFS and MapReduce can not easily solve some complex problems. At this time, other ecological projects of Hadoop emerge, such as Pig, Hive, HBase and so on solve problems that Hadoop is not easy or can’t solve from the perspective of vertical domain ecology.

The same is true of Spark, which started as a replacement for the MapReduce computing engine and has since evolved into a variety of language interfaces, superframeworks such as Spark SQL, Spark Structured Streaming, MLlib, GraphX and others greatly enrich the usage scenarios of Spark and expand Spark’s vertical domain ecosystem. Spark’s support for a variety of Data sources makes Spark an alliance between the computing engine and storage, creating a strong upstream and downstream ecosystem and laying the foundation for end-to-end solutions.

The ecology of the Flink project I am working on is still in its infancy. At that time, I joined Alibaba not only because OF the dominance of Flink as a streaming computing engine, but also because of the opportunity of Flink ecology. If you look at my career, you will find some changes. I started to focus on the core framework layer of big data and gradually developed into surrounding ecological projects. One of the main reasons is my judgment of the whole big data industry: the first half of the battle of big data focused on the bottom framework, which is nearing the end. In the future, there will not be so many new technologies and frameworks in the bottom big data ecosystem, and each subdivision field will survive the fittest, become mature, and become more centralized. The focus of the second half of the battle from the bottom to the top, towards the ecology. Previous big data innovations have been more IAAS and PAAS, and you will see more SAAS type big data products and innovations in the future.

Every time I talk about the ecology of big data, I bring up this chart. This chart basically includes all the big data scenarios you need to deal with on a daily basis. From the data producer on the far left, to data collection, data processing, and then to data application (BI + AI). You’ll find Flink can be applied at every step. It involves not only big data, but also AI, but Flink’s strength is in stream computing processing. Ecology in other fields is still in its infancy, and my personal work is to improve Flink’s end-to-end capabilities in the above graph.

03 Ability to evolve

If the depth and breadth of a technology are not a problem, then at least the technology is worth learning at the moment. But investing in a technology also takes time into consideration. You don’t want your skills to become obsolete so quickly that you have to learn a new skill every year. So a technology worth investing in learning must have a persistent ability to evolve.

It’s been more than 10 years since I first learned Hadoop, and it’s still widely used today. There are a lot of public cloud vendors competing for Hadoop, but you have to admit that if a company is going to start a big data division, the first thing it will do is build a Hadoop cluster. When we talk about Hadoop today, it’s not Hadoop anymore, it’s more of a general term for the Hadoop ecosystem. Check out Cloudera CPO Arun’s article [1]. I agree with it.

[1] : medium.com/acmurthy/h…

Not to mention the Spark project. After 14,15 years, Spark has reached a plateau. But Spark is still evolving and embracing change. Spark on K8s is a great example of Spark’s embrace of cloud native. Delta and MLFlow, which are very popular in the Spark community, are proof of Spark’s strong ability to evolve. Spark is not just the Spark that replaced MapReduce, but a general-purpose computing engine that can be used in a variety of scenarios.

It has been almost a year and a half since I joined Alibaba in 18 years. During this year and a half, I have witnessed the evolution of Flink.

First of all, Flink has been released in several major versions, integrating most functions of Blink and greatly improving the ability of Flink SQL.

Secondly, Flink’s support for K8s, Python and AI all prove Flink’s strong ability of evolution.

Small Tips

In addition to these three dimensions, I’d like to share a few tips that I use to evaluate a new technology.

1. Use Google Trends. Google Trends is a good indicator of the trend of a technology, and the trend chart above is a good comparison of the 3 Streaming computing engines Flink, Spark Streaming and Storm, and it’s not hard to conclude that Flink is the king of Streaming computing.

2. Check awesome on GitHub. One indicator of a technology’s popularity is the Awesome List on GitHub, which has a GitHub star count. Also, take a weekend to read the Awesome List, because it’s basically a distillation of the technology and you can get a pretty good idea of its value.

3. See if there are tech evangelists on tech websites that endorse the technology (I personally often look at medium.com). There is usually a group of people in the tech world who are very dedicated to technology and have a good taste. If a technology is really good, there are tech evangelists who endorse it for free and share tips on how to use it.

04 summary

Everyone’s time is limited, and it’s important to choose a technology worth investing in during that time.

The above are my thoughts on how to evaluate whether a technology is worth learning, as well as a small summary and review of my own career in technology selection, I hope these thoughts can be helpful to everyone’s career.

About the author:

Zhang Jianfeng, veteran of open Source, Github ID: Zjffdu, Apache Member, used to work for Hortonworks. Currently, HE is working as a senior technical expert in Alibaba Computing Platform Business Division. He is also the PMC of Apache Tez, Livy and Zeppelin open source projects. And Apache Pig Committer. I was lucky to get in touch with big data and open source very early, and I hope to make some contributions to big data and data science in the field of open source.