Author: Sun Jincheng

Apache Flink Runtime is written in Java, and the upcoming release of Apache Flink 1.9.0 will enable a new ML interface and a new Flink-Python module. Why Flink added support for Python?

Flink also needs to add support for Python. This is probably a good point to make. But why do these very popular projects support Python? What is the “magic” of the Python language that has attracted such prestigious projects? Let’s look at the statistics.

1. Most popular programming languages

Here’s the latest ranking of the most popular languages from industry analyst firm RedMonk:

Here are the top 10 from Github and StackOverflow:

  • JavaScript
  • Java
  • Python
  • PHP
  • C++
  • C#
  • CSS
  • Ruby
  • C
  • TypeScript

Pytnon came in third, while R and Go, which are also very popular at the moment, were 15th and 16th. This very objective ranking is a testament to how large Python’s audience is, and any project that supports Python is an invisible expansion of the project’s audience!

2. The hottest part of the Internet

At present, the hottest field of the Internet should be big data computing. The previous era of single-machine computing has passed, and the improvement of single-machine processing capacity lags far behind the increasing speed of data. The following aspects will analyze why big data computing is the hottest field of the Internet in the era of big data.

2.1 In the era of big data, the amount of data is increasing day by day

With cloud computing, Internet of things, the rapid development of information technology such as artificial intelligence, data content increases exponentially, we first see a forecast data, the global total data in a short time in 10 years ZB ZB increased from 16.1 to 163, the rapid growth of the amount of data has been far beyond a single computer storage and processing power, is as follows:

In the figure above, the unit of data volume is ZB. Let’s briefly introduce the statistical unit of data volume. The basic unit is bit, and all units are given in order: BIT, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, and DB. The transformation relationship between them is:

  • 1 Byte =8 bit
  • 1 KB = 1,024 Bytes
  • 1 MB = 1,024 KB
  • 1 GB = 1,024 MB
  • 1 TB = 1,024 GB
  • 1 PB = 1,024 TB
  • 1 EB = 1,024 PB
  • 1 ZB = 1,024 EB
  • 1 YB = 1,024 ZB
  • 1 BB = 1,024 YB
  • 1 NB = 1,024 BB
  • 1 DB = 1,024 NB

Looking at the amount of data above, we might wonder if global data is really so scary, and where does it all come from? In fact, WHEN I saw this data, I questioned it deeply. However, after reviewing the data carefully, I found that the global data is indeed growing rapidly. For example, the Facebook social platform has tens of billions of photos every day, and the New York Stock Exchange has several TERabytes of transaction data every day. As for the data of Alibaba’s Singles’ Day 2018, a miracle of 213.5 billion yuan was created in terms of transaction volume. In terms of data volume, only alibaba’s internal monitoring log processing reached 162 GB/ SEC. Therefore, the Internet industry represented by Alibaba also promotes the rapid growth of data volume. Similarly, the transaction volume of Alibaba Double 11 in the past 10 years is used to prove the growth of data, as follows:

2.2 The value of data comes from data analysis

How can big data generate value? There is no doubt that statistical analysis of big data can help us make decisions. For example, in the recommendation system, we can analyze a user’s interests and hobbies according to their long-term purchasing habits and purchasing records, and then make accurate and effective recommendations. So faced with the above massive data, in a computer can not process, so how do we in the limited time of all the data for statistical analysis? For that matter, we have to thank Google for publishing three papers:

  • GFS – In 2003, Google published the Google File System paper, which is an extensible distributed File System for large, distributed applications that access large amounts of data.
  • MapReduce – In 2004, Google released MapReduce paper, which describes the distributed computing method of big data. The main idea is to decompose tasks and process them simultaneously in multiple computing nodes with weak processing capacity, and then combine the results to complete big data processing. MapReduce is a programming model for distributed parallel computing, as shown in the following figure:

  • BigTable – In 2006, Google published the BigTable paper, is a typical NoSQL distributed database.

Benefiting from Google’s three papers, the Apache open source community quickly developed the Hadoop ecosystem, HDFS, MapReduce programming model, and NoSQL database HBase. And soon got the global academia and industry universal attention, and has been promoted and popularized application. Among them, Alibaba launched the Hadoop-based ladder project in 2008, and Hadoop became the core technology system of Distributed computing of Alibaba. In 2010, it reached a cluster of thousands of machines. The cluster development of Hadoop in Alibaba is as follows:

However, MapReduce development using Hadoop requires developers to be proficient in Java language and have a certain understanding of the operation principle of MapReduce, which raises the development threshold of MapReduce to a certain extent. Therefore, a number of open source frameworks have emerged in the open source community to simplify MapReduce development, among which Hive is a typical representative. HSQL allows users to describe MapReduce calculations in AN SQL-like manner. For example, wordCount, which requires dozens or even hundreds of lines, can be completed in one SQL statement, greatly reducing the development threshold of MapReduce. In this way, the Hadoop technology ecosystem continues to develop, and the distributed big data computing based on Hadoop is gradually popularized in the industry.

2.3 Maximization of data value and timeliness

Every piece of data is a piece of information, and the timeliness of information refers to the time interval and efficiency of receiving, processing, transmission and utilization of information sent from the information source. The shorter the interval, the more timeliness. Generally the more timely, the greater the value brought by the information, such as a preference recommended scenario, the user bought a “evaporate”, if can give the user level in seconds interval recommended a “oven” discount product, the probability of the user to buy “oven” will be very high, so after 1 day, according to the data users to buy “steam box” Given that a user might want to buy an oven, I think this recommendation is less likely to be adopted. Based on this data timeliness problem, it also exposes the disadvantage of Hadoop batch computing, that is, low real-time performance. Based on such demands of The Times, typical real-time computing platforms are also born at the right time. Spark was born in AMP Lab of UCBerkeley in 2009, and BackType, the core concept of Storm, was proposed by Nathan in 2010. Flink also started a research project in Berlin, Germany, in 2010.

Alpha and artificial intelligence

After Google AlphaGo defeated Lee Se-dol, the world champion and professional nine-dan go player, 4-1 in 2016, people are gradually looking at deep learning in a new way and setting off the “craze” of artificial intelligence. Baidu Encyclopedia defines Artificial Intelligence (Artificial Intelligence) as: Artificial Intelligence is a new technical science that researches and develops theories, methods, technologies and application systems for simulating, extending and expanding human Intelligence.

Machine learning is a method or tool for artificial intelligence. Machine learning has a high status in big data technology platforms led by Spark and Flink. Spark has made great efforts in ML in recent years, and PySpark has integrated many excellent machine learning libraries, such as Pandas, which is far better than Flink in this respect. So Flink faced its limitations head-on and opened up a new ML interface and a new Flink-Python module in Flink 1.9!

So what does the importance of machine learning have to do with Python? Let’s take a look at the statistics to see which languages are the most popular for machine learning.

Jean-francois Puget, a data scientist at IBM, once did an interesting analysis. He scoured the trends in employers’ job requirements on indeed, a popular job search site, to assess the most popular job phrases on the market. When he searches “Machine learning” alone, he can also get a similar result:

Its structure finds that Python is similar to the hot “Machine Learning”. Although this is a survey in 2016, it is enough to prove Python’s position in “Machine Learning”. And the RedMonk statistics we mentioned above prove it!

In addition to various investigations, we can also say why Python is the best language for machine learning from Python’s characteristics and the existing Python ecosystem.

Python is an object-oriented interpreted programming language invented by Dutchman Guido van Rossum in 1989, with the first version released in 1991. Python is slower than anyone as an interpreted language, but the philosophy of the Python designers is “one way and only one way to do one thing.” When developing a new Python syntax, faced with multiple choices, Python developers will typically choose the unambiguous syntax with little or no ambiguity. Python has a large user base because it is easy to learn, and many machine learning libraries are developed in Python, such as NumPy, SciPy, and structured data operations. So Python’s rich ecosystem provides a degree of convenience for machine learning, and is certainly the most popular machine learning language!

4. To summarize

This article focuses on why Apache Flink needs to support the Python API. The actual figures show that we are in an era of big data at present, and the value of data depends on big data analysis. Due to the importance of timeliness of data, the famous Apache Flink streaming computing platform was born.

At present, in the era of big data computing, AI is a hot development direction, machine learning is one of the important means of AI, and just because of the characteristics of language and ecological advantages, Python has become the most important language for machine learning, which shows the motivation of Apache Flink to launch Flink API! The Apache Flink Python API is a product of The Times. It is inevitable that the waters flow naturally.

Tips:

This article is excerpted from Jin Zhu’s blog “Apache Flink talks series”. Click “Read the original article” to view the author’s original series of talks

Apache Flink Geek Challenge, 100,000 prize money waiting for you /

The link for details: tianchi.aliyun.com/markets/tia…