18 slides, 29 questions and answers, all here!

From April 25th to 26th, Flink Forward, the first online event of Apache’s top projects in the world, will be launched. Focus on the classic scenes and business stories of real-time computing of Major Internet companies such as Alibaba, Google, AWS, Uber, Netflix, DellEMC, Weibo, Didi and so on. 19 quality talks are translated and explained in Chinese by the core contributors of Flink, and you can watch them online for free.

Flink Forward was streamed live in Beijing, Shanghai and Hangzhou for one and a half days, attracting nearly 20,000 online views from developers around the world. In addition to the quality content, Flink Forward is also the first edition to create a question collection. Students who watch the live broadcast online can timely raise questions about the guest sharing and invite the lecturer to answer them online.

Flink Forward Chinese essence version PPT download ▼

Scan the following “Flink Chinese Community” public qr code, the background reply keyword [0425PPT] to download all the conference shared content ~

Shimo. Im/Sheets/Twgy…

Live review and Flink Community Learning Package can be downloaded by clicking:

Flink Forward Global Online Conference Chinese highlights 0425 Flink Forward Global online Conference Chinese highlights 0426

The following are some representative questions and answers from lecturers.

Keynote: Introducing Stateful Functions 2.0: Stream Processing meets Serverless Applications

** Speaker: ** Yu Li (Juding), Apache Flink Committer, Apache Flink 1.10 Release Manager, Alibaba Senior Technical Expert.

Q: Did PyFlink support Stateful Function? State management was Stateful Function. “A” : Currently not supported.

Stateful Function State management was the same as normal streaming job State management, and no special processing was done. Actor systems, or applications, differ from stream processing in that stream processing is a DAG (directed acyclic graph) structure. But actor systems can have rings. Stateful Function was added to support feedback loop, but it did not change the runtime kernel, which could be understood as taking advantage of the streaming state management.

Round table | Lyft: based on the analysis of Flink quasi real-time data platform

Commentary guest: Wang Yang (yi Qi), Alibaba technology experts.

“Q” : Does Flink write the Parquet file in real time produce a lot of small files? How do you deal with small files? “A” : StreamingFileSink to write Parquet data in StreamingFileSink will produce small files, which will cause poor performance in presto/ Hive client parsing. Lyft is using SuccessFile Sensor to allow Airflow to automatically schedule ETL missions to conduct compaction and Deduplication, while the ones that have already been processed will swap out rawevent partitions. This process results in better data quality and improved interactive query performance.

Speech | weibo Flink based machine learning practice

Sharing Guests:

Yu Qian is a senior algorithm engineer at Weibo Machine Learning R&d Center. He has been dedicated to building real-time data processing and online machine learning frameworks using Flink for many years, and has rich experience in the development of recommendation systems for social media applications.
Cao Fuqiang is a system engineer at Weibo Machine Learning Research and Development Center. Now I am responsible for the data calculation module of Weibo machine learning platform. It mainly involves real-time computing Flink, Storm, Spark Streaming, offline computing Hive, Spark, etc. At present, he focuses on the application of Flink in micro-blog machine learning scenarios.
Xiang Yu is an algorithm architecture engineer at Weibo Machine Learning R&d Center.

Q: How does Gemini work? “A” : This problem is quite complicated. We will release detailed instructions and comparative experiments on our official account later.

The micro blog machine Learning R&D Center team will share a technical article on the topic of “How to use Gemini”, in addition to detailed instructions and comparative experimental analysis, please look forward to it!

“Q” : What window is the multi-stream join of the sample based on? “A” : The existing window calculation of Flink cannot meet our business needs. We use Union + Timer to realize the sliding window, store the data in map State, and use Rocksdb + SSD hard disk to store the data at the bottom, and define the trigger triggering mechanism of the sample. After comparing state Backend policies of Rocksdb and Java Heap, rocksdb + SSD is selected as state backend after balancing service scenarios, processing speed, and hardware cost.

“Q” : Can you explain in detail how the multimedia feature calculation is supported by Flink? How stable is this one? How is that guaranteed? “A” : First we deploy the algorithm model on the GPU and encapsulate the model as RPC service. Then through Flink to call RPC service, real-time generation of pictures, video features.

** Stability: ** We monitor the whole process of the whole job through Flink metrics, including but not limited to RPC service time, success rate and other indicators. At Least Once ensures that every piece of data is processed Once. Overall job latency is monitored by monitoring on the Source (Kafka) end.

In addition, high availability guarantee mechanism (reconciliation system) is introduced according to business scenarios to ensure the stability of data processing. At present, the success rate of key business can reach 99.999%.

Q: How to make the application automatically convert the original input data into the input variables required by the model after the model is online? “A” : Online prediction, in the online system, we obtain from characteristics of the service characteristics of field, joining together the characteristics of the original data, and then through a feature processing module, the original sample need to input data into a model, which could be libsvm format or is suitable for other data formats within DNN), and then spread to the model service module, The output data format of feature processing and the code of feature processing are consistent with the training and prediction. The only difference is that the training data will have more fields related to label compared with the online prediction data.

Speech | Alink: to improve machine learning platform based on the Flink ease of use

Sharing guest: Yang Xu (Product number), Alibaba senior technical expert.

Q: Are there many algorithms that support real-time machine learning? How to prevent the influence of individual singular values on the model? “A” : All of Alink’s classification and regression models support streaming data prediction, and online learning algorithms currently support FTRL. During the training of each model, special data is processed. In addition, data cleaning can also be carried out before training using Alink’s data processing component.

“Q” : FlinkML is no longer available in 1.10? What is the relationship between FlinkML and ALink? “A” : FlinkML is the machine learning algorithm library of Flink, divided into the old version and the new version. Before making Alink, we first carefully investigated the situation of FlinkML (the old version of FlinkML) at that time. It only supported more than 10 algorithms, and the data structure it supported was not universal enough. There were few optimizations in algorithm performance, and its code had not been updated for a long time. So, instead of trying to improve and update FlinkML, we decided to redesign and develop a machine learning algorithm library based on Flink, which evolved into Alink.

Throughout Alink’s development, we have been closely involved with the Flink community, reporting our progress, discussing technical issues, and getting feedback and suggestions at the annual Flink Forward conference. As Alink’s features continue to be enhanced and improved, the community is increasingly welcoming the open source of Alink. We can start to work more closely with the Flink community to promote the open source Alink code into FlinkML.

At the same time, more people in the community realized the problems of the old Version of FlinkML and decided to scrap the whole old version of FlinkML and build a new one. We actively participate in the design of the new FlinkML API and share our experience in the design of Alink API. Concepts such as Alink’s Params were adopted by the community; After that, I started to contribute algorithm implementation codes for the new version of FlinkML, and I have submitted more than 40 PR, including algorithm basic framework, basic tool classes and several algorithm implementations.

Alink contains a large number of machine learning algorithms. In the process of contributing to FlinkML, community Commiters are required to discuss, design and review the code. This process is helpful for improving the code, but due to the limited resources of community Commiters, The process of fully contributing code to FlinkML can take a long time. At this point, we had to consider whether there was another way to get users to use it first, and Alink was a good solution to open source, while continuing to contribute algorithm implementation to FlinkML. User feedback also helps us to better improve the algorithm implementation. The idea gained support from the community, leaders and colleagues within the company, and at Flink Forword Asia 2019, Alink was announced as open source.

2020 of the round table | Flink SQL: give me the who

Guest speaker: Wu Chong (Yun Xie), Apache Flink PMC, Alibaba technical expert.

Q: Is the metadata of the catalog table in demo memory based or persisted to external storage? “A” : Demo has two catalogs registered, A default catalog (memory) and A hive catalog (persistent). Both catalogs can store batch tables and streams.

“Q” : Is this case any different from the feature you used in the Flink SQL case last time (February 2020)? “A” : This demo covers more features, including 4 types of Join, stream batch consistency, CEP, etc.

Round table | Apache Flink misuse

** Commentary guests: ** Sun Jincheng (Jinzhu), Apache Member, Apache Flink PMC, Senior technical expert of Alibaba.

Q: Flink window calculation. Heap access consumes a lot of CPU. Compared with Spark, the same logical window calculation consumes a lot of CPU. “A” : This depends on the specific scene, need more detailed scene description? General optimization methods are as follows:

Incremental polymerization is used to replace full polymerization as far as possible [1]. Not only does it reduce the size of the state, but it can be computed as soon as the data arrives in the window.
Note whether all types are recognized by Flink, otherwise the default Kryo will be used for serialization and deserialization, resulting in increased CPU overhead [2]. Can deserve to go upenv.getConfig().disableGenericTypes();To disable Kryo and verify that all types are recognized by Flink.

[1] ci.apache.org/projects/fl… [2] ci.apache.org/projects/fl…

Q: Can datastreamutil be used when multiple Windows cascade the same keyBY? Is there any way to optimize A when multiple keys are too long? 1. DataStreamUtil can be used to cascade keys to avoid multiple shuffles. 2. In business, it is best if there is a way to optimize the key length, such as reducing the number of fields; Or extract data of a specified length or position as a key. Secondly, technically, it is possible to hash the key, such as md5, but this will cause excessive CPU consumption, which needs to be weighed against the network or I/O consumption caused by the long key.

Round table | Uber: use the Flink CEP geography test practice

Speaker: Fu Dian, Apache Flink Committer, Alibaba technical expert.

Q: How is CEP tuned for performance? “A” : In Flink CEP, the complexity of rules has A great impact on performance. Therefore, if you encounter performance problems, you can optimize the rules from the perspective of service simplification

“Q” : Does the window stagger with different keys use a custom window trigger? “A” : it can be understood as the implementation of A custom WindowAssigner, WindowAssigner for each key call, add A random factor, so that different keys get A different window range.

Speech | A deep dive into Flink SQL

Guest speaker: Wu Chong (Yun Xie), Apache Flink PMC, Alibaba technical expert.

Q: Can minibatch reduce interaction with state be used in datastream? “A” : Minibatch optimization is only implemented in the AGGREGATION operator of the SQL layer and cannot be used in DataStream.

“Q” : In order to support streaming batch unification, Flink SQL uses a lot of CodeGen technology in the bottom layer. The same SQL generates different code in the bottom CodeGen. Does this CodeGen process take time? For batch scenarios, especially OLAP scenarios where quick results are required, how much codeGen will take to complete the process? “A” : Currently codeGen happens at compile time, so it’s only executed once, so it’s fine for streaming and batch jobs. However, for OLAP scenarios, codeGen and code compilation are indeed very sensitive, which is an optimization direction in the future. There is no evaluation of codeGen time.

“Q” : How does join optimization work when stream mode might not get statistics? “A” : All optimizations of the current flow calculation model are deterministic optimizations, without considering statistics. However, batch optimization has been considered. In cases where stats is not available, we will have a default value, such as rowCount =10^8.

Speech | Flink ‘s application at Didi

Sharing guest: Xue Kang, current didi technology expert, real-time computing director. He graduated from Zhejiang University and worked as a senior R&D engineer of Baidu. He has rich experience in big data ecological construction.

“Q” : Can you explain the implementation of StreamSQL online debug? “A” : parses SQL, replaces source and sink with files and standard output, then executes DML normally, prints the result to standard output and displays it on the platform.

“Q” : SQL IDE written SQL, how is the kinship implemented? “A” : Each connector will report connected data source information, such as Kafka cluster and topic, to Kafka as indicators, and then store it in Druid. The platform connects all links to form A complete link.

“Q” : How to monitor the status of each flink cluster job, similar to the status of each flink- Web job (running or failing). A: Periodically obtains the JM address of each app through yarn API, obtains the running job information through JM restful API, and determines the startup time of each job. If the startup time is between two checks, it indicates that the job has been restarted during the period. The alarm is generated after A certain number of times. Pay attention to what you just submitted.

“Q” : metadata management for kafka table, group.id,start-mode Or just store static kafka Connection information/schema information, group. Id /start-mode, etc. as table parameters? “A” : Yes, only static information is saved. Personalized runtime information is taken as parameter and submitted as part of the job in the form of set key=value.

| the speech Data Warehouse, Data ‘lakes, What ‘s Next?

** Jin Xiaojun (Xian Yin), Senior technical expert of Alibaba.

“Q” : Can Hologres support high-performance update operations to implement Flink RetractSink? “A” : Yes. In fact, if hologres is used, it is good to save details directly, most scenes do not need to do pre-aggregation, when the need to directly query.

Q: What is the query efficiency of Hologres with a large amount of data? Can update delete operation be supported? “A” : Yes, there are trillions of tables online to do multidimensional analysis, and the result can be calculated within 200ms. Hologres supports updates and deletes.

“Q” : What are the differences between Hologres and hudi, Delta and Iceberg, the current data lake frameworks in the community? “A” :

Hologres is data ingestion in real-time, while the current open source solution is Mini-Batch, similar to the difference between Flink and Spark Streaming.
Hologres itself is a service provider, which can directly provide services to online applications with higher SLA.
Hologres can provide high QPS query ability, can be directly used as flink dimension table.

Speech | finally wait until your: PyFlink + Zeppelin

Sharing Guests:

Sun Jincheng (Jinzhu), Apache Member, Apache Flink PMC, Senior technical expert of Alibaba.
Zhang Jianfeng (Jian Feng), Apache Member, Apache Zeppelin PMC, Alibaba Senior technical expert.

Q: Since we are aiming for a full Python integration, what about beefing up Jupyter notebook? “A” : First PyFlink will support both Zeppelin and Jupyter. Zeppelin is leading the way. Zeppelin is more about big data computing, and Jupyter is more about machine learning. Zeppelin can be used for multi-tenant enterprise, and Jupyter is more about single user.

Q: What are the best application scenarios for Flink on Zeppelin? “A” : ETL and data analysis for batch computation, suitable for Flink SQL, PyFlink and Table API.

“Q” : What is Zeppelin’s support for K8s so far? Does the community have plans for this? Why do you choose to deploy Zeppelin Server using Pod instead of Statefulset or Deployment? “A” : This is being done and depends on Flink support for K8S. Zeppelin 0.9 + Flink 1.11 is expected to support K8S perfectly.

Production-Ready Flink and Hive Integration – what story you can tell now?

Commentary guest: Li Rui (Tian Li), Apache Hive PMC, Alibaba technical experts.

Q: Hive is available, and useful Hive client tools such as DBVIS are available. If your business is using Hive for offline batch queries, is it worth integrating with another framework? Why don’t I use DBVIS to do hive analysis? Question: Hive is a batch analysis tool. Is it necessary to force integration with streams? Is it better to use specialized tools? “A” : Many users need to improve Hive in real time, such as real-time write, presto, impala, and other interactive queries. Flink can be integrated with Hive in batch mode to achieve better performance than Hive original batch. On the other hand, we also see users who want to consume data written to Hive in real time, which requires streaming integration.

“Q” : In 1.10 kafka tables can be built on hivecatalog. Is it possible to write kafka data to hive tables? “A” : No, 1.10 only stores kafka table metadata through the Hive Catalog, but only supports batch writes to actual data. Streaming writing to Hive tables is supported only in 1.11.