The main points of

  • There are many factors to consider and trade-offs to make when migrating batch-based ETL to streaming processing. You can’t stream everything just because streaming processing technology is popular.
  • The example for this article comes from Netflix, which uses Apache Flink instead of the original batch system. Flink was chosen because of their need for real-time event processing and extensible custom Windows.
  • There are many difficulties encountered during the migration, such as how to obtain data from real-time data sources, how to manage metadata, how to recover data, how to deal with out-of-order data, and how to do operations and maintenance.
  • Streaming offers business benefits, such as the ability to train machine learning algorithms with the latest data.
  • Streaming also brings technical benefits, such as greater flexibility to save on storage costs and integration with other real-time systems.

At QCon 2017 in New York, Shriya Arora presented a talk “Personalizing Netflix with Streaming Datasets”, sharing a data job migration case of Netflix, They used Flink instead of the original batch-based ETL.

Arora, a senior data engineer at Netflix, began his talk by explaining that his main goal was to help his audience understand whether streaming data pipes could help them solve problems with traditional ETL batch jobs. In addition, she discusses the considerations and trade-offs that need to be made when moving to streaming processing. “Batch processing is not dead,” Arora says, and while there are many streaming engines out there, none alone provides the best solution.

Netflix’s core mission is to enable users to watch customized video content anytime, anywhere. Every day, Netflix processes 450 billion events generated by more than one billion active users in more than 190 countries, who collectively watch more than 125 million hours of video per day. Netflix’s system uses a microservice architecture, where services communicate with each other through remote procedure calls (RPCS) and messages. The Kafka cluster in the production environment has more than 700 topics that manage messages and feed data to the data processing pipeline.

Netflix’s Data Engineering and Analytics (DEA) team and Netflix Research team operate the customized system. The microservice application generates user and system data, which is collected into the Netflix Keystone Data Pipeline, a petabyte real-time streaming data processing system. Traditional batch processing works like this: save the data to Hadoop Distributed File System (HDFS) deployed on Amazon S3, and then use Apache Spark, Pig, Hive, or Hadoop to process the data. The processed data is saved to a database table or Elasticsearch and can be used by the Research team, downstream systems, or dashboard applications. At the same time, they also use Apache Kafka to direct data to Flink or Spark Streaming.

“Don’t try to stream everything,” Arora admonished the audience before describing how they converted ETL batch jobs to streaming. Streaming does offer business benefits, such as the ability to use the latest data to train machine learning algorithms, provide new ideas for market innovation, and increase the possibility of developing new machine learning algorithms. Streaming also brings technical benefits, such as greater flexibility to save storage costs (raw data does not need to be saved in its original form), faster fallback (batch processing requires long fallback times in the event of a failure), real-time auditing, and integration with other real-time systems.

The biggest challenge when implementing streaming is choosing the right engine. The first thing to be clear about is whether the data is being processed on an event stream basis or on a microbatch basis. In Arora’s view, microbatches are just a subset of batch processing that changes the time window from a day to an hour ora minute, and data is still processed based on a large chunk of data rather than events. Moving to microbatch may be the most appropriate low-cost solution if the expectation is simply to get results faster and the organization has already invested heavily in batch processing.

Another consideration when choosing an engine is which features are most critical. This is not a question that can be solved by a brainstorming session, but usually only after in-depth understanding and research. Netflix’s system requires “conversational” event data, which is a session-based time window. Various engines have varying degrees of support for this feature, and in the end, they chose Flink because it offers better support for custom Windows than Spark Streaming (however, since this talk, Spark has released 2.2.0, and more recently 2.3.0, Provides Structured Streaming and advanced session processing capabilities.

Another question to consider is whether you need to support the lambda architecture. Lambda here does not mean AWS Lambda or the serverless architecture. In the world of data processing, the Lambda architecture refers to the use of both batch and streaming processing to process large amounts of data in order to make trade-offs in latency, throughput, and fault tolerance. The batch layer provides an overall accurate view of batch data, while implementing a speed layer for real-time streaming that provides a near-complete view of real-time online data. If you already have a batch job and just need to add a speed layer, you can choose a Lambda architecture engine so that existing code can be reused.

Consider the following when choosing a streaming engine:

  • What engines are being used by other teams in the company? If adopting new technology requires a lot of investment, you might want to consider leveraging the technology your company already has.
  • What is the company’s existing ETL system? Does the new technology integrate well with existing data sources and data pools?
  • What is the learning curve for new technology? What engine and programming language is used for batch processing?

The penultimate part of the presentation focused on how Netflix converted batch ETL into streaming ETL. Previously, Netflix’s DEA team used batch ETL to analyze Source of Play and Source of Discovery, which took at least 8 hours to complete. A playback source is the location from the home page of the Netflix application to where the user starts playing the video, while a discovery source is the location where the user discovers the content they want to watch on the home page. The DEA team’s ultimate goal was to optimize the home page, maximize the conversion rate of user views, and reduce the delay (which at the time was more than 24 hours) between the occurrence of an event and the results of the analysis, and real-time processing could significantly reduce this gap.

After an in-depth analysis of the discovery source problem, Netflix found that the streaming engine could help them achieve some of the following goals: processing high-throughput data (around 100 million plays are generated by users worldwide every day); Interact with live microservices through a fat client (based on RPC) to populate played events; Integration with other systems in the Netflix ecosystem; Centralized logs and alarms; Support input data with low change frequency (such as metadata tables containing movie metadata or national demographics).

In the end, Arora and her team opted for Flink, along with other related technologies:

  • Use Kafka as the message bus;
  • Abstract, query, and analyze data using Hive, which provides SQL-style interfaces (mainly used for metadata).
  • Use Amazon S3 to store data (HDFS);
  • Integration with the rest of the Netflix ecosystem through Netflix OSS;
  • Schedule and execute jobs using Apache Mesos;
  • Continuous delivery using Spinnaker.

The following diagram shows the overall pipeline architecture for discovery sources:

Arora summarized the main challenges faced by THE DEA team during the migration process:

  • Obtaining data from live streaming sources:
    • The migrated job requires access to the complete user play history.
    • In theory, this can be easily done through streaming processing, as the integration with the Netflix ecosystem and the real-time nature of the data processing means that processing an event requires only a simple RPC call.
    • However, because both Flink and Netflix OSS are developed in Java, there are sometimes library compatibility issues (known as “JAR package hell”).
  • Side input:
    • Every piece of metadata in a streaming job might have been captured when the live stream source data was retrieved.
    • Fetching data repeatedly requires more network overhead and ultimately leads to irrational resource utilization.
    • So they cache metadata in memory and refresh it every 15 minutes.
  • Data recovery:
    • If a batch job fails due to infrastructure problems, it can be rerunked because the data is still stored in an underlying object store, such as HDFS. This may not be the case in streaming processing because the original event may not exist after it is processed.
    • In the Netflix ecosystem, the TTL of the message bus (Kafka) can be 4 to 6 hours, and if a streaming job fails to execute and cannot be detected and fixed within the TTL, the data will be lost.
    • The solution is to save the raw data in HDFS (for a day or two) for subsequent reprocessing.
  • Out-of-order events:
    • If the data pipeline fails and data recovery is required, this means that “old” data will be intermixed with new live data.
    • The problem is how to correctly use the generation time of the data to flag late data.
    • The DEA team chose to implement the time window itself and Post Processing to ensure that the correct event time was used.
  • Monitoring and alarm growth:
    • In the event of a pipe failure, the appropriate team must receive an immediate alarm.
    • If the alarm is not triggered in time, data will be lost.
    • Efficient monitoring, logging, and alerting are critical.

Arora concluded his presentation by saying that while migrating batch ETL to streaming has many business and technical benefits, it also presents many challenges. Streaming engineers pay a price for this because most traditional ETL is batch-based, and streaming machine learning training is a relatively new field. Data teams also need to deal with operational issues, such as participating in shift stand-by or dealing with outages. It can be said that batch failures need to be addressed urgently, while stream failures need to be addressed immediately. Enterprises need to invest resources to build resilient infrastructure, build efficient monitoring and alerting mechanisms, and create continuous delivery pipelines that can support rapid iteration and deployment.

A video of Arora’s presentation at QCon 2017 in New York can be found at InfoQ.

About the author

Daniel Bryant is an organisational and technological change-maker. His current work includes promoting agile adoption in the enterprise by introducing better requirements and planning models, focusing on architectural issues related to agile development, and promoting continuous integration and continuous delivery. His current technical areas include DevOps tools, cloud computing/container platforms, and microservice implementations. He is head of the London Java Community (LJC), a contributor to multiple open source projects, writes for several well-known websites (InfoQ, DZone and Voxxed), and speaks regularly at international conferences (QCon, JavaOne and Devoxx).

Migrating Batch ETL to Stream Processing: A Netflix Case Study with Kafka and Flink