Hulu big data architecture and application experience

On August 12, 2017, Dong Xicheng, head of Hulu’s Big data Architecture department, delivered a speech titled “Hulu Big Data Architecture and Application Experience” at netease Learned Practice Day: Big Data and Artificial Intelligence Technology Conference. IT Tycoon said (ID: ITdakashuo) as the exclusive video partner, by the sponsor and speaker review authorized release.

Read the word count: 1540 | 4 minutes to read

Guest speech video address: suo.im/4ymiiP

Abstract

Dong Xicheng, head of Hulu’s Big Data Architecture department, shares Hulu’s practical experience in big data architecture and applications.

Overview

Above is hulu’s overall big data architecture. Our architecture is pretty much the same as any other architecture, with a few minor differences.

Hulu runs four main things on YARN: batch processing, interactive computing, streaming, and services.

We developed Nesto, an interactive computing engine, and we have a tool called Voidbox for running services.

On top of this, we also provide a variety of tools to make it easy for users to use the entire cluster. Firework, a client-side management tool, Horcrux, a unified configuration manager, and other tools.

In addition, you can see on the right side of the figure that we used two cluster management tools, Cloudera Manager and Hawkeye.

We have 4 clusters and 3 datacenters, and these infrastructures are shared by all the teams in Beijing and the United States. At present, there are about ten people in charge of the operation, maintenance, optimization and development of the whole cluster in the field of big data infrastructure.

Hulu Focus

Hawkeye

As a big data basic operation and maintenance team, part of the work is operation and maintenance. Operation and maintenance inevitably requires a powerful management and monitoring system. Hawkeye mainly helps users better understand the changes of data or applications, which is mainly divided into three parts.

The first one is a report. We periodically send teams reports on hot and cold data distribution, small file distribution, and data growth. Keep teams informed of changes to the data.

The second is all kinds of alarms. There are service-level alarms, data growth alarms, large application alarms, service-state alarms, and so on. These alerts are sent either to the data team or to the infrastructure team to see how the cluster is doing.

We also have automated programs that automatically generate tasks based on the condition of the disk or the machine and send it to the resident team in the machine room to help us fix it.

Firework

Firework integrates all Hadoop location and version information within Hulu. If you want to use Firework to access different clusters, you just need to use the corresponding command to tell you which room to visit, and it will automatically cache the corresponding version locally from the Central repository, and then apply the corresponding client to access Central.

Hulu Spark

Currently, Spark has been upgraded to version 2.1, and we have developed over 30 large patches for the internal version.

When sparkStreaming is combined with Kafka, any movement will cause spark Streaming problems as it is very poor in the area of stability, which has not been addressed by the open source community.

Each Spark has a lot of executors, and sometimes customers need to debug each executors. We use heuristic algorithms to dynamically detect the operating status of each executors and take action for the executors.

We have High CPU Applications. For such applications, we allow users to customize the number of executors per node.

The above examples are some of the internal adjustments and optimizations we made to help users improve the stability and performance of Spark.

OLAP Engines

There are more and more OLAP engines, and OLAP typically has three layers.

As shown in the figure above, at the bottom level are the various OLAP engines, Impala, Presto, Nesto, and Druid.

Above are applications developed based on OLAP engine, such as multidimensional analysis, time series analysis, Cohort analysis and user flow rate analysis.

At the top are various visualization systems, Tableau and Hulu BI Portal.

OLAP – Presto

One of the better OLAP engines is Presto. Hulu has enabled the Resource Group feature in Presto, which divides OLAP resources into several Resource pools for different groups to use.

ElasticPresto is a scalable Presto that dynamically increases or decreases computing resources depending on the load of the query. Yarn provides a good operating system for us in this area. Running Presto on YARN simplifies deployment, facilitates rolling upgrades, and allows flexible scaling of available computing resources based on load.

On-line analytical processing (OLAP) – Nesto Background

Hulu has developed its own computing engine, Nesto, to solve the problem of nested data queries.

Hadoop Multi-DC

Challenge

There’s a lot of data, a lot of applications, and a lot of mixed types of applications.

We keep downtime within a day, or a few hours or even less.

The Hulu scenario is unique because we are a multinational company with a lot of offices, coordinating with all the data sets in multiple offices.

If you want to do a transparent migration, you may need to make many technical changes and customize the infrastructure (code level) to ensure a smooth migration.

Components

DCNamenode: supports the data center topology.

DCTunnel: blocks are synchronized according to the folder-level whitelist and blacklist. Bandwidth is limited; Priority-based block replication; Adjust quotation; A Web portal that displays progress.

DCBalancer: Balancing within each data center.

conclusion

Build basic tools to better serve users, customize open source projects for users’ specific cases, and build new systems for specific scenarios.

That’s all for today’s sharing, thank you!