On July 29, 2017, Li Wei, senior product manager of Qingyun, delivered a speech on “Best Practices of Cloud Big Data Platform” at the big Data and Artificial Intelligence Conference. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker.

Read the words: 3289 | 9 minutes to read

Guest Speech video and PPT review:
suo.im/4A4Y7h


Abstract

When developing big data platforms or big data solutions, many enterprises often do not know which products to choose to meet their needs. This sharing will discuss the practice and thinking of big data platform based on the cloud platform architecture of Qingyun.

Cloud Platform Architecture

Qingyun provides a complete infrastructure cloud and technology platform cloud. The IaaS layer at the bottom of the figure provides standard network storage and computing services. We believe that hosts, virtual machines, containers and physical machines are all resources in the architecture and share the same set of schedulers. The big data platforms, databases and caches in the upper PaaS services are iaAS-based and invoke IaaS APIS. On top of that are managed services, which contain some of their own deployment architectures.

Complete enterprise-level big data platform

The typical big data platform architecture deals first with various data sources and then with data transmission. Kafka is recommended for this transport layer. After data transmission is complete, the big data storage layer is used, such as MongoDB, HBase, and object storage. The computing layer above is generally divided into several categories. Storm is used for real-time processing, Spark is recommended for quasi-real-time processing, and Hadoop and Hive are used for batch processing. There is also a need for task scheduling and platform management to manage the various open source products that come in. We put Redis, Memcached, and MySQL at the top, but this layer is not the UI layer, it is close to the user and has high usage.

All of the above architectures are built on top of IaaS, whether virtual machines, physical machines, or containers. The core of the architecture is that each module in the bottom-to-top and left-to-right lifecycle can be replaced according to the enterprise scenario.

Big data product selection

Real-time stream processing engine comparison

Mainstream real-time Streaming engine products include Storm, Storm Trident, Spark Streaming, SAMZA, Flink, etc. There are many dimensions to consider when choosing them. There is a difference, for example, between at-least-once (At least once, which results in the retransmission of the message) and Exactly once (the message must be processed only once, in the event of an error or otherwise). In terms of Latency, Storm is implemented via Native stream processing with very low Latency. Spark Streaming is micro-batching, which processes streams in small batches over a period of time so that its latency is higher. In terms of Throughput, if Storm’s Native Throughput is not as high, Spark Streaming Throughput will be high.

Storage — Hbase. vs. Cassandra

HBase and Cassandra are very similar products. Both HBase and Cassandra provide high-performance mass data reading and column storage with high read and write performance. And the application scenario is similar, both used for monitoring or log data storage.

However, there are many differences between them. First, HBase is strongly consistent and Cassandra is ultimately consistent. In terms of stability, HBase has Hmaster, Namenode HA, and Cassandra is decentralized and has no single point of failure. In terms of partition policies, HBase is a range partition based on the orderly arrangement of primary keys, and Cassandra is a consistent Hash arrangement and can be customized. In terms of availability, HBase is consistent at the cost of availability. After a HBase goes Down, read and write operations are temporarily unavailable, while Cassandra goes Down, and read and write operations continue.

Ad-hoc & OLAP query analysis product comparison

There are many dimensions to consider in interactive query. Here, data volume, flexibility and performance are selected to measure these three dimensions.

Hive

Can handle mass processing, flexible query, but low performance.

Phoenix + HBase

The two combinations can also query massive data and have high performance. However, rowkey-based queries are inflexible and suitable for scenarios with fixed services.

ElasticSearch

ElasticSearch provides flexible query and high performance, but can only support TERabytes of data. This is mainly related to its architecture, because it is a fully linked structure, so it may lead to scalability problems of nodes.

Kylin

It can handle large amounts of data and has high performance, but low flexibility. Because Kylin uses pre-aggregated queries, the cube dimensions and facts need to be pre-calculated in the data warehouse and stored in HBase to achieve high performance, which causes it to lose flexibility.

Druid

Mass data, performance and query flexibility are all satisfied, but it is time series data, and each record of data source must have a Timestamp.

HashData (GreenPlum)

Standard SQL queries are used, so the queries are flexible. It also has high performance, supporting pB-level data and scalability of nodes across the cloud. But its limitation is that it can only handle structured data.

Massive Data OLTP — Distributed database

The product we use to process massive amounts of data is shown above. It has a cluster of middleware, and each node below is a shard, and each shard is a cluster of MySQL. These shards are infinitely scalable on the cloud, so the architecture can support massive amounts of data.

At the level of architecture, we also make automatic database and table, data consistency and distributed transaction ability in distributed database.

Object storage service with infinite scale

Object storage is very important in big data processing. Generally, we use S3 object storage, which can be infinitely scalable and supports universal file types, including unstructured data, logs, videos, audio, and pictures. There is no limit on the existence of files.

The object storage provided by Qingyun supports S3, so that some mainstream big data products can directly use QingStor object storage. Although this mode has a performance loss, centralized data storage facilitates switching between computing engines. The same data can be calculated by different computing engines.

Ad-hoc & OLAP query analysis product comparison

A large household appliance group — public opinion analysis system based on mass data

In the whole architecture, the crawled data and the backup data of the relational database are stored in the object store, and then analyzed by Spark. The presentation files in the results of the analysis can be displayed through the UI.

Qingyun — Flow chart of online business big data analysis

Qingyun also has a lot of business needs to be analyzed. We will analyze and synchronize the data of some online databases, including typical ETL processing. After the data is processed, it will be stored in HDFS, which will be calculated by Spark. For example, PostgreSQL and Elasticsearch are exposed to the front-end through apI-server. This is a typical big data analysis process that we use to validate various services provided by the QingCloud big data platform.

A large global Internet workplace social platform

It is a large Internet social platform running on QingCloud public cloud with a very typical architecture. The top layer uses its own Web Server to access load balancing, and the bottom layer has a data service layer, which can handle MySQL, cache, Elasticsearch, MongoDB and other data stores, and the data transfer layer Kafka. You can input application-level system logs to the Spark platform for analysis, such as whether a user has added friends to a recommended system.

Roadmap

Big data platform management architecture

Qingyun not only provides components related to big data, but also provides the platform to manage them. Our big data management platform can directly execute Hive, SQL, and Spark scripts through UI, and directly view Storm and ZooKeeper data information. You can view file structure from browser, HDFS, and object storage, and submit HBase for real-time query. You can see what the data structure looks like in the Kafka transmission queue. These are the platform managements on top of the components of big data.

Big data platform +Appcenter2.0

Big data technology is changing too fast for us to provide all related products, so we need to provide a framework layer under the big data platform, so that we can transform various products into services and integrate them into the platform.

Qingyun’s AppCenter is such a framework, and our PaaS and big data services all interact based on this framework. This ensures unified platform management at the top and plug-in framework integration of products at the bottom.