The author pushes a senior data r&d engineer sugar fried chestnut


In recent years, the rapid development of mobile Internet, Internet of things and cloud computing has spawned massive data. In terms of big data processing, different technology stacks have different performance. Many developers struggle with how to quickly and efficiently process these massive amounts of data.


With the rise of Greenplum, many problems existing in the past big data warehouse have been effectively solved, and Greenplum has become a typical representative of the new generation of database. This paper makes an in-depth study on how to choose an effective technology stack for Individual push (daily interaction) when dealing with huge amount of data, and analyzes Greenplum’s practice in individual push in detail based on its own business scenarios.





01



The background of Greenplum


In 2002, the amount of Internet data was in a period of rapid growth. On the one hand, traditional databases could not meet the current computing needs; on the other hand, most of the traditional databases were based on SMP architecture and had poor expansion performance. In the face of increasing data volume, SMP architecture cannot continue to support, developers need a database that can support distributed parallel data computing capabilities, and Greenplum came into being.


Different from the SMP architecture of traditional databases, Greenplum is a completely shared Nothing structure and has significantly improved scalability compared to SMP. The Greenplum system is mainly based on the MPP architecture, consisting of multiple servers connected through the network of nodes, each node only needs to access its own local resources (including memory, storage, etc.).






02



Interpret the Greenplum architecture

Greenplum consists of the Master Master node, Interconnect network layer, and multiple nodes responsible for data storage and computing.

Image from Greenplum community


Master has two parts: Master node and slave node. The two main functions are to generate query plans, distribute and coordinate parallel computing of segments, and maintain the Global System Catalog through Master. The global System Catalog holds a set of system tables of metadata owned by the Greenplum database system itself. Master usually does not participate in data interaction, and Greenplum does all parallel tasks on data nodes of segments. Therefore, the Master node is not a performance bottleneck for the database.


Interconnect, the network layer in the middle, is mainly responsible for the Dispatch distribution of parallel query plans, and coordinates the parallel execution of the executor on the QE node through the libpQ network connection. Because of Interconnect, Greenplum enables efficient collaboration and parallel computation of multiple PostgreSQL instances in the same cluster.


Below the structure diagram are nodes responsible for data storage and computation, with multiple instances on each node. Each instance is a PostgreSQL database. Instances on the same machine share the IO and CPU of the node, while different machines are independent of each other. PostgreSQL has excellent stability and data processing performance, but it also has rich syntax support, which meets the needs of Greenplum.


03



The advantage of the Greenplum

Greenplum has several advantages that make it a great tool for processing massive amounts of big data.

Advantages 1: Support fast data loading and parallel computing, greatly improving data processing efficiency;

Image from Greenplum community


Greenplum’s data pipeline can efficiently transfer data from disk to CPU, whereas Spark, a popular computing engine on the market, allocates one memory per concurrent query to transfer data, which is not conducive to large data set queries. Greenplum can load data in parallel, has real-time query capabilities, and can perform more efficient calculations on large data sets.


Advantage 2: Extended performance enhancement

Greenplum is based on the MPP architecture, does not share nodes at all, and can be queried in parallel, so it can scale up to petabytes of data linearly. Greenplum is open source, has an active community and is extremely reliable for users.


Advantage 3: functional optimization

Greenplum supports complex SQL queries, greatly simplifying the operation and interaction of data. Currently, popular technologies such as HAWQ, Spark SQL, and Impala are basically optimized based on MapReduce. Although some technologies also use SQL queries, their support for SQL is limited.


We made a multi-dimensional comparison of several mainstream tools on the market, as shown in the figure below:

plan

way

advantages

disadvantages

Green-

plum

Real-time data is read directly from GPDB, and historical data is aggregated and stored in GPDB every day

1. It was originally a commercial database with good disaster tolerance and stability. 2. Fast query speed

The scalability of super – scale cluster needs to be verified




phoenix

HBase stores original data. Use Phoenix to query data

It mainly relies on HBase, which is convenient to use. SQL queries are also supported

Secondary index will bring rapid growth of data volume, and when the business dimension is very large, if the use of secondary index data volume will be large; HBase enables the coprocessor mode, which affects O&M

Impala+

kudu

Streaming data is first entered into Kudu, and real-time data is queried through Impala. Every day, the data of Kudu is converted into Parquet format and stored in HDFS. The historical data is queried by Impala, and the real-time data and historical data are displayed in union when the page is displayed

Computing and storage are separated, and HDFS is used for storage, which speeds up query



Impala queries are memory intensive and Kudu is not very disaster tolerant.




04



Greenplum’s fault tolerance mechanism

Greenplum database, referred to as GPDB, has rich features, supports multi-level fault tolerance mechanism, and has high availability performance.



1

High availability of primary nodes: To avoid the single point of failure of the primary node, Greenplum specifically sets up a copy of the primary node (called a Standby Master) to replicate both simultaneously through stream replication technology. When the primary node fails, the secondary node can become the primary node, completing user requests and coordinating query execution.


2

High availability of data nodes: Greenplum can be equipped with a mirror for each data node, and prior to version 6 it was synchronized at the file block level; In version 6, Greenplum adopted WAL log replication for data synchronization (called filerep technology). Ftsprobe periodically detects the heartbeat of each data node. When a node fails, GPDB automatically performs a failover.


3

High availability of network: To avoid single point of failure, each host is configured with multiple network ports and uses multiple switches to prevent the entire server from becoming unavailable when the network is faulty.

GPDB also provides graphical performance monitoring. Based on this function, users can determine the current running status and historical query information of the database, and track the system usage and resource information.



05



Application of Greenplum in push business scenarios

After investigating Greenplum in depth, we incorporated it into Our technology map and implemented it in our business lines, such as our app Stats business.

1. Business pain points:

“Push the application of statistics” is a mobile APP data statistics and analysis platform, it can be used from the user attributes, behavior, industry comparative index for the comprehensive statistical analysis on the APP, multi-dimensional help APP operators deep excavation user requirements, a clear understanding of the APP’s industry status, to provide comprehensive data to support the product operating and promotion decisions.


At the beginning, data statistics in “Individual push application statistics” were mainly conducted offline. Later, with the gradual iteration and optimization of the product, many requirements could not be met gradually, such as:

1) Some scenes need real-time calculation;

2) The sky heavy scene in user active statistics;

3) Too many statistical dimensions of indicators cannot be calculated in advance.


2. Solutions:

To solve these problems, Getui introduced Greenplum as a real-time data analysis tool. Currently, Twitter uses a version of Greenplum 5.16.


1) For the real-time problem. The amount of event logs of “Individual Push Application Statistics” reaches billions of pieces every day. When importing data, we use GPKafka as the way of real-time data writing.


The data import process is as follows:

Flume collects log data and sends it to Kafka cluster.

Spark Streaming consumes data in Kafka cluster first, and then performs real-time data cleaning and processing.


Spark Streaming then writes the processed data to a Kafka cluster;


The gPKafka tool consumes data in the Kafka cluster and writes the consumed data to the Heap table in Greenplum for real-time data analysis.

2) For the problem of heaviness, we reformed Greenplum and incorporated Roaringbitmap on its basis. In the data processing stage, the scheduled task constructs daily user information into bitmap format, and realizes multi-day user computing query according to the “and” operation of bitmap.


3) Scenarios for complex dimension statistics. In some scenarios, users define complex query conditions, such as multi-dimensional event analysis and funnel query. “Individual push application statistics” query function module, using PLJava, similar to a UDF script, through the use of PLJava defined function in SQL, to achieve the rapid query of complex statistics.


3. Usage

Currently, the Greenplum cluster used by the Push business line is 10 units in size, with billions of data volumes per day. In the recent past, we’ve also experienced cluster expansion, with nodes augmented by native Tools in Greenplum, which automatically redistributes data. Greenplum’s suite of automation tools also makes scaling up easier for operations staff.


Effect of 4.

The introduction of Greenplum greatly enhanced the data processing capability of Getui, including real-time storage of massive data, real-time query, label operation, etc. At present, the efficient use of data is of great importance to enterprises. Building OLAP analysis systems plays an important role in business guidance and corporate decision-making, and Greenplum can help companies quickly build analysis engines to achieve comprehensive data analysis.


_

Data query

Business support

The stability of

Before using

Previously, real-time data entry into the database is not supported and is generally in the form of T +1 reports

HBase and Codis previously used have limited service support capabilities


Stability is average stability is average


After using

Real-time storage based on massive data, and real-time statistics, support real-time data query

It can support complex SQL queries, complete funnel statistics, multi-dimensional event analysis and other functions

Complete operation and maintenance system support, can do real-time monitoring and alarm to the database




This paper mainly introduces the practice of A push in Greenplum. In the future, Getui will also explore Greenplum in more depth, such as the use of custom functions, and share with developers how Greenplum can be used in production environments.