Abstract: The information society is moving from the Era of the Internet to the era of the Internet of Things, enterprises inevitably have to face a series of problems brought by the increase in the amount of data: how to efficiently store and expand, how to achieve intelligent and real-time analysis in the case of minimal changes to the original business.

This article is shared by Huawei cloud community “How to efficiently store and analyze 5 billion Massive Data? 3 Tips for GaussDB (for Cassandra)”, by Cassandra official.

At present, the information society is moving from the Internet era to the Internet of Things era, and information interaction is becoming more complex, efficient and intelligent. For Internet companies and IOT enterprises, it is both an opportunity and a challenge. As a result, enterprises inevitably face a series of problems brought by the increase of data volume, such as how to efficiently store and expand the capacity, and how to achieve intelligent and real-time analysis with minimal changes to the original business.

To meet the challenges, Huawei cloud GaussDB (for Cassandra) provides customers with a series of capabilities, such as strong scalability, high storage, efficient import/export, and real-time analysis, and successfully serves many Internet companies and IOT enterprises, winning high recognition and support from customers. This article uses one of the customer business pain points as an example to talk about three secrets to efficient storage and real-time analysis.

Massive storage, PB level insensitive expansion

When a user uses a database for offline localization deployment or another database for cloud disk storage, the user often needs to plan and purchase storage resources in advance when the capacity reaches the threshold. Unnecessary computing resources may also need to be expanded. With GaussDB (for Cassandra), you don’t have to worry. GaussDB (for Cassandra) uses a separate storage and computation architecture to expand storage capacity to a maximum of PB levels.

In addition, to perform big data analysis, customers need to maintain two sets of resources at the same time by writing data in the database to the HDFS for MapReduce and Spark analysis. Therefore, maintenance and resource costs become a pain point. After using GaussDB (for Cassandra), customers can use only GaussDB (for Cassandra) to complete database storage and big data analysis. In addition, GaussDB (for Cassandra) provides easy-to-use CQL interfaces. Let users focus on feature development rather than resource management.

Data change capture and real-time analysis

One usage scenario of the customer needs to conduct online analysis and real-time recommendation of crawler or user input data. In this business, the total amount of data reaches 5 billion, but the incremental data is less than 500 million, and the analysis object is mainly newly added data every day. In this scenario, GaussDB (for Cassandra) provides the streaming service and real-time analysis solution. The client can simultaneously read/write data and perform real-time analysis without modification with a small loss of read/write performance. The solution consists of the following phases:

  1. Write data to GaussDB using open source driver (for Cassandra)

  2. GaussDB (for Cassandra) provides a streaming interface to capture data changes

  3. The streaming service component built by the customer reads the Streaming interface data and writes it to the specified Kafka queue

  4. The Kafka queue writes streaming data to Spark or Flink

  5. Spark allows customers to analyze incremental data or merge data to perform full analysis

Full data export analysis

Another service of the customer needs to analyze and process the full data periodically. However, it does not want to affect online services and needs to process the full data during off-hours. GaussDB (for Cassandra) provides a full data export and analysis solution. It can trigger tasks to export data and analyze cold data during off-peak hours. The data export rate is 10+ times higher than that of open source. The following is a weekly regular export data analysis user portrait solution for Internet customers, which has the following stages:

  1. You can configure ECS specifications and mount obsFS parallel file systems as required

  2. The customer configures export jobs on DLF, including ECS information, export parameters, and scheduled tasks

  3. CDM delivers a job

  4. Export task on ECS exports data for specified conditions in the specified table in GaussDB (for Cassandra) to obsFS

  5. Spark reads full data from obsFS for data analysis

Huawei cloud GaussDB (for Cassandra) solves problems such as difficulty in expansion, high cost, and delay in change, and realizes efficient storage and real-time analysis of massive data, providing more possibilities for Internet companies and IOT enterprises to develop digitally. For more information about GaussDB (for Cassandra), visit huawei cloud official website.

** Author of this article: ** Cassandra team of Huawei Cloud Gauss

Hangzhou Xi ‘an Shenzhen resume: [email protected]

For more technical articles, check out the official Gauss Cassandra blog

Click to follow, the first time to learn about Huawei cloud fresh technology ~