One of our customers was having a MySQL problem with a large table with over 2 billion records and counting. If you don’t replace the infrastructure, you run the risk of running out of disk space, which could ultimately destroy the entire application. Moreover, there are other problems with such a large table: poor query performance, poor schema design, and no easy way to analyze the data because there are too many records. We wanted a solution that would address these issues without introducing costly maintenance Windows that would cause applications to fail and customers to be unable to use the system. In this article, I’ll present our solution, but I also want to remind you that this is not a suggestion: different situations call for different solutions, but maybe someone can get some valuable insights from our solution.
Will cloud solutions be the cure?
After evaluating several alternatives, we decided to move our data to the cloud, and we chose Google Big Query. We chose it because our customers prefer Google’s cloud solutions, their data is structured and analyzable, and they don’t require low latency, so BigQuery seemed like the perfect choice. After testing, we are convinced that Big Query is a good enough solution to meet the needs of our customers, enabling them to use analytics tools that can analyze data in seconds. However, as you probably already know, running a large number of queries against BigQuery can be very expensive, so we wanted to avoid querying directly from the application and use BigQuery only as an analysis and backup tool.
Stream data to the cloud
When it comes to streaming data, there are many ways to do it, and we chose a very simple one. We used Kafka because we already used it extensively in the project, so there was no need to introduce another solution. Kafka gives us another advantage — we can push all the data onto Kafka, hold it for a while, and then transfer it to the destination without adding a lot of load to the MySQL cluster. This approach provides a fallback if BigQuery’s introduction fails, such as if it is too costly or difficult to execute the requested query. It was an important decision, which brought us many benefits at very little cost.
Stream data from MySQL to Kafka
For how to stream data from MySQL to Kafka, you might think of Debezium (debezion.io) or Kafka Connect. Both solutions are good choices, but in our case, we had no way to use them. MySQL server version is too old, Debezium does not support, upgrade MySQL upgrade is not an option. We also cannot use Kafka Connect, because the lack of auto-increment columns in the table makes it impossible for Kafka Connect to transfer data without losing it. We know that it is possible to use timestamps, but this approach is likely to lose some of the data because Kafka queries the data using timestamps with lower precision than those defined in table columns. Of course, both of these solutions are good, and I recommend using them to stream database data to Kafka if they don’t cause conflict in your project. In our case, we need to develop a simple Kafka producer that queries the data without losing it, then streams the data to Kafka, and another consumer that sends the data to BigQuery, as shown in the figure below.
Stream data to BigQuery
Reclaim storage space by partitioning
We streamed all the data to Kafka(we used data filtering to reduce the load) and then streamed the data to BigQuery. This helped us solve the query performance problem, allowing us to analyze large amounts of data in seconds, but the space problem remained. We wanted to design a solution that would both solve the problem now and be easy to use in the future. We prepared a new schema for the data tables, using sequence ids as the primary key and partitioning the data by month. By partitioning large tables, we can back up old partitions and remove them when they are no longer needed, retaking some space. Therefore, we create a new table with a new schema and populate the new partitioned table with data from Kafka. After all the records were migrated, we deployed a new version of the application that inserted into the new table and dropped the old table to reclaim space. Of course, in order to migrate the old data to the new table, you need to have enough free free space. However, in our case, we were constantly backing up and removing the old partitions during the migration to make sure there was enough space to store the new data.
Stream data into partitioned tables
Reclaim storage space by collating data
After we stream the data to BigQuery, we can easily analyze the entire data set and validate some new ideas, such as reducing the space taken up by tables in the database. One idea is to verify how different types of data are distributed in tables. It turned out that almost 90% of the data was unnecessary, so we decided to sort it out. I developed a new Kafka consumer that filters out unwanted records and inserts the records that need to be left into another table. We call it a rectification sheet, as shown below.
Type A and type B are filtered out after collation:
Flow the data into the new table
With the data sorted out, we updated the application to read from the new table. We continue to write data to the previously mentioned partitioned table, and Kafka keeps pushing data from this table to the organizing table. As you can see, we solved the customer’s problem with the above solution. With partitioning, storage space is no longer an issue, and data collation and indexing solve some of the application’s query performance problems. Finally, we stream all the data to the cloud so that our customers can easily analyze all the data. Since we only use BigQuery for specific analysis queries, and related queries from the user’s other applications are still handled by the MySQL server, the overhead is not very high. Another important point is that all of this is done without downtime, so customers are not affected.
conclusion
In general, we use Kafka to stream data to BigQuery. By pushing all the data to Kafka, we have the space to develop other solutions so that we can solve important problems for our customers without worrying about getting things wrong.