The purpose of database and table is to solve two problems:
First, it is too much data query slow problem. The “queries” we are talking about are mainly queries and updates in transactions, because read-only queries can be handled by caching and master-slave separation, as we discussed in the previous two sessions on “how MySQL handles high concurrency”. As we talked about in the last class, solving the query is slow. You just need to reduce the total amount of data in each query. In other words, dividing the table can solve the problem.
Second, to deal with the problem of high concurrency. As mentioned earlier, when a single database instance fails, concurrent requests are spread across multiple instances. Therefore, to solve the problem of high concurrency, separate libraries are required.
To put it simply, the amount of data is large, it is divided into tables; High concurrency, sub – library.
Three sharding algorithms are commonly used
- Scope sharding is easy to generate hot issues, but more friendly to query, suitable for scenarios with small concurrency;
- Hash sharding makes it easier to distribute data and queries evenly across all shards.
- Table lookup is more flexible, but less performance.
MySQL to Redis synchronization
- MQ message Synchronization
- Update the Redis cache in real time with Binlog (using Canal)
How to replace the database without stopping?
- On-line synchronization program, copy data from the old library to the new library, and keep synchronization in real time;
- Online double write order service, only read and write old library;
- Start double write and stop the synchronization program at the same time;
- Start the comparison and compensation process to ensure that the old and new database data is exactly the same;
- Step – down read requests to the new library;
- Offline comparison compensation program, close double write, read and write are switched to the new library;
- Offline double write for old library and order service.
How should huge amounts of data like clickstream be stored?
For the storage system of massive original data, what we require is ultra-high write and read performance, and almost unlimited capacity, and not high requirement for data query ability. In production, you can choose Kafka or HDFS. Kafka has better read and write performance and a single node can support higher throughput. HDFS, on the other hand, offers truly unlimited storage capacity and is more query-friendly.
One is distributed streaming data storage, with Pravega and Pulsar’s storage engine Apache BookKeeper among the more active projects. These distributed streaming data storage systems follow the same path as Kafka, providing truly unlimited capacity and better query capability on the basis of high throughput.
Another category is Time Series Databases. InfluxDB and OpenTSDB are active projects. These sequential databases not only have very good read and write performance, but also provide very convenient query and aggregate data capabilities. However, they don’t store all kinds of data, and they focus on things like monitoring data, which have a time signature and are all numerical. If you have a need to store massive amounts of surveillance data, these are the projects to look at.