In the previous article, we stated that if the server is to improve the overall throughput, it must reduce the processing time of each request. So what factors at the database level affect performance in the current scenario?
Pooling technology to reduce frequent database connection creation
Encountered such a problem, the solution is to follow the current overall logic to think, first of all, the application and the database to deal with, will inevitably design the establishment of database links. Then complete the database operations in the current connection and close the connection.
In this scenario, each time a client initiates a request, it needs to re-establish a connection. Does frequent connection creation affect performance? The answer is yes, and here’s a way to test it
#-i Specifies the name of a network adapter
tcpdump -i eth0 -nn -tttt port 3306
Copy the code
When we make a connection to the database, the packet capture command will print the following information about the connection. (Test with Navicat’s link test tool)
Focus on the first eight lines of data.
2021-06-24 23:15:50.130812 IP 218.768.219.57423. > 172.17136.216.3306.: Flags [S], seq 759743325, win 64240, options [mss 1448,nop,wscale 8,nop,nop,sackOK], length 0
2021-06-24 23:15:50.130901 IP 172.17136.216.3306. > 218.768.219.57423.: Flags [S.], seq 3058334924, ack 759743326, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
2021-06-24 23:15:50.160730 IP 218.768.219.57423. > 172.17136.216.3306.: Flags [.], ack 1, win 260, length 0
2021-06-24 23:15:50.161037 IP 172.17136.216.3306. > 218.768.219.57423.: Flags [P.], seq 1:79, ack 1, win 229, length 78
2021-06-24 23:15:50.190126 IP 218.768.219.57423. > 172.17136.216.3306.: Flags [P.], seq 1:63, ack 79, win 259, length 62
2021-06-24 23:15:50.190193 IP 172.17136.216.3306. > 218.768.219.57423.: Flags [.], ack 63, win 229, length 0
2021-06-24 23:15:50.190306 IP 172.17136.216.3306. > 218.768.219.57423.: Flags [P.], seq 79:90, ack 63, win 229, length 11
2021-06-24 23:15:50.219256 IP 218.768.219.57423. > 172.17136.216.3306.: Flags [P.], seq 63:82, ack 90, win 259, length 19
2021-06-24 23:15:50.219412 IP 172.17136.216.3306. > 218.768.219.57423.: Flags [P.], seq 90:101, ack 82, win 229, length 11
2021-06-24 23:15:50.288721 IP 218.768.219.57423. > 172.17136.216.3306.: Flags [.], ack 101, win 259, length 0
Copy the code
-
The first part is the TCP three-way handshake to establish the connection of the packet
- The first packet is a SYN packet sent by the client to the service segment
- The second packet is an ACK packet and a SYN packet returned from the server to the client
- The third packet is the ACK packet returned by the client to the server
-
The second part is the procedure for the Mysql server to verify the client password
From 130812 to 288721, the total time is 157909, close to 158ms, which seems to be very small, and the impact on the system is not great in the case of a small number of requests. However, as the volume of requests increases, the impact of the request time becomes significant.
As we all know, the solution to this problem is to use pooling technology to pre-establish database connections. When the application needs to use connections, it can directly obtain and call from the pre-established connections, as shown in Figure 2-2.
A database connection pool works like a thread pool in that it has two most important configurations: minimum and maximum connections, which control the flow of getting connections from the pool:
- If the current number of connections is less than the minimum number of connections, a new connection is created to process the database request;
- If there are idle connections in the connection pool, reuse idle connections.
- If there are no connections in the free pool and the current number of connections is less than the maximum number, a new connection is created to process the request;
- If the current number of connections is greater than or equal to the maximum number of connections, the system waits for the old connections to become available based on the maxWait time specified in the configuration.
- If you wait longer than this, an error is thrown to the user.
In general, the core idea of connection pooling is space swap time. It is expected that pre-created objects can be used to reduce the performance cost of frequently creating objects, and at the same time, objects can be managed uniformly, reducing the cost of using objects.
Performance optimization of the database itself
Database performance optimization itself is also very important, common optimization means
- Create and use indexes properly, and try to access data only through indexes
- Optimization of SQL execution plan, SQL execution plan is one of the most core technology of relational database, it represents the data access algorithm when SQL execution, optimization of execution plan can improve the performance of SQL query
- Return as little data as possible for each data interaction, because larger data means increased network communication latency. A common approach is to query the data through paging, returning only the fields needed for the current scenario
- Reduce the number of interactions with the database, such as batch submissions and batch queries
- .
Performance problems of database read and write operations
If the boss says that the company is going to run an operation next month, the number of users will increase rapidly, resulting in increased read pressure on the database. If you run MySQL 5.7 on a 4-core 8GB machine, you can probably support 500 TPS and 10000 QPS, while the actual QPS may be 10W. So what’s the solution?
First of all, let’s analyze this problem. In most user-oriented systems, they are mostly read and write models. For example, e-commerce, most of the time is searching and browsing, and tiktok, most of the time is loading short videos. A common solution is read/write separation,
Read/write separation is to split a database into two parts, one for transactions and the other for reads, as shown in Figure 2-3.
With master-slave replication, we can write only to the master and read only to the slave, so that even if the write request locks the table or records, it does not affect the execution of the read request. At the same time, when the read traffic is relatively large, we can deploy multiple slave libraries to jointly undertake the read traffic, which is the so-called “one master multiple slave deployment mode”. In your vertical e-commerce project, you can resist the high concurrent read traffic in this way. In addition, the slave library can also be used as a standby library to avoid data loss due to the failure of the master library.
So you might say, can I resist a lot of concurrency by increasing the number of slave libraries indefinitely? Not really. As the number of slave libraries increases, there are more I/O threads connected from slave libraries, and the master library also needs to create the same number of log dump threads to handle the replication requests. Therefore, the resource consumption of the master library is high, and the network bandwidth of the master library is limited. Therefore, in practice, a master library can hang 3 ~ 5 slave libraries at most.
Of course, master-slave replication has its drawbacks, including complexity in deployment and delays in master-slave synchronization, which can sometimes have an impact on the business
Performance problems caused by increased data volume
With the growth of services, the amount of data in the database also increases. Data is stored in a single table during the initial development mainly to catch up with the schedule. Therefore, when the amount of data in a single table increases, the query and write of the database will incur very high performance overhead, as shown in the following.
- If the amount of data in a single table is too large (tens of millions to hundreds of millions), even if you use an index, the space occupied by the index increases with the increase of data volume, and the database cannot cache the full amount of index information. In this case, the index data needs to be read from disk, which affects the performance of the query.
- As the data volume increases, the disk space is occupied, and the database takes longer to back up and restore
- The data of different modules, such as user data and user relationship data, are all stored in one master library. If the master library fails, all modules will be affected
- Benchmark MySQL5.7 on a 4-core 8G cloud server, which can support about 500TPS and 10000QPS. You can see that the performance of database for writing is weaker than the ability of data query. Then with the increase of system write requests, It also increases the time of write requests (update data operations need to update indexes synchronously, which takes a long time in the case of large data volume).
In such scenarios, the solution is to shard data, which is the mechanism for dividing databases and tables, as shown in Figure 2-4. The core of data splitting reduces the data I/O burden of single tables and libraries, thereby improving performance for database-related operations.
Performance improvements brought by different storage devices
Previously, we learned about some optimization ideas for traditional relational databases. Overall, after optimization, programs can improve the computational performance of accessing the database. But there are still some situations that, even after optimization, cannot be solved using traditional relational databases, such as.
- When the amount of data reaches TB level, the traditional relational database basically divides the database into tables, and the amount of data in a single table is also very large.
- For some data that is not suitable for relational database storage, traditional database cannot do it, so the characteristics of database itself limit the management of diverse data.
So NOSQL emerged, we are not unfamiliar with the concept of NOSQL, it is different from the traditional relational database of other database systems collectively, it does not use SQL as a query language, and relative to the traditional relational database,
It provides higher performance and horizontal scaling capability, which is very suitable for the scenario of high concurrency and large data volume in Internet projects, as shown in Figure 25, which represents different types of NOSQL databases that are relatively mainstream at present.
All Nosql frameworks are documented on this site
Hostingdata. Co. UK/no (- datab…
The Key – Value database
Key-value database, the typical representative is Redis, Memcached, is currently the industry’s very mainstream Nosq database.
There are two reasons why I/O performance is better than traditional relational databases
- Data is memory-based with high read and write efficiency
- KV data, the time complexity is O(1), the query speed is fast
KV NoSql’s biggest advantage is high performance, using the BenchMark of Redis, TPS can reach nearly 10W level, the performance is very strong. Redis also has some obvious disadvantages that all KV NoSql types have:
- Single query mode, only KV mode, does not support conditional query, the only way of multi-condition query is data redundancy, but this will greatly waste storage space
- Memory is limited and cannot support mass data storage
- Similarly, since the storage of KV type NoSql is memory based, there is a risk of data loss
Based on the features of key-value databases, this type of database is suitable for cache scenarios.
- Read more to write less
- Strong reading ability
- Data loss is acceptable
Compared with traditional databases, this type of storage has the advantage of high read/write performance. It is generally used in scenarios that have high performance requirements.
- Used to do distributed cache, improve program processing efficiency.
- Used for session data storage
- Other functional features, such as message communication, distributed locks, and Bloom filters
- The feed flow of weibo is realized with Redis in the early stage. (An ongoing flow of updated and presented content to users. Everyone’s moments, Micro blog page and so on are a Feed stream.)
Column database
When we first learned about databases, they were stored in two-dimensional tables, with each row representing a complete piece of data. In most traditional relational databases, data is stored in rows. In recent years, however, columnar storage has also become more widely used in big data frameworks.
As shown in Figure 2-6, all columns in a database Table are arranged in a row at a time and stored in row bits. With B+ tree or SS-Table as the index, the corresponding row data can be quickly found through the primary key.
In practical applications, most operations are Entity units, that is, most CRUD operations are for a whole row of records. If you need to save a row of data, you only need to add a row of data after the original data, so data writing is very fast.
But for a query, a typical query operation would traverse the entire table, grouping, sorting, aggregating, and so on. For row storage, the advantages of such operations are lost. Worse, analytical SQL may not need to use all columns, but only some columns. However, columns in the row that are irrelevant to this operation must also participate in the data scan.
For example, as shown in Figure 2-7, now I want to count the total number of likes for all articles. As a row storage system, what does the database do?
- First you need to load all the rows into memory
- Sum is then performed on the LIKE_num column
The advantage of row storage for OLAP scenarios is lost, so column storage is introduced.
OLTP (On-Line transaction processing) is translated into online transaction processing, OLAP (On-Line Analytical processing) is translated into online Analytical processing, literally OLTP is to do transaction processing, OLAP is to do analysis processing. From the perspective of database operation, OLTP is mainly to add, delete and change data, while OLAP is to query data
As shown in Figure 2-8, column storage groups each column of data together to facilitate column operations, such as the sum of the like_num statistics mentioned above. After column storage, only one disk operation is required to complete the summary of three data, so it is very suitable for OLAP scenarios.
- When a query involves only partial columns, only the related columns need to be scanned
- Each column has the same type of data and is more correlated with each other. Therefore, column data compression is efficient.
But it’s not very friendly for OLTP, because writing a single line of data requires modifying multiple columns.
Columnar storage is widely used in big data analysis, such as recommendation portrait (risk control of Ant Financial), empty data (collection data of Didi Taxi), message/order (telecom, banking), and many storage at the bottom of order query. Feeds streams (friends’ circle apps) and so on.
Document database
In a traditional database, all information is divided into discrete data fields and stored in a relational database, or even in different table structures for some complex scenarios.
For example, in a technical forum, suppose the relationship between users, articles, and article comment tables is shown in Figure 2-10.
If a user clicks on an article and wants to show the author of the article, the details of the article, and the comments on the article, what does the server do?
- Find article details
- Find user information based on the UID in the article
- Query the list of all comments on this article
- Query the name of the creator of each comment
This process is either multiple database queries or a complex relational query to retrieve, which is not very convenient either way. The document database can solve this problem.
The document database is a document unit, and there are many forms of specific documents, such as XML, YAML, JSON, BSON, etc. Documents store specific fields and values, and applications can use these fields to query and filter data.
Generally, a document contains all the data in the entity, such as the structure shown in Figure 2-10. We can directly build the basic information of an article into a complete document and store it in the document database. The application only needs to make a single request to obtain all the data. b
Article :{Creator:{uid: ", username: "}, Topic: {title: ", content: "}, Reply: [ { replyId:, content:'' }, { replyId:, content:'' } ] }Copy the code
MongoDB is the most popular Nosql database at present. It is a document database oriented to collection and independent of Schema Free. Its data is grouped in “collections,” each of which has a separate name and can contain a wireless number of documents. Collections are similar to tables in a relational database except that there is no explicit schema.
In a database, schema (pronounced “skee-muh” or “skee-mah” in Chinese) is the organization and structure of a database. Schemas and schemata can be used in the plural forms. Schemas contain schema objects, These can be tables, columns, data types, views, stored procedures, relationships, primary keys, and **foreign keys The key). A database schema can be represented by a visual graph that shows the database objects and their relationships to each other
As shown in Figure 2-11, storing data in flexible jSON-like documents means that fields can vary from document to document, and data structures can change over time.
MongoDB does not provide data consistency check or transactions. Therefore, MongoDB is not suitable for storing non-critical data in scenarios requiring high data transaction requirements. Common application scenarios are as follows:
- Use Mongodb to record application logs
- Storage monitoring data, such as buried points of applications, can be directly reported and stored in mongoDB
- MongoDB can be used to realize O2O express applications. For example, the information of express riders and express merchants is stored in MongoDB, and then queried through the geographical location of MongoDB, which is convenient to query nearby merchants and riders.
Graphic database
Graphical database, which represents a database stored as a data structure “graph”. A graphical data store manages two types of information: node information and edge information. Nodes represent entities, and edges represent relationships between those entities. Both nodes and edges can contain attributes that provide information about the node or edge (similar to columns in a table).
Edges can also contain a direction to indicate the nature of the relationship.
Graphical data stores are used to enable applications to efficiently execute queries that traverse nodes and edge networks, and to analyze relationships between entities. Figure 2-12 shows the organizational staff data that has been structured as a graph.
Entities are employees and departments, and the edges indicate the affiliation and the department in which the employee works. In this figure, the arrows on the edges indicate the direction of the relationship.
Using this structure, you can simply and directly perform things like “Find Sarah’s direct or indirect reports” or “Who works in the same department as John?” The query. For large graphs with a large number of entities and relationships, complex analysis can be performed quickly. Multiple graphics databases provide a query language that can be used to efficiently traverse a relational network. For example: relationships, maps, network topologies, traffic routes, and other scenarios.
NewSql
NewSql is the next generation of data storage solutions after the development of Nosql.
Earlier we saw the advantages of Nosql.
- High availability and scalability, automatic partitioning, easy expansion
- Strong consistency is not guaranteed, and performance is greatly improved
- Extremely flexible without relational model constraints
However, some advantages are not suitable for certain scenarios, such as not ensuring strong consistency, which is fine for ordinary applications, but for some financial grade enterprise applications,
Strong consistency needs to be high. In addition, Nosql does not support SQL statements. Different Nosql databases have their own independent apis for data manipulation, which is relatively cumbersome and complex.
So NewSql emerged. Simply put, NewSql is a combination of noSQL’s powerful scalability on traditional relational databases. Traditional SQL architecture design is not in the DNA of distribution, but NewSql was born in the cloud era, naturally distributed architecture.
Key features of NewSQL:
- SQL supports complex queries and big data analysis.
- ACID transactions are supported and isolation levels are supported.
- Elastic scaling and capacity expansion are transparent to the service layer.
- High availability, automatic DISASTER recovery
Commercial NewSql
-
Spanner, F1: Google
-
OceanBase: ali.
-
TDSQL: tencent
-
UDDB: UCloud
conclusion
When NoSQL databases were first introduced, they were considered the silver bullet to replace relational databases, perhaps for several reasons, in my opinion:
- It makes up for the shortcomings of traditional database in performance.
- Convenient database change, do not need to change the original data structure;
- Suitable for large data volume scenarios common in Internet projects;
However, this view is a mistake, because slowly we found that in business development scenarios still need to take advantage of SQL statement powerful query functions and traditional database transactions and flexible indexing functions, NoSQL can only be used as a complement to some scenarios.
Use Redis to optimize performance issues
Redis is a kind of key-vlaue database that is used very much at present. Let’s first understand the performance gap between Redis and mysql through a pressure test data.
Demo project: Springboot-Redis-example
Pressure the two urls in this project using the JMeter tool.
- http://localhost:8080/city/{id}
- http://localhost:8080/city/redis/{id}
Where, based on the interface accessed by mysql, throughput data is as follows, QPS =4735/s.
Figure 2-14 shows the pressure measurement data based on REDis.
It is obvious that Redis has 1000 more QPS than Mysql in the same application.
Understand Redis
Back in 2008, a guy in Sicily, Italy, went by the pen name Antirez (invece.org/) and created a visitor profile…
Sometimes we need to know the visitor’s IP address, operating system, browser, search terms used, location, page address visited and so on. In China, there are many websites providing this function, such as CNZZ, Baidu Statistics, and Google Analytics abroad.
That is to say, we do not have to write their own code to achieve this function, just need to embed a section of JS code in the global footer on the line, when the page is visited, it will automatically send the information of visitors to these website statistics server, and then we can log in the background to view the data.
LLOOGG.COM does just that, allowing you to view up to 10, 000 recent browsing records. In this case, it needs to create a List for each site, and the visits from different sites go into different lists. If the list is longer than the length specified by the user, it needs to delete the earliest record (fifO).
As LLOOGG.COM gets more and more users, so does the number of lists it needs to maintain, keeping track of the latest requests and deleting the earliest ones. LLOOGG.COM originally used MySQL as a database, and as you can imagine, since every record and delete required reading and writing to disk, the amount of data and concurrency was so large that no amount of effort to optimize the database would work in this case.
Considering that disk is the ultimate bottleneck limiting database performance, Antirez plans to abandon disk and implement a prototype database with a list structure, keeping data in memory instead of disk, which can greatly improve the efficiency of list push and pop. Antirez found this solution, so he rewrote the in-memory database in C and added persistence, and in 2009 Redis was born. Starting with tabulated databases, Redis now supports multiple data types and offers a range of advanced features. Redis has become a widely used open source project around the world.
Why is it called REDIS? The full name is Remote Dictionary Service, which translates to Remote Dictionary Service.
Key-value Specifies the database usage ranking
For the most part, we think of Redis as a caching component, but as we can see from its history, it was not originally used as a cache. However, in many Internet applications, it plays the most important role as a cache. So let’s talk about the key features of Redis and why we use it as a database cache.
You are familiar with caches, for example, we have caches on the hardware level of THE CPU, we have caches on the browser, we have caches on the mobile application. The reason we cache the data is because it’s too expensive to fetch it from its original location, so putting it in a temporary store makes fetching it faster.
To understand the nature of Redis, we must answer a few questions:
1. Why put data in memory?
-
Memory is faster, 10W QPS
-
Reduce calculation time
2. If you are using in-memory data structures as caches, why not use HashMap or Memcache?
-
Richer data types
-
In-process and cross-process; Single and distributed
-
Rich in features: persistence mechanism, expiration policy
-
Support for multiple programming languages
-
High availability, clustering
Db-engines.com/en/ranking/…