Finally, my own original content. This article was originally published by the author on February 10, 2022EdgeDB DayBecause the Chinese version of the speech was written in advance, it was sent out in advance, first published in Nuggets, and then translated into English and posted on EdgeDB’s official website.
The most common question I get in the Chinese community about EdgeDB is: How is EdgeDB different from openGauss, OceanBase, TiDB? Does EdgeDB support horizontal scaling? This article will try to answer the above questions from the perspective of EdgeDB architecture design, and “what is EdgeDB”.
architecture
The overall architecture of EdgeDB is very simple. It is basically a server program that encapsulates PostgreSQL:
Your application needs to define a data structure /schema and send EdgeQL queries to EdgeDB based on this schema. For example, here is a schema definition written in EdgeQL SDL:
type Person { property name -> str; } type Team { property slogan -> str; multi link members -> Person { property title -> str; }}Copy the code
Here is your EdgeQL query:
select Team {
slogan,
members: {
name,
@title
} order by @title
}
Copy the code
For details on the advantages of EdgeQL and SDL, please refer to “This is SQL? We can Do Better”.
Moving on to this diagram, on the EdgeDB side, EdgeQL queries are compiled into SQL, then executed in PostgreSQL and returned in the original path. EdgeDB itself does not store any data, including your schema and data stored directly in PostgreSQL. There are also eight thousand basic EdgeDB data, including built-in schemas, data types, standard libraries, user roles, database configurations, and more.
EdgeDB does not change the PostgreSQL backend. It is a normal PostgreSQL database at version 13 or higher, and it can even be a PostgreSQL backend of some cloud platforms. In addition, EdgeDB stores its configuration information in PostgreSQL, so EdgeDB is a “stateless” service compared to PostgreSQL. Therefore, EdgeDB Server and PostgreSQL are shown separately in the following figure. Most of the time, though, EdgeDB uses its own PostgreSQL and users are unaware of it.
performance
If I told you that EdgeDB Server was actually written in Python, would you dare to use it?
In fact, EdgeDB is optimized to match PostgreSQL’s native performance. This may not sound like much, but I’m talking about overall efficiency — compared to most current solutions, the overall efficiency is influenced by connection resource allocation, SQL compilation and caching, ORM overhead, SQL optimization, and so on, so EdgeDB comes out on top overall. Not to mention the improvements EdgeDB has made to developer productivity. So how does EdgeDB do it?
From the bottom up, there is the binary protocol that communicates with PostgreSQL. EdgeDB code here is derived from asyncpg project, also using Cython. So it’s much faster than any other way to connect to PostgreSQL through Psycopg2 in Python, and even faster than either of the two options in Go. In addition to binary protocol and Cython acceleration, EdgeDB makes extensive use of Pipelining, and each EdgeQL query (yes, no matter how complex, EdgeDB compiles into a very long and efficient SQL) generates only one network read/write (logical interface). Greatly reduces the time cost of repeated network trips.
As prepared statements are prepared, EdgeDB will cache PostgreSQL prepared SQL statements. Applications that are prepared can have a limited number of queries, such as dozens of different queries, executed with different parameters. So EdgeDB can skip PostgreSQL’s recompilation of SQL every time and go straight to the planning phase. This can be done in non-EdgeDB applications or database frameworks using advanced techniques such as the baked Query feature of SQLAlchemy.
Similarly, EdgeDB will also cache EdgeQL to SQL compilation results, and execute the cache hit directly. Instead of a string hash, the cache index is an AST (abstract syntax tree) that has been parsed syntactic and semantic. The advantage of this is that even if your EdgeQL statement has some literals, the EdgeDB can parse out the backbone of the sentence through the AST without affecting the cache hit. Because each query has to be parsed before looking into the cache, EdgeDB wrote a Python parser plug-in in Rust that takes about 50-70 microseconds to parse a statement, which is 0.05 milliseconds, or 0.00005 seconds.
On the right is the EdgeQL compilation process pool. Since the compiler itself is written in Python, it is CPU intensive to execute, so EdgeDB created a process pool (not a thread pool to bypass the GIL) to compile EdgeQL, and pickled data to UNIX Domain sockets. But in general, if the cache is fully warmed up, there isn’t much work to do to compile the process pool.
Finally, the binary protocol between the EdgeDB and the client is on the top. This protocol specially imitates the binary protocol of PostgreSQL. On the one hand, the team has the most experience in asyncpg, and on the other hand, it inherits the data format of PostgreSQL. In other words, from the perspective of binary transport layer, EdgeDB Server does not need to unpack the query result data sent from PostgreSQL Server. EdgeDB can directly package the query result data with its own binary protocol and send it to the client. That is, EdgeDB is almost like a transparent proxy in front of PostgreSQL for actual user data, but it uses a completely different query language and type system.
The connection
When EdgeDB enters a real high-concurrency environment, things get even more interesting:
First of all, the connection between the EdgeDB Server and the client is very light and is completely stateless once authenticated (except in database transactions where the same back-end connection must be bound) because all front-end connections share the same EdgeQL compilation cache and back-end PostgreSQL connection pool. EdgeDB assigns a connection to the back-end PostgreSQL database only when a client makes a request, and immediately returns the query to the connection pool for use by other front-end connections. The principle here is similar to pgBouncer, with EdgeDB you don’t need the middleware anymore, and you don’t need to worry about front-end connections taking up the limited number of PostgreSQL connections. With uvloop, it’s ok to support tens of thousands or hundreds of thousands of front-end connections on a single machine, as long as your backend PostgreSQL can hold up.
Because of its lightness, front-end connections on the EdgeDB Server do not have a “connection pool”, only a limit on the maximum number of connections, more for protection against attacks than for limited PostgreSQL resources. At the same time, the EdgeDB Server will actively kill front-end connections that have been inactive for a long time (30 seconds by default) — the client can reconnect.
EdgeDB’s current network concurrent I/O is hosted by uVLoop, which is rumored to be the fastest Python asynchronous network framework. If you read the brief History of EdgeDB, you know that uvloop and asyncpg — and even Python’s async/await syntax to a certain extent — were made for EdgeDB. So EdgeDB is currently the best solution for I/O concurrency in Python, but Python was chosen mainly because EdgeDB had a lot of iterations in the early days and needed this flexibility. Then, when it stabilized, we would consider rewriting the I/O layer with Rust.
Second, you may have noticed that there are two waves of clients using different schemas. This corresponds to the PostgreSQL concept of a “logical library”, where a database instance can have multiple logical sub-libraries. EdgeDB also supports this feature, and it’s more mature than PostgreSQL because PostgreSQL can’t help you balance the pressure between different logical libraries. There are hundreds or thousands of database connections. If you assign one more connection to DB1, you assign one less connection to DB2. And the same connection and there is no way to zero cost replacement library (PG pot). EdgeDB has the advantage of architecture design, so we can see the proportion of front-end connection, so we write a complex algorithm in the back-end connection pool of EdgeDB, which is used to schedule the number of database connection resources allocated by different logical libraries, so as to automatically balance the optimal quality of service (QoS). That is, we will not be dry drought dead flood of flood dead, so as to completely liberate the “front-end” developers of the brain, do not have to worry about this matter.
The quality of
When it comes to quality of service (QoS), we have to mention the optimization EdgeDB has made for QoS in the official client.
When your application calls the query() method of the EdgeDB client, the client does not simply forward the request. Instead, it does a series of things that are usually done by app developers to improve the quality of the application.
Each client encapsulates a pool of connections, which are initially empty and only created when needed, so clients are no problem in lazy loading mode. When creating the connection, maybe the network was disconnected, maybe the Docker container where the EdgeDB Server was located was not started, maybe the cloud service was being restarted or failover, but it was just not connected. How to do? Before reporting an error to the application developer, the client tries to reconnect the connection. If the connection is connected, the retry time can be configured.
After receiving the connection, the client first looks at the local query cache. If it already has the query type information, it will directly use this information to encode the input parameter data and then directly use an optimistic_execute server interaction to complete the query. Otherwise, you need to prepare for the parameter type and perform the execute call, which requires two round trips to the server.
Further down the line, it’s nice if the server returns results successfully, but sometimes it’s just wrong. Again, when a client complains about a problem to the application developer, it tries to fix it itself. If the connection to the EdgeDB database is still in place and the problem is limited to a “retried” problem such as a data conflict caused by the isolation level or a temporary PostgreSQL outage on the back end, the client will simply attempt to resend the assembled request data. However, if the EdgeDB database connection is lost, the client will attempt to reconnect if the retry rule allows, unless the EdgeQL query is not read-only. How does the client know if the statement is read-only? The EdgeDB Server knows that the prepare result contains read-only information. If this data is not returned in time, it is better to report an error.
For code that uses database transactions, the process is still the same, but more transparent to the application developer. For example, here’s Python code:
async for tx in client.transaction():
async with tx:
await tx.execute("insert ...")
Copy the code
Or this JavaScript code:
await client.transaction(async tx => {
await tx.execute(`insert ... `);
});
Copy the code
EdgeDB’s official client interface forces application developers to consider what to do if the entire transaction code is retried. This is actually the correct way to write a database transaction (because the best practice for dealing with SerializationError at the SERIALIZABLE isolation level is to consciously re-execute the entire transaction code), but just because other database drivers don’t provide it, You can make the user see the white page and it becomes a retry. With the forced retry transaction interface, you don’t accidentally write code that shouldn’t be in a transaction, such as manipulating a counter in Redis.
With a variety of QoS protection mechanisms, application developers do not need to worry about errors when, for example, the client connection pool is too long and the Server shuts down the connection. At the same time, EdgeDB can reduce the pressure of concurrency and improve the overall quality of service.
The surrounding
EdgeDB, on the other hand, is more than just a database server:
- At the top, we developed official client libraries that support multiple programming languages, each of which has the functionality mentioned in the previous paragraph, and some of which support advanced use of query constructors.
- In the blue box below is the PostgreSQL backend supported by EdgeDB. In addition to the built-in PostgreSQL, there are also PostgreSQL SaaS on many cloud platforms, and EdgeDB can take advantage of their features, such as RDS multi-availability failover.
- In the red box on the left is the EdgeDB command line tool (CLI), a single executable written in Rust that is primarily used in development environments and supports mainstream development platforms. CLI provides developers with a complete set of edgeDB-based development processes and usage practices, as well as a super user-friendly interactive command line query client (REPL).
- On the bottom left, the EdgeDB official website maintains a large number of learning materials, and I personally try to maintain the Chinese content with the help and support of @Daisy (translator of EdgeDB I Ching).
- On the right are managed services in EdgeDB development for production. Predicates, EdgeDB has a lot of support for DevOps, such as built-in Prometheus call server health interface, PostgreSQL high availability cluster support, and more.
- At the bottom right is the development automation of EdgeDB, including unit testing, functional testing, testing under different operating system and programming language versions, and so on, as well as about 60 different automated release target platforms.
experience
Finally, a bit of developer experience.
The first step in developing with EdgeDB is to install the EdgeDB CLI, which on Linux/macOS is a command:
$ curl --proto '=https'- tlsv1.2 - sSf https://sh.edgedb.com | sh
Copy the code
Also a command on Windows (the server runs using WSL) :
PS> iwr https://ps1.edgedb.com -useb | iex
Copy the code
When you are done, you need to initialize the EdgeDB project in your project folder (empty folder if you are new) :
$ edgedb project init
No `edgedb.toml` found in this repo or above.
Do you want to initialize a new project? [Y/n]
> Y
How would you like to run EdgeDB for this project?
1. Local (native package)
2. Docker
> 1
Checking EdgeDB versions...
Specify the version of EdgeDB to use with this project [1-rc3]:
> # left blank for default
Specify the name of EdgeDB instance to use with this project:
> my_instance
Initializing EdgeDB instance...
Bootstrap complete. Server is up and running now.
Project initialialized.
Copy the code
At this point, the CLI program downloads the EdgeDB Server and creates a database instance (with a PostgreSQL instance inside), and then creates the following files (folders) in the current folder:
edgedb.toml
-Edgedb project file, including version number and so on;dbschema/default.esdl
– An empty schema definition for later schema editing;dbschema/migrations/*.edgeql
– Database migration is automatically generated. It is managed by CLI commands instead of manual editing.
These files should be added to version control such as Git, and then you can start development. Some of the edgedb-related behaviors are:
- Connect to the database and perform some queries manually: directly
edgedb
+ return; - Modify schema: Modify the ESDL file directly and execute the modification first
edgedb migration create
Create the Migration script and execute itedgedb migrate
To complete the migration; - Code to database:
import edgedb
+edgedb.create_client()
Different languages or environments vary slightly, but as long as there isedgedb.toml
No additional connection parameters are required to execute code in the (child) folder of Production environments use environment variables to set connection parameters.
Please refer to the documentation for more details.
It can be seen that the use experience of EdgeDB in daily development is very simple and violent, because the client library and command line tools are produced by our own, so we save everything for developers, and the consistency is very high. For example, you won’t even realize that TLS is enabled because EdgeDB Server will automatically create the development certificate, and CLI will remember the trusted certificate, so the client library can connect to the Server without any intervention from the developer. Future hosted EdgeDB cloud instances will experience the same.
conclusion
As can be seen from the system architecture, the current focus of EdgeDB lies in:
- EdgeQL, the theoretical successor to SQL
- Single database efficiency, as a basic database, first serve most small and medium-sized application scenarios
- Development experience and work efficiency, for the use of “cool” to do a lot of work
- Cloud ecosystem adaptation has potential as a Serverless database
However, EdgeDB has nothing to do with NewSQL, and currently offers no additional support for horizontal scaling, other than positioning itself as a new basic general-purpose OLTP database. Sure, you can provide your own scalable PostgreSQL backend, and EdgeDB has built-in support for some degree of high availability, as well as read-only copies, but that’s not the current focus of EdgeDB. If the EdgeDB Server itself becomes a system bottleneck, add two more instances of the EdgeDB Server to the same PostgreSQL back end.
However, EdgeDB is positioned more towards a basic database like PostgreSQL, which can be played on, but the general database itself tends to have a solid foundation first. EdgeDB, known as EdgeQL, prioritises developer experience and efficiency while balancing performance, breaking the shackles of many existing technology stacks, such as ORM, Bringing best practices such as declarative Schemas, workflows with migration, and transaction retries to modern application development, it’s a whole new database species.
Welcome to our websiteThe official website,OSCHINA Project home page,Zhihu columnandThe nuggets columnFor more information.