Basics

VFS

Virtual File SystemVirtual file systems read and write different file systems on different physical media through the standard system call (Syscall) on Unix, which provides a unified API interface for various file systems.

A file system generally goes through several stages to be able to be read and written by an operating system:

  • Create: Format a Disk in some way and create a file system. When creating a file system, the system writes the control information of the file system to the disk.
  • Registration: each file system declares that it can be supported by the operating system kernel so that the kernel can record control information about the file system. Registration is divided into static registration and dynamic registration. Static registration is registered when the kernel is compiled; Dynamic registration refers to registration by loading modules.
  • Installation: Mount the file system to the directory tree of the root file system of the operating system. At this point, the operating system can read and write to the file system.

After that, the VFS interface can read and write to the file system.

Taking the sys_read() level as an example, VFS shields the details of specific file system read and write methods from sys_read() by switching from vfs_read(). From a programming design pattern perspective, this is like adding a read() to a specific file systemAgent layer (Proxy), the agent can also transfer read and write operations to different file systems.

BTree

From the introduction of Wikipedia:

In computer science, a B-tree is a self-balancing tree that keeps data in order. This data structure allows data lookup, sequential access, insertion, and deletion to occur in logarithmic time. A B tree, generally a generalized binary search tree, can have more than two child nodes. Different from the self-balanced binary search tree, the B tree optimizes the read and write operations of large chunks of system data. B trees reduce the intermediate process of locating records and thus speed up access. A B-tree is a data structure that can be used to describe external storage. This data structure is often used in database and file system implementations.

Pager

The Pager module handles the concurrent operation control of SQLite. Pager also ensures that SQLite is “ACID” (Atomic, Consistent, Isolated, Durable) to support Transaction processing. What we do to the database file does not directly affect the database file itself, but instead deals with the Pager, which writes the changes to the database file.

structure

Pager divides SQLite’s.db files into Blocks of the same size and specification. Each block is called a “Page” and is typically 1024 bytes in size. On Unix systems, Pager provides an abstract interface to the operating system through os_UNIx. c.

Multi-process/multi-threaded access

Pager effectively controls access to SQLite database files in multi-process/multi-threaded conditions through the following mechanisms.

The lock

By locking, Pager can ensure that the process/thread is operating properly on the database file. The lock can be in one of several states:

  • UNLOCKED: the DB is UNLOCKED. The DB does not need to read or write data in UNLOCKED. The data in the DB is considered suspicious.
  • SHARED: The DATABASE can read data but cannot write data. Multiple processes can read data from the DATABASE at the same time by holding the lock. However, when the DB SHARED lock is held, other threads or processes in the same process are not allowed to write to the DB.
  • RESERVED: This lock can only be acquired by one process at a time. This lock can be obtained when a process is reading data from db and intends to write to db at some point in the future. If the RESERVED lock exists, other processes acquire the SHARED lock to read the DB.
  • PENDING: a process is allowed to acquire the lock in preparation for subsequent writes to db. If a DB’s PENDING lock is held, the DB does not allow new processes or threads to acquire the SHARED lock. A process that obtains a PENDING lock cannot write to the DB immediately. Instead, it must wait for other SHARED blocks to complete, and then acquire the EXCLUSIVE lock for actual writes.
  • EXCLUSIVE: The lock can be acquired to actually write. Only EXCLUSIVE locks can write to db; EXCLUSIVE is a unique mutex. When an EXCLUSIVE lock exists, no other locks (UNLOCKED, SHARED, or RESERVED) can exist.

Write to database file

The Rollback record

The Rollback Journal mechanism ensures the integrity of database files in the event of errors in database operations.

Before modifying the database file, Pager will save a record of the current database file named Rollback Journal to the directory where the current database file is located. The file name is XXX(db name)-journal. In addition to saving the contents of the database, the file size of the current database is also saved so that it can be properly recovered later in the event of a writing error to the database.

Rollback Journal is considered hot when Pager needs it to restore database files. When a process or thread terminates during modification to a database file due to a process Crash or power outage, Pager creates a record file of type Hot Journal to help recover. When SQLite attempts to read data from a database file, it first checks to see if the corresponding Hot Journal exists. If yes, rollback is performed on the database files before data is read.

Lock acquisition

For a process that wants to write to a database file, it must acquire locks in the following order:

SHARED --> RESERVED --> PENDING --> EXCLUSIVE

Copy the code

If one process or thread obtains the RESERVED lock, other processes or threads cannot obtain the RESERVED lock. As a result, the database write operation of other processes or threads fails and an SQLITE_BUSY error is returned. Before obtaining the RESERVED lock, the process/thread that wants to write to the database file will create a Rollback Journal and write the contents of the database into the Rollback Journal. At this time, other processes/threads can still read data from the database file.

write
  • Changes made to the Page by the process are not immediately written to disk but remain in the cache. It is written to the database file on disk only when the cache is full or the transaction commits a COMMIT. At this point, the process/thread needs to wait for the read of other SHARED locks to complete and ensure that the Rollback Journal has been created.
  • If a database file needs to be written because the cache is full, the process/thread does not commit the transaction immediately, but continues to perform changes to other pages. As a result, the life of an EXCLUSIVE lock acquired by a process/thread can be very long, from the start of a write to the end of a commit transaction.
  • The Rollback Journal will be deleted immediately after the commit (Hot Journal will be created if the operating system crashes during the write). When all changes are actually written to disk, EXCLUSIVE and PENDING are released. After the PENDING lock is released, other processes/threads can attempt to acquire the SHARED lock to read data from the latest database file.

Create connection, write database source code analysis

SQLite exposes a number of interfaces to the Framework, from connection establishment to reading and writing to transaction operations.

// /framework/base/core/java/android/database/sqlite/SQLiteConnection.java

// All Framework operations to the database are ultimately forwarded to SQLite through these JNI interfaces

private static native int nativeOpen(String path, int openFlags, String label,

boolean enableTrace, boolean enableProfile);

private static native void nativeClose(int connectionPtr);

.

private static native long nativeExecuteForLastInsertedRowId(

int connectionPtr, int statementPtr);

.

private static native void nativeResetCancel(int connectionPtr, boolean cancelable);

Copy the code

Limited space, this section only introduces the connection establishment (the existence of a database connection is the basis for the Framework to operate the DB) and write to the DB (through the full state of the Pager lock, Rollback Journal, more complex than reading).

Connection is established

NativeOpen (), the corresponding JNI interface implementation in/framework/base/core/JNI/android_database_SQLiteConnection CPP.

static jint nativeOpen(JNIEnv* env, jclass clazz, jstring pathStr, jint openFlags,

jstring labelStr, jboolean enableTrace, jboolean enableProfile) {

. //sqliteFlags checks and character conversion

sqlite3* db;

// Open the database using sqlite3_open_v2, sqlite3_open, sqlite3_open16, etc

int err = sqlite3_open_v2(path.string(), &db, sqliteFlags, NULL);

if (err ! = SQLITE_OK) {

throw_sqlite3_exception_errcode(env, err, "Could not open database");

return 0;

}

.

//1. If you want to connect to a database with read/write functionality, you need to check whether the database object can be read/write

//2. Set busy Handler to handle SQLITE_BUSY

//3. Register android function



// Wrap the database object around the database connection

SQLiteConnection* connection = new SQLiteConnection(db, openFlags, path, label);

. // Set the tracing, profiling, etc

/ / return

return reinterpret_cast<jint>(connection);

}

Copy the code

No matter use what way to open the database, finally will enter the openDatabase ()/external/sqlite/dist/sqlite3. C

static int openDatabase(

const char *zFilename, /* Database filename UTF-8 encoded */

sqlite3 **ppDb, /* OUT: Returned database handle */

unsigned int flags, /* Operational flags */

const char *zVfs /* Name of the VFS to use */

) {

sqlite3 *db; /* Store allocated handle here */

int rc; /* Return code */

int isThreadsafe; /* True for threadsafe connections */

char *zOpen = 0; /* Filename argument to pass to BtreeOpen() */

char *zErrMsg = 0; /* Error message from sqlite3ParseUri() */

*ppDb = 0;



. / / flag set

// SQLite DB allocates memory

db = sqlite3MallocZero( sizeof(sqlite3) );

// If the allocation fails, jump to opendb_out

if( db==0 ) goto opendb_out;

If (isThreadsafe){// No thread-safe identifier is set and therefore relevant to the specific SQLite configuration

db->mutex = sqlite3MutexAlloc(SQLITE_MUTEX_RECURSIVE);

if( db->mutex==0 ){

sqlite3_free(db);

db = 0;

goto opendb_out;

}

}

// lock db.mutex and proceed to the next execution

sqlite3_mutex_enter(db->mutex);

// Two DB backend, the first is main for the main operation and the second is temp, as we will see later

db->nDb = 2;

db->aDb = db->aDbStatic;

. // Set db parameters

// Create a sort function

createCollation(db, "BINARY", SQLITE_UTF8, 0, binCollFunc, 0);

createCollation(db, "BINARY", SQLITE_UTF16BE, 0, binCollFunc, 0);

createCollation(db, "BINARY", SQLITE_UTF16LE, 0, binCollFunc, 0);

createCollation(db, "RTRIM", SQLITE_UTF8, (void*)1, binCollFunc, 0);

If (db->mallocFailed){// Opendb_out fails to allocate memory

goto opendb_out;

}

The default sort function is BINARY

db->pDfltColl = sqlite3FindCollSeq(db, SQLITE_UTF8, "BINARY", 0);

assert( db->pDfltColl! = 0);

// Create a utF-8 case-insensitive sort function

createCollation(db, "NOCASE", SQLITE_UTF8, 0, nocaseCollatingFunc, 0);

// Convert the database file name, specified VFS, and flag to the corresponding Uri

db->openFlags = flags;

rc = sqlite3ParseUri(zVfs, zFilename, &flags, &db->pVfs, &zOpen, &zErrMsg);

if( rc! =SQLITE_OK){//Uri parse failed

if( rc==SQLITE_NOMEM ) db->mallocFailed = 1;

sqlite3Error(db, rc, zErrMsg ? "%s" : 0, zErrMsg);

sqlite3_free(zErrMsg);

goto opendb_out;

}

// Open the database driver and point it to pBt of the first Database Backend

// This is where the database file is actually opened, using BTree to open the database file

// place the created BTree in db->aDb[0].

rc = sqlite3BtreeOpen(db->pVfs, zOpen, db, &db->aDb[0].pBt, 0,

flags | SQLITE_OPEN_MAIN_DB);

if( rc! =SQLITE_OK ){

if( rc==SQLITE_IOERR_NOMEM ){

rc = SQLITE_NOMEM;

}

sqlite3Error(db, rc, 0);

goto opendb_out;

}

// Obtain the database mode of Database Backend

db->aDb[0].pSchema = sqlite3SchemaGet(db, db->aDb[0].pBt);

db->aDb[1].pSchema = sqlite3SchemaGet(db, 0);

db->aDb[0].zName = "main";

db->aDb[0].safety_level = 3;

db->aDb[1].zName = "temp";

db->aDb[1].safety_level = 1;

.

// Register the built-in function into db. This time, the content is not read from DB.

// The first read will be when the DB object is accessed for the first time

sqlite3RegisterBuiltinFunctions(db);

.

// How to exit

opendb_out:

sqlite3_free(zOpen);

if( db ){

assert( db->mutex! =0 || isThreadsafe==0 || sqlite3GlobalConfig.bFullMutex==0 );

// Thread safe locks need to exit

sqlite3_mutex_leave(db->mutex);

}

rc = sqlite3_errcode(db);

assert( db! =0 || rc==SQLITE_NOMEM );

if( rc==SQLITE_NOMEM ){

// There is no space to create database objects, perform database shutdown, and free resources

sqlite3_close(db);

db = 0;

}else if( rc! =SQLITE_OK ){

db->magic = SQLITE_MAGIC_SICK;

}

// make ppDb pointer to DB for the caller to use.

*ppDb = db;

return sqlite3ApiExit(0, rc);

}

Copy the code

In SQLite3, we use BTree as the storage engine to use the database. We also use sqlite3BtreeOpen() to open the database

SQLITE_PRIVATE int sqlite3BtreeOpen(

sqlite3_vfs *pVfs, /* VFS to use for this b-tree */

const char *zFilename, /* Name of the file containing the BTree database */

sqlite3 *db, /* Associated database handle */

Btree **ppBtree, /* Pointer to new Btree object written here */

int flags, /* Options */

int vfsFlags /* Flags passed through to sqlite3_vfs.xOpen() */

) {

BtShared *pBt = 0; /* Shared part of btree structure */

Btree *p; /* Handle to return */

sqlite3_mutex *mutexOpen = 0; /* Prevents a race condition. Ticket #3537 */

int rc = SQLITE_OK; /* Result code from this function */

u8 nReserve; /* Byte of unused space on each page */

unsigned char zDbHeader[100]; /* Database header content */



.

if( isMemdb ){

// Is the in-memory DB

flags |= BTREE_MEMORY;

}

if( (vfsFlags & SQLITE_OPEN_MAIN_DB)! =0 && (isMemdb || isTempDb) ){

vfsFlags = (vfsFlags & ~SQLITE_OPEN_MAIN_DB) | SQLITE_OPEN_TEMP_DB;

}

// Allocate memory

p = sqlite3MallocZero(sizeof(Btree));

if( ! p ){

return SQLITE_NOMEM;

}

p->inTrans = TRANS_NONE;

p->db = db;

// If set to cache, try to find and use the cached BTree shared memory. If found, return directly, no further execution.

.



// pBt = 0, the shared BTree storage object is empty

.

pBt = sqlite3MallocZero( sizeof(*pBt) );

if( pBt==0 ){

rc = SQLITE_NOMEM;

goto btree_open_out;

}

// Open the sqlite3 database file

rc = sqlite3PagerOpen(pVfs, &pBt->pPager, zFilename,

EXTRA_SIZE, flags, vfsFlags, pageReinit);

If (rc==SQLITE_OK){if(rc==SQLITE_OK){if(rc==SQLITE_OK){if(rc==SQLITE_OK)

rc = sqlite3PagerReadFileheader(pBt->pPager,sizeof(zDbHeader),zDbHeader);

}

if( rc! =SQLITE_OK ){

goto btree_open_out;

}

pBt->openFlags = (u8)flags;

pBt->db = db;

// Register BusyHandler for database files

sqlite3PagerSetBusyhandler(pBt->pPager, btreeInvokeBusyHandler, pBt);

p->pBt = pBt;



pBt->pCursor = 0;

pBt->pPage1 = 0;

// Determine whether sqlite3 is opened read-only

if( sqlite3PagerIsreadonly(pBt->pPager) ) pBt->btsFlags |= BTS_READ_ONLY;

.

// Set the Pager Object size

rc = sqlite3PagerSetPagesize(pBt->pPager, &pBt->pageSize, nReserve);

// Add the shared BTree object to the global LinkedList

// Add the new BTree to the linked list of shared BTree objects

// How to exit

.

}

Copy the code

write

Based on the executeForLastInsertedRowId SQLiteConnection JNI methods, for example

static jlong nativeExecuteForLastInsertedRowId(JNIEnv* env, jclass clazz,

jint connectionPtr, jint statementPtr) {

// Convert an upper-level int pointer to an SQLiteConnection pointer

SQLiteConnection* connection = reinterpret_cast<SQLiteConnection*>(connectionPtr);

// Convert the Statement pointer

sqlite3_stmt* statement = reinterpret_cast<sqlite3_stmt*>(statementPtr);

int err = executeNonQuery(env, connection, statement);

return err == SQLITE_DONE && sqlite3_changes(connection->db) > 0

? sqlite3_last_insert_rowid(connection->db) : -1;

}

Copy the code

Switch to executeNonQuery execution

// execute sqlite3_step into SQLite

int err = sqlite3_step(statement);

if (err == SQLITE_ROW) {

throw_sqlite3_exception(env,

"Queries can be performed using SQLiteDatabase query or rawQuery methods only.");

} else if (err ! = SQLITE_DONE) {

throw_sqlite3_exception(env, connection->db);

}

Copy the code

Then switch to SQLite and execute SQlite3.c

SQLITE_API int sqlite3_step(sqlite3_stmt *pStmt){

int rc = SQLITE_OK; /* Result from sqlite3Step() */

int rc2 = SQLITE_OK; /* Result from sqlite3Reprepare() */

Vdbe *v = (Vdbe*)pStmt; /* the prepared statement */

int cnt = 0; /* Counter to prevent infinite loop of reprepares */

sqlite3 *db; /* The database connection */



if( vdbeSafetyNotNull(v) ){

return SQLITE_MISUSE_BKPT;

}

db = v->db;

/ / lock

sqlite3_mutex_enter(db->mutex);

// execute sqlite3Step in a loop and finally forward to sqlite3VdbeExec()

while( (rc = sqlite3Step(v))==SQLITE_SCHEMA

&& cnt++ < SQLITE_MAX_SCHEMA_RETRY

&& (rc2 = rc = sqlite3Reprepare(v))==SQLITE_OK ){

// reset SQLiteStatement

sqlite3_reset(pStmt);

assert( v->expired==0 );

}

// If execution fails, the cleanup phase is performed

if( rc2! =SQLITE_OK && ALWAYS(v->isPrepareV2) && ALWAYS(db->pErr) ){

const char *zErr = (const char *)sqlite3_value_text(db->pErr);

sqlite3DbFree(db, v->zErrMsg);

if( ! db->mallocFailed ){

v->zErrMsg = sqlite3DbStrDup(db, zErr);

v->rc = rc2;

} else {

v->zErrMsg = 0;

v->rc = rc = SQLITE_NOMEM;

}

}

rc = sqlite3ApiExit(db, rc);

sqlite3_mutex_leave(db->mutex);

return rc;

}

Copy the code

Finally, it is forwarded to sqlite3VdbeExec for execution, which will be converted into SQLite VDBE Program. Please refer to the official document VDBE for the content of the VDBE Program