Author: Zhang Lianzhuang PostgreSQL research engineer

Engaged in PostgreSQL database kernel development for many years, have a very deep research on CITus.

Last time, we introduced the PostgreSQL data recovery tool: PG_reconvery

This article will take you to understand the implementation principle of PG_recovery tool, design ideas, and bring source code interpretation.

| data back to the implementation of the principle

A normal way to read data in a database system is to start with a select * from PG_recovery query (that is, to execute a transaction). During the query operation, a snapshot of the transaction is generated. The GetActiveSnapshot() function is used to view the currently visible data.

| design idea

1. How to read Dead tuples?

PostgreSQL uses snapshots to determine the visibility of current database data, so when a piece of data is deleted, the entity of the data still exists in the database instance. This invisible piece of data is usually called a Dead tuple (in PostgreSQL a piece of data is called a tuple).

PostgreSQL provides special snapshots of SnapshotAny (and many others). This snapshot can read any data, and PG_recovery is all data read in this way. By default, only recovery data is returned, not visible data.

2. How much data does the function return?

The amount of data is returned on a row, one row at a time.

3. How to control memory?

Functions are executed multiple times, and some states are global. Therefore, memory can be controlled using the multi_call_memory_CTx (context of the memory pool) parameter.

Parameters of the function

To create a function in SQL, execute the following statement. Refer to the previous section for details on how to use the function.

CREATE FUNCTION pg_recovery(regclass, recoveryrow bool DEFAULT true) RETURNS SETOF record
Copy the code

Regclass: indicates the type of a PostgreSQL table. The table name is automatically converted to OID (the unique identifier of an object in the OID database). Therefore, you only need to enter the table name.

Reconveryrow bool DEFAULT ture: The DEFAULT value is true, indicating that only recovery data is returned. The value false indicates that all data is returned. Execute the following statement to change the parameter default values.

select * from pg_recovery('aa', recoveryrow => false)
Copy the code

RETURNS SETOF Record: Function RETURNS row type data.

| source interpretation

Essential data

typedef struct { Relation rel; TupleDesc reltupleDESC; TupleConversionMap *map; Table mapping. Table data is mapped to user-defined returned columns. TableScanDesc Scan. HTAB * active_cTID; Ctid bool DROPpedColumn; Pg_recovery_ctx;Copy the code

Hidden columns

Add a hidden column in RecoveryRow that, when all information is returned, can be used to identify whether the row is recovery data or user visible data.

static const struct system_columns_t {
    char       *attname;
    Oid         atttypid;
    int32       atttypmod;
    int         attnum;
} system_columns[] = { 
    { "ctid",     TIDOID,  -1, SelfItemPointerAttributeNumber },
    { "xmin",     XIDOID,  -1, MinTransactionIdAttributeNumber },
    { "cmin",     CIDOID,  -1, MinCommandIdAttributeNumber },
    { "xmax",     XIDOID,  -1, MaxTransactionIdAttributeNumber },
    { "cmax",     CIDOID,  -1, MaxCommandIdAttributeNumber },
    { "tableoid", OIDOID,  -1, TableOidAttributeNumber },
    { "recoveryrow",     BOOLOID, -1, DeadFakeAttributeNumber },
    { 0 },
};
Copy the code

Pg_recovery simplified code

Datum pg_recovery(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; pg_recovery_ctx *usr_ctx; recoveryrow = PG_GETARG_BOOL(1); Funcctx = SRF_FIRSTCALL_INIT(); funccTX = SRF_FIRSTCALL_INIT(); funccTX = SRF_FIRSTCALL_INIT(); Oldcontext = MemoryContextSwitchTo(funCCTX ->multi_call_memory_ctx); ->rel = heap_open(relid, AccessShareLock); Usr_ctx -> reltupleDESC = RelationGetDescr(usr_ctx->rel); Funcctx ->tuple_desc = BlessTupleDesc(tupdesc); Usr_ctx ->map = recovery_convert_tuples_by_name(usr_ctx->reltupledesc, funccTX ->tuple_desc, "Error converting tuple descriptors!" , &usr_ctx->droppedcolumn); -- column mapping usr_ctx->scan = heap_beginscan(usr_ctx->rel, SnapshotAny, 0, NULL, NULL, 0); Active_scan = heap_beginscan(usr_ctx->rel, GetActiveSnapshot(), 0, NULL, NULL, 0); active_scan = heap_beginscan(usr_ctx->rel, GetActiveSnapshot(), 0, NULL, NULL, 0); While ((tupleIN = heap_getNext (active_scan, ForwardScanDirection))! = NULL) hash_search(usr_ctx->active_ctid, (void*)&tuplein->t_self, HASH_ENTER, NULL); -- CTID} funccTX = SRF_PERCALL_SETUP(); Usr_ctx = (pg_recovery_ctx *) funccTX ->user_fctx; get_tuple: if ((tuplein = heap_getnext(usr_ctx->scan, ForwardScanDirection)) ! Dead hash_search(usr_ctx-> active_cTID, (void*)&tuplein->t_self, HASH_FIND, &alive); tuplein = recovery_do_convert_tuple(tuplein, usr_ctx->map, alive); SRF_RETURN_NEXT(funCCTX, HeapTupleGetDatum(tupleIN)); Heap_endscan (usr_ctx->scan); Heap_close (usr_ctx->rel, AccessShareLock); SRF_RETURN_DONE(funcctx); -- Release function resources}}Copy the code

Generate a mapping table

TupleConversionMap * recovery_convert_tuples_by_name(TupleDesc indesc, TupleDesc outdesc, const char *msg, bool *droppedcolumn) { attrMap = recovery_convert_tuples_by_name_map(indesc, outdesc, msg, droppedcolumn); -- Processing recoveryRow/hidden column/visible column mapping map->indesc = indesc; map->outdesc = outdesc; map->attrMap = attrMap; map->outvalues = (Datum *) palloc(n * sizeof(Datum)); map->outisnull = (bool *) palloc(n * sizeof(bool)); map->invalues = (Datum *) palloc(n * sizeof(Datum)); map->inisnull = (bool *) palloc(n * sizeof(bool)); map->invalues[0] = (Datum) 0; map->inisnull[0] = true; return map; }Copy the code

Tuple conversion function

HeapTuple recovery_do_convert_tuple(HeapTuple tuple, TupleConversionMap *map, bool alive) { heap_deform_tuple(tuple, map->indesc, invalues + 1, inisnull + 1); For (I = 0; i < outnatts; i++) { outvalues[i] = invalues[j]; Outisnull [I] = null[j]; } return heap_form_tuple(map->outdesc, outvalues, outisNULL); Convert column data to tuples}Copy the code