background

Pika is a persistent large-capacity Redis storage service, compatible with most interfaces of String, Hash, list, zset and SET. It solves the capacity bottleneck caused by insufficient memory due to the large amount of data stored in Redis. Users can migrate from Redis to piKA services without changing any code. With good compatibility and stability, it has been used by 360 internally over 3000 instances and github community over 3.8K star. Since the capacity of a single PIKA is limited by the capacity of a single hard disk, 360’s business and community have increasingly strong demand for distributed PIKA cluster. Therefore, we launched the native distributed PIKA cluster and released PIKA version V3.4. In contrast to piKA + CODIS clustering, CODIS support for PIKA creation and management of slot operations is not friendly, requiring heavy involvement of operations personnel. Pika native clusters do not require additional deployment of the CODIS-Proxy module.

Cluster Deployment Structure

Taking a cluster with three PIKA nodes as an example, the cluster deployment structure is shown in the figure above:

  1. Deploy the Etcd cluster as a meta-information store for the PIKA Manager.
  2. Pika Manager is deployed on the three physical machines and Etcd service ports are configured. The Pika Manager registers with the ETCD and competes to become the leader. One and only one PIKA Manager in the cluster can become the leader and write cluster data to etCD.
  3. Pika nodes are deployed on each of the three physical machines, and piKA node information is added to the PIKA Manager.
  4. For load balancing, register the PIKA service port with the LVS.

The data distribution

To isolate data based on services, Pika clusters introduce the concept of tables. Different service data is stored in different tables. Service data is stored in the slot based on the hash value of the key. Each slot has multiple copies, forming a replication group. All slot replicas in a Replication group have the same slot ID. One slot replica is the leader and the others are followers. To ensure data consistency, only the Leader provides read and write services. Pika Manager can be used to schedule and migrate slots to evenly distribute data and read/write pressures across the PIKA cluster. This ensures full utilization of resources in the cluster and allows for horizontal capacity expansion and reduction based on service pressures and storage capacity requirements.

Pika uses Rocksdb as the storage engine, and a corresponding Rocksdb is created for each slot. Each slot in piKA supports reading and writing redis 5 data structures. Therefore, data migration is particularly convenient, just need to migrate the slot in the PIKA. But at the same time, it also has the problem of occupying too much resources. The current PIKA creates slots with five rocksdbs by default, one for each of the five data structures. If a table contains a large number of slots or a large number of tables are created, a single PIKA node will contain multiple slots, creating too many Rocksdb instances and consuming too many system resources. In future releases, we will support the creation of one or more data structures based on business needs when creating slots, and continue to optimize the Blackwidow interface layer in PIKA to reduce the use of Rocksdb.

The data processing

  1. When the PIKA node receives a user request, the parsing layer parses the REDis protocol and passes the parsed result to the Router layer for judgment.
  2. The router finds the slot corresponding to the key based on the hash result and determines whether the slot is on the local node.
  3. If the slot where the key resides resides resides on another node, a task is created and queued based on the request and the request is forwarded to the peer node for processing. The task returns the request to the client when it receives the result of processing the request.
  4. If the slot in which the key resides belongs to a local node, the request is processed locally and returned to the client.
  5. For write requests that need to be processed locally, binlogs are written to the Replication Manager module and asynchronously copied to other slot copies. The Process Layer writes the Leader slot based on consistency requirements. Blackwidow is the interface package for Rocksdb.

We built the proxy into piKA and did not need to deploy it separately. In contrast to Redis Cluster, the client does not need to be aware of the presence of proxy and only needs to use the cluster as if it were a single machine. Service ports of piKA nodes can be mounted to LVS to achieve load balancing across the cluster.

Log copy

The Replication Manager module in PIKA is responsible for primary/secondary synchronization of logs. To be compatible with Redis, PIKA supports inconsistent log replication, where the leader slot writes data directly to the DB without waiting for an ACK from the follower slot. The RAFT consistency protocol is also supported for log replication. The DB is written only after receiving ack of most copies.

Inconsistent log replication

In inconsistent scenarios, the process is as follows:

  1. After receiving the request from the client, the processing thread directly locks and writes binlog and operates db.
  2. The processing thread returns the client response.
  3. The slave thread sends a BinlogSync synchronization request to the follower slot to synchronize the logs.
  4. Follower Slot returns BinlogSyncAck to report the synchronization status.
Consistent Log Replication

In the consistent log replication scenario:

  1. The processing thread writes the client request to a binlog file
  2. Synchronize with the slave library by sending a BinlogSync request
  3. Returns BinlogSyncAck reporting synchronization status from the library
  4. Write the request to db after checking that most of the responses from the library are satisfied
  5. Return the response to the client

Cluster metadata processing

Based on CoDIS-Dashboard, we developed piKA Manager (PM for short) as the global control node of the entire cluster, which is used to deploy and schedule the management cluster. PM stores metadata and routing information of the entire cluster.

  • Added the function of creating multiple tables in a cluster to facilitate service data isolation based on different tables.
  • You can specify the number of slots and replicas when creating tables, facilitating o&M to create tables based on service scale and fault tolerance.
  • Logically change the concept of a group to a replication group, changing the original process-level data and log replication to slot-level replication.
  • You can create a password when creating a table to isolate services. The client only needs to execute auth and SELECT statements to authenticate and operate on the specified table.
  • Slot migration is supported to facilitate capacity expansion and reduction based on service requirements.
  • Integrated with the Sentinel module, the PM will continuously send heartbeat to the PIKA nodes in the cluster to monitor the survival status. If the PM finds that the leader slot is down, the SLAVE slot with the largest binlog offset is automatically promoted to the leader slot.
  • The storage back-end supports metadata writing into ETCD to ensure high availability of metadata.
  • The Pika Manager achieves high availability by constantly competing for locks from etCD to become the leader.

Afterword.

The introduction of piKA native cluster solves the limitation of disk capacity of single PIKA, and can be expanded according to service requirements. There are still some drawbacks, such as the absence of raft based internal auto-selection, range-based data distribution, and a display board for monitoring information. We will address these issues in subsequent releases.