Introduction to the

SeaweedFS is a simple, highly scalable distributed file system. It can:

  1. Storage billions of files
  2. Quickly obtain files serve the files fast

Wed-fs started out as a Fackbook-based implementation of Haystack’s thesis, which was designed to optimize the internal image storage and retrieval of Fackbook. Later, on this basis, the author added several features to form the current weed- FS.

SeaweedFS started out as an object store to efficiently process small files. Instead of managing all file metadata in the central master server (see HDFS NN implementation), it allows these volume servers to manage files and their metadata. This relieves concurrent stress on the central master server and propagates file metadata to the volume server, allowing faster file access (with only one disk read operation). Metadata per file has only 40 bytes of disk storage overhead. Read using O(1) disk.

Making: github.com/chrislusf/s… Official documentation: github.com/chrislusf/s…

Components

Quick Start

Docker

1. Start the container

1.1 Starting the Master Cluster

$docker run -p 9333:9333 --name master0 chrislusf/seaweedfs master -port 9333 -peers {host2}:9444,{host3}:9555 & $docker  run -p 9444:9444 --name master1 chrislusf/seaweedfs master -port 9444 -peers {host1}:9333,{host3}:9555 & $docker run -p  9555:9555 --name master2 chrislusf/seaweedfs master -port 9555 -peers {host1}:9333,{host2}:9444 &Copy the code

#{host} is replaced with the corresponding container IP

You can visit http://localhost:9333/ at this point

1.2 Starting the Volume Cluster

$docker run -p 8080:8080 -p 18080:18080 --name volume0 --link master0 chrislusf/seaweedfs volume -max=5 -mserver={master}:9333 -port=8080 &

$docker run -p 8081:8081 -p 18081:18081 --name volume1 --link master0 chrislusf/seaweedfs volume -max=5 -mserver={master}:9333 -port=8081 &

$docker run -p 8082:8082 -p 18082:18082 --name volume2 --link master0 chrislusf/seaweedfs volume -max=5 -mserver={master}:9333 -port=8082 &
Copy the code

#{master} is replaced with the leader master IP

2. Upload files

2.1 Requesting file Number Allocation

GET http://localhost:9333/dir/assign

2.2 Uploading files

E0c0 POST http://localhost:8082/7, 019238

3. Obtain files

E0c0 GET http://localhost:8082/7, 019238

Master

The principle of

Replica synchronization between Master Server clusters is based on the RAFT protocol.

The master server stores Topology information about the entire volume server. The Topology is as follows: Topology > Data Center > Rack > Data Node > Volume. To find all volume information on a Data Node, go to Data Center-> Rack -> Data Node.

In addition, the leader Master also needs to find all nodes where the volume is located (e.g. /dir/lookup in the Master API), so the leader Master maintains another structure: Topology -> Collection -> VolumeLayout-> Data Node List The VolumeLayout is organized by a Collection based on the backup mode. VolumeLayout stores the mapping of each volume ID to its location in the Data Node List and the read-write nature of the volume.

The leader Master keeps the connection with each volume server through heartbeat. The Volume Server reports the local volume information (such as increment, deletion, expiration) to the Leader Master through heartbeat.

Then, the leader master synchronizes the location information of the volume to other masters through the GRPC (KeepConnected). Therefore, when the location of a volume is queried, the leader master directly reads data from the local Topology, while the non-Leader master obtains data from the local Master Client’s vidMap.

When a file is to be uploaded, the leader Master is also responsible for assigning a globally unique and increasing ID as the file ID.

Parameter is introduced

Example: $weed master-port 9333-peers {host2}:9444,{host3}:9555

Parameter Type Description -ip String IP address of the Master server (default localhost) -port Int HTTP listening port (default 9333) -peers String represents the server cluster, A comma to separate all the master node IP: port, and the sample 127.0.0.1:9093127.00 0.1:9094 - volumePreallocate no for volumes pre-allocated disk space - volumeSizeLimitMB Uint Master Stop the restriction on volumes written to excess volumes (default: 30000) -cpuProfile String Cpu profile output file -defaultreplication String If the default backup type is not specified. Default "000" -garbageThreshold String Threshold for clearing and reclaiming space (default "0.3") -ip. bind String IP address to be bound (default "0.0.0.0") -maxCPU Int Maximum number of cpus. 0 for all available CPU -mdir String Data directory where metadata is stored (default "/ TMP") - memProfile String Memory configuration file output file -pulseSeconds Int Secret String Encryption JSON Web Token method -whiteList String Comma separates Ip addresses with write permission. If it is empty, there is no limit. The white listCopy the code

defaultReplication

-000 Does not back up, Only one copy of data - 001 backs up one copy of data in the same rackj - 010 backs up one copy of data between different racks in the same data center - 100 backs up one copy of data in different data centers - 200 replicates two copies in two different data centers - 110 Back up one copy of the data at a different rack and one copy at a different data centerCopy the code

If the data backup type is XYZ

-x Number of backup files in other data centers. -y Number of backup files in different data centers. -z Number of backup files in racks of the same serverCopy the code

Core components

Start the process

The dotted line indicates the start of the coroutine

The process that

  1. To construct the master server

    1. Instantiation masterServer

    2. Based on gorilla. Mux binding func

    3. StartRefreshWritableVolumes

      1. The leader traverses all volumes in the local memory map and deletes the volumes that exceed the specified size from the writable list of volumeLayout.
      2. The Compact instruction is issued to each volume for cavity compression using the replication algorithm + two-phase commit: Go through the Volume Location List in volumeLayout of all local collections. For non-read-only volumes, call the corresponding Volume ServerVacuumVolumeCheckObtain GarbageRatio, check whether file voidsneed to be compressed according to the specified garbage rate, and delete the Volume that needs to be compressed from the writable list of the leader Master VolumeLayout. Initiate to the Volume Server specified in each Volume Location ListVacuumVolumeCompactRequests for compression are sent to each Volume Server if the compression is successfulVacuumVolumeCommitRequest to submit compression and add the Volume to the leader Master VolumeLayout writable list. Otherwise, a VacuumVolumeCleanup request is sent to the Volume Server to clear compressed content. This Volume is not writable during this period.
      3. The scheduled task is compressed every 15 minutes.
    4. After the volume Grow Request coroutine is enabled, the volume Server interface is invoked through RPC to apply for a Grow Volume

  2. Start the raft server

Seaweedfs uses our own raft framework. Raft is a consensus algorithm for managing replica logs using raft protocol to elect, heartbeat, and synchronize logs between masters.

Raft demo: thesecretlivesofdata.com/raft

Github:github.com/chrislusf/r…

  1. Start the GRPC

Register master Server and Raft Server with GRPC and start. The project uses google.golang.org/[email protected]

  1. If raft does not select the leader after 15 seconds, it manually selects the first peer to become the leader

  2. Data is synchronized from the leader master through the Master client’s KeepConnected. The master receives the volume change information and saves the data in the Master Client’s vidMap

  3. Start listening for client HTTP requests

Volume

The principle of

Volume Server is the location where files are stored. Files are organized in the following formats: Store -> DiskLocations -> Volumes -> Needles

DiskLocations correspond to different directories. Each directory contains multiple volumes. Each volume is a.dat disk file. To quickly locate a service file in an. Dat file, each volume corresponds to an. Idx file to store the offset and file size of all service files in the. Dat file. Idx will be read into LevelDb during Volume Server startup to speed up index query. Although a service file works as the primary and secondary Volume servers, all Volume servers are peer. Receive external requests equally, and upload service files to other machines after receiving them.

directory

.idx

One index corresponds to one row

The heartbeat

GRPC doHeartbeat Enables continuous heartbeat between the Volume Server and the leader Master.

The Volume Server obtains volumeSizeLimit from the Leader Master as the size limit of a single volume.

The leader master collects statistics from the volume server on the total number of volumes in each DiskLocation, maximum file key ID,. Dat file size, fileCount, deleteCount, and deletedSize. The Volume Server periodically reports this data to the Leader master. When a new volume is added, deleted, or reclaimed due, the volume server also reports this data to the Leader Master. In this way, the leader master can know which volumes are distributed on which Volume Servers and know the running status of the Volume Server in real time.

Space recycling

Because Needle deletion is soft deletion in appending mode, there may be deleted files in the.dat file. After a certain percentage, it needs to be recycled for space compression.

The compression process actually reads the.idx files into LevelDb, filters the deleted and expired files, and prints them into the.cpx copy of the temporary.idx file, and reads the corresponding data from the.dat file into the.cpd file.

After the replication and compression process is complete, commit the file to rename the.cpx and.cpd files to.idx and.dat files. If a new Needle is generated in this interval, the newly generated Needle is appended to the end of the newly generated file.

Needle format

The Needle business file is stored in a.dat file in three formats: [Header][Data][EXT]

The sections are organized as follows:

Header: [Cookie(4 bytes)][NeedleId(8 bytes)][DataSize(4 bytes)] Data: [Data][CheckSum(4 bytes)][LastAppendNs(4 bytes)][Padding] EXT: [Name] [MimeSize(1 byte)][Mime] [LastModified(5 bytes)][TTL: Count(1 byte)][TTL:Unit(1 byte)] [PairsSize (2 bytes)] [Pairs]Copy the code

Where the Name to Pairs field is optional and is marked by the corresponding bit of the Flags field

Parameter is introduced

Example: $weed volume-max = 5-mserver =localhost: 9333-port =8080

Parameter Type Description -ip string IP address or server name -dir string Directory where data files are stored dir[,dir]... (default "/ TMP") -max string Volumes Maximum value,count [,count]... (Default: 7) -port Int HTTP listening port number (default: 8080)\ -cpuProfile string Cpu profile output file -dataCenter String Data center name of the current volume service. -rack string Rack name of the current Volume server. -idleTimeout Int Connection idle seconds (default 30) -images.fix.orientation (true/false) JPG orientation adjusted while uploading -index string Select memory ~ performance balance mode [memory | leveldb | boltdb | btree]. Bind string Specifies the IP address to be bound (0.0.0.0 by default). -maxCPU Int Specifies the maximum number of cpus. -memProfile String Memory profile output file -mServer String List of master servers separated by commas (default "localhost:9333") -publicUrl string Public access address -pulseSeconds Int Number of seconds between heartbeats, The value must be smaller than or equal to master server Settings (default: 5) -read.redirect(true/false Redirect redirect or non-local volumes -whiteList String Comma separated Ip addresses with write permission. If it is empty, there is no limit.Copy the code

Start the process

Master Server API

See the full API at: github.com/chrislusf/s…

/dir/assign

Allocate volume, which assigns a data node and a file ID that can then be used to upload data to the corresponding data node. The leader master first checks whether the local VolumeLayout has a node with the corresponding number of copies. If so, the leader master responds directly to the node; otherwise, the leader master expands the volume. Call the AllocateVolume method of the corresponding volume server through GRPC to apply for the allocation of volume. The corresponding Volume Server sends the new volume information to the leader master using SendHeartbeat. The Leader Master synchronizes the change to other MasterClients through KeepConnected mechanism to complete volume allocation.

/dir/lookup

Find the location of a volume. If it is the leader master, get it from the local topology. Otherwise, it is obtained from the Master client’s vidMap, where the location of the corresponding volume is obtained from the Leader Master through the KeepConnected mechanism.

/vol/grow

Add a volume. The logic is the same as that in /dir/assign.

/vol/vacuum

Use the replication algorithm and two-phase commit to compress the volume Server disk file void. The logical compression operation is the same as that performed when the Master Server starts.

/col/delete

Delete all volumes from a Collection Collection. Obtain all volume servers in a Collection from the leader master’s topology, and then initiate RPC calls to each volume server through the DeleteCollection. After all data on the disk is deleted, the leader Master deletes the volume from the local topology.

Since this information is only available on the leader master, there is no need to notify other masters. However, the Volume Server periodically invokes CollectHeartbeat to collect local volume information and sends it to the leader Master using SendHeartbeat. The leader master compares the received information with the old data on the node, calculates the volume information of the new offline or online volume, that is, calculates the increment, and broadcasts the information to other masters through the KeepConnected mechanism.

Volume API

See the full API at: github.com/chrislusf/s…

Access to the file

GET/HEAD /vid/fid/filename /vid/fid.ext /vid,fid.ext

Obtain a file from the Volume Server in fid format [NeedleId(8 bytes)][Cookie(8 bytes)]. When the server receives the request, it looks up the volume from all diskLocations to see if it exists.

If not, make a /dir/lookup request to the master to get its location and return 302 to the caller;

If the data is in the volume server, the data is read directly, and the cookie and checkSum are compared. If they are correct, the data is returned to the caller.

Delete the file

DELETE /vid/fid/filename /vid/fid.ext /vid,fid.ext

When volume Server receives the request, it reads the file from the local Needle Mapper and checks the cookie. If a match is found, it deletes the file from the local Needle Mapper and appends a Needle file with empty data to the end of the.dat file. And add an invalid size record to the.idx file to indicate that the Needle has been deleted. That is, a logical deletion is not a physical deletion.

After the local files are deleted, the same deletion request is sent to other volume servers to delete all other backup files.

As you can see, deleting a file is not actually deleting it from the.dat file, so frequent deleting operations create a large amount of empty space, causing the effective storage space of the Volume Server disk to decrease. This requires the compact operation to compress the.dat file.

Upload a file

PUT /vid/fid/filename /vid/fid.ext /vid,fid.ext

When volume Server receives a request, it checks whether the file exists and whether cookie, checkSum, and data are equal. If they are equal, it returns a Needle file to the end of the.dat file and a record to the.idx file.

After volume Server has added local files, it makes the same write requests to other volume Servers to commit all other backup files.

conclusion

In fact, the main purpose of the author’s research on SeaweedFS is to learn the application of Golang in practical projects, including golang language features, the use of mature wheels, object-oriented thinking of GO, server architecture design, network, files, exception handling, RAFT, GRPC, etc. Therefore, I choose such a distributed file system project with simple business functions. On the other hand, IT can be compared with HDFS developed with Java, so that I can gain more and deeper thinking. Of course, the differences between the two are not covered in detail in this article, which may be covered after in-depth use of seaweedFs.

eggs

I didn’t finish it until the first day of the New Year, and NOW I’m on the high-speed train back home. I wish you all a happy Year of the Tiger.