Ceph performance tuning for Ceph distributed storage practice applications

1. Tune the system configuration

Example Set the prefetch cache for a disk

echo "8192" > /sys/block/sda/queue/read_ahead_kb 
Copy the code

Example Set the number of system processes

echo 4194303 > /proc/sys/kernel/pid_max
Copy the code

Adjusting CPU Performance

Note: Virtual machines and some hardware cpus may not support adjustment.

1) Make sure the kernel adjustment tool is installed:
```
yum -y install kernel-tools
Copy the code
```
2) Adjust to performance mode

Adjustments can be made for each core:
```
echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
Copy the code
```
Or use the CPU tool to adjust:
```
cpupower frequency-set -g performance
Copy the code
```
Support five operating mode adjustment:

Performance: Only focuses on efficiency and sets the CPU frequency to the maximum supported operating frequency. This mode maximizes the pursuit of system performance.

Powersave: Set the CPU frequency to the lowest so-called “power saving” mode. The CPU will be fixed at the lowest operating frequency it supports. This mode is the maximum pursuit of low power consumption of the system.

Userspace: The system assigns the decision-making power of frequency conversion policy to the user-mode application program, and provides corresponding interfaces for the user-mode application program to adjust the CPU running frequency.

Ondemand: quickly and dynamically adjust the CPU frequency ondemand. Once there is a CPU computation task, it will immediately reach the maximum frequency and return to the lowest frequency immediately after completion of execution.

Conservative: : it smoothly adjusts the CPU frequency, and the frequency changes gradually. The main difference from ondemand is that it incrementally allocates the CPU frequency ondemand rather than the highest frequency.
1. Some hardware may not be supported, and the following errors may occur during adjustment:
```
[root@CENTOS7-1 ~]# cpupower frequency-set -g performance
Setting cpu: 0
Error setting new values. Common errors:
- Do you have proper administration rights? (super-user?)
- Is the governor you requested available and modprobed?
- Trying to set an invalid policy?
- Trying to set a specific frequency, but userspace governor is not available,
   for example because of hardware which cannot be set to a specific frequency
   or because the userspace governor isn't loaded?
Copy the code
```

Optimizing network parameters

Modifying a configuration file:

vi  /etc/sysctl.d/ceph.conf 
Copy the code

Configuration contents:

net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Copy the code

Effective execution:

sysctl -p /etc/sysctl.d/ceph.conf
Copy the code

2. Optimized Ceph cluster configuration

Main configuration parameters of Ceph

FILESTORE configuration parameters:

Parameter names	describe	The default value	Recommended values
filestore xattr use omap	Use object maps for XATTRS, EXT4 file systems, XFS or BTRFS	false	true
filestore max sync interval	Maximum synchronization interval from logs to data disks (seconds)	5	15
filestore min sync interval	Minimum synchronization interval from logs to data disks (seconds)	0.1	10
filestore queue max ops	Maximum operands accepted by the data disk	500	25000
filestore queue max bytes	Maximum number of bytes in a data disk operation (bytes)	100 < < 20	10485760
filestore queue committing max ops	Operands that can be committed on the data disk	500	5000
filestore queue committing max bytes	Maximum number of bytes that a data disk can commit (bytes)	100 < < 20	10485760000
ilestore op threads	Concurrent file system operands	2	32

Journal Configuration parameters:

Parameter names	describe	The default value	Recommended values
osd journal size	OSD log size (MB)	5120	20000
journal max write bytes	Journal Maximum number of bytes written at one time (bytes)	10 < < 20	1073714824
journal max write entries	Journal Indicates the maximum number of records written at a time	100	10000
journal queue max ops	Journal Indicates the maximum number of operands in the queue at one time	500	50000
journal queue max bytes	Journal Maximum number of bytes in the queue at one time (bytes)	10 < < 20	10485760000

Osd config Tuning Configuration parameters:

Parameter names	describe	The default value	Recommended values
osd max write size	Maximum number of writes to an OSD node at a time (MB)	90	512
osd client message size cap	Maximum number of bytes allowed in memory by the client	524288000	2147483648
osd deep scrub stride	Number of bytes allowed to read from Deep Scrub	524288	131072
osd op threads	Specifies the number of threads operated by the OSD process	2	8
osd disk threads	OSD threads for intensive operations such as recovery and Scrubbing	1	4
osd map cache size	Reserving the cache of OSD Maps (MB)	500	1024
osd map cache bl size	OSD Map cache in memory (MB)	50	128

Osd-recovery Tuning Configuration parameters:

Parameter names	describe	The default value	Recommended values
osd recovery op priority	Recovery operation priority. The value ranges from 1 to 63. A higher value indicates a higher resource usage	10	4
osd recovery max active	Number of active recovery requests at the same time	15	10
osd max backfills	Maximum number of backfills allowed by an OSD	10	4

Osd-client Tuning Configuration parameters:

Parameter names	describe	The default value	Recommended values
rbd cache	RBD cache	true	true
rbd cache size	RBD cache size (bytes)	33554432	268435456
rbd cache max dirty	Maximum number of dirty bytes allowed when the cache is write-back. If the value is 0, use write-through	25165824	134217728
rbd cache max dirty age	Duration of dirty data in cache before flushing to disk (seconds)	1	5

Optimized Configuration Example

[global]# global Settings
fsid = XXXXXXXXXXXXXXX # Cluster ID
mon Initial members = centos7-1, centos7-2, centos7-3
mon The host = 10.10.20.11 10.10.20.12, 10.10.20.13 # monitor IP address
auth Cluster Required = CEPHx # Cluster authentication
auth Service Required = cePHx # Service authentication
auth Client Required = CEPHx # Client authentication
osd Pool default size = 2 #
osd The possible values of min_size are as follows: Pool default min size = 1 #PG The possible values of MIN_size are as follows: Min_size is the minimum number of I/OS that a PG can accept
public Network = 10.10.20.0/24 # Public network (monitorIP segment)
cluster Network = 10.10.20.0/24 # cluster network
max Open files = 131072 # Default 0# If this option is set, Ceph sets the Max Open FDS for the system
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
[mon]
mon data = /var/lib/ceph/mon/ceph-$id
mon Clock drift allowed = 1 # Default value 0.05 # Clock drift
mon Osd min Down reporters = 13 # Default value 1 # Minimum number of OSD nodes that report down to monitor
mon Osd Down out interval = 600 # Default value 300 # Specifies the number of seconds ceph waits for an OSD node to become Down or out
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd Journal size = 20000 # Default 5120 # OSD journal size
osd Journal = /var/lib/ceph/osd/$cluster-$id/journal # specifies the location of the osd journal
osd MKFS type = XFS # Format system type
osd Max write size = 512 # Default value 90 # Maximum number of OSD writes (MB)
osd Client message size cap = 2147483648 # Default 100 # Maximum number of bytes allowed by the client in memory
osd Deep scrub Stride = 131072 # Default 524288 #
osd Op threads = 16 # Default 2 # Concurrent file system operands
osd Disk Threads = 4 # Default value 1 #OSD threads for intensive operations such as recovery and Scrubbing
osd Cache size = 1024 # Default value 500 # Reserve OSD map cache (MB)
osd Map cache bl size = 128 # Default value 50 #OSD map cache in memory (MB)
osd Mount options XFS = "rw, noexec, nodev, noatime, nodiratime, nobarrier" # default value rw, noatime, inode64 # Ceph OSD XFS mount options
osd Recovery OP priority = 2 # Default value 10 # Recovery operation priority. The value ranges from 1 to 63
osd Recovery Max Active = 10 # Default value 15 # Number of recovery requests active at the same time
osd Max Backfills = 4 # Default value 10 # Maximum number of backfills allowed by an OSD
osd Min pg log entries = 30000 # Default value 3000 # Build PGLog is the maximum number of PGLog retained
osd Max pg log entries = 100000 # default value 10000 # build PGLog is the maximum number of PGLog retained
osd Mon heartbeat interval = 40 # default 30 #OSD pings to a monitor
ms Dispatch throttle bytes = 1048576000 # Default value 104857600 # Maximum number of messages to be sent
objecter Inflight OPS = 819200 # Default value 1024 # Indicates the maximum number of unsent I/O requests allowed for client flow control. If the number exceeds the threshold, application I/OS will be blocked
osd Op log threshold = 50 # Default value 5 # Show how many operations log at a time
osd Crush Chooseleaf type = 0 # The default value is 1 # The type of bucket used when the crush rule uses chooseleaf
filestore Xattr use OMap = true # Default FAP # To use object map for XATTRS, for EXT4 file systems, and for XFS or BTRFS
filestore Min sync interval = 10 # default 0.1#
filestore Max sync interval = 15
filestore Queue Max ops = 25000 # default 500#
filestore Queue Max bytes = 1048576000 # Default 100 # Maximum number of bytes in a data disk operation
filestore Queue Max OPS = 50000 (default 500) # The number of operations that can be committed on the data disk (right)
filestore Queue Max bytes = 10485760000 # Default 100 # Maximum number of bytes that can be committed on the data disk (bytes)
filestore Split multiple = 8 # Default value 2 # Maximum number of files in the previous subdirectory to split into subdirectories
filestore Merge threshold = 40 # Default value 10 # Minimum number of files in the previous subclass directory to merge into the parent class
filestore Fd cache size = 1024 # Default value 128 # Object file handle cache size
filestore Op threads = 32 # Default 2 # Concurrent file system operands
journal Max Write bytes = 1073714824 # Default value 1048560 #journal Maximum number of bytes written at one time (bytes)
journal Max write entries = 10000 # Default value 100 #journal Maximum number of entries written at one time
journal Queue Max ops = 50000 # default 50 #journal Maximum number of operations in the queue at one time
journal Queue Max bytes = 10485760000 # default 33554432 #journal Maximum number of bytes in the queue at one time (bytes)
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
[client]
rbd Cache = true # Default value true #RBD cache
rbd Cache size = 335544320 # Default 33554432 #RBD
rbd Cache Max Dirty = 134217728 # Default value 25165824 # Maximum number of dirty bytes allowed when the cache is write-back. If the value is 0, write-through is used
rbd Cache Max dirty age = 30 # Default value 1 # How long dirty data is in the cache before being flushed to disk (seconds)
rbd Cache writeThrough until Flush = false # Default true # This option is intended to be compatible with virtio drivers prior to Linux-2.6.32 to prevent data from being written back because flush requests are not sent
After this parameter is set, librbd will writethrough until the first flush request is received, and then writeback.
rbd Cache Max dirty object = 2 # Default value 0 # Maximum number of objects, which is calculated by RBD cache size. Librbd by default splits disk images in units of 4MB
Each chunk Object is abstracted as an Object; In LIBRbd, the cache is managed in Object. Increasing this value can improve performance
rbd Cache target dirty = 235544320 # Default 16777216 # The size of the dirty data to start the write back cannot exceed rbD_cache_max_dirty
Copy the code

3. Tune best practices

MON advice

The deployment of a Ceph cluster must be planned correctly, and MON performance is critical to the overall performance of the cluster. MON should normally be on a dedicated node. To ensure proper arbitration, the number of MON’s should be odd.
OSD advice

Each Ceph OSD has logs. OSD logs and data may be stored on the same storage device. If the write operation is submitted to the logs of all OSD nodes in the PG, the write operation is complete. Therefore, faster logging performance improves response times.

In typical deployment, OSD uses traditional mechanical disks with high latency. To maximize efficiency, Ceph recommends that a separate low-latency SSD or NVMe device be used for OSD logs. Administrators must be careful not to store too many OSD logs on the same device, because this may cause performance bottlenecks. The following SSD specifications should be considered:
- Mean time between failures (MTBF) for supported writes
- IOPS capacity
- Data transfer rate
- Bus /SSD coupling capability
Red Hat You are advised to create a maximum of six OSD logs for each SATA SSD device or 12 OSD logs for each NVMe device.
RBD advice

Workloads on RBD block devices are usually I/O intensive, such as databases running on VMS in OpenStack. For RBDS, OSD logs must be stored on SSDS or NVMe devices. For back-end storage devices, different service levels can be provided based on the storage technologies used to support OSD processes, such as NVMe SSDS, SATA SSDS, and HDDS.
Object Gateway Suggestion

Workloads on Ceph object gateways are typically throughput intensive. If it’s audio and video, it can be very large. However, the bucket index pool may show more I/O intensive workload patterns. Administrators should store this pool on SSD devices.

The Ceph object Gateway maintains an index for each bucket, and Ceph stores this index in a RADOS object. As buckets grow in large numbers (more than 100,000), index performance degrades (because only one RADOS object participates in all indexing operations).

To do this, Ceph can store large indexes in multiple RADOS objects or shards. Administrators can enable this function by setting the rgw_override_bucket_index_MAX_shards configuration parameter in the ceph.conf configuration file. The recommended value for this parameter is the estimated number of objects in the bucket divided by 100,000.
CephFs advice

Metadata pools that hold directory structures and other indexes can be a bottleneck for CephFS. You can use SSD devices for this pool.

Each CephFS metadata server (MDS) maintains an in-memory cache for different kinds of items such as index nodes. Ceph limits the size of this cache using the mDS_cache_memory_limit configuration parameter. Its default value is expressed in absolute bytes, equal to 1 GB, and can be tuned as needed.

This article was created and shared by Mirson. For further communication, please add to QQ group 19310171 or visit www.softart.cn

Ceph performance tuning for Ceph distributed storage practice applications

1. Tune the system configuration

2. Optimized Ceph cluster configuration

3. Tune best practices

Related Posts

[CI/CD Technical Topic] “Jenkins Practical Series” Jenkinsfile+DockerFile to achieve automatic deployment

The concept and implementation scheme of distributed transaction

3 ways to sort lists in Java