1. Tune the system configuration

  1. Example Set the prefetch cache for a disk

    echo "8192" > /sys/block/sda/queue/read_ahead_kb 
    Copy the code
  2. Example Set the number of system processes

    echo 4194303 > /proc/sys/kernel/pid_max
    Copy the code
  3. Adjusting CPU Performance

    Note: Virtual machines and some hardware cpus may not support adjustment.

    1) Make sure the kernel adjustment tool is installed:

    yum -y install kernel-tools
    Copy the code

    2) Adjust to performance mode

    Adjustments can be made for each core:

    echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
    Copy the code

    Or use the CPU tool to adjust:

    cpupower frequency-set -g performance
    Copy the code

    Support five operating mode adjustment:

    Performance: Only focuses on efficiency and sets the CPU frequency to the maximum supported operating frequency. This mode maximizes the pursuit of system performance.

    Powersave: Set the CPU frequency to the lowest so-called “power saving” mode. The CPU will be fixed at the lowest operating frequency it supports. This mode is the maximum pursuit of low power consumption of the system.

    Userspace: The system assigns the decision-making power of frequency conversion policy to the user-mode application program, and provides corresponding interfaces for the user-mode application program to adjust the CPU running frequency.

    Ondemand: quickly and dynamically adjust the CPU frequency ondemand. Once there is a CPU computation task, it will immediately reach the maximum frequency and return to the lowest frequency immediately after completion of execution.

    Conservative: : it smoothly adjusts the CPU frequency, and the frequency changes gradually. The main difference from ondemand is that it incrementally allocates the CPU frequency ondemand rather than the highest frequency.

    1. Some hardware may not be supported, and the following errors may occur during adjustment:
    [root@CENTOS7-1 ~]# cpupower frequency-set -g performance
    Setting cpu: 0
    Error setting new values. Common errors:
    - Do you have proper administration rights? (super-user?)
    - Is the governor you requested available and modprobed?
    - Trying to set an invalid policy?
    - Trying to set a specific frequency, but userspace governor is not available,
       for example because of hardware which cannot be set to a specific frequency
       or because the userspace governor isn't loaded?
    Copy the code
  4. Optimizing network parameters

    Modifying a configuration file:

    vi  /etc/sysctl.d/ceph.conf 
    Copy the code

    Configuration contents:

    net.ipv4.tcp_rmem = 4096 87380 16777216
    net.ipv4.tcp_wmem = 4096 16384 16777216
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    Copy the code

    Effective execution:

    sysctl -p /etc/sysctl.d/ceph.conf
    Copy the code

2. Optimized Ceph cluster configuration

  1. Main configuration parameters of Ceph

    FILESTORE configuration parameters:

    Parameter names describe The default value Recommended values
    filestore xattr use omap Use object maps for XATTRS, EXT4 file systems, XFS or BTRFS false true
    filestore max sync interval Maximum synchronization interval from logs to data disks (seconds) 5 15
    filestore min sync interval Minimum synchronization interval from logs to data disks (seconds) 0.1 10
    filestore queue max ops Maximum operands accepted by the data disk 500 25000
    filestore queue max bytes Maximum number of bytes in a data disk operation (bytes) 100 < < 20 10485760
    filestore queue committing max ops Operands that can be committed on the data disk 500 5000
    filestore queue committing max bytes Maximum number of bytes that a data disk can commit (bytes) 100 < < 20 10485760000
    ilestore op threads Concurrent file system operands 2 32

    Journal Configuration parameters:

    Parameter names describe The default value Recommended values
    osd journal size OSD log size (MB) 5120 20000
    journal max write bytes Journal Maximum number of bytes written at one time (bytes) 10 < < 20 1073714824
    journal max write entries Journal Indicates the maximum number of records written at a time 100 10000
    journal queue max ops Journal Indicates the maximum number of operands in the queue at one time 500 50000
    journal queue max bytes Journal Maximum number of bytes in the queue at one time (bytes) 10 < < 20 10485760000

    Osd config Tuning Configuration parameters:

    Parameter names describe The default value Recommended values
    osd max write size Maximum number of writes to an OSD node at a time (MB) 90 512
    osd client message size cap Maximum number of bytes allowed in memory by the client 524288000 2147483648
    osd deep scrub stride Number of bytes allowed to read from Deep Scrub 524288 131072
    osd op threads Specifies the number of threads operated by the OSD process 2 8
    osd disk threads OSD threads for intensive operations such as recovery and Scrubbing 1 4
    osd map cache size Reserving the cache of OSD Maps (MB) 500 1024
    osd map cache bl size OSD Map cache in memory (MB) 50 128

    Osd-recovery Tuning Configuration parameters:

    Parameter names describe The default value Recommended values
    osd recovery op priority Recovery operation priority. The value ranges from 1 to 63. A higher value indicates a higher resource usage 10 4
    osd recovery max active Number of active recovery requests at the same time 15 10
    osd max backfills Maximum number of backfills allowed by an OSD 10 4

    Osd-client Tuning Configuration parameters:

    Parameter names describe The default value Recommended values
    rbd cache RBD cache true true
    rbd cache size RBD cache size (bytes) 33554432 268435456
    rbd cache max dirty Maximum number of dirty bytes allowed when the cache is write-back. If the value is 0, use write-through 25165824 134217728
    rbd cache max dirty age Duration of dirty data in cache before flushing to disk (seconds) 1 5
  2. Optimized Configuration Example

    [global]# global Settings
    fsid = XXXXXXXXXXXXXXX # Cluster ID
    mon Initial members = centos7-1, centos7-2, centos7-3
    mon The host = 10.10.20.11 10.10.20.12, 10.10.20.13 # monitor IP address
    auth Cluster Required = CEPHx # Cluster authentication
    auth Service Required = cePHx # Service authentication
    auth Client Required = CEPHx # Client authentication
    osd Pool default size = 2 #
    osd The possible values of min_size are as follows: Pool default min size = 1 #PG The possible values of MIN_size are as follows: Min_size is the minimum number of I/OS that a PG can accept
    public Network = 10.10.20.0/24 # Public network (monitorIP segment)
    cluster Network = 10.10.20.0/24 # cluster network
    max Open files = 131072 # Default 0# If this option is set, Ceph sets the Max Open FDS for the system
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    [mon]
    mon data = /var/lib/ceph/mon/ceph-$id
    mon Clock drift allowed = 1 # Default value 0.05 # Clock drift
    mon Osd min Down reporters = 13 # Default value 1 # Minimum number of OSD nodes that report down to monitor
    mon Osd Down out interval = 600 # Default value 300 # Specifies the number of seconds ceph waits for an OSD node to become Down or out
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    [osd]
    osd data = /var/lib/ceph/osd/ceph-$id
    osd Journal size = 20000 # Default 5120 # OSD journal size
    osd Journal = /var/lib/ceph/osd/$cluster-$id/journal # specifies the location of the osd journal
    osd MKFS type = XFS # Format system type
    osd Max write size = 512 # Default value 90 # Maximum number of OSD writes (MB)
    osd Client message size cap = 2147483648 # Default 100 # Maximum number of bytes allowed by the client in memory
    osd Deep scrub Stride = 131072 # Default 524288 #
    osd Op threads = 16 # Default 2 # Concurrent file system operands
    osd Disk Threads = 4 # Default value 1 #OSD threads for intensive operations such as recovery and Scrubbing
    osd Cache size = 1024 # Default value 500 # Reserve OSD map cache (MB)
    osd Map cache bl size = 128 # Default value 50 #OSD map cache in memory (MB)
    osd Mount options XFS = "rw, noexec, nodev, noatime, nodiratime, nobarrier" # default value rw, noatime, inode64 # Ceph OSD XFS mount options
    osd Recovery OP priority = 2 # Default value 10 # Recovery operation priority. The value ranges from 1 to 63
    osd Recovery Max Active = 10 # Default value 15 # Number of recovery requests active at the same time
    osd Max Backfills = 4 # Default value 10 # Maximum number of backfills allowed by an OSD
    osd Min pg log entries = 30000 # Default value 3000 # Build PGLog is the maximum number of PGLog retained
    osd Max pg log entries = 100000 # default value 10000 # build PGLog is the maximum number of PGLog retained
    osd Mon heartbeat interval = 40 # default 30 #OSD pings to a monitor
    ms Dispatch throttle bytes = 1048576000 # Default value 104857600 # Maximum number of messages to be sent
    objecter Inflight OPS = 819200 # Default value 1024 # Indicates the maximum number of unsent I/O requests allowed for client flow control. If the number exceeds the threshold, application I/OS will be blocked
    osd Op log threshold = 50 # Default value 5 # Show how many operations log at a time
    osd Crush Chooseleaf type = 0 # The default value is 1 # The type of bucket used when the crush rule uses chooseleaf
    filestore Xattr use OMap = true # Default FAP # To use object map for XATTRS, for EXT4 file systems, and for XFS or BTRFS
    filestore Min sync interval = 10 # default 0.1#
    filestore Max sync interval = 15
    filestore Queue Max ops = 25000 # default 500#
    filestore Queue Max bytes = 1048576000 # Default 100 # Maximum number of bytes in a data disk operation
    filestore Queue Max OPS = 50000 (default 500) # The number of operations that can be committed on the data disk (right)
    filestore Queue Max bytes = 10485760000 # Default 100 # Maximum number of bytes that can be committed on the data disk (bytes)
    filestore Split multiple = 8 # Default value 2 # Maximum number of files in the previous subdirectory to split into subdirectories
    filestore Merge threshold = 40 # Default value 10 # Minimum number of files in the previous subclass directory to merge into the parent class
    filestore Fd cache size = 1024 # Default value 128 # Object file handle cache size
    filestore Op threads = 32 # Default 2 # Concurrent file system operands
    journal Max Write bytes = 1073714824 # Default value 1048560 #journal Maximum number of bytes written at one time (bytes)
    journal Max write entries = 10000 # Default value 100 #journal Maximum number of entries written at one time
    journal Queue Max ops = 50000 # default 50 #journal Maximum number of operations in the queue at one time
    journal Queue Max bytes = 10485760000 # default 33554432 #journal Maximum number of bytes in the queue at one time (bytes)
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    [client]
    rbd Cache = true # Default value true #RBD cache
    rbd Cache size = 335544320 # Default 33554432 #RBD
    rbd Cache Max Dirty = 134217728 # Default value 25165824 # Maximum number of dirty bytes allowed when the cache is write-back. If the value is 0, write-through is used
    rbd Cache Max dirty age = 30 # Default value 1 # How long dirty data is in the cache before being flushed to disk (seconds)
    rbd Cache writeThrough until Flush = false # Default true # This option is intended to be compatible with virtio drivers prior to Linux-2.6.32 to prevent data from being written back because flush requests are not sent
    After this parameter is set, librbd will writethrough until the first flush request is received, and then writeback.
    rbd Cache Max dirty object = 2 # Default value 0 # Maximum number of objects, which is calculated by RBD cache size. Librbd by default splits disk images in units of 4MB
    Each chunk Object is abstracted as an Object; In LIBRbd, the cache is managed in Object. Increasing this value can improve performance
    rbd Cache target dirty = 235544320 # Default 16777216 # The size of the dirty data to start the write back cannot exceed rbD_cache_max_dirty
    Copy the code

3. Tune best practices

  1. MON advice

    The deployment of a Ceph cluster must be planned correctly, and MON performance is critical to the overall performance of the cluster. MON should normally be on a dedicated node. To ensure proper arbitration, the number of MON’s should be odd.

  2. OSD advice

    Each Ceph OSD has logs. OSD logs and data may be stored on the same storage device. If the write operation is submitted to the logs of all OSD nodes in the PG, the write operation is complete. Therefore, faster logging performance improves response times.

    In typical deployment, OSD uses traditional mechanical disks with high latency. To maximize efficiency, Ceph recommends that a separate low-latency SSD or NVMe device be used for OSD logs. Administrators must be careful not to store too many OSD logs on the same device, because this may cause performance bottlenecks. The following SSD specifications should be considered:

    • Mean time between failures (MTBF) for supported writes
    • IOPS capacity
    • Data transfer rate
    • Bus /SSD coupling capability

    Red Hat You are advised to create a maximum of six OSD logs for each SATA SSD device or 12 OSD logs for each NVMe device.

  3. RBD advice

    Workloads on RBD block devices are usually I/O intensive, such as databases running on VMS in OpenStack. For RBDS, OSD logs must be stored on SSDS or NVMe devices. For back-end storage devices, different service levels can be provided based on the storage technologies used to support OSD processes, such as NVMe SSDS, SATA SSDS, and HDDS.

  4. Object Gateway Suggestion

    Workloads on Ceph object gateways are typically throughput intensive. If it’s audio and video, it can be very large. However, the bucket index pool may show more I/O intensive workload patterns. Administrators should store this pool on SSD devices.

    The Ceph object Gateway maintains an index for each bucket, and Ceph stores this index in a RADOS object. As buckets grow in large numbers (more than 100,000), index performance degrades (because only one RADOS object participates in all indexing operations).

    To do this, Ceph can store large indexes in multiple RADOS objects or shards. Administrators can enable this function by setting the rgw_override_bucket_index_MAX_shards configuration parameter in the ceph.conf configuration file. The recommended value for this parameter is the estimated number of objects in the bucket divided by 100,000.

  5. CephFs advice

    Metadata pools that hold directory structures and other indexes can be a bottleneck for CephFS. You can use SSD devices for this pool.

    Each CephFS metadata server (MDS) maintains an in-memory cache for different kinds of items such as index nodes. Ceph limits the size of this cache using the mDS_cache_memory_limit configuration parameter. Its default value is expressed in absolute bytes, equal to 1 GB, and can be tuned as needed.


This article was created and shared by Mirson. For further communication, please add to QQ group 19310171 or visit www.softart.cn