Know yourself and know your enemy, and you can win a hundred battles
Analyze the service I/O model
Understand the basic business storage model:
Maximum concurrency, maximum read/write bandwidth requirements.
- The number of concurrent requests determines how many RGW instances you need to support a single RGW without knowing the maximum number of concurrent requests per RGW.
- The maximum read/write bandwidth determines how many OSD nodes you need to support such a large read/write bandwidth, and whether the bandwidth at the endpoint entrance meets this requirement.
Is the client distributed on the internal and external network?
- Clients are mainly distributed on extranets, which means that in a complex network environment like the public network, data read and write may be affected by some uncontrollable factors. Therefore, public clouds that provide object storage dare not tell you their bandwidth, latency and concurrency.
- Clients are distributed internally and externally, which means that the hop count of the route between the endpoint and the client should be minimized to ensure bandwidth.
Read/write ratio, average request size.
- If the reads are high, such as in CDN scenarios, you can consider adding a read cache component at the endpoint entrance.
- If the write ratio is high, such as data backup, you need to control the inlet bandwidth properly to prevent multiple services from preempting the write bandwidth at peak times, which affects the overall service quality.
- The average request size determines whether to optimize the performance of the object storage service for large files or small files.
Experience of an old driver: In most cases, we do not know the specific I/O model of the service. In this case, we can consider small-scale access to the service and connect the access logs of the front-end endpoint (for example, nginx is used in the front end) to the log analysis system such as ELK. With the help of ELK, the service I/O model can be easily analyzed.
If the troops and horses do not move, food and grass should go first
All business ideas ultimately need to be built on the basic hardware platform. In the early stage, the business scale is small and the hardware selection may not be paid much attention. However, with the continuous growth of the business, it will become particularly important to choose a reliable and stable hardware platform with high cost performance.
Experience with hardware resources on a small scale
Small-scale cluster refers to a cluster with less than 20 machines or 200 OSD nodes. To save money, you can use VMS for MON nodes. However, the following conditions must be met:
* MON quantity in more than 3, up to 5, more than a waste of money, and CPU and memory control in 2 core 4G or more. * MON devices must be distributed on different physical machines and NTP synchronization must be performed at the same time.Copy the code
OSD nodes meet the following requirements:
* OSD should not be added to a VM even if it is killed. It is troublesome to change disks and troubleshoot faults in daily life. Moreover, because the I/O stack of virtual machines has another layer, data security is poor. If the network bandwidth is gigabit, you can only bond OSD nodes to ensure that each OSD node has 40 to 60M/s of network bandwidth. Based on experience, if the network bandwidth is tens of gigabit, do not bond OSD nodes to gigabit nodes. * Each OSD node should have at least 1 core and 2 gigabytes of physical resources, otherwise the memory will not be able to handle data recovery when some problems occur. * If you don't care about performance and don't have SSDS to do journal, forget it. * Index pool is best if SSD is used.Copy the code
The RGW node is much easier:
* RGW can consider using virtual machines or even Docker, because we can improve the high availability and concurrency of the whole service by opening more of this service. * Each RGW has at least 2 cores and 4G resources, because each request is cached in memory before being written to radOS. * RGW service node 2 or so, the front end can access nginx to do reverse proxy to improve concurrency. * Untuned civetWeb concurrency is much worse than FastCGI.Copy the code
Medium scale hardware resource experience
Medium scale refers to a cluster with 20 to 40 servers or 400 OSD nodes. In this phase, services have taken shape and MON nodes are not suitable for VMS. Note the following:
* The number of MON devices should be more than 3, up to 5, and the CPU and memory should be more than 4 core 8GB. If possible, store the MON metadata on SSDS. LevelDB on Mon has a performance bottleneck when the cluster grows to a certain size, especially when it comes to compression. * The MON server must be deployed on different physical servers. If possible, deploy the MON server across multiple cabinets. However, do not deploy the MON server across multiple IP network segments.Copy the code
OSD nodes that meet the following requirements:
* OSD to consider doing a good job of crushmap failed domain isolation, can use 3 copies, absolutely do not go to save money to get 2 copies, later disk batch life, this is a great hidden danger. * The number of OSD disks on each OSD physical node should not be too large, and the capacity of a single disk should not be too large. If you get an 8TB SATA disk, if the usage of a single disk reaches 80% or more and the disk fails, the whole data backfill will take a long time to wait. Of course, you can control the number of concurrent backfill, but it will affect services. Do your own trade-offs. * It is better for every OSD to have SSD journal as there is no need to save SSD money at this scale. * Index pool must be on SSD, it will be a quantum leap in performance.Copy the code
The RGW node is much easier:
* RGW can still consider virtual machines and even Docker. * The front-end portal should be loaded with load balancing solutions such as LVS or nGINx anti-proxy. High availability and load balancing are mandatory. * Control the number of Rgw_override_bucket_INDEx_MAX_SHards based on the number of SSDS and the failure domain design, and tune the index performance of bucket. * You can deploy one RGW service on each OSD node, but ensure that the cpus and memory of the node are sufficient.Copy the code
Experience with hardware resources in medium and large scale
Medium – and large-sized cluster refers to a scenario in which the number of cluster machines is less than 50 or the number of OSD nodes is less than 500. In this phase, services have reached a certain scale. Do not use virtual machines on MON nodes.
* If the number of MON devices is more than 3, a maximum of 5, SSD. * MON devices must be deployed on different physical machines and must be deployed across multiple cabinets. However, do not use multiple IP network segments. It is best to use 10-GIGABit interconnection. * Each MON guarantees 8 cores and 16GB of ram is basically enough.Copy the code
OSD nodes that meet the following requirements:
* OSD must be designed in advance of the crushmap failed domain isolation, recommended 3 copies, as for the EC solution, depends on whether you can solve the EC data recovery problem after a batch of server power failure. * OSD Hard disks must be uniformly configured. Do not mix 4TB and 8TB hard disks, which will cause trouble in weight control. In addition, you must optimize the PG distribution of each OSD to avoid uneven distribution of performance pressure and capacity. Pg distribution tuning will be covered later. * SSD Journal and SSD Index pool are required.Copy the code
The RGW node is much easier:
* RGW or old honest practical physics machine, to this scale, the money saved today will one day have to cry to pay the tuition. * Front-end load balancing solution, can do a layer of optimization, such as improving front-end bandwidth and concurrency, increase read cache, and even on DPDK. * Using CivetWeb instead of FastCGI can greatly improve deployment efficiency, when combined with Docker can achieve fast concurrency flexibility scaling. * RGW can be centrally deployed on several physical nodes, or it can be mixed with OSD.Copy the code
Foresight is the foundation of success; unpredictability is the waste.
Hard drive valuable data is priceless, so be a little bit more respectful of the storage system that deals with that data. Some high-risk operations, we must be careful, do not get into the habit of restarting deleting data at every turn in the test environment, after the online may be a habitual action will let you forget for life. Also, be sure to do all kinds of testing before you go live, otherwise waiting for online problems to go to the last minute may not be enough. The following points are mandatory for testing.
-
Troubleshooting and recovery: Use Cosbench to test your Crushmap fault area design ability and operation personnel’s basic level while reading and writing operations.
-
An independent client must be prepared for performance pressure measurement, and the client and server should not be mixed. At the same time, all network traffic generated by pressure measurement should be isolated from the network, so as not to affect the online environment.
-
The importance of NTP service is really worth mentioning separately. Make sure to check the clock of all nodes before going online. The method of increasing the delay of MON_clock_DRIFt_allowed is nothing but a false alarm. All related to hardware update maintenance (such as downtime to replace the motherboard and memory, CPU, RAID card), be sure to check the clock is correct before resuming business, or enough to drink a pot of you.
-
Feature coverage testing depends on your QA skills, and frankly, it’s hard to achieve what you expect from either official use cases or Cosbench, so you should do it yourself. If possible, have a set of SDKS for various languages, and you may not be able to do it one day. Make sure you have an API compatibility list before you go online, so you don’t have to check the usability of the interface at the last minute.
A brilliant strategy wins a thousand miles away
After the system goes online, how to make the operation and maintenance work worry and effort is a great knowledge. Smart operation and maintenance, seven points by tools, three points by experience. In the face of a variety of operation and maintenance tools, it is obviously not advisable to write scripts and build wheels foolishly. Moreover, if the scale is large, you cannot manage it with scripts alone, and the cost of scripted management will become higher and higher in subsequent staff turnover. Therefore, I recommend an operation and maintenance architecture scheme suitable for small and medium-sized operation and maintenance teams, as shown in the following figure
The deployment tool
Ruby syntax is not very operational friendly, and Puppet is too heavy to maintain and requires a separate client deployment. But for operation cepH there is still a feeling of killing chickens with a knife. As for SaltStack, although python syntax is largely operational, the Calamari team at Ceph and SaltStack were at lopsided with each other over version compatibility issues, and the Calamari project ended up being a dead end. Ansible is also the official deployment tool of Ceph. SSH agent free is similar to ceph-deploy, but ansible is more engineering practice. So let’s go in the direction of normalization with Ansible.
Log collection and management
If you want to make fun of the disadvantages, it is a bit gross to learn GROK regular, but gradually adapt to it, put MON/OSD/RGW/MDS logs into ELK, next is to see you accumulate good operation and maintenance experience. Familiar with Ceph log, and constantly improve all kinds of anomalies and the alarm trigger conditions, such as daily disk, RAID card common hardware failure logs together, also basically by log can quickly diagnosed OSD disk failure, no longer silly hung up the OSD also don’t know what’s the matter, as long as inform room in tray is good for you.
Asynchronous task scheduling
Why there is asynchronous task scheduling this thing, because front-end ELK although diagnostic analysis of the cause of the failure, the final failure also need someone to log in to the specific machine for processing, introduce Celery this distributed task scheduling middleware later, operation and maintenance staff put the corresponding fault handling operation into ansible playbook, For example, when ELK finds a disk fault, it calls the Playbook of operation and maintenance personnel to remove the corresponding disk, then umount it and use megacli and other tools to turn on the disk fault light. The last email informs XX that the machine with the IP of XX cabinet in XX machine room needs to replace the disk, and the rest is to wait for the machine room to replace the disk and restore the data.
alerts
Wechat and other communication tools greatly facilitate internal communication, especially when you are not at work, through the operation of the robot to trigger the Ansible playbook script to handle faults, this feeling let you feel the operation of real fun! The students in the machine room will change the disk and send the message to the wechat robot. The robot will establish an OSD recovery task and send corresponding execution requests to the operation and maintenance personnel. The operation and maintenance students only need to confirm the operation to the robot. It’s never too much to ask for help while trying to fix a breakdown.
Finally, the recommended tools are attached:
Github.com/ceph/ceph-a… www.elastic.co/cn/products docs.celeryproject.org/en/latest/i… Wxpy. Readthedocs. IO/useful/latest/I… ansible-tran.readthedocs.io/en/latest/