Event review
A review of the whole event Gitlab.com was put on Google Doc at the first time, and later posted a Blog to explain the matter. Here, I briefly review the process of the event.
First of all, a student named YP was doing some load balancing work for gitLab’s online database. When doing this work, a sudden situation happened. Gitlab was attacked by DDoS, and the use of the database soared. There was a db2.staging database that was 4GB behind the production database. When I was fixing the synchronization problem with the db2.staging database, I found that there were various problems between DB2.staging and the main database. At this time, I was working very late. After trying several different methods, he couldn’t hold the db2.staging hang at any point. So he tried to delete the DB2.staging database and start a new replication. As a result, the command to delete the database failed to hit db1.cluster. As a result, the entire production database is deleted by mistake. (Chen Hao Note: this failure is basically “working too long” + “getting lost in switching between most terminal Windows”)
During the restoration process, they found that only the DB1.staging database could be used for restoration, while the other five backup mechanisms were not available. The first one was database synchronization without webhook synchronization, the second one was hard disk snapshot without database synchronization, and the third one was pg_dump backup. Found wrong version (version 9.2 to 9.6 data dump) cause not dump to the data, the fourth S3 backup, no backup, the fifth is related to the backup process is patchy, scripts, and only a few rough flesh bad document, that is to say, not only is the human flesh, but also completely unenforceable. (Note: Even if these backup mechanisms work, there is a problem, because most of these backups are basically done once every 24 hours, so the recovery from these backups must also lose data, only the first database synchronization will be real-time.)
Finally, gitlab could copy the data back from db1.staging at 6 hours ago. It was very slow at 60Mbits/S and took a long time to copy. Because of the poor performance of that machine. The data is now restored, but since the data was restored 6 hours ago, the following data is lost:
- As a rough estimate, 4,613 projects, 74 forks, and 350 imports were lost; However, because the Git repository is still there, the data in the database can be derived backwards from the Git repository, but issues and so on in the project are completely lost.
- About ±4979 commit records were lost.
- There may be 707 missing users, according to Kibana’s logs.
- Webhooks lost after 17:20, January 31.
Since Gitlab made the details of the whole incident public, there was also a lot of outside help. Simon Riggs, CTO of 2nd Quadrant, also published a post on his blog, Dataloss at Gitlab, which gave some very good advice:
- – there may be some Bug with PostgreSQL 9.6 data synchronization hang, it is being fixed
- PostgreSQL has a 4GB synchronization lag which is normal and not a problem.
- Stopping the slave normally causes the master to automatically release the WALSender link count, so the max_WAL_senders parameter of the master should not be reconfigured. However, when a slave node is stopped, the number of complex connections on the primary node is not released quickly, and the newly started slave node consumes more connections. Gitlab’s configuration of 32 links is too high, he thinks, and generally two to four would suffice.
- Also, the max_connections=8000 in gitLab’s previous configuration was too high, and it makes sense to drop to 2000 now.
- Pg_basebackup creates a checkpoint on the primary node and then starts synchronization, which takes about four minutes.
- Deleting a database directory manually is a dangerous operation and should be left to a program. The repmgr just released is recommended
- Restoring backups is also very important, so it should be done with the appropriate program. Barman (which supports S3) is recommended
- Testing backup and restore is an important process.
There may be a reason why Gitlab students are not familiar with PostgreSQL.
Subsequently, Gitlab has also opened a series of issues on its website. The list of issues is written post-mortem here (this list may be updated constantly).
- Infrastructure# 1094 – Update PS1 across all hosts to more clearly differentiate between hosts and environments
- Infrastructure# 1095 — Prometheus monitoring for backups
- Infrastructure# 1096 — Set PostgreSQL’s MAX_connections to a sane value
- Infrastructure# 1097 — Investigate Point in time recovery & continuous archiving for PostgreSQL
- Hourly LVM snapshots of the production Database
- Snapshots of production Databases infrastructure#1099 — Snapshots of production Databases
- Infrastructure# 1100 — Move staging to the ARM environment
- Infrastructure# 1101 — Recover Production Replica (S)
- Infrastructure# 1102 — Automated testing of recovering PostgreSQL database backups
- Infrastructure# 1103 – Improve PostgreSQL replication documentation/runbooks
- Infrastructure# 1104 — Kick out SSH users inactive for N minutes
- Infrastructure# 1105 — Investigate pgbarman for creating PostgreSQL backups
From the list above, we can see some improvements. It’s fine, but I don’t think it’s quite enough.
Call waiting welfare
1. Recently sorted out 20G resources, including product/operation/test/programmer/market, etc., and Internet practitioners [necessary skills for work, professional books on the industry, precious books on interview questions, etc.]. Access:
-
Scan the code of wechat to follow the public account “Atypical Internet”, forward the article to the moments of friends, and send the screenshots to the background of the public account to obtain dry goods resources links;
2. Internet Communication Group:
-
Pay attention to the public account “atypical Internet”, in the background of the public account reply “into the group”, network sharing, communication;
Related thoughts
Because I’ve done this before (deleted databases by mistake, lost my machine in multiple terminal Windows…) And I have seen it once in Amazon and at least four times in Ali (the misoperation accident in Ali human meat operation and maintenance is the most I have seen), but I can’t share it publicly here, I can share it privately. Here, I just want to share my experience and knowledge from both non-technical and technical aspects.
Technical aspects
Human flesh operations
I’ve always found it a very bad habit to go straight to the production line and type orders. In my opinion, the strength of a company’s operation and maintenance capability is related to your online environment. The more you like to tap orders online, the weaker your operation and maintenance capability will be. The more you handle problems through automation, the stronger your operation and maintenance capability will be. The reasons are as follows:
First, if any change to the code is a release, then any change to the production environment (hardware, operating system, network, software configuration…) “Is also a release. Such a release should go through a release system and release process, and be well tested, live, and roll-back planned. The point is, the publishing process can be recorded, tracked, and retraced, whereas typing commands online is completely untraceable. No one knows what orders you typed.
Second, the truly healthy operational capability is that people run the code and the code runs the machine, not people run the machine. You knock what command nobody knows, but you write a tool to do on the change line system, this tool did what, look at the source of the tool to know.
In addition, some people say that in the future, we should use MV instead of RM. Some people say that in the future, when doing such things, one person should do it while the other person looks on. Some people say that we should have a checklist to force the process to make changes online. In my opinion, although these can work, they are still not good, and they are as follows:
First, if you want to solve a problem that requires more people to do it, it is made labor intensive. Our technology today is trying to eliminate the cost of labor, not increase it. As a technician, the best way to solve problems is to try to use technology, not more human methods. The difference between human beings and animals is that they can invent and use modern tools instead of using more manpower. And it’s not just that people have problems of one kind or another (exhaustion, moodiness, impatience, impulsivity…). “, and the machine is a single brainless tireless, but also because the efficiency and speed of machine work is N times higher than human flesh.
Second, adding a permissions system or another watch Dog system is completely backward. Who will maintain and approve permissions in the permissions system? It’s not just that the extra system requires extra maintenance, but that it doesn’t solve the problem at root. In addition to solve the employment problem for the society, no benefit, failure will still occur, the same person with authority will misoperate. For Gitlab, as the CTO at 2nd Quadrant suggests, you need an automated backup and restore tool, not a permissions system.
Third, things like using MV instead of RM, a checklist and a heavier process are worse. The logic is very simple, because 1) these rules need people to learning and memory, essentially, you don’t believe that man, so you figure out some rules and processes, and execution of these rules and process, and depends on the people, in its form, 2) in addition, write on paper is unenforceable, can perform is the only program, so, Why not code checklists and processes? (You might say that programs make mistakes, too. Yes, program errors are consistent and human errors are inconsistent.)
Most crucially, loss of data in a variety of situations, not just the wrong operation of workers, for example, electricity, virus in the disk is damaged, and so on, in these cases, you design the processes, rules, human flesh, permissions system, checklists, and so on all ignore use, at this time, what do you think to do? Yes, you will find that you will have to design a highly usable system with better technology! There is no other way.
About the backup
A system needs to do data backup, but you will find that in the case of Gitlab, even if all the backup is available, there will inevitably be data loss, or there will be many problems. The reasons are as follows:
1) Backups are usually periodic, so if your data is lost, the data recovered from your most recent backup is lost from the time of the backup to the time of the failure.
2) The backup data may have version incompatibilities. For example, if you make a change to the data scheme or make some adjustments to the data between the last time you backed up data and the failure, your backup data will be incompatible with your online application.
3) Some companies or banks have DISASTER recovery data centers, but the disaster recovery data centers never live a day. When a real disaster comes and you need to live, you’ll find that all kinds of problems keep you from living. You can read this report from a few years ago and get a good sense of what happened when Ningxia Bank crashed in July.
Therefore, in the event of a disaster, you may find that even if your well-designed “backup system” or “disaster recovery system” can work normally, data will be lost, and the backup system that may not be used for a long time is difficult to recover (for example, the application, tool, data version is not compatible, etc.).
I wrote about transactions in distributed Systems earlier. Do you remember this picture? Look at the Data Loss line. Backups, Master/Slave and Master/Master are all lost.
So if you want your backup system to be available all the time, you want it to be Live all the time, and a Live multi-node system is basically a distributed, highly available system. There are many reasons for data loss, such as power failure, disk damage, virus infection, and so on. However, processes, rules, human checks, permission systems, and checklists are all just to avoid misoperation. In this case, you have to use better technology to design a highly available system. There is no other way. (More important things)
In addition, you can refer to my article on High Availability Systems, which uses MySQL as an example, where replication can only reach two nines.
AWS S3’s high availability is 4 + 11 “durability” (durability for 11 “9” is defined by AWS as “if you save 10,000 objects it takes 10 million years to lose one”) which means not only does hard drive fail, machine power fail, whole machine room die, It is guaranteed to withstand data loss from two facilities and still be available. Imagine, if you make the availability of data through technology to achieve this share, then, you are afraid of being mistakenly deleted data on a node?
Non-technical
Failure to reflect
Generally speaking, faults need reflection. In Amazon, faults above S2 require Correction of Errors (COE), one of which is to Ask 5 Whys, I noticed that the first paragraph of Gitlab’s Bug review blog also said to write Ask 5 Whys today. As for Ask 5 Whys, it is not amazon’s way of playing. It is still a common way of playing in the industry. That is to say, we keep on doing what for ourselves until we find the general reason of the problem, which forces all the parties to learn and explore a lot of things. The Wikipedia entry 5 Whys lists 14 rules:
- You need to find the right team to do this troubleshooting.
- Use paper or a whiteboard instead of a computer.
- Write down the whole process to make sure everyone can understand it.
- Distinguish between cause and symptom.
- Pay special attention to cause and effect.
- Indicate the Root Cause and related evidence.
- 5 why answers need to be precise.
- Steps to get to the root of the problem, rather than jumping to the conclusion.
- Be based on objective facts, data and knowledge.
- Evaluate the process, not the person.
- Don’t blame “human error” or “inattention to work” as the root of the problem.
- Foster a climate and culture of trust and sincerity.
- Keep asking “why” until you get to the root of the problem. This ensures that you don’t fall into the same hole twice.
- When you give the “why” answer, you should answer from the user’s point of view.
Engineer culture
In fact, I’ve talked about all of these points many times in my blog, which you can refer to in “What is Engineering Culture?” And Development Team Effectiveness. The truth is, if you’re a technology company, you trust technology more than you trust management. Believe that technology will solve problems with technology, believe in management, then there will only be systems, processes and values to solve problems.
This truth is simple, the loss of data in a variety of situations, not just the wrong operation of workers, for example, electricity, virus in the disk is damaged, and so on, in these cases, you design the process, the rules, human flesh, permissions system, checklists, and so on all doesn’t work, at this time, what do you think to do? Yes, you will find that you will have to design a highly usable system with better technology! There is no other way. (Important things need to be said three times)
Public events
A lot of companies are basically such a routine, the first is to cover up, if the cover can not start to lie, can not lie, on the “text acted the role of non”, “avoid the serious”, “diversion”. However, the best way to face a crisis is “more sincere, less routine”, the so-called “more sincere” best practice is “transparent and open all information”, Gitlab this matter set a very good example for everyone. AWS also exposes all its glitches and details.
What was done was wrong, and by releasing all the details, there would be less room for speculation, more resistance to gossip and dirty pr, and more public understanding and support. It’s amazing to see Gitlab live broadcast the entire restoration process on YouTube, and you can check it out on their blog, which is very positive about the transparency and openness.