The upstart claims to have grown too fast for its cloud to keep up, and now all five recovery tools have failed.
Source code hosting center GitLab.com crashed after experiencing a data loss blamed on its sudden discovery that the backup was invalid.
On Tuesday night Pacific time, the startup sent out a series of disturbing tweets, which we’ve listed below. Behind the scenes, a tired system administrator working late into the night in the Netherlands accidentally deleted a directory on a server he shouldn’t have deleted during a frustrating database replication: He completely deleted a folder containing 300GB of active production data that hadn’t been fully replicated.
By the time he cancelled the RM -rf command, a measly 4.5GB of data was left. The last possible set of backups was made six hours in advance.
-
We are performing emergency database maintenance and https://t.co/r11UmmDLDE will be offline.
— GitLab.com status (@gitlabStatus) January 31, 2017
-
We are experiencing problems with our production database and are working to restore it.
— GitLab.com status (@gitlabStatus) February 1, 2017
-
We accidentally deleted production data and may have to restore it from a backup system. Google Doc with Live Note https://t.co/EVRbHzYlk8
— GitLab.com status (@gitlabStatus) February 1, 2017
“This incident affects the database (including issues and merge requests), but not the Git codebase (codebase and wikis),” the last tweet said of the Google Doc.
So it’s somewhat comforting to users that not all data is lost. But the document ends with this:
So in other words, none of the five deployed backup/replication methods are running reliably or were set up correctly.
This word, the net fried open pot. To outline the mistakes, the startup bluntly elaborates on them as follows:
-
By default, LVM snapshots are created every 24 hours. YP was run manually just about six hours before the failure.
-
Regular backups also appear to be made every 24 hours, though YP has not been able to figure out where they are stored. According to JN, this didn’t seem to work, producing files only a few bytes in size.
-
SH: pg_dump seems to fail because it is running PostgreSQL 9.2 binary instead of 9.6 binary. This happens because if data/PG_VERSION is set to 9.6, omnibus only uses Pg 9.6, but the file does not exist on the worker node. As a result, it runs 9.2 by default and quietly fails. Therefore, no SQL dump occurs. Fog Gem may have erased earlier backups.
-
Disk snapshot in Azure is enabled for NFS server, but not for database server.
-
Once the data is synchronized to the pilot environment, the synchronization process eliminates Web hooks. Unless we can retrieve this data from regular backups within the last 24 hours, it will be lost.
-
Replication programs are unreliable, error-prone, rely on several random shell scripts, and lack adequate documentation.
-
What we backed up to S3 obviously didn’t work either: the buckets were empty.
Compounding the problem is the fact that GitLab claimed last year that its business was growing so fast that its cloud couldn’t keep up with demand and would build and run its own Ceph cluster. PabloCarranza, GitLab’s head of infrastructure, said the decision to deploy its own infrastructure “will make GitLab more efficient, more stable and more reliable because we will have more control over the whole infrastructure.”
It has since backtracked on that decision, telling us via Twitter:
Theregister@gitLab is committed to improving application performance and is considering alternative cloud hosting providers.
— Connor Shea (@Connorjshea) February 1, 2017
As of press time, GitLab said it did not have an estimate of how long it would take to recover, but was working to recover from a pilot server that might have “no Web hooks” but “only available snapshots.” The source code was created 6 hours ago, so some data must have been lost.
Last year, GitLab, founded in 2014, raised $20 million in venture capital funding. For now, those investors may be a little more upset than users.
TheRegister will update this article as soon as more information becomes available. The system administrator who mistakenly deleted the active data decided that “he had better not run any commands with superuser privileges now.
Without authorization, declined to be reproduced