What happened

In recent days, we’ve seen conversations all over Reddit and Hacker News about Atlassian’s Cloud version going down, which caused nearly 400 companies, Between 50,000 and 400,000 customers were unable to access at least seven products, including Jira and Confluence. The outage has been ongoing since April 4, and Atlassian estimates that many affected customers were unable to access their services for two weeks, Access has been restored to at least 53% of companies.

On April 4, JIRA, Confluence, OpsGenie and other Atlassian services stopped working at certain companies.

On April 6, Atlassian released the same update every few hours without sharing any information about it. Updates as follows:

“We are in the verification phase of some examples. When re-enabled, the support staff will update the account with an open event ticket. Restoring customer sites remains our top priority and we are coordinating with our global team to ensure work continues 24/7 until all instances are restored.”

On April 7, Atlassian acknowledged the problem and provided some simple details via their Twitter account.

Atlassian offered few details about the outage over the next few days. Meanwhile, the News has been widely discussed on Hacker News, with comments from people claiming to be former employees that the company’s internal engineering practices are below standard.

“This does not suprise me at all. (…) at Atlassian, their incident process and monitoring is a joke. More than half of the incidents are customer detected.

“It doesn’t surprise me at all. (…). At Atlassian, their event flow and monitoring is a joke. More than half of the incidents were detected by customers.

The cause of the incident

Atlassian says it’s talking to customers, but they don’t seem to be happy. Atlassian CTO Sri Viswanath posted a blog post about the event. Here’s the AI front:

On April 4, about 400 Atlassian Cloud customers experienced a total outage of their service offerings. We are working hard to get the site back online and have restored normal service to approximately 53% of our customers (as of April 14, you can see the latest progress on the Statuspage). The rest of the customers are expected to gradually return to normal service over the next two weeks.

First of all, let’s be clear: this is not a cyber attack, nor is it part of any system expansion failure. In addition, most of the recovered customers had no data loss, and only a small number of customers reported that their data could not be retrieved in the 5 minutes before the incident.

Admittedly, this incident was inconsistent with our response times and operating standards, and on behalf of Atlassian, I sincerely apologize. We know that Atlassian products are vital to your team; Once the availability of services is limited, your business will suffer. We are working around the clock to help customers get their business back up and running as soon as possible.

What’s going on?

We have a standalone application called “Insight-Asset Management” that is widely used in Jira Service Management and Jira Software. The app is fully integrated into Atlassian products in the form of native features. Therefore, we need to deactivate this older version of the app on customer sites that already have it installed. Our engineering team had planned to use existing scripts to deactivate the standalone application instance, but it unexpectedly caused two major problems:

  • Poor communication. First, there was a failure of communication between the team requesting the outage and the team performing the outage, which resulted in the former failing to provide the correct application ids for the outage and instead handing over the ids of all the applications on the entire cloud site to the implementation team.
  • Script error. Second, the script we used provides both a “delete mark” function for routine operations (to ensure recoverability) and a “permanent delete” function for permanently deleting data for compliance purposes. About 400 customers’ sites were accidentally deleted as a result of the script operating on the wrong ID list in the wrong execution mode.

In order to achieve incident recovery, our global engineering team specifically established systematic processes to help affected customers get back on track.

Recovery program

Atlassian Data Management (www.atlassian.com/trust/secur…

To ensure high availability, we preset and maintain a set of synchronized standby copies in multiple AWS availability zones (AZs). The available area has automatic failover and is generally saved every 60 to 120 seconds. We deal with data center outages and other common service disruptions on a regular basis and ensure there is no appreciable impact to customers.

We also maintain multiple immutable backups to protect against data corruption events, allowing us to recover to a specific point in time. Backup data is retained for 30 days, and Atlassian continues to test and audit the stored backups used to support recovery needs.

With these backups, the agency periodically rolls back individual customers or a small number of customers who accidentally delete their data. If necessary, we can immediately restore all customers to a new environment.

But there is still a gap in our automation system to automatically restore a large number of customers to their existing (and currently used) environment without affecting other customers.

In our cloud environment, each data store contains data from multiple customers. Since the data deleted in this incident does not exist independently and its data store also carries information still used by other customers, we have to manually extract and restore all parts from the backup. As a result, the recovery of individual customer sites became a lengthy and complex process, requiring constant internal and customer verification during the recovery. Our initial customer site recovery solution was only semi-automated and still involved a series of laborious and complex steps due to the manual verification of customer data in the recovered site mentioned earlier.

Now, we want to move to a new, more automated process, as follows:

  • Re-enable metadata for deleted sites within a centralized orchestration system.
  • Restores customer data extracted from the backup, including users and permissions.
  • Re-enable ecosystem applications, billing data, and other data that is not directly related to customer-generated data.

Because each site requires data store extraction and recovery, we constantly test the site and work closely with customers to ensure the accuracy of the recovery results.

We are currently restoring customers in batches, with a maximum of 60 tenants in a round. From start to finish, the entire site recovery process should take 4 to 5 days. Our team has developed and started using new multi-batch recovery capabilities, which we believe will help shorten our overall recovery cycle.

Priority number one: get back to business

We know that such incidents undermine our credibility with customers. We have failed to meet the high operational standards we set for ourselves and have failed to communicate fully with our customers — so far, our communications have been limited to those directly affected by the incident.

The incident has been noticed by the engineering team and Atlassian as a whole. We will continue to work around the clock until we successfully restore every customer’s website.

Now, we would like to report the following work arrangements to you:

  • Restore the customer site. We will continue to work directly with affected customers to help them recover their sites and communicate one-on-one with customer support teams through support workorders. Please be assured that this matter is being resolved as soon as possible.
  • Updated daily. We will update affected customers with their work orders and daily status updates at any time.
  • Post facto review. We will also organize post-hoc reviews and publish the findings and follow-up plans in a timely manner. Relevant reports will be made public after completion.

Finally, thank you customers: thank you for your cooperation, and thank you for every step you have taken with us. We understand that you need to present the event to stakeholders in your organization and that this outage has caused significant disruption to normal business. My team and I will do everything we can to restore business for each customer as soon as possible, and we will do everything we can to mitigate the impact of this incident.

Lessons learned from this outage

Any engineering team could learn a lot from this outage.

Event handling:

  • Has operational manuals for disaster recovery and black Swan events. Plan in advance how to respond, evaluate, and communicate to potential contingencies. Follow the disaster recovery manual. Atlassian has its own Confluence disaster Recovery manual, but may not fully follow it. Their operating manual states that any operating manual has guidance on communication and upgrading.
  • Communicate directly and transparently. In the event of a similar incident, it is important to communicate with customers in a timely manner, as a lack of communication can lead to a lack of trust, not only between the affected customers, but also between anyone who is aware of the disruption. While Atlassian may think it’s safe to say nothing, it may not be the best option. See what GitLab or Cloudflare does after an outage.
  • Speak the language of the customer. Atlassian status updates are vague and lack technical details, which may not be enough for IT directors and Ctos who buy Atlassian products.
  • Avoid status updates that say nothing. Most status updates on the event page are copy-pasting the same content, apparently in order to provide updates every few hours… But these aren’t updates, and they add to the feeling of not being able to control the outage.
  • Avoid silence. Atlassian remained silent until day 9. Avoid this method at all costs.

Avoid events:

  • Have a rollback plan for all migrations and deprecations.
  • Conduct migration and deprecation test runs.
  • Do not delete data from production.
  • Instead, mark the data to be deleted, or use a lease to avoid data loss.