Preparing for the Inevitable: 5 Steps to Follow When Technology Fails

Give your engineers the help they need to make the tough calls at crunchtime.

learn more about Patrick Hill

By Patrick Hill

gilaxia | Getty Images

Opinions expressed by Entrepreneur contributors are their own.

Amazon's massive internet outage in late March was a reminder that any company offering a public cloud service, however big or small, needs a plan for incident response. Outages are a fact of life; what matters is how you respond when they occur.

Having processes in place is essential, but those processes can't (and shouldn't try to) cover all eventualities. If something unexpected strikes at 3 a.m., your incident response team needs firm guidelines to help them decide how to act in the critical moments that follow.

Related: Why Every Small Business Needs a Backup and Disaster Recovery Plan

At Atlassian, we came up with five values that guide how we respond to incidents and minimize disruption. A lot gets written about "values," but they're more than something nice to hang on the wall. Our engineers look to these values to steer them through tough decisions they have to make under pressure.

Each value maps to a specific component of incident response. I'm sharing them here in the hope they'll be useful to your organization, too.

Detect

Value: Atlassian knows before our customers do

A well-designed service will have enough monitoring to detect and flag any issue before it becomes an incident. If your team isn't getting paged about imminent problems before they impact customers, you need to improve your monitoring and alerting.

Respond

Value: Escalate, escalate, escalate

The worst thing an engineer can decide is that they don't want to wake someone because it might not be their problem. Nobody should mind getting woken for an incident and finding out they're not needed. But they will mind if they're not woken when they should have been. We're supposed to be on the same team, and teammates support each other.

Related: The Worst Hacks of 2017 -- So Far

Recover

Value: Stuff happens; clean it up fast

Customers don't care why your service is down, only that you restore it as fast as possible. Never hesitate to get an incident resolved quickly so you can minimize the impact.

If you're the tech lead and you know you can restore service with a quick restart, but you could also spend time investigating the cause while the service is still down, what should you do? This value guides your answer: Restore now and figure out the cause later; the customer experience comes first.

Learn

Value: Always blameless

Incidents are a part of running a service. We all improve by holding teams accountable, not apportioning blame. Human error is never a valid root cause for a major incident. Why was that engineer able to deploy a dev version to production? How did a command-line typo have such a devastating effect?

Assigning blame is never the appropriate response. Figure out what safeguards were missing and put them in place.

Related: 3 Ways to Protect Yourself From a Ransomware Disaster

Improve

Value: Never have the same incident twice

Determine the root cause and identify the changes that will prevent that whole class of incidents from happening again. Can the same bug bite elsewhere? What situations could lead to a programmer introducing this bug? Commit to delivering specific changes by specific dates.

With these values in place, the next step is to ensure they're put into practice. We hold monthly meetings where we discuss how they've been implemented and dissect occasions when they weren't. We call people out for following them -- and for not following them. And we've added them to our documentation for incident response.

Service outages are a big deal: the AWS incident affected 54 of the top 100 retailers, and that's in just one industry segment. Your footprint may be a good deal smaller, but the impact of an outage on both you and your customers can be just as disruptive, proportionally speaking. Give your engineers the help they need to make the tough calls at crunchtime. Both they and your customers will thank you for it.

Patrick Hill

SRE Team Lead at Atlassian

Patrick Hill is SRE team lead at Atlassian, provider of team collaboration and productivity software that helps teams organize, discuss and complete shared work. Teams at more than 68,000 organizations use Atlassian products including JIRA, Confluence, HipChat, Trello and Bitbucket. Based in Austin, Texas, Hill helps build teams and processes to ensure consistently high performance and availability of Atlassian's internal and external cloud services.

Related Topics

Editor's Pick

Everyone Wants to Get Close to Their Favorite Artist. Here's the Technology Making It a Reality — But Better.
The Highest-Paid, Highest-Profile People in Every Field Know This Communication Strategy
After Early Rejection From Publishers, This Author Self-Published Her Book and Sold More Than 500,000 Copies. Here's How She Did It.
Having Trouble Speaking Up in Meetings? Try This Strategy.
He Names Brands for Amazon, Meta and Forever 21, and Says This Is the Big Blank Space in the Naming Game
Business News

These Are the Most and Least Affordable Places to Retire in The U.S.

The Northeast and West Coast are the least affordable, while areas in the Mountain State region tend to be ideal for retirees on a budget.

Thought Leaders

The Collapse of Credit Suisse: A Cautionary Tale of Resistance to Hybrid Work

This cautionary tale serves as a reminder for business leaders to adapt to the changing world of work and prioritize their workforce's needs and preferences.

Business Solutions

Learn to Build a ChatGPT Bot for Only $30

If you want to see what AI can do for your business, grab this course bundle today.

Data & Recovery

If You Have a Business, You Have Passwords to Manage

How a password management system is crucial for entrepreneurs.

Business News

'I Don't Feel Like It's Unreasonable': A-List Actor Refused Service At Hotspot For Not Following Dress Code

Academy Award-winning actor Russell Crowe had quite the afternoon after trying to stop at a Japanese steakhouse in Melbourne, Australia following a game of tennis.

Business News

I Live on a Cruise Ship for Half of the Year. Look Inside My 336-Square-Foot Cabin with Wraparound Balcony.

I live on a cruise ship with my husband, who works on it, for six months out of the year. Life at "home" can be tight. Here's what it's really like living on a cruise ship.