Why Your Company Should Incorporate a Wartime Mindset to Prepare for a Digital Crisis
Free Book Preview: Coach ’Em Way Up
A few weeks ago, Amazon’s Prime Day didn’t kick off as planned. Eager shoppers temporarily encountered technical glitches on the website, preventing them from taking advantage of great deals. The result? Analysts estimated Amazon lost $90 million in sales, but the company ultimately rebounded from the episode just fine, setting a Prime Day record by taking home $3.4 billion in sales.
Incidents like these are increasingly common today (just visit Outage.Report to see services that are currently down). There’s a reason for that: technology breaks. Whether due to human error, a third-party service interruption or a rare but devastating “black swan” event, even the most sophisticated tech companies like Amazon have to deal with digital disruptions.
And, because today’s consumers expect an “always-on” experience, being unprepared can cost you even more than a missed revenue opportunity.
Fortunately, an unexpected outage doesn’t have to produce chaos. By adopting a disciplined approach and a “wartime” mindset, you can minimize damage and quickly restore order. Here are three suggestions to become better prepared before your next digital crisis occurs:
Commit acts of ‘self-sabotage’ to prepare for the worst.
Practice makes perfect. That’s why military service members undergo basic training before they ever step foot in a war zone. When they do finally enter combat, they are (hopefully) ready for battle.
You should apply a similar level of rigor and training to the teams that oversee your digital operations. One of the best ways to do that is to deliberately engineer a major incident, known in some circles as “wartime.” Our team at PagerDuty created just such an internal initiative, known as “Failure Fridays.”
Essentially: Every Friday from 10 a.m. to noon, we take something offline. It won’t be a mission-critical part of our business such as customer support, but the missing service still presents a problem that our engineers must solve in order to return to “peacetime.”
Running through mock scenarios like this gives each member of your team an opportunity to know what to do ahead of a real crisis. Investing this time and effort in advance will pay dividends when disaster actually strikes.
Clearly define roles and responsibilities in advance.
Establishing clear lines of responsibility and duties is crucial to successful military or non-military operations, especially when things go awry. For instance, during the now-famous Osama bin Laden operation, one of the helicopters involved was damaged during the landing and needed to be destroyed because it featured proprietary technology. While other Navy SEALs infiltrated the compound, one remained with the helicopter to take care of this critical task.
In order for your eam members to be successful in the face of the unexpected, they need to understand their roles. Without that understanding, unplanned events may become “all hands” events, which are unnecessarily disruptive to employees, and extremely inefficient. Just imagine what would not have been accomplished had all 23 Navy SEALs tried to solve the helicopter problem.
The most important role to define is incident commander. While organizational structure during peacetime is generally based on seniority, wartime is different. During major digital incidents, the incident commander becomes the highest-ranking individual, higher even than the CEO. This individual, who is usually closer to the actual responder teams than someone in executive management, is responsible for taking all available information and then making every decision.
Formalizing this structure and roles/responsibilities, as well as distributing this information to employees, helps avoid confusion regarding whom team members should take orders from. This enables a quicker, more efficient incident-response process.
Prioritize the health of your team and ensure work-life balance.
Just as military personnel are allotted off-duty time, your team needs breaks from the taxing, stress-inducing experience of putting out fires. In our research, we’ve found that over half of IT professionals experience sleep and/or personal life interruptions more than 10 times per week as a result of digital service disruptions or outages.
The proper response here starts with understanding the differences between human ops and humane ops. The latter accounts for the human factors associated with being on the frontlines of operations, such as employee health and happiness, and incorporates them into your people-management strategy.
In my opinion, one of our customers, Intercom, a tech company in the Bay Area, sets the standard for a humane ops approach. It allows employees within its engineering organization to volunteer when to go on call, and then rotates them out every six months. It also compensates employees for taking on-call shifts. These little things -- giving your team control over their schedules and recognizing their contributions -- can make a big difference in avoiding burnout and fostering goodwill.
Another way to achieve humane ops is to give employees the leeway to determine when a crisis is really a crisis instead of requiring them to sound the alarm at the slightest hint of trouble. Sometimes it is better to wait a few minutes and let an issue sort itself out instead of unnecessarily mobilizing an entire crisis team. Trust your team to make the right judgment calls and you’ll improve work-life balance in the process.
In today’s digital-powered world, moments of failure are inescapable. But if you can battle-test your team and set its members up for success, you’ll find these incidents won’t inspire panic, but instead promote a shared commitment to resolve incidents as fast and efficiently as possible. Here’s to a better customer experience for your company -- with more uptime and less downtime.