Not Every Bug or Outage is Preventable, But Help Is Out There. Here's What You Can Do.
“Every business is a software business”: That 20-year-old saying by "the father of software quality" Watts Humphrey is truer now than ever before. Retail companies rely on ecommerce to sell to customers, and many newspapers have web-first publications. What's more, every major brand has its own app these days. Nike and Starbucks have both gained reputations as leading software companies despite the fact that their respective core businesses seem far removed from the world of high technology!
But the process isn’t over.. Many traditional companies are still struggling to shift into the digital age, while even older tech companies may be finding themselves left behind by modern trends and changes.
Things, after all, move fast. Mobile phone companies release new versions of their apps every week, while big online platforms may go through multiple changes in a day. And the consequences of failure in all this change are high. If a major retailer’s web-app is down, it's losing money every minute as online customers go elsewhere. If a bank’s login is unavailable, millions of people are locked out from their money, and that damages trust.
Not every bug or outage is preventable, of course, but there are some good guidelines and tools you can use to keep your business up and running in the digital age.
1. Test, but be ready to fix the things your tests missed.
Testing is extremely important if you're determined to avoid an outage. Whether you employ system tests, unit tests or comprehensive quality-assurance tests, more testing will catch more bugs. No matter how much you test, however, you’ll still have bugs that will make it out into the world. That's inevitable. And you can’t always wait to conduct every type of test when you’re focused on marketing and shipping.
The good news is, companies have a variety of ways of finding the causes of bugs in their live systems. That doesn't mean users, who aren’t usually very good at giving detailed bug reports (How often have you taken the time to write a detailed report when a website or app wasn't working?). So the first thing you need to do is detect the problem.
Here, services like Sentry can send alerts if something breaks, and monitoring tools like Prometheus, DataDog and New Relic are useful for keeping an eye on complicated modern systems. Another interesting new tool is Rookout, which allows software teams to debug live apps without having to stop them or break them. This tool also helps to fix broken systems while keeping down time to a minimum, and it finds hard-to-reach bugs that may not appear in testing.
2. "Chaos engineering": testing to destruction
There’s only one way of knowing how well you can handle having things break: You break them yourself. Yes, that's a serious statement.
"Chaos engineering" is a way of building and testing resilience by deliberately breaking parts of a system to ensure that the system as a whole can still work. Netflix has been using chaos engineering since 2011 when the company built a “Chaos Monkey” app to randomly disable its servers. The app has since been released as a free, open-source tool. Also, a startup called Gremlin offers chaos engineering as a service to make it easier for companies to get started even if their personnel lack specialist knowledge.
3. Automate your deployment.
In one way, creating software code is a bit like any other product: Once you make it, you have to ship it. In many software companies, the code is created by a development team; and the software is then shipped by an integration and deployment team, and maintained by an operations team. Throw in testing and it might take weeks for the code to make it into your web-app.
Automated deployment cuts out some of these middle stages and makes it faster to make your new features or bug-fixes live. Continuous Integration tools like Jenkins and CircleCI can test, build and deploy your code automatically to keep the back-end running. A new, specialized solution for automatic deployment of websites and web apps is Netifly, which also integrates with new technologies like severless computing.
4. Keep control of your code.
Automated integration and deployment fixes a lot of problems, but it introduces new issues, too. Small mistakes in the code can lead to passwords or API keys being accidentally deployed live, or the wrong configuration file being used and messing everything up.
It’s important to balance automated deployment with your ongoing effort to keep control of your code. Some of the popular continuous deployment tools will check for particularly bad mistakes at the build or deployment stage. Datree is a new tool to stop unwanted code before it even makes it into the company code repositories, by letting you define policies and then enforcing them before a pull request is merged.
Related: Twitter Suffers Worldwide Outage
Follow these guidelines and, yes, your systems may still suffer bugs and outages. But you’ll have fewer of them, be well equipped to handle them, able to fix them faster and more likely to keep your company up and running -- and making money.