Read time:
3 min

What Teams Get Wrong About Outages and Operational Risk

Major cloud outages have become a regular part of operating in today’s ecosystem. Cloudflare went down this week, taking large portions of the internet with it and affecting companies like X, OpenAI, and others. Last month it was AWS. Before that, Azure. Each event took down thousands of services that rely on those providers for uptime, security, performance, or all three.

For most companies, these interruptions aren’t avoidable. When you build on top of external providers, you inherit their risks. Cross-cloud redundancy often sounds like a solution in theory, but the cost, complexity, and operational overhead make it unrealistic for most organizations.

This outage didn’t impact our environment, yet it highlighted a broader truth: not all downtime’s preventable, and some risk lives outside your own systems.

What teams can control, however, is everything that happens inside their own environment. And that’s where most avoidable incidents begin.

What You Can’t Control

External outages will happen. Even the best providers have failure points. Each of these platforms has experienced major interruptions in the last year. They operate global networks with enormous complexity. No amount of planning can eliminate risk at that scale.

  • Cloudflare
  • AWS
  • Azure
  • Microsoft 365
  • GitHub
  • Xbox Live
  • OpenAI

For companies that rely on them, the impact’s simple: you wait. You communicate. You monitor. And you recover when the provider recovers.

Trying to engineer around “no downtime ever” isn’t just impractical. It’s often counterproductive. The best teams focus their resilience efforts where they can actually make a difference.

What You Can Control

Most severe internal incidents happen for reasons that have nothing to do with AWS or Cloudflare. They happen because of small mistakes, neglected process, or rushed decisions.

These are the failures teams can prevent:

  • weak or untested backups
  • unverified restore procedures
  • code deployed without review
  • passwords shared or stolen through phishing
  • no environment separation
  • missing observability
  • unclear runbooks
  • lack of incident response practice

These risks sit inside the organization. They’re controllable. They compound when left unchecked. And unlike global outages, they’re the ones teams have full authority to eliminate.

The Cloudflare outage isn’t a warning about cloud providers. It’s a reminder to harden the parts of the system that belong to you.

Hardening Before Launch

As teams approach launch, process discipline matters more than building new features. The best protection against downtime isn’t avoiding risk entirely. It’s preparing for the kinds of failures that come from within.

This means:

  • confirming backups actually restore
  • validating deployment pipelines
  • reviewing access policies
  • rotating credentials
  • reducing blast radius where possible
  • tightening test coverage
  • establishing clear ownership of systems
  • practicing incident response before it’s needed

Teams that do these things consistently recover faster and make fewer mistakes. They won’t prevent outages from external providers, but they’ll avoid compounding the impact with their own errors.

Where Leadership Makes the Difference

Technology teams operate in an environment where some problems will always be outside their control. What leaders can provide is clarity: where the real risks are, what can be prevented, and how the team’ll respond when failures come from upstream providers.

The most effective approach is transparency. Communicate early. Communicate clearly. Share what happened, what’s known, and what the team’s doing next. Incidents create uncertainty. Clear leadership reduces it.

The Cloudflare outage won’t be the last major disruption of the year. Outages will continue as the ecosystem grows in size and complexity. What matters is how well teams prepare for the parts they can influence.

External failures are inevitable.
Internal failures are optional.

Interested in learning more about this topic? Contact our solution experts and setup a time to talk.