What AWS Availability Zones Really Mean for Resilience
Recent outages have highlighted the importance of understanding how AWS Availability Zones, Regions and failover strategies affect cloud costs and business continuity
Running business-critical workloads on AWS rather than on-premises requires enterprises to balance the cost of redundancy against the likelihood of disruption. Every additional copy of data or failover mechanism improves availability, but it also increases storage and transfer costs.
Recent events have demonstrated the risks of getting that balance wrong. In May, a cooling system failure in a single AWS Availability Zone (AZ), or data center, in Virginia caused a major outage at Coinbase. In its postmortem, the US-based cryptocurrency exchange highlighted the tension between ensuring low latency for an application that “cannot tolerate inter-zone network hops” and having the capacity to failover to another zone in the event of an outage.
A few weeks earlier, drone strikes damaged AWS facilities in the United Arab Emirates (UAE) and Bahrain, disrupting cloud services across the Middle East and prompting technology vendors including Red Hat and Snowflake to advise customers to move their workloads to other regions. Connected device platform EMQX was forced to decommission its capacity in the UAE region and migrate all customer deployments previously hosted there to alternative AWS regions.
These incidents raise questions for enterprises around how much resilience is enough and the real cost of achieving it. The answers depend on understanding the building blocks of AWS infrastructure.
Understanding AWS Geography
At a high level, AWS operates Regions, which incorporate AZs, Local Zones and Outposts.
The AWS Cloud spans 123 AZs within 39 geographic Regions. Amazon has announced plans to add seven more AZs and two more Regions in Saudi Arabia and Chile. Each AZ has separate power, networking and physical infrastructure, allowing workloads to continue running even if one localized component fails.
“In the public cloud, that was a concept that emerged with AWS,” Fred Lherault, Field CTO EMEA and Emerging Markets at Everpure, tells TechChannel. “You’ve got notions of Regions and Availability Zones that basically would map to a Region being a city, and the Availability Zone would be the computer room you would have within the data center in that city.”
The purpose of an AZ is fault isolation. As Lherault explains, “within one of the zones, if something fails, it will only impact this Availability Zone.” That is key for enterprises that require constant access to data. If they deploy all storage, networking, compute and application resources within a single zone, a failure can affect the entire workload.
“If you care about high availability, you have to architect your application so that it will be using resources in different Availability Zones,” Lherault says. “The likelihood of multiple AZs becoming unavailable is very low.”
But that does not eliminate risk entirely. “It can happen. You can still have a regional disaster, whether a natural disaster or a man-made disaster that would impact the whole Region … but it tends to be fairly rare.”
Bringing Compute Closer With Local Zones and Outposts
AWS has also introduced Local Zones, designed for latency-sensitive applications. “They are smaller Regions that tend to be closer to cities but are attached to a wider Region,” Lherault explains. An enterprise building an application may deploy the bulk of the infrastructure in a Region, “but then anything that communicates with the client that really needs low latency, [it would] deploy that in a Local Zone.”
For enterprises that have data sovereignty or latency requirements, AWS Outposts provide a hybrid option. “You’re basically deploying AWS hardware,” Lherault says. “It’s connected to AWS, you manage it through the AWS control plane, but the data and the compute run in your data center.”
AWS encourages the use of multi-zone architectures to reduce the risk of a single point of failure while maintaining relatively low latency within Regions. This means failures that affect a single data center should not cascade across an entire application, provided it is architected properly.
Why Cloud Redundancy Requires a New Way of Thinking
Resilience in the cloud does not come for free. Users pay AWS on demand for network traffic between distributed systems, as well as compute and storage. That includes the movement of data between AZs and across Regions, which becomes a significant cost driver in highly redundant architectures.
This represents a major shift in thinking for engineers, Lherault notes. “When you come from the world of the data center, you don’t necessarily think about it from a cost point of view, because you’ve already purchased your network equipment.”
In that model, networking decisions are primarily about performance. “That’s the way network engineers have been thinking for forever—you want to use the shortest path possible to optimize performance. Now in the public cloud, you have to think about it also from a cost point of view.”
AWS provides tools such as placement groups and orchestration controls to help manage resources so that they are co-located into the same AZs, or conversely never placed in the same location.
Redundancy requires data transfer, but the key is minimizing unnecessary transfer through compression, selective replication and localized access patterns. “When you’re architecting, you want to make sure that you don’t cross Regions and Availability Zones unnecessarily,” Lherault advises. “Only send data when you really have to.”
That shift is especially important for mainframe engineers adapting to cloud environments. “One of the most important things for engineers moving from the world of the data center to the public cloud to understand is that resiliency in the public cloud is within one Availability Zone, so resiliency is going to be worse if you’re used to enterprise systems,” Lherault warns.
One of the platforms that can assist here is Kubernetes, which helps ensure that in the event of failure the application can restart automatically from a different location, in this case a different AZ.
Lherault adds: “If you build your application like you’ve been doing it on in the data center, you’re going to have availability challenges and potentially also durability challenges.”
Understanding AWS Durability Levels and Storage Classes
In AWS, durability levels—which measure how likely data is to survive in the event of a failure—vary significantly depending on the storage class. “The highest class of service in terms of durability is AWS S3,” Lherault notes, describing it as “11 nines of durability (99.999999999%),” which makes it “really, really, really safe to store your data in there.” However, he cautions that S3 is object storage, which requires applications to be designed accordingly, as it does not behave like traditional databases.
Other storage options offer lower guarantees. Only certain tiers of services with the Amazon Elastic Block Store (EBS) offering reach around 99.999% durability, which many mainframe engineers would consider to be on the lower side.
That has financial and technical implications that are unavoidable for regulated industries. “You still have to implement some form of higher availability on top of it,” Lherault says, citing the European Union’s Digital Operational Resilience Act (DORA) regulation in banking, which requires organizations to maintain disaster recovery systems spanning different locations.
In terms of how quickly an organization can recover and how much data loss it can tolerate, disaster recovery planning in the cloud follows the same principles as on-premises environments. This framework will determine how often to transfer data and the steps needed to recover, but the tools available in the cloud are different. “From a data point of view, you don’t necessarily have the same breadth of tools in the public cloud that have existed in the data center for decades,” Lherault says.
But one advantage is that the on-demand model in the cloud shifts the cost calculations.
Finding Cost Efficiencies With the On-Demand Model
Rather than maintaining standby compute, networking and storage capacity in a secondary location, organizations typically pay only for data to be replicated and stored across AZs or Regions when they need it.
This makes cloud-based disaster recovery more flexible and potentially more cost-efficient, but it also increases the importance of robust automation and well-tested processes. With infrastructure-as-code tools such as Terraform or Ansible, organizations can provision the compute and networking resources they need and deploy them automatically during a failover event.
Lherault observes that organizations using automation and abstraction tools to operate across hybrid environments are effective in “monitoring and understanding what’s being used when,” to ensure they manage resilience, cost and performance as part of a coherent architecture.