We use Heroku, a PaaS (Platform as a Service) vendor, quite often for hosting small and medium-sized web applications. It makes it drop-dead simple to install, deploy, maintain, and scale your projects on Amazon AWS's infrastructure. Being a fully managed service, they also take care of ongoing hardware and OS-level maintenance, and have engineers around-the-clock monitoring and responding to infrastructure emergencies. Furthermore, they offer a compelling library of 3rd party add-ons to complement and enhance your applications. All of this comes at a price point (e.g. virtually free to hundreds a month) that is hard to beat.
June was a rough month for them though, with their infrastructure suffering two widespread application outages, begging the question: what happened and how reliable are they? Digging a little deeper, we find both incidents linked to power outages in Amazon's US East region which is where Heroku's infrastructure exclusively lives. When they detected the first outage, Heroku began to move some of their infrastructure out of the affected Amazon AWS Availability Zone (AZ) into another AZ. About an hour into the incident most Heroku applications were back online, well before the affected AZ had recovered from the outage. In this case, Heroku definitely proved their value-add by minimizing their customers downtime.
Looking at the latter outage, we didn't experience the same minimizing effects unfortunately. This outage stemmed from a terrible storm that sadly took several lives and left millions without power on the East Coast. It took out enough Amazon AWS infrastructure to offline Netflix, Instagram, and Pintrest as well, and made MSNBC front page news. During the crisis, while a few of our applications were offline, the majority of them were still online, albeit in a staggered state. Most production-level applications on Heroku have multiple "dynos" provisioned to them, and Heroku strives to automatically distribute these dynos across AZ's precisely to guard against a disaster such as this. I surmise applications totally offline either had all their dynos, or single points of failure such as a non-replicated database, provisioned in the downed AZ.
All taken into account, I must say Heroku proved their worth in both cases. An in-house ops team to respond the way Heroku did in the first incident would run thousands of dollars a month. A cross-AZ redundant architecture already built-in to Heroku's "dyno manifold" architecture would have easily tacked on thousands as well in custom software architecture and engineering - simply not justifiable in small or even some medium-sized applications. It is still a bit unsatisfying though and prompts the question: how can we do better in reliability?
The IT industry frames the question more specifically as: how many 9's do I need (and can pay for)? Acts of God, human error, and other unforeseen circumstances are a fact of life. Nobody promises 100% uptime, so it becomes a question of how many 9's? "Two-9's" (99%) means you are down for no more than 7.2 hours/mo., while "Five-9's" (99.999%) means you are down for no more than 25.9s/mo. See handy chart here. Amazon's EC2 SLA ("Service Level Agreement") promises 99.95% uptime over a year before they reimburse credits to you, so looks like customers may qualify for some credits this year ;-). Heroku does not have a SLA, but they are usually at a bit over Three-9's on a given month with months like June bringing them down to Two-9's. So there you have it: as the market stands today up to a few hundred dollars will get you two-closer-to-three-9's a month, while four-9's or more will cost you thousands in infrastructure a month.
Of course we all want more 9's with fewer 0's in the price tag. One way to achieve more 9's is to build apps that straddle multiple infrastructure providers across regions. Considering the work in maintaining machinery and processes for multiple providers and to say nothing of the additional technical challenges around new failure points introduced, I'm hard-pressed to say the net effect is one less 0. If you're not familiar, Amazon infrastructure is partitioned by Region, then by AZ's beneath a Region. A cross-region architecture will be resilient even if a disaster takes out all AZ's on the East Coast, resulting over time in more 9's. Other infrastructure providers have a similar hierarchy but different terminology. Now rumor has it that Heroku and other PaaS vendors are working on the much harder problem of cross-region redundancy, and being able to offer that at an attractive price using their economies of scale. Until then, it looks like smaller and less mission-critical applications will have to sit tight and pray for [no] rain.