Hawaiʻi's Technology Community

More 9's and fewer 0's please

We use Heroku, a PaaS (Platform as a Service) vendor, quite often for hosting small and medium-sized web applications. It makes it drop-dead simple to install, deploy, maintain, and scale your projects on Amazon AWS's infrastructure. Being a fully managed service, they also take care of ongoing hardware and OS-level maintenance, and have engineers around-the-clock monitoring and responding to infrastructure emergencies. Furthermore, they offer a compelling library of 3rd party add-ons to complement and enhance your applications. All of this comes at a price point (e.g. virtually free to hundreds a month) that is hard to beat.

June was a rough month for them though, with their infrastructure suffering two widespread application outages, begging the question: what happened and how reliable are they? Digging a little deeper, we find both incidents linked to power outages in Amazon's US East region which is where Heroku's infrastructure exclusively lives. When they detected the first outage, Heroku began to move some of their infrastructure out of the affected Amazon AWS Availability Zone (AZ) into another AZ. About an hour into the incident most Heroku applications were back online, well before the affected AZ had recovered from the outage. In this case, Heroku definitely proved their value-add by minimizing their customers downtime.

Looking at the latter outage, we didn't experience the same minimizing effects unfortunately. This outage stemmed from a terrible storm that sadly took several lives and left millions without power on the East Coast. It took out enough Amazon AWS infrastructure to offline Netflix, Instagram, and Pintrest as well, and made MSNBC front page news. During the crisis, while a few of our applications were offline, the majority of them were still online, albeit in a staggered state. Most production-level applications on Heroku have multiple "dynos" provisioned to them, and Heroku strives to automatically distribute these dynos across AZ's precisely to guard against a disaster such as this. I surmise applications totally offline either had all their dynos, or single points of failure such as a non-replicated database, provisioned in the downed AZ.

All taken into account, I must say Heroku proved their worth in both cases. An in-house ops team to respond the way Heroku did in the first incident would run thousands of dollars a month. A cross-AZ redundant architecture already built-in to Heroku's "dyno manifold" architecture would have easily tacked on thousands as well in custom software architecture and engineering - simply not justifiable in small or even some medium-sized applications. It is still a bit unsatisfying though and prompts the question: how can we do better in reliability?

The IT industry frames the question more specifically as: how many 9's do I need (and can pay for)? Acts of God, human error, and other unforeseen circumstances are a fact of life. Nobody promises 100% uptime, so it becomes a question of how many 9's? "Two-9's" (99%) means you are down for no more than 7.2 hours/mo., while "Five-9's" (99.999%) means you are down for no more than 25.9s/mo. See handy chart here. Amazon's EC2 SLA ("Service Level Agreement") promises 99.95% uptime over a year before they reimburse credits to you, so looks like customers may qualify for some credits this year ;-). Heroku does not have a SLA, but they are usually at a bit over Three-9's on a given month with months like June bringing them down to Two-9's. So there you have it: as the market stands today up to a few hundred dollars will get you two-closer-to-three-9's a month, while four-9's or more will cost you thousands in infrastructure a month.

Of course we all want more 9's with fewer 0's in the price tag. One way to achieve more 9's is to build apps that straddle multiple infrastructure providers across regions. Considering the work in maintaining machinery and processes for multiple providers and to say nothing of the additional technical challenges around new failure points introduced, I'm hard-pressed to say the net effect is one less 0. If you're not familiar, Amazon infrastructure is partitioned by Region, then by AZ's beneath a Region. A cross-region architecture will be resilient even if a disaster takes out all AZ's on the East Coast, resulting over time in more 9's. Other infrastructure providers have a similar hierarchy but different terminology. Now rumor has it that Heroku and other PaaS vendors are working on the much harder problem of cross-region redundancy, and being able to offer that at an attractive price using their economies of scale. Until then, it looks like smaller and less mission-critical applications will have to sit tight and pray for [no] rain.

Comment

You need to be a member of TechHui to add comments!

Join TechHui

Comment by Cameron Souza on July 7, 2012 at 6:38am: Despite the bad luck, I think Amazon's ability to recover quickly was impressive. They are still much better than nearly all on-site solutions I've seen.

Comment by Paul Graydon on July 6, 2012 at 1:23pm: Alex, my intention is not to be hard on NoSQL, it was more just an observation from the fallout of last Friday.

There are all sorts of good uses for NoSQL stores, and they're being used in the right way all over the place. RDBMSs should never be seen as the be-all and end-all of data storage, like they were for a while.

The problem is people keep using them wrong, too, and they seem to get all vocal and rant a lot when they suddenly get hurt. There was a whole bunch of that on Friday & Saturday.

MongoDB is the biggest target because it seems to have picked up an almost cultish following. People have been switching to it when their RDBMS is slow, instead of either tuning their RDBMS or considering if they're using the RDBMS incorrectly. It's quick and simple to set up MongoDB, particularly if you don't bother to learn much about it, or learn how to use it correctly.

Worse there has been an almost ongoing e-peen measuring contest as to how fast they're able to insert data (there is an awesome BLACKHOLE engine in MySQL that's arguably faster ;)), rather than paying attention to the important things too like reliability. Folks like those at Basho are doing great stuff with Riak for its scalability and reliability. It's dead simple to scale a Riak cluster, especially in comparison to typical RDBMS setups.

Like the same argument I'll make in response to people who insist on the stupid Windows/OS X/Linux/Whatever. Use the right tool for the right job :)

Comment by Calin West on July 6, 2012 at 10:32am: Joseph++. Its nothing to do with SQL vs. no-SQL. Its ACID vs. non-ACID. There is a place for ACID and a place for performance optimized non-ACID databases (logging, caching, reporting, etc.) There is no golden hammer.

Comment by Joseph Lui on July 5, 2012 at 10:06pm: Hello Alex, good to hear from you. I'm not going to split hairs about what NoSQL has come to mean, but there are certainly ACID-compliant data stores with no SQL involved. What I think Paul was trying to say (and I agree with) is that the pendulum swung a bit too far away from SQL recently. Reliability with your datastore is so important that with most applications you want to start with it first, and relax it if you need to based on requirements, not the other way around. After something like a power outage, you want those guarantees afforded to you by ACID-compliance, and not a contaminated DB with countless half-committed transactions. That being said, probably neither my original post nor Paul's comments were meant to primarily be about ACID databases. So yes we should bring the focus back to how a large provider like Amazon AWS should stay vigilant minimizing single points of failure and historically demonstrated bottlenecks, plus maximizing fault tolerance, disaster recovery, transparency, and overall QoS.

Comment by Alex Salkever on July 5, 2012 at 7:20pm: Also, I think you guys are maybe being a bit too hard on NoSQL. It opens up a lot of possibilities for scalable architectures that would simply not work on ACID DBs without a massive infrastructure investment / or some really fancy coding. There are some HUGE telcos running Riak and Mongo, for example, in critical applications or business analytics.

That said, the next gen of NewSQL platforms could be insanely good and offer both ACID and the type of speed and easy scalability that NoSQL offers.

Comment by Alex Salkever on July 5, 2012 at 7:05pm: Joseph, if you dig deep into the AWS explanation to customers, you see that the outage also caused software problems on AWS not related to the power outage. Granted, shutting down a massive chunk of their infrastructure caused the problem, but folks who wanted to move their instances to other zones could not do so because they had to go through key components on EBS hosted in Amazon East Zone. I haven't seen it covered. I asked reporters about it and they did not get an answer out. So this may also be less about ACID and DBs like Mongo and more about layers of abstraction that fail poorly, as well as system architectures that have persistent single points of failure.

Comment by Daniel Leuck on July 2, 2012 at 5:23pm: Paul Graydon: ...(prime amongst most of the anguish seems to be MongoDB) aren't all they're cracked up to be.

Very true. MongoDB has its place in areas such as logging and caching, but right now, for many developers, its a golden hammer and everything looks like a nail.

Comment by Joseph Lui on July 1, 2012 at 8:27pm: Ah cool, one non-EC2 Amazon service that made me happy through the storm was RDS. I checked a Multi-AZ instance during that time and saw that it indeed had automatically flipped its IP to the replicant.

Your last point is a good one. Infrastructure reliability has two basic dimensions: uptime and data integrity--of course ongoing data integrity, but especially post-disaster data integrity. The only common acceptable use I can think of off the top of my head for a non-ACIDic database is as a performance optimizing cache, where cache misses automatically cascade to the underlying ACIDic persisted store. That and casual games which involve a lot of users and no real money.

Comment by Paul Graydon on July 1, 2012 at 12:52pm: Netflix are probably one of the biggest poster childs of the Amazon EC2 infrastructure. Their entire infrastructure runs on it.If you're interested in cloud/Amazon stuff from an operations perspective, follow https://twitter.com/#!/adrianco and keep track of their tech blog: http://techblog.netflix.com/

Amongst the things that they do to ensure up-time is run a piece of software called Chaos Monkey, which will randomly shut down parts of their production infrastructure to help them identify any single points of failure. Their infrastructure is designed to continue to operate even when components die, such as search or the recommendation engine. They have extensive monitoring and automated systems in place to automatically handle any failure condition.

Good chunks of Netflix went down hard on Friday night. Those in the other AZs were okay, but yet again Amazon's back-plane proved incapable of handling the incoming requests. Netflix were sending requests to AWS to change the IPs on the Elastic Load Balance service to remove those in the troubled zone but found it to be stuck, completely unresponsive.

With almost every major problem that's happened in the last few years in their infrastructure, Amazon has been unable to cope with the in-bound provisioning or reconfiguration requests and similar, to the point where the back-plane has become completely unresponsive. Worse they've made no attempt to provide any real reassurance of intent to increase the capacity.

Amongst some of the debate back and forth on twitter was discussions about relying on Amazon's non-EC2 services. It's nice and attractive to be able to do your DNS (Route 53), Load balancing (ELB) and persistent storage (EBS) with them, but if you can't change them whenever, wherever, regardless of the circumstances you're a little screwed :)

What I found a little amusing in an admittedly smug and arrogant way was the number of people who suddenly discovered the importance of ACID with their databases, and why some of these newer DB stores (prime amongst most of the anguish seems to be MongoDB) aren't all they're cracked up to be. Speed is one thing, data reliability is ultimately more important though.

RSS

Welcome to
TechHui

Sign Up
or Sign In

Or sign in with:

More 9's and fewer 0's please

You need to be a member of TechHui to add comments!

Sponsors