Hawaiʻi's Technology Community

Okay, who broke EC2?

Day 2 and I still can't check in on 4square. Oh calamity.

Seriously though, what a mess. I think this is a real black eye for cloud computing.

Replies to This Discussion

Permalink Reply by Les Vogel on April 22, 2011 at 6:29pm

I've heard that in the press a lot today. I'd suggest is more likely a learning curve issue for Amazon. It's one way that Google may have a leg up on them.

Permalink Reply by Brian on April 22, 2011 at 6:31pm

Well, Google has had outages also.. I don't think in the grand scheme of things this will change migration to cloud much - it does mean that people need to understand SLAs more and the impact of externalizing this particular risk.

Still though... must be a lot of freaked out people. May speak for an off-site barebones infrastructure (one that can preserve your process/information - though perhaps not with the capacity you need to actually run it).

Permalink Reply by Paul Graydon on April 22, 2011 at 7:47pm

I wouldn't necessarily call this a black eye for cloud computing. I would definitely call it a wake-up call though. A single point of failure is still a single point of failure, even if it's in the cloud. Leaving your infrastructure dependent on a single location or provider is leaving yourself open for a possible complete failure like we've seen with Amazon over the past few days. It's been standard practice for a good long while to advise having a complete hot-spare somewhere, and preferably with another vendor, once you reach a stage where any downtime is liable to cause significant financial loss.

People have been seeing Cloud as the be-all end-all. No servers so no problems with dying hardware, merrily forgetting that there is a lot more to any infrastructure than the physical server. Being able to spawn up multiple servers doesn't mean a thing if the facilit(y|ies) you're spawning them in have no access to the internet, for example.

What really baffles me most is that there are tools out there that can present you a standardised API for spawning instances with different vendors. That takes a lot of the hassles out of a multi-vendor setup.

Permalink Reply by Brian on April 22, 2011 at 9:41pm

Well I guess blackeye in the sense that it leaves a mark.. but doesn't take them down.. anyway I don't want to get too wrapped up in analogies.

I agree with you though, I've been saying for a while that vendor-lockin is a huge risk with cloud computing. I think public/private clouds are part of this answer (where the public infrastructure provides capacity spillover, but you have your own clouds as well) and/or basically 'multihoming' your cloud infrastructure.

Permalink Reply by Fred Baclig on April 23, 2011 at 12:46am

anyone else not curious why such companies like Quora, Reddit, or Fourquare haven't deployed instances in other regions as fail-safes?

Permalink Reply by Paul Graydon on April 23, 2011 at 9:40am

Reddit went into quite some details about a month back after they last had a several hour outage caused by Amazon's EBS.http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24... In short there are a number of architectural problems with reddit that lock it to a single geographic area. They do use multiple availability zones, which is amazon's phrase for multiple data centers, but it's all in the same geographic area, North Virginia, and it was the entire area that was affected. There is a follow up post for the past few days http://www.reddit.com/r/announcements/comments/gva4t/on_reddits_out.... They are making changes but it's a slow process. Conde Nast who owns them seems to be a typical large old media company, slow, and unwilling to invest. Being bought by them was probably a really bad decision. Other sites in their position could easily push for VC funding to help them scale and stabilise the service and give them breathing time to sort out monetisation. Conde Nast put them under a hiring freeze back in 2009 at the start of a huge traffic growth time for reddit, wanting the latter first. Reddit Gold and a sudden change in management above them has finally started to allow them to hire new devs.

Ketralnis laid quite the smack down on Amazon and the problems that it's caused Reddit. http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for... .

Fred "derfie" Baclig said:

anyone else not curious why such companies like Quora, Reddit, or Fourquare haven't deployed instances in other regions as fail-safes?

Permalink Reply by Brian on April 25, 2011 at 7:37pm

Great link, thanks for sharing Paul.

I guess in the long run these are really just hiccups that everyone needs to learn from.

Permalink Reply by Paul Graydon on April 25, 2011 at 9:44pm

One other interesting write up from today came from one of the developers behind the StackExchange websites, talking about a concept that has fascinated me for a couple of months on and off since I first heard of it, Chaos Monkey. Netflix uses AWS for all their hosting, and despite the outage didn't suffer any downtime at all, and Chaos Monkey, a service that deliberately breaks their infrastructure is part of why:

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-mon...

Permalink Reply by J. David Beutel on May 9, 2011 at 1:26pm

Interesting post mortem on that AWS outage: http://aws.amazon.com/message/65648/

and how Netflix avoided most of that pain:

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aw...

Permalink Reply by Brian on May 11, 2011 at 6:49pm

I really see Netflix's approach as OO interfaces applied at the system component rather than merely the software level. Great to see it worked for them. I love trying to break things and have usually been disappointed at how this is looked at as an undesirable thing in system development.

RSS

Welcome to
TechHui

Sign Up
or Sign In

Or sign in with:

Okay, who broke EC2?

Replies to This Discussion

Sponsors