TechHui

Hawaiʻi's Technology Community

Day 2 and I still can't check in on 4square. Oh calamity.

 

Seriously though, what a mess. I think this is a real black eye for cloud computing.

Views: 72

Replies to This Discussion

I've heard that in the press a lot today.  I'd suggest is more likely a learning curve issue for Amazon.   It's one way that Google may have a leg up on them.

Well, Google has had outages also.. I don't think in the grand scheme of things this will change migration to cloud much - it does mean that people need to understand SLAs more and the impact of externalizing this particular risk.

 

Still though... must be a lot of freaked out people. May speak for an off-site barebones infrastructure (one that can preserve your process/information - though perhaps not with the capacity you need to actually run it).

I wouldn't necessarily call this a black eye for cloud computing.  I would definitely call it a wake-up call though.  A single point of failure is still a single point of failure, even if it's in the cloud.  Leaving your infrastructure dependent on a single location or provider is leaving yourself open for a possible complete failure like we've seen with Amazon over the past few days.  It's been standard practice for a good long while to advise having a complete hot-spare somewhere, and preferably with another vendor, once you reach a stage where any downtime is liable to cause significant financial loss.

 

People have been seeing Cloud as the be-all end-all.  No servers so no problems with dying hardware, merrily forgetting that there is a lot more to any infrastructure than the physical server.  Being able to spawn up multiple servers doesn't mean a thing if the facilit(y|ies) you're spawning them in have no access to the internet, for example.

 

What really baffles me most is that there are tools out there that can present you a standardised API for spawning instances with different vendors.  That takes a lot of the hassles out of a multi-vendor setup.

Well I guess blackeye in the sense that it leaves a mark.. but doesn't take them down.. anyway I don't want to get too wrapped up in analogies.

 

I agree with you though, I've been saying for a while that vendor-lockin is a huge risk with cloud computing. I think public/private clouds are part of this answer (where the public infrastructure provides capacity spillover, but you have your own clouds as well) and/or basically 'multihoming' your cloud infrastructure.

anyone else not curious why such companies like Quora, Reddit, or Fourquare haven't deployed instances in other regions as fail-safes?

Reddit went into quite some details about a month back after they last had a several hour outage caused by Amazon's EBS.http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24...  In short there are a number of architectural problems with reddit that lock it to a single geographic area.  They do use multiple availability zones, which is amazon's phrase for multiple data centers, but it's all in the same geographic area, North Virginia, and it was the entire area that was affected.  There is a follow up post for the past few days http://www.reddit.com/r/announcements/comments/gva4t/on_reddits_out....  They are making changes but it's a slow process.  Conde Nast who owns them seems to be a typical large old media company, slow, and unwilling to invest.  Being bought by them was probably a really bad decision.  Other sites in their position could easily push for VC funding to help them scale and stabilise the service and give them breathing time to sort out monetisation.  Conde Nast put them under a hiring freeze back in 2009 at the start of a huge traffic growth time for reddit, wanting the latter first.  Reddit Gold and a sudden change in management above them has finally started to allow them to hire new devs.

Ketralnis laid quite the smack down on Amazon and the problems that it's caused Reddit.  http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for... .

 


Fred "derfie" Baclig said:

anyone else not curious why such companies like Quora, Reddit, or Fourquare haven't deployed instances in other regions as fail-safes?

Great link, thanks for sharing Paul.

 

I guess in the long run these are really just hiccups that everyone needs to learn from.

One other interesting write up from today came from one of the developers behind the StackExchange websites, talking about a concept that has fascinated me for a couple of months on and off since I first heard of it, Chaos Monkey.  Netflix uses AWS for all their hosting, and despite the outage didn't suffer any downtime at all, and Chaos Monkey, a service that deliberately breaks their infrastructure is part of why:

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-mon...

Interesting post mortem on that AWS outage:  http://aws.amazon.com/message/65648/

 

and how Netflix avoided most of that pain:

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aw...

I really see Netflix's approach as OO interfaces applied at the system component rather than merely the software level. Great to see it worked for them. I love trying to break things and have usually been disappointed at how this is looked at as an undesirable thing in system development.

RSS

Sponsors

web design, web development, localization

© 2024   Created by Daniel Leuck.   Powered by

Badges  |  Report an Issue  |  Terms of Service