Amazon Web Services are down

timf · on April 21, 2011

Some quotes regarding how Netflix handled this without interruptions:

"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." [1]

"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." [2]

[1] https://twitter.com/adrianco/status/61075904847282177

[2] https://twitter.com/adrianco/status/61076362680745984

campnic · on April 21, 2011

"Cheaper than cost of being down." This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise. (.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability). Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.

jedberg · on April 21, 2011

At their level of income, this is true.

For us, we are just now staffing up to the level where we can make the changes necessary to do the same thing.

adrianco · on April 21, 2011

I think it's incredible that you guys can run a site at all with the few people you've got. Hope it all gets better again soon.

Goladus · on April 21, 2011

I would also be shocked if Amazon isn't giving Netflix preferred pricing because it's such a high-profile customer.

adrianco · on April 21, 2011

Netflix pays standard rates for instances but uses reserved instances to pay less on bulk EC2 deployments

redthrowaway · on April 22, 2011

Are you looking to diversify across ebs or set up dedicated hosting?

xsmasher · on April 21, 2011

It's a strange algebra though; doesn't it mean the WORSE Amazon's uptime is, the more money you should give them?

misterbwong · on April 21, 2011

More accurately, the more unstable your infrastructure is, the more you will need to spend to ensure stability.

rickmode · on April 21, 2011

Spending more on AWS to increase reliability isn't necessarily a benefit to Amazon. The increased costs can them less competative.

dfranke · on April 21, 2011

I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.

imajes · on April 21, 2011

Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.

Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.

When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)

awj · on April 21, 2011

I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")

I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.

adrianco · on April 21, 2011

There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.

brianpan · on April 21, 2011

Obviously you haven't been around my wife when she loses the last 5 minutes of a show. SLA or no, services will get cancelled.

hoop · on April 21, 2011

I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.

This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.

darrenkopp · on April 21, 2011

Also, when there are service interruptions, they send out credits to customers.

nateberkopec · on April 21, 2011

Every decision in a business is like this - measure the cost of action A versus the cost of not-A. It's just rare that in this case, those costs are easily quantifiable.

elliottcarlson · on April 21, 2011

Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.

And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.

alexpopescu · on April 21, 2011

More comments coming from Adrian Cockcroft:

1. See slides 32-35 of http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011

2. "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util."

https://twitter.com/#!/adrianco/status/61089202229624832

prakash · on April 21, 2011

Here's the 24h latency data on EC2 east, west, eu, apac: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-24h.png

Last 60 minutes comparison data: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-60m.png

time in GMT.

A study we (Cedexis) did in January comparing multiple ec2 zones and other cloud providers: (pdf) http://dl.dropbox.com/u/1898990/76-marty-kagan.pdf

jjm · on April 21, 2011

Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.

olahav · on April 23, 2011

Yes - It is all business decisions. As someone said already an instance on AWS can cost up to 7X a machine you own on co-location. here is how outbrain manages it's multi datacenter architecture while saving on Disaster recovery headroom. http://techblog.outbrain.com/2011/04/lego-bricks-our-data-ce...

asymptotic · on April 21, 2011

Amazon's EC2 SLA is extremely clear - a given region has an availability of 99.95%. If you're running a website and you haven't deployed across across more than one region then, by definition, your website will have 99.95% availailbility. If you want a higher level of availability use more than one region.

Amazon's EBS SLA is less clear, but they state that they expect an annual failure rate of 0.1-0.5%, compared to commodity hard-drive failure rates of 4%. Hence, if you wanted a higher level of data availability you'd use more than one EBS volume in different regions.

These outages are affecting North America, and not Europe and Asia Pacific. That's it. Why is this even news? Were you expecting 100% availability?

hoop · on April 21, 2011

    Amazon's EC2 SLA is extremely clear -
    a given region has an availability of 99.95%.
    If you're running a website and you haven't
    deployed across across more than one region then,
    by definition, your website will have 99.95%
    availailbility. If you want a higher level of
    availability use more than one region.

Good point.

let P(region fails) = 0.05% and let's assume (and hope) that the probability of failure of one region is independent of the state of the other regions.

P(two regions fail) = P(one region fails and another region fails) = P(region fails) * P(region fails) = 0.05% * 0.05% = 0.0025%

Making your availability = 100% - 0.0025% = 99.9975%

Ultimately it's more of a business decision if you want to pay for the extra 0.0475% of availability. I would think (or hope) that most engineers would want it anyway.

The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?

jerf · on April 21, 2011

"let's assume (and hope) the probability of failure of one region is independent of the state of the other regions."

In practice, that's not true, and it's not true enough to ruin the entire rest of your calculations. For Amazon regions to function independently, they'd have to be actually, factually independent; there is no interaction between them. The reaction to one node going down is never to increase the load on other nodes as people migrate services, etc. There's fundamentally nothing you can do about the fact that if enough of your capacity goes out then you will experience demand in excess of supply.

If you want true redundancy you will at the very least need to go to another entirely separate service that is not Amazon... and if enough people do that, they'll break the effective independence of that arrangement, too.

(This is a special case of a more general rule, which is that computers are generally so reliable that the ways in which their probabilities deviate from Gaussian or independence tends to dominate your worst-case calculations.)

hoop · on April 21, 2011

I agree with you 100% that they're not independent, but I don't know enough about the data to model the probabilities of failure and availability in a HN comment :-)

After today's event, it would certainly be interesting to see how resource consumption changed in other availability zones and at other providers during this outage.

I wonder if that could be measured passively? What I mean is, by monitoring response times of various services that are known to be in specific regions and seeing how that metric changes (as opposed to waiting on a party that has little-to-no economic benefit to release that information.)

gairbheil · on April 21, 2011

No, your decimal point is off. 0.05% * 0.05% = 0.0005 * 0.0005 = 0.00000025, or 0.000025%. It works out to an expected downtime of 8 seconds per year, instead of over 4 hours for one location.

Of course, redundancy doesn't set itself up, so there are added costs on top of Amazon.

hoop · on April 21, 2011

Thank you for this, I don't know I tried doing the math without converting the percents to decimals, I should have known better.

bartonfink · on April 21, 2011

Why wouldn't a simple expected value calculation work? You've shown that you can calculate the extra availability that subscribing to another region provides. Simply multiply the cost of an outage by the extra availability provided by an additional region that would have prevented that outage.

If expanding to another region costs more than just taking the outage, then it's categorically not a good option. If management still says no in the face of numbers that suggest yes, then that tells you that you're missing a hidden objection, and how you proceed will depend on a lot of factors specific to your situation.

hoop · on April 21, 2011

I think you're right, that would be the best way of presenting this argument to management. To do so, however, the company would need to calculate its Total Cost of Downtime (which probably isn't very complex for many companies) which is its own subject entirely IMO.

obfuscate · on April 21, 2011

> calculate its Total Cost of Downtime (which probably isn't very complex for many companies)

Not complex even factoring in reputational damage?

gphil · on April 21, 2011

> The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?

In businesses where physical goods are sold to customers, "the management" is generally very motivated to avoid stock-out situations in which sales are lost due to lack of inventory (even if it's only a very small percentage.) The reason for this is because they are concerned about the potential loss of customer goodwill. It seems that the same applies in this situation.

yatima2975 · on April 21, 2011

Minor nitpick, but the availability should be even better, since 1% * 1% = 0.01% the availability becomes 99.999975% - six nines, anyone?

stef25 · on April 21, 2011

Reddit's been down for several hours today, I'm sure they are already way lower 99.95%.

JonnieCache · on April 21, 2011

0.05% of one year is 4 hours, 22 minutes and 48 seconds.

cmurdock · on April 21, 2011

Reddit experienced some issues with Amazon a month ago that resulted in the site being down for almost a day. I'm pretty sure they're way below that percentage.

forgotusername · on April 21, 2011

That's conflating SLAs again, Reddit's long-running problems have been with EBS reliability.

cube13 · on April 21, 2011

Even so, .5% downtime per year is about 44 hours, and Reddit's definitely had more downtime in the last few months than that.

Of course, that's assuming that 100% of Reddit's problems were due to EBS only and not a combination of EBS, EC2 and their own code.

ssmoot · on April 21, 2011

Is this really standard practice for measuring the SLA? The contracts I've seen for a couple small businesses are generally per billable period.

Which always made sense to me. I pay you for 99.5% uptime this month. If you don't achieve it, then I get a discount, as simple as that. If your availability is below that, I don't pay full price for the billable period and then reconcile at the end of the year.

Any links or general advice on this topic anyone has I'd be pretty interested on finding out if there's a general consensus of it being done differently?

stef25 · on April 21, 2011

Don't forget to take in to account ALL of Reddit's downtime; there is quite a bit of it.

bad_user · on April 21, 2011

Which means their entire quota for this year is all gone.

yuvadam · on April 21, 2011

and then some...

asymptotic · on April 21, 2011

If the Reddit web server admins took availability seriously they would have chosen to deploy across more than one region. Do you disagree? Why do you disagree? I'm being honest, no snark involved in my questions.

dspillett · on April 21, 2011

He wasn't suggesting that all Reddit's problems are due to Amazon services, he was using Reddit's down time today as a data point illustrating that the uptime guarantee claimed for the service has not been kept this year (in fact a whole year's "permitted downtime" as implied by the 99.95% SLA may be eaten on one day). Presumably Amazon will be handing out some refunds and other compensation (assuming the SLA isn't of the toothless "it'll be up at least 99.95% of the time, unless it isn't" variety).

Perhaps the Reddit admins decided that "up to 0.05%" downtime permitted by the SLA would be acceptable, compared to the extra expense of using more of Amazon's services (and any coding/testing time they may have needed to take advantage of the redundancy depending on how automatic the load balancing and/or failover are within Amazon's system) to improve their redundancy. By my understanding the promise isn't 99.95% if you use more than one of our locations, it is 99.95% at any one location, so the fact that Reddit don't make use of more than one location is irrelevant when talking about the one location they do use not meeting the expectations listed in the SLA.

I'm not saying Reddit's implementation decision is right (I don't have the metrics available to make such a judgement) but it would have been made based partly on that 99.95% figure and how much they trusted Amazon's method of coming to that figure as a reliability they could guarantee. If I had paid money for a service with a 99.95% SLA, unless the SLA had no teeth, I would be expecting some redress at this point (though there is probably no use nagging Amazon about that right now: let them concentrate on fixing the problem and worry about explanations/blame/compo later once things are running again).

trotsky · on April 21, 2011

Very few cloud SLA's seem to have teeth to me. Amazon's SLA gives service credit equal to 10% of your total bill for the billing period if they blow past the 0.05%. This is a lot better than some cloud providers that will simply prorate the downtime, but pretty crappy in terms of actual business compensation. It's equivalent to a sales discount almost any organization with a sales staff could write without thinking about it - meaning Amazon is still making money on every customer even when they've blown past their SLA - assuming every single customer fills out the forms to apply for the discount. Hint: Many won't, see mail in rebates.

A number of tier 1 network providers offer certain customers SLA's that are clearly in place to prove that they invest in redundancy and disaster planning. ex: less than 99.99% --> 10% credit. less than 99.90% --> no charges for the circuit in the billing period.

This reflects an understanding that downtime can hurt your business/infrastructure far in excess of the measurable percentage.

jedberg · on April 21, 2011

> If the Reddit web server admins took availability seriously

We do.

> they would have chosen to deploy across more than one region.

It's far too costly to do that. We are deployed across multiple AZs, but this failure hit multiple AZs.

jbellis · on April 21, 2011

Why is it more expensive to deploy in zones X,Y in regions A,B than zones M,N in region C? I assume you don't just mean "US West is ~10% more expensive than US East."

jedberg · on April 21, 2011

It's the combination of the extra cost of having machines in US West plus the cost of keeping the data synchronized between them (which is a lot) plus the added development overhead of making sure that things work cross region.

We'll get there one day, but we aren't there yet.

josephb · on April 21, 2011

> but this failure hit multiple AZs.

Gah! You can't always account for all the failure modes that Amazon might have.

asymptotic · on April 21, 2011

This seems to be a prevalent misunderstanding. Amazon's EC2 SLA of 99.95% applies at the scope of a region. A region may contain more than one availability zone. Hence, deploying on multiple availability zones still only affords the 99.95% availability level.

Yes, multi-region availability on AWS is hideously expensive. However, some organisations value an availability of greater than 99.95% enough to warrant such a multi-region deployment. Clearly reddit, and many, many other AWS users, do not. This isn't a value call on my part; I definitely couldn't afford the inter-region data transfer costs, all I know is that AWS offers you the tools to deploy high availability web services.

BrandonM · on April 21, 2011

Why does Reddit really need 99% availability? Is a customer unduly harmed or is the world even worse off if Reddit is down for a couple cumulative days per year? Is it worth the cost? Would you put up with more ads and/or pay for Reddit in order to make sure that it's available 24/7/365?

ry0ohki · on April 21, 2011

Probably not as much for the customer as for the company. When sites are unreliable, people end up going to the more reliable competitors as they arise.

shasta · on April 21, 2011

I wouldn't think there a large number of customers deciding "this is too unreliable, I'm leaving" on the basis of a few hours of down time. On the other hand, there might be a large number of people who, upon finding your site down, decide to visit alternatives that are up at the time, and some of those people might decide they like the alternatives better.

scorpion032 · on April 21, 2011

If any site can take down time and not lose users, reddit can. And it has.

stef25 · on April 21, 2011

Not qualified to speak about what Reddit should or should not do about the arrangement with Amazon. I have read several posts, including one by an (ex) Reddit employee saying Amazon is not delivering what they said they would, that much is clear. I really doubt all their downtime is part of the SLA.

bad_user · on April 21, 2011

The whole point of AWS is to forget about maintaining hardware infrastructure.

Amazon are the ones who should have made backups in multiple regions, and transfer the load on failure.

jasonkester · on April 21, 2011

Actually, the whole point of AWS is to have options for using hardware that you don't own. They don't offer any magic "all your stuff in one package, guaranteed to work all the time" service. So yes, you do still need to think about your hardware infrastructure. You just don't have to own it.

And Amazon does have all their stuff available in multiple regions. It's up to you to use it though.

ceejayoz · on April 21, 2011

> The whole point of AWS is to forget about maintaining hardware infrastructure.

If that were the case, you wouldn't be presented with region and availability zone options.

JonnieCache · on April 21, 2011

Then it would be much more expensive at the bottom tiers, meaning I wouldn't be able to play with it on a whim without thinking about the money. That would suck.

edanm · on April 21, 2011

"[S]hould" is the wrong word here. Clearly, they don't maintain such backups. This is clear to anyone using their service. They pay for the service anyway, so apparently it's still worth it to them, even without auto-backups.

Would it make sense for Amazon to maintain automatic backups (and potentially charge more for them)? I don't know. It might make business sense, it might not. But their service is apparently popular enough even without it.

ayb · on April 21, 2011

Not sure if I could say Amazon should be doing that - but I'd love it if other value added providers (such as Heroku) could implement this.

blhack · on April 21, 2011

http://cache-scale.appspot.com/c/www.reddit.com/

That's a cache of it. I really wish that the admins at reddit would implement something like this themselves, then link to it when downtime like this happens.

wallnutboy · on April 21, 2011

They do have a read-only mode, don't they? I'm not sure why they don't enable read-only mode when things like this happen. It may be that Amazon's service being down forbids this. I dunno.

groovylick · on April 21, 2011

They have a read-only mode for "free" which is their akamai cache that the unlogged in users see.

evolution · on April 21, 2011

do you have any stats of this app in terms of total data stored, bandwidth required per day, requests per second?

blhack · on April 21, 2011

I don't, unfortunately, because I didn't write it :(

_4vyi · on April 21, 2011

Reddit being down is not news.

jedberg · on April 21, 2011

That hurts. But you're right, we've had a lot of issues.

I think the reason this is news is because it is a massive Amazon failure.

Peroni · on April 21, 2011

Reddit gets a lot of grief for stability issues but the fact is it is an immensely popular site that a huge number of people have a close affiliation to. A massive percentage of these people spend a significant portion of their day browsing Reddit, interacting with other redditors, etc. and for the site to be down for as long as it has is news, regardless of similar issues occuring in the past.

The main reason this is news is because this is an Amazon issue but also because tens of thousands of people who frequent the site regularly are now aimlessly browsing the internet in the hopes of finding alternative lulz and in my case some of us are even getting work done. shudder

mattdeboard · on April 21, 2011

I can't imagine how frustrating the jobs of the Reddit admins must be.

jasonlotito · on April 21, 2011

Is admin supposed to be plural? I mean, do they really have multiple system admins now? I ask, only because I know people have been coming and going recently.

Frankly, for the size of the site, they do really, really well for the limited resources they have.

mattdeboard · on April 21, 2011

I meant admins in the more general purpose sense of administrators, people who are paid to maintain the system. But yeah I agree, quality:resources ratio is really really high.

jedberg · on April 21, 2011

It's usually very rewarding. The awesome community is what keeps me doing it.

freakball · on April 22, 2011

Awesome, I just figured out that you kept the votes tallied during the 'downtime'. What's interesting is how clearly good and bad submissions were dichotomized when nobody had anything else to vote on.

scorpion032 · on April 21, 2011

You made Google App Engine managers cry once: http://www.theregister.co.uk/2009/07/06/dziuba_google_app_en...

Anything coming up for amazon? If not for anything else, for pure entertainment value!

davidreiss666 · on April 21, 2011

No. Not really. But we kind of miss it anyway.

bdonlan · on April 21, 2011

Note also that 0.1-0.5% refers to irrecoverable data loss, not temporary unavailability.

yuvadam · on April 21, 2011

Current status: bad things are happening in the North Virginia datacenter.

EC2, EBS and RDS are all down on US-east-1.

Edit: Heroku, Foursquare, Quora and Reddit are all experiencing subsequent issues.

guynamedloren · on April 21, 2011

Indeed they are. Right before the issues began, I pushed a bad update to one of my Heroku apps, causing it to crash. A minute later I fixed the bug, re-pushed the git repo to Heroku... and nothing. I've been stuck with an error message on my website for hours. Unfortunate timing!

jrnkntl · on April 21, 2011

Staging servers are an easy thing on Heroku :)

tlholaday · on April 21, 2011

You have my sincere sympathy.

samratjp · on April 21, 2011

http://status.heroku.com/ for those on Heroku

mike-cardwell · on April 21, 2011

Is it just me, or is their status page down?

bskari · on April 21, 2011

I know you probably have your answer by now, but this site always helps me out when I have that question:

http://www.downforeveryoneorjustme.com

guynamedloren · on April 21, 2011

It's been on-and-off for the past few hours.. It's up right now (for me, anyway).

_d8fd · on April 21, 2011

This morning from 5a - 6a Pacific time I was able to access my Heroku app just fine.

JonnieCache · on April 21, 2011

I can access all my heroku apps that have their own DNS. Anything with a .heroku.com subdomain is down for me. Frustrating, knowing that the apps are still running but aren't routable.

cooldeal · on April 21, 2011

Yay, cloud.

JonnieCache · on April 21, 2011

Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to make a useful point about the practicality or otherwise of massively virtualised systems for webapp deployment, please do. It's going to take more than two words though.

cooldeal · on April 21, 2011

I guess you haven't seen the Microsoft ads about the cloud? http://www.youtube.com/watch?v=Lel3swo4RMc

Anyway, my bad, I was just trying to make a joke to lighten up the mood. Sorry.

JonnieCache · on April 21, 2011

You're right, I hadn't. Fair enough.

recoiledsnake · on April 21, 2011

I laughed, it's relevant and puts things in perspective if you had seen the ads. So it's not really content free even if it's just two words.

mtogo · on April 21, 2011

Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to bitch about a post that we all got the point of, you're going to need more than two paragraphs.

antonioe · on April 21, 2011

Foursquare is up now. Quora not showing the 503 anymore, Reddit still down.

jedberg · on April 21, 2011

We rely heavily on EBS still, so this is hurting us more than most others. Hopefully they'll have us back up soon.

bvi · on April 21, 2011

You guys might have answered this in one of your AMAs/blog posts (or was it raldi who commented?), but what options can reddit resort to should this stuff happen again to this degree of severity?

jedberg · on April 21, 2011

We're moving away from the EBS product altogether. The hard part is dealing with the master databases. Normally I'd have a master database with a built in raid-10, but I can't do that on EC2, so I have to come up with another option.

So I guess that is the long way of saying that hopefully it won't happen again.

jjm · on April 21, 2011

I do not believe you could be effective by moving away from EBS, you know without giving up quite a bit.

Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.

One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).

Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.

joshhart · on April 21, 2011

I can't say whether much has changed within the last year, but when I worked at FathomDB we had serious issues with EBS. You couldn't trust it. Odd things would happen like disks getting stuck in a reattaching state for days and disks having poor performance.

epi0Bauqu · on April 21, 2011

How do you move away from EBS and still deal with large data?

obfuscate · on April 21, 2011

Not sure what you had in mind by "large", but instance storage goes up to 1.7TB: http://aws.amazon.com/ec2/instance-types/

Devilboy · on April 22, 2011

The reason Reddit uses RAID10 is for performance, not disk size. A single instance storage device is just too slow for the Reddit database.

obfuscate · on April 22, 2011

Many instance types have 2 or 4 virtual disks (presumably on different physical disks).

joshhart · on April 21, 2011

I imagine they'd do consider some combination of the following (sorted by most likely)

1. Sharding data 2. Pulling tables out to other servers from the main DB 3. Pruning excessive data 4. Compressing data

jjm · on April 21, 2011

It still has to be stored somewhere though right? If it's EBS you've just made yourself a complicated solution that will eventually fail all over again. No?

joshhart · on April 21, 2011

If the data is sharded, then the data/server is small enough enough to fit within the individual server's disk and you no longer need EBS to store it.

shrike · on April 22, 2011

NOTE: I work for Gluster.

We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).

Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.

oomkiller · on April 21, 2011

Please tell us how you plan on moving 700 EBS volumes to something completely different. It sounds amazing.

bad_user · on April 21, 2011

Foursquare is up, but Quora is still showing the 503.

antonioe · on April 21, 2011

I stand corrected. 4sq down again.

wmoxam · on April 21, 2011

Not all EC2 & EBS instances are down. I have several in US-east-1a and 1 is down, while all of the others are working.

jedberg · on April 21, 2011

Same with us. About 10% of our 700+ volumes are having problems right now.

It's hard to tell for sure since there isn't any load.

nostromo · on April 21, 2011

We're fine on EC2 -- but everything on RDS seems to be giving us big problems. We started a few backups before we new it was systemic, and all of them are stuck at 0%. We also tried spinning up new instances -- and they're all stuck in booting.

obfuscate · on April 21, 2011

I think which physical data center "us-east-1a" etc. corresponds to differs from user to user, to load-balance given that people will probably be more likely to use 1a than the other zones.

agotterer · on April 21, 2011

We had about 45 min of downtime around 4am EST. Our RDS instances, EBS backed and normal instances all returned without problems. We are in Virginia us-east-1a and us-east-1b.

potomak · on April 21, 2011

Also Cuorizini (http://cuorizini.heroku.com) is down!

dsl · on April 21, 2011

4/21/2011 is "Judgement Day" when Skynet becomes self aware and tries to kill us all. http://terminator.wikia.com/wiki/2011/04/21

I am just a little freaked out right now.

jedberg · on April 21, 2011

Don't worry. If skynet is in EC2, we'll be fine.

HelloBeautiful · on April 21, 2011

You don't understand, Skynet is using all Amazon resources, hence the outages ;-)

Amazon have stated many times that amazon.com itself runs mostly on AWS platform, but it works fine now ...

cicloid · on April 21, 2011

AWS platform on a private cloud, it is not the same as AWS platform for us commoners.

iamwil · on April 21, 2011

I wonder if it's a virus or worm whose activation date was today due to the fact above, found its way into the amazon servers.

Probably just a coincidence.

crb · on April 21, 2011

Do we all regret letting GLaDOS reboot yet?

crb · on April 21, 2011

Sigh, Portal jokes were so much more popular on Reddit. :)

mindcrime · on April 21, 2011

HN really isn't the place for internet memes, jokes about pop culture, and things that are judged trivial / frivolous. Part of what makes the HN community what it is, is a focus on high-quality, reasoned, rational discourse. IOW: HN != Reddit

ig1 · on April 21, 2011

A couple of hours into the failure, and no sign of coverage on Techcrunch (they're posting "business" stories though). It shows how detached Techcrunch has become from the startup world.

Edit: I tweeted their European editor about it and he's posted a story up now.

dwc · on April 21, 2011

Perhaps this isn't really news. These days it's normal.

dwc · on April 21, 2011

It's ugly, but true enough. You don't have to like it to acknowledge it. It's just another cloud outage bringing down one or more high profile sites. It's a "dog bites man" story.

mcritz · on April 21, 2011

This feels the same way as hearing that the whole Internet just got shut down.

kylec · on April 21, 2011

I guess this is one Reddit outage that can't be blamed on poor scaling

jedberg · on April 21, 2011

Thankfully, no. :)

But yeah, right now we're shutting everything down to try and avoid possible data corruption. Once they restore service, hopefully we'll be able to come back quickly.

pvh · on April 21, 2011

Hey Jedberg, if you guys aren't already rolling your own, check out fdr's WAL-E tool. It bounces postgres write-ahead logs off S3 and goes great with the new PG9 replication.

https://github.com/heroku/WAL-E

ghotli · on April 21, 2011

Thanks for this. I had designed and partially implemented this exact same thing. Do you know of this running in production anywhere?

stef25 · on April 21, 2011

Amazon is really not being kind to you guys; I sort of hope you'll find an alternative solution fast!

acangiano · on April 21, 2011

If I was Rackspace, I'd be at Reddit/Wired's headquarters already.

pbhjpbhj · on April 21, 2011

Didn't Jedburg say that they could reduce the failure by spending with Amazon.

I wonder if Rackspace really want this particular traffic burden. It seems that if Reddit choose not to pay for the load they need then you get lot's of bad press for it ... perhaps I'm seeing it wrong.

Rubbish analogy: Kinda like if I was doing a haulage business and you called out for a wheelbarrow to carry some elephants, then when the barrows broke we got bad press despite. If you'd paid for a heavy animal transport package ... OK it's all going wrong, you get the idea.

jedberg · on April 21, 2011

No, I said that we have spent all we can, and at this point we need development.

However, in this case, the outage is not because of any issues with our setup, but with Amazon.

pbhjpbhj · on April 21, 2011

>"we have spent all we can"

So is it a financial constraint with Amazon? Would you be suffering the same sorts of outages regardless of the technology on the backend or does AWS basically suck?

rm445 · on April 22, 2011

You have the wrong end of the stick, because you're missing the history of the story. Reddit have a weird budget when it comes to staffing costs versus operating costs due to their parent company's policies as a media comapny - so they have a decent budget but are massively understaffed.

Statements like the one you're quoting are in that context. Let's say you have an unlimited operating budget - you can come up with all kinds of wonderful plans for massive redundancy and zero downtime. But you can't make that happen if you're not allowed to hire any engineers or sysadmins! As far as I'm aware reddit are paying Amazon mucho dinero but still having irredeemable problems with the storage product, EBS. They are stuck on an unreliable service without the manpower to move off.

That's the story, as far as I can piece together from comments here and on reddit.

pbhjpbhj · on April 22, 2011

Ah, you see from what I read on reddit I understood that the staff shortage was simply part of Conde Nast's unwillingness to spend money on reddit and that constant downtime issues were another facet of that same problem.

It's not making money and those looking after reddit don't want to ruin it with a huge money grab - instead taking a soft approach, first just begging for money, then adding in a subscription model (freemium anyone?) and more subtle advertising by way of sponsored reddits (/r/yourCompany'sProduct type stuff).

I understand they've been hit with more staff problems just recently despite having a new [systems?] engineer start with them.

So in your view EBS is the problem regardless of finance? That was the nut I was attempting to crack. TBH I didn't expect someone at reddit to stick their neck out and say "yeah Amazon sucks" but they might have confirmed that the converse was true and they were simply lacking the necessary finance to support the massive userbase they have.

ulope · on April 21, 2011

Rackspace (and really all the "popular" US hosters) seem ridiculously expensive compared to hosting prices we have in Germany (see e.g. http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... this is one of the biggest root server hosters in Germany).

Is this really so or are Racksoace and co. Just "boutique" offerings?

blhack · on April 21, 2011

Will you guys ever do something like this:

http://cache-scale.appspot.com/c/www.reddit.com/

but officially supported (and paid for) by you?

hysterix · on April 21, 2011

Sup jedberg, I obviously don't have nearly the level of knowledge with the intricacies of reddit, but coming from a strictly "business" standpoint, the amount of downtime reddit receives due to amazon issues is astounding. Perhaps it's time to look for alternatives?

Anyway, thanks for your time.

thirtysixred · on April 21, 2011

SHUT. DOWN. EVERYTHING.

Good plan though. :p

jedberg · on April 21, 2011

My original comment was "We're going Madagascar on the servers." Then I remembered I was on HN, not reddit. :)

JonnieCache · on April 21, 2011

>Then I remembered I was on HN, not reddit.

Right now we may as well be on reddit.

on April 21, 2011

[deleted]

jedberg · on April 21, 2011

Don't do that. HN is not the place for memes.

reddit isn't either, but we lost that battle a while ago.

hysterix · on April 21, 2011

It's refreshing for me to see you say this. God speed soldier.

davidreiss666 · on April 21, 2011

You don't need to flee the country yet.

ILIKECAKE · on April 21, 2011

Hey man.. some of us would have got it :)

rmoriz · on April 21, 2011

President Madagascar is doin' cloud biz, too?

retube · on April 21, 2011

Reminds me of that scene in Jurassic Park

retube · on April 21, 2011

Good lord. What's with all the DVs??

retube · on April 21, 2011

Ha. HN really on it's high horse today.

_tfee · on April 21, 2011

I know right? I made a hilarious joke a little while ago and jedberg yelled at me and everyone downvoted me. And I'll be very surprised if this comment doesn't get downvoted to hell too.

EDIT: I also simply greeted jedberg, and a bunch of people thought that was a good reason to downvote. Do people think there's an imminent influx of redditors, and that they have to dissuade them from becoming HNers? I don't think that's the case.

EDIT: Fuckin' called it.

on April 21, 2011

[deleted]

jedberg · on April 21, 2011

I'm here all the time. :)

stavros · on April 21, 2011

Apparently most of their problems are caused by bad EBS writes/performance, or at least so they said a few weeks ago after some particularly bad downtime.

It looks like EBS will randomly decide to switch to a few bps of performance from time to time. I would use Amazon for my startup, but these issues really make it hard to justify.

valisystem · on April 21, 2011

EBS seems to be the main problem here, I'll cite a former reddit employe (first comment on the blog that talked about EBS problems).

I don't work for reddit anymore (as of about a week ago, although I didn't get as much fanfare as raldi did), but I can tell you that they're giving Amazon too much credit here. Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it.

Source: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

cache: - scroll down for comment - http://webcache.googleusercontent.com/search?q=cache:cfbs-sp...

crb · on April 21, 2011

EC2 instances only have one network interface. The public IP address you have pointing to your instance is a DNAT done somewhere further up the chain.

If you get a large network load to your instance - say, a DDoS attack - you can find you no longer have enough network capacity to talk to your EBS disks.

This is what happened to Bitbucket in 2009: http://blog.bitbucket.org/2009/10/04/on-our-extended-downtim...

stavros · on April 21, 2011

This doesn't appear to be the issue here, though. valisystem's link mentions that it wasn't an interface issue, EBS is just shit, apparently.

muppetman · on April 21, 2011

Slightly offtopic, but wasn't that post by an ex-employee as a comment? Not that the technical aspect of it wasn't fantastic, because it was, but I don't think Reddit said anything publically did they?

stavros · on April 21, 2011

It was, you are right, I misremembered. valisystem's comments above contains the reference.

muppetman · on April 21, 2011

Yes valisystem linked to it. All good, I wasn't sure if I misremembered.

thezilch · on April 21, 2011

Looks like troubles in only one availability zone.

mechanical_fish · on April 21, 2011

That seems to be incorrect. We have problem children in us-east-1b, -1c, and -1d.

justinsb · on April 21, 2011

AWS randomize the zones per account, so "your" -1b is not necessarily the same as "my" -1b. I'm only seeing problems in my -1c. Are you seeing 3 zones failing all under the same account?

If multiple AZs are down, AWS are going to have some serious explaining to do...

dangrossman · on April 21, 2011

Amazon RDS's most expensive feature is automatic, instant Multi-AZ failover to protect against this kind of situation. It's not working quite like that, which the AWS status page acknowledges. This is a major failure.

nlo · on April 21, 2011

"AWS randomize the zones per account"

Interesting. This is news to me.

I found some more info here (Google cache since alestic.com is presently unreachable): http://webcache.googleusercontent.com/search?q=cache:0jxzyFj...

jedberg · on April 21, 2011

We (reddit) are seeing failures in all zones.

redthrowaway · on April 21, 2011

If memory serves, Amazon's reporting of what zones are experiencing problems has been...optimistic...in the past.

justinsb · on April 21, 2011

AWS have now confirmed that this affects multiple availability zones. From the status page: "..impacting EBS volumes in multiple availability zones in the US-EAST-1 region"

justincormack · on April 21, 2011

Thats not good. The whole point of multiple AVs is for them to not fail at the same time. Suggests some dependencies that should not be there perhaps, or at least some correlation of something, like software upgrades. Looking for a good explanation of this; one AV going down is not a problem and should not impact anyone who is load balancing.

Joakal · on April 21, 2011

How does Reddit display the 'offline' page if it's down?

jedberg · on April 21, 2011

The server is still up, so we can serve it right out of the load balancer.

wallnutboy · on April 21, 2011

Are you able to enable the 'read-only-mode' using the same method?

clistctrl · on April 21, 2011

I was to be under the impression a great deal of Reddits issues were linked to Amazon.

mtodd · on April 21, 2011

Why is ELB not mentioned at all on the Service Health Dashboard?

We're experiencing problems with two of our ELBs, one indicating instance health as out of service, reporting "a transient error occurred". Another, new LB (what we hoped would replace the first problematic LB), reports: "instance registration is still in progress".

A support issue with Amazon indicated that it was related to the ongoing issues and to monitor the Service Health Dashboard. But, as I mentioned before, ELB isn't mentioned at all.

jen_h · on April 21, 2011

We've got a single non-responsive load balancer IP in one of our primary ELBs (others have been fine for several hours now), so while everything else for us is up & running, still have transient errors for folks that get shunted to through that one system.

The interesting thing about the ELB in a situation like this is that I believe it may, in many instances, be better to hobble along and deal with an elevated error rate if at least some of your ELB hosts are working than to re-create the entire ELB somewhere else, especially if you're a high-traffic site where you may hit scaling issues going from 0 to 60 in milliseconds (OMMV, but we've been spooked enough in the past not to try anything hasty until things get back to normal).

mtodd · on April 21, 2011

We have an identical load balancer to one that is causing problems so we're lucky enough to reroute traffic through that one instead to get to the same boxes. (The boxes serve two different APIs through two different DNS CNAMEs so we split the ELBs for future and sanity). In this case, it's helped us out. Alternatively, we would've just routed all traffic to our west coast ELBs.

mtodd · on April 21, 2011

Quote from the AWS support rep: "I can confirm that ELB has been affected by the EBS issue despite the lack of messaging on the AWS Dashboard".

potomak · on April 21, 2011

Quora says: "We'd point fingers, but we wouldn't be where we are today without EC2."

olegp · on April 21, 2011

Nice way to point fingers while saying you're not.

stevepotter · on April 21, 2011

Yes, that was by far my favorite comment. Well played.

helium · on April 21, 2011

I just launched a site on Heroku yesterday and cranked up the dynos up in anticipation of some "launch" traffic. Now, I can't log in to switch them off. Thanks EC2, you owe me $$$s

random42 · on April 21, 2011

Actually, I'd expect Heroku to not charge for when the site was down, as they are clearly not available, it does not sound fair if they charge for it.

Am I expecting too much from them?

helium · on April 21, 2011

My app on heroku is running, it's just that I can't log in to their management console to de-allocate resources that I am paying for by the hour.

random42 · on April 21, 2011

If I were you, I'd send them an email requesting this, on your behalf. At the end of the day, its their responsibility to make the console unavailable. I will be more than unimpressed if they dont see this logic here.

mike-cardwell · on April 21, 2011

Isn't that a design flaw in Heroku? Shouldn't you be able to log into Heroku and change stuff like that even if the entire of Amazons cloud service is down?

SandB0x · on April 21, 2011

Agree. You can delegate work but not responsibility.

Maxious · on April 21, 2011

> Nothing special-case here: we deploy with git push, just like any other Heroku user. Dogfooding is good for you. http://blog.heroku.com/archives/2009/4/1/fork_our_docs/

mike-cardwell · on April 21, 2011

That's all well and good, but it's no use for their customers if/when Amazon goes down.

mathrawka · on April 21, 2011

I think this is a good example of how the "cloud" is not a silver bullet to making your site always up. AWS provides a way to keep it up, but it is up to each developer to ensure that they are using AWS in a way to make sure their site can handle problems in one availability zone.

I think we will see more of a focus from big users of AWS about focusing on how to create a redundant service using AWS. Or at least I hope we will!

justinsb · on April 21, 2011

All well and good, but the elephant in the room is that multiple availability zones have failed at the same time. It looks like AWS have a single point of failure they weren't previously aware of.

jedberg · on April 21, 2011

This outage is affecting all AZ's in the East. So even a multizone setup wouldn't help for this one. Only a multiregion setup.

This outage is a lot like having your entire datacenter lose power.

mathrawka · on April 21, 2011

I thought AZs were supposed to be different physical data centers.

If that is not the case, then having a multi-region setup would be a necessity for any major sites on AWS.

Perhaps there will be a time where to truly be redundant, one would need to use multiple cloud providers. Which would be a _huge_ pain to do now I imagine, with all the provider lock-ins we have.

jedberg · on April 21, 2011

> I thought AZs were supposed to be different physical data centers.

They are. Which means this is probably a software issue or some other systemic issue.

mathrawka · on April 21, 2011

Well that isn't supposed to happen :(

I think I'll go home and wait it out there, but it appears that they are having some progress in recovering it. But our site is still affected.

6ren · on April 21, 2011

If "multi-region" means North America, Europe and Asia Pacific, doing so would also improve world-wide latency (e.g. here in Australia...).

Could you use this outage to justify switching to multi-region?