"Cheaper than cost of being down."
This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise.
(.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability).
Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.
I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.
Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.
Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.
When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)
I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")
I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.
There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.
I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.
This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.
Every decision in a business is like this - measure the cost of action A versus the cost of not-A. It's just rare that in this case, those costs are easily quantifiable.
Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.
And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.
Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.
Yes - It is all business decisions.
As someone said already an instance on AWS can cost up to 7X a machine you own on co-location.
here is how outbrain manages it's multi datacenter architecture while saving on Disaster recovery headroom.
http://techblog.outbrain.com/2011/04/lego-bricks-our-data-ce...
Amazon's EC2 SLA is extremely clear - a given region has an availability of 99.95%. If you're running a website and you haven't deployed across across more than one region then, by definition, your website will have 99.95% availailbility. If you want a higher level of availability use more than one region.
Amazon's EBS SLA is less clear, but they state that they expect an annual failure rate of 0.1-0.5%, compared to commodity hard-drive failure rates of 4%. Hence, if you wanted a higher level of data availability you'd use more than one EBS volume in different regions.
These outages are affecting North America, and not Europe and Asia Pacific. That's it. Why is this even news? Were you expecting 100% availability?
Amazon's EC2 SLA is extremely clear -
a given region has an availability of 99.95%.
If you're running a website and you haven't
deployed across across more than one region then,
by definition, your website will have 99.95%
availailbility. If you want a higher level of
availability use more than one region.
Good point.
let P(region fails) = 0.05% and let's assume (and hope) that the probability of failure of one region is independent of the state of the other regions.
P(two regions fail) = P(one region fails and another region fails) = P(region fails) * P(region fails) = 0.05% * 0.05% = 0.0025%
Making your availability = 100% - 0.0025% = 99.9975%
Ultimately it's more of a business decision if you want to pay for the extra 0.0475% of availability. I would think (or hope) that most engineers would want it anyway.
The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?
"let's assume (and hope) the probability of failure of one region is independent of the state of the other regions."
In practice, that's not true, and it's not true enough to ruin the entire rest of your calculations. For Amazon regions to function independently, they'd have to be actually, factually independent; there is no interaction between them. The reaction to one node going down is never to increase the load on other nodes as people migrate services, etc. There's fundamentally nothing you can do about the fact that if enough of your capacity goes out then you will experience demand in excess of supply.
If you want true redundancy you will at the very least need to go to another entirely separate service that is not Amazon... and if enough people do that, they'll break the effective independence of that arrangement, too.
(This is a special case of a more general rule, which is that computers are generally so reliable that the ways in which their probabilities deviate from Gaussian or independence tends to dominate your worst-case calculations.)
I agree with you 100% that they're not independent, but I don't know enough about the data to model the probabilities of failure and availability in a HN comment :-)
After today's event, it would certainly be interesting to see how resource consumption changed in other availability zones and at other providers during this outage.
I wonder if that could be measured passively? What I mean is, by monitoring response times of various services that are known to be in specific regions and seeing how that metric changes (as opposed to waiting on a party that has little-to-no economic benefit to release that information.)
No, your decimal point is off. 0.05% * 0.05% = 0.0005 * 0.0005 = 0.00000025, or 0.000025%. It works out to an expected downtime of 8 seconds per year, instead of over 4 hours for one location.
Of course, redundancy doesn't set itself up, so there are added costs on top of Amazon.
Why wouldn't a simple expected value calculation work? You've shown that you can calculate the extra availability that subscribing to another region provides. Simply multiply the cost of an outage by the extra availability provided by an additional region that would have prevented that outage.
If expanding to another region costs more than just taking the outage, then it's categorically not a good option. If management still says no in the face of numbers that suggest yes, then that tells you that you're missing a hidden objection, and how you proceed will depend on a lot of factors specific to your situation.
I think you're right, that would be the best way of presenting this argument to management. To do so, however, the company would need to calculate its Total Cost of Downtime (which probably isn't very complex for many companies) which is its own subject entirely IMO.
> The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?
In businesses where physical goods are sold to customers, "the management" is generally very motivated to avoid stock-out situations in which sales are lost due to lack of inventory (even if it's only a very small percentage.) The reason for this is because they are concerned about the potential loss of customer goodwill. It seems that the same applies in this situation.
Reddit experienced some issues with Amazon a month ago that resulted in the site being down for almost a day. I'm pretty sure they're way below that percentage.
Is this really standard practice for measuring the SLA? The contracts I've seen for a couple small businesses are generally per billable period.
Which always made sense to me. I pay you for 99.5% uptime this month. If you don't achieve it, then I get a discount, as simple as that. If your availability is below that, I don't pay full price for the billable period and then reconcile at the end of the year.
Any links or general advice on this topic anyone has I'd be pretty interested on finding out if there's a general consensus of it being done differently?
If the Reddit web server admins took availability seriously they would have chosen to deploy across more than one region. Do you disagree? Why do you disagree? I'm being honest, no snark involved in my questions.
He wasn't suggesting that all Reddit's problems are due to Amazon services, he was using Reddit's down time today as a data point illustrating that the uptime guarantee claimed for the service has not been kept this year (in fact a whole year's "permitted downtime" as implied by the 99.95% SLA may be eaten on one day). Presumably Amazon will be handing out some refunds and other compensation (assuming the SLA isn't of the toothless "it'll be up at least 99.95% of the time, unless it isn't" variety).
Perhaps the Reddit admins decided that "up to 0.05%" downtime permitted by the SLA would be acceptable, compared to the extra expense of using more of Amazon's services (and any coding/testing time they may have needed to take advantage of the redundancy depending on how automatic the load balancing and/or failover are within Amazon's system) to improve their redundancy. By my understanding the promise isn't 99.95% if you use more than one of our locations, it is 99.95% at any one location, so the fact that Reddit don't make use of more than one location is irrelevant when talking about the one location they do use not meeting the expectations listed in the SLA.
I'm not saying Reddit's implementation decision is right (I don't have the metrics available to make such a judgement) but it would have been made based partly on that 99.95% figure and how much they trusted Amazon's method of coming to that figure as a reliability they could guarantee. If I had paid money for a service with a 99.95% SLA, unless the SLA had no teeth, I would be expecting some redress at this point (though there is probably no use nagging Amazon about that right now: let them concentrate on fixing the problem and worry about explanations/blame/compo later once things are running again).
Very few cloud SLA's seem to have teeth to me. Amazon's SLA gives service credit equal to 10% of your total bill for the billing period if they blow past the 0.05%. This is a lot better than some cloud providers that will simply prorate the downtime, but pretty crappy in terms of actual business compensation. It's equivalent to a sales discount almost any organization with a sales staff could write without thinking about it - meaning Amazon is still making money on every customer even when they've blown past their SLA - assuming every single customer fills out the forms to apply for the discount. Hint: Many won't, see mail in rebates.
A number of tier 1 network providers offer certain customers SLA's that are clearly in place to prove that they invest in redundancy and disaster planning. ex: less than 99.99% --> 10% credit. less than 99.90% --> no charges for the circuit in the billing period.
This reflects an understanding that downtime can hurt your business/infrastructure far in excess of the measurable percentage.
Why is it more expensive to deploy in zones X,Y in regions A,B than zones M,N in region C? I assume you don't just mean "US West is ~10% more expensive than US East."
It's the combination of the extra cost of having machines in US West plus the cost of keeping the data synchronized between them (which is a lot) plus the added development overhead of making sure that things work cross region.
This seems to be a prevalent misunderstanding. Amazon's EC2 SLA of 99.95% applies at the scope of a region. A region may contain more than one availability zone. Hence, deploying on multiple availability zones still only affords the 99.95% availability level.
Yes, multi-region availability on AWS is hideously expensive. However, some organisations value an availability of greater than 99.95% enough to warrant such a multi-region deployment. Clearly reddit, and many, many other AWS users, do not. This isn't a value call on my part; I definitely couldn't afford the inter-region data transfer costs, all I know is that AWS offers you the tools to deploy high availability web services.
Why does Reddit really need 99% availability? Is a customer unduly harmed or is the world even worse off if Reddit is down for a couple cumulative days per year? Is it worth the cost? Would you put up with more ads and/or pay for Reddit in order to make sure that it's available 24/7/365?
Probably not as much for the customer as for the company. When sites are unreliable, people end up going to the more reliable competitors as they arise.
I wouldn't think there a large number of customers deciding "this is too unreliable, I'm leaving" on the basis of a few hours of down time. On the other hand, there might be a large number of people who, upon finding your site down, decide to visit alternatives that are up at the time, and some of those people might decide they like the alternatives better.
Not qualified to speak about what Reddit should or should not do about the arrangement with Amazon. I have read several posts, including one by an (ex) Reddit employee saying Amazon is not delivering what they said they would, that much is clear. I really doubt all their downtime is part of the SLA.
Actually, the whole point of AWS is to have options for using hardware that you don't own. They don't offer any magic "all your stuff in one package, guaranteed to work all the time" service. So yes, you do still need to think about your hardware infrastructure. You just don't have to own it.
And Amazon does have all their stuff available in multiple regions. It's up to you to use it though.
Then it would be much more expensive at the bottom tiers, meaning I wouldn't be able to play with it on a whim without thinking about the money. That would suck.
"[S]hould" is the wrong word here. Clearly, they don't maintain such backups. This is clear to anyone using their service. They pay for the service anyway, so apparently it's still worth it to them, even without auto-backups.
Would it make sense for Amazon to maintain automatic backups (and potentially charge more for them)? I don't know. It might make business sense, it might not. But their service is apparently popular enough even without it.
That's a cache of it. I really wish that the admins at reddit would implement something like this themselves, then link to it when downtime like this happens.
They do have a read-only mode, don't they? I'm not sure why they don't enable read-only mode when things like this happen. It may be that Amazon's service being down forbids this. I dunno.
Reddit gets a lot of grief for stability issues but the fact is it is an immensely popular site that a huge number of people have a close affiliation to. A massive percentage of these people spend a significant portion of their day browsing Reddit, interacting with other redditors, etc. and for the site to be down for as long as it has is news, regardless of similar issues occuring in the past.
The main reason this is news is because this is an Amazon issue but also because tens of thousands of people who frequent the site regularly are now aimlessly browsing the internet in the hopes of finding alternative lulz and in my case some of us are even getting work done. shudder
Is admin supposed to be plural? I mean, do they really have multiple system admins now? I ask, only because I know people have been coming and going recently.
Frankly, for the size of the site, they do really, really well for the limited resources they have.
I meant admins in the more general purpose sense of administrators, people who are paid to maintain the system. But yeah I agree, quality:resources ratio is really really high.
Awesome, I just figured out that you kept the votes tallied during the 'downtime'. What's interesting is how clearly good and bad submissions were dichotomized when nobody had anything else to vote on.
Indeed they are. Right before the issues began, I pushed a bad update to one of my Heroku apps, causing it to crash. A minute later I fixed the bug, re-pushed the git repo to Heroku... and nothing. I've been stuck with an error message on my website for hours. Unfortunate timing!
I can access all my heroku apps that have their own DNS. Anything with a .heroku.com subdomain is down for me. Frustrating, knowing that the apps are still running but aren't routable.
Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.
If you actually wish to make a useful point about the practicality or otherwise of massively virtualised systems for webapp deployment, please do. It's going to take more than two words though.
You guys might have answered this in one of your AMAs/blog posts (or was it raldi who commented?), but what options can reddit resort to should this stuff happen again to this degree of severity?
We're moving away from the EBS product altogether. The hard part is dealing with the master databases. Normally I'd have a master database with a built in raid-10, but I can't do that on EC2, so I have to come up with another option.
So I guess that is the long way of saying that hopefully it won't happen again.
I do not believe you could be effective by moving away from EBS, you know without giving up quite a bit.
Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.
One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).
Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.
I can't say whether much has changed within the last year, but when I worked at FathomDB we had serious issues with EBS. You couldn't trust it. Odd things would happen like disks getting stuck in a reattaching state for days and disks having poor performance.
It still has to be stored somewhere though right? If it's EBS you've just made yourself a complicated solution that will eventually fail all over again. No?
We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).
Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.
We're fine on EC2 -- but everything on RDS seems to be giving us big problems. We started a few backups before we new it was systemic, and all of them are stuck at 0%. We also tried spinning up new instances -- and they're all stuck in booting.
I think which physical data center "us-east-1a" etc. corresponds to differs from user to user, to load-balance given that people will probably be more likely to use 1a than the other zones.
We had about 45 min of downtime around 4am EST. Our RDS instances, EBS backed and normal instances all returned without problems. We are in Virginia us-east-1a and us-east-1b.
HN really isn't the place for internet memes, jokes about pop culture, and things that are judged trivial / frivolous. Part of what makes the HN community what it is, is a focus on high-quality, reasoned, rational discourse. IOW: HN != Reddit
A couple of hours into the failure, and no sign of coverage on Techcrunch (they're posting "business" stories though). It shows how detached Techcrunch has become from the startup world.
Edit: I tweeted their European editor about it and he's posted a story up now.
It's ugly, but true enough. You don't have to like it to acknowledge it. It's just another cloud outage bringing down one or more high profile sites. It's a "dog bites man" story.
But yeah, right now we're shutting everything down to try and avoid possible data corruption. Once they restore service, hopefully we'll be able to come back quickly.
Hey Jedberg, if you guys aren't already rolling your own, check out fdr's WAL-E tool. It bounces postgres write-ahead logs off S3 and goes great with the new PG9 replication.
Didn't Jedburg say that they could reduce the failure by spending with Amazon.
I wonder if Rackspace really want this particular traffic burden. It seems that if Reddit choose not to pay for the load they need then you get lot's of bad press for it ... perhaps I'm seeing it wrong.
Rubbish analogy: Kinda like if I was doing a haulage business and you called out for a wheelbarrow to carry some elephants, then when the barrows broke we got bad press despite. If you'd paid for a heavy animal transport package ... OK it's all going wrong, you get the idea.
So is it a financial constraint with Amazon? Would you be suffering the same sorts of outages regardless of the technology on the backend or does AWS basically suck?
You have the wrong end of the stick, because you're missing the history of the story. Reddit have a weird budget when it comes to staffing costs versus operating costs due to their parent company's policies as a media comapny - so they have a decent budget but are massively understaffed.
Statements like the one you're quoting are in that context. Let's say you have an unlimited operating budget - you can come up with all kinds of wonderful plans for massive redundancy and zero downtime. But you can't make that happen if you're not allowed to hire any engineers or sysadmins! As far as I'm aware reddit are paying Amazon mucho dinero but still having irredeemable problems with the storage product, EBS. They are stuck on an unreliable service without the manpower to move off.
That's the story, as far as I can piece together from comments here and on reddit.
Ah, you see from what I read on reddit I understood that the staff shortage was simply part of Conde Nast's unwillingness to spend money on reddit and that constant downtime issues were another facet of that same problem.
It's not making money and those looking after reddit don't want to ruin it with a huge money grab - instead taking a soft approach, first just begging for money, then adding in a subscription model (freemium anyone?) and more subtle advertising by way of sponsored reddits (/r/yourCompany'sProduct type stuff).
I understand they've been hit with more staff problems just recently despite having a new [systems?] engineer start with them.
So in your view EBS is the problem regardless of finance? That was the nut I was attempting to crack. TBH I didn't expect someone at reddit to stick their neck out and say "yeah Amazon sucks" but they might have confirmed that the converse was true and they were simply lacking the necessary finance to support the massive userbase they have.
Rackspace (and really all the "popular" US hosters) seem ridiculously expensive compared to hosting prices we have in Germany (see e.g. http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... this is one of the biggest root server hosters in Germany).
Is this really so or are Racksoace and co. Just "boutique" offerings?
Sup jedberg, I obviously don't have nearly the level of knowledge with the intricacies of reddit, but coming from a strictly "business" standpoint, the amount of downtime reddit receives due to amazon issues is astounding. Perhaps it's time to look for alternatives?
I know right? I made a hilarious joke a little while ago and jedberg yelled at me and everyone downvoted me. And I'll be very surprised if this comment doesn't get downvoted to hell too.
EDIT: I also simply greeted jedberg, and a bunch of people thought that was a good reason to downvote. Do people think there's an imminent influx of redditors, and that they have to dissuade them from becoming HNers? I don't think that's the case.
Apparently most of their problems are caused by bad EBS writes/performance, or at least so they said a few weeks ago after some particularly bad downtime.
It looks like EBS will randomly decide to switch to a few bps of performance from time to time. I would use Amazon for my startup, but these issues really make it hard to justify.
EBS seems to be the main problem here, I'll cite a former reddit employe (first comment on the blog that talked about EBS problems).
I don't work for reddit anymore (as of about a week ago, although I didn't get as much fanfare as raldi did), but I can tell you that they're giving Amazon too much credit here. Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it.
EC2 instances only have one network interface. The public IP address you have pointing to your instance is a DNAT done somewhere further up the chain.
If you get a large network load to your instance - say, a DDoS attack - you can find you no longer have enough network capacity to talk to your EBS disks.
Slightly offtopic, but wasn't that post by an ex-employee as a comment? Not that the technical aspect of it wasn't fantastic, because it was, but I don't think Reddit said anything publically did they?
AWS randomize the zones per account, so "your" -1b is not necessarily the same as "my" -1b. I'm only seeing problems in my -1c. Are you seeing 3 zones failing all under the same account?
If multiple AZs are down, AWS are going to have some serious explaining to do...
Amazon RDS's most expensive feature is automatic, instant Multi-AZ failover to protect against this kind of situation. It's not working quite like that, which the AWS status page acknowledges. This is a major failure.
AWS have now confirmed that this affects multiple availability zones. From the status page: "..impacting EBS volumes in multiple availability zones in the US-EAST-1 region"
Thats not good. The whole point of multiple AVs is for them to not fail at the same time. Suggests some dependencies that should not be there perhaps, or at least some correlation of something, like software upgrades. Looking for a good explanation of this; one AV going down is not a problem and should not impact anyone who is load balancing.
Why is ELB not mentioned at all on the Service Health Dashboard?
We're experiencing problems with two of our ELBs, one indicating instance health as out of service, reporting "a transient error occurred". Another, new LB (what we hoped would replace the first problematic LB), reports: "instance registration is still in progress".
A support issue with Amazon indicated that it was related to the ongoing issues and to monitor the Service Health Dashboard. But, as I mentioned before, ELB isn't mentioned at all.
We've got a single non-responsive load balancer IP in one of our primary ELBs (others have been fine for several hours now), so while everything else for us is up & running, still have transient errors for folks that get shunted to through that one system.
The interesting thing about the ELB in a situation like this is that I believe it may, in many instances, be better to hobble along and deal with an elevated error rate if at least some of your ELB hosts are working than to re-create the entire ELB somewhere else, especially if you're a high-traffic site where you may hit scaling issues going from 0 to 60 in milliseconds (OMMV, but we've been spooked enough in the past not to try anything hasty until things get back to normal).
We have an identical load balancer to one that is causing problems so we're lucky enough to reroute traffic through that one instead to get to the same boxes. (The boxes serve two different APIs through two different DNS CNAMEs so we split the ELBs for future and sanity). In this case, it's helped us out. Alternatively, we would've just routed all traffic to our west coast ELBs.
I just launched a site on Heroku yesterday and cranked up the dynos up in anticipation of some "launch" traffic. Now, I can't log in to switch them off. Thanks EC2, you owe me $$$s
If I were you, I'd send them an email requesting this, on your behalf. At the end of the day, its their responsibility to make the console unavailable. I will be more than unimpressed if they dont see this logic here.
Isn't that a design flaw in Heroku? Shouldn't you be able to log into Heroku and change stuff like that even if the entire of Amazons cloud service is down?
I think this is a good example of how the "cloud" is not a silver bullet to making your site always up. AWS provides a way to keep it up, but it is up to each developer to ensure that they are using AWS in a way to make sure their site can handle problems in one availability zone.
I think we will see more of a focus from big users of AWS about focusing on how to create a redundant service using AWS. Or at least I hope we will!
All well and good, but the elephant in the room is that multiple availability zones have failed at the same time. It looks like AWS have a single point of failure they weren't previously aware of.
I thought AZs were supposed to be different physical data centers.
If that is not the case, then having a multi-region setup would be a necessity for any major sites on AWS.
Perhaps there will be a time where to truly be redundant, one would need to use multiple cloud providers. Which would be a _huge_ pain to do now I imagine, with all the provider lock-ins we have.
A blog post last month touched on this:
"Q: Why is reddit tied so tightly to the affected availability zone?
A: When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area. Luckily, we are currently in a hiring round which will increase the technical staff by 200% :) These new programmers will help us address this issue."
Not sure if the costs of data transfer between regions (charged at full internet price) would justify the added reliability/lower latency though.
Quora is down, and evidently "They're not pointing fingers at EC2" --
http://news.ycombinator.com/item?id=2470119 -- I was going to post a screen shot, but evidently my Dropbox is down too.
I'm seeing 1 EBS server out of 9 having issues (5 in one availability zone, 4 in another). CPU wait time on the instance is stuck at 100% on all cores since the disk isn't responding. Sounds like others are having much more trouble.
Silver lining: Hopefully I can test my "aws is failing" fallback code. (my GAE based site keeps a state log on S3 for the day when GAE falls in a hole.)
Two years ago TechCrunch was publishing an article every time Rackspace went down listing all the hot startups down along with it. AWS is no more a SPOF than any other major hosting provider.
Wow. I can only imagine the intense frustration the site owner must be feeling right about now. Makes you really stop and question the whole "cloud" based service. Or at least should make you realize you need fall-backs other than the cloud service itself.
So when big sites deal use Amazon Web Services for major traffic, do they get a serious customer relationship? Or is it just generic email/web support and a status page?
I would say, rather, that it is priced to have very specific sorts of customer using it.
The relationship between the pricing tiers changes fairly drastically, depending on how much you are already spending on Amazon Web Services. Gold, for instance, starts out at 4x the price of Silver support, but by the time you're spending 80K/month on services, it's only a $900 premium (and stays there no matter how much bigger your bill is). At the $150K/month level, it's a 2x jump from Gold to Platinum, which may or may not be a huge jump, considering the extra level of service you get.
Mine is worse. I booked Tues-Thurs off. I only have internet in work at the moment. I'm going to miss reddit now and be without internet until I return to work on the 3rd of May. Stupid Sky and their stupid take forever switch overs.
Or a South African 11 day weekend. Easter, and then public holidays on 27 April and 1/2 May are combining this year to provide a massive holiday opportunity.
UK Universities give staff the Tuesday too, giving them a five day weekend. I took two days holiday next week and because of Easter and the Royal Wedding that gives me 11 days off straight.
To clarify, this Monday is ANZAC Day, a commemorative holiday for troops who fought for Australia. Because that's also Easter Monday, the ANZAC Day holiday is moved to Tuesday, despite commemorative services being held on Monday.
Is it a 4-day holiday in Norway too?! Opera Support Forums have been sketchy for hours and if they won't come back until Tuesday, I'm SOL with my Opera problems.
Actaully, I'm on the couch in front of the fireplace, watching old SNL, waiting patiently for Amazon to fix their shit, and figuring out how we can not use EBS anymore.
I'm pretty sure redditors would be more than happy to deal with a day or so extra downtime as you guys switched to a better platform. Just leave a simple page up saying "Dumping Amazon, brb"... doubt you'd get many complaints.
TBH Amazon is so bad at this point that turning off Reddit is as good as trying to keep it running. Of course then you need to deal with the increased suicide rate.
Periodically, latest a couple of days ago, there's a post / discussion about whether outsourcing core functionality is the right thing to do. There are valid points on both sides of the issue.
For my part, if I'm going to be up in the middle of the night I'd rather be up working on fixing something rather than up fretting and checking status. But either way things get fixed. The real difference comes in the following days and weeks. When core stuff is in the cloud then you can try to get assurances and such, fwiw. When core stuff is in-house then you spend time, energy and money making sure you can sleep at night.
I thought you could cluster your instances across many regions and replicate blah blah blah and change your elastic ip addresses in instances like this?
Is this a case that it's not being utilised or does that system not work?
I appreciate you are busy right now so I'm not expecting a reply any time soon.
"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." [1]
"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." [2]
[1] https://twitter.com/adrianco/status/61075904847282177
[2] https://twitter.com/adrianco/status/61076362680745984