Given the analogy between actual clouds and computer clouds, it now seems appropriate to extend the concept to storms that those clouds may bring. This was illustrated recently (April 21, 2011) when Amazon had a cloud outage (a mixed metaphor no doubt) in their Amazon Web Services business. This situation was covered by the NY Times (here), and the professional computer press (here) among others. As a result of Amazon's problems some Web sites were reported to be down for as long as 11 hours, although actual loss of previously stored information has seemingly not been part of the problem--this time. However there is a related question for any new data that was or should have been generated during the outage. Where is it, and will the gap be properly filled in retroactively?
The Amazon postmortem explanation has to be what will be a classic, if it is not already a classic. In fact I can picture a pull down menu of explanations where this would have to be one of the choices. The explanation in short: a configuration error was made during a network upgrade. A far more detailed explanation was posted by Amazon here. From a Web page perspective an interesting aspect of the posted explanation is that while it is clearly on the amazon.com Web site, it is not easily found, if it all, by starting at amazon.com, or at least I didn't find it from there.
One interesting aspect of the story is that different customers were affected in different ways in part because of the type of services they had under contract. For example Netflix said it had taken "full advantage" of the redundant cloud architecture which protects against local malfunctions, while other customers who were reliant on only the effected Virginia location were more adversely effected. Understanding what you are paying for can be a challenge here, especially for less sophisticated users, who in some cases would be exactly the ones turning to cloud providers.
As readers here probably know well, the computer cloud is a way for enterprises to outsource their computer operations in whole or in part including severs, software and data management, substituting Web access for the maintenance of their own servers and support. This concept has grown in popularity in recent years among both large and small organizations. Not surprisingly then, there was a good deal of sentiment expressed about this being a wake-up call that will cause a re-evaluation of reliance on the cloud, and that the reliability image would take a hit. What is perhaps most surprising about these sentiments is that seemingly knowledgeable people were willing to acknowledge that they were surprised that the Amazon system, or any large system, could go out. This is the it-will-always-work-and-always-be-there perspective that vendors love, and customers continue to buy into. (There is a wireless version of this that adds there-will-be-full-coverage-of-unlimited-capacity-and -we-will-not-interfere-with-anyone-and no one-will-interfere-with-us.) It might also be noted that some cloud data storage providers have already closed their doors, and there is no automatic guarantee that customer data will be accessible or easily transferred when cloud providers are gone.
While no medical systems were reported as being effected by the Amazon problem, this is none-the-less yet another example of what can go wrong with distributed information systems, and why preparedness generally, and risk management in particular, are necessary elements of today's connectivity. ISO 80001, as discussed here by Tim Gee, is a starting point for healthcare to think about these issues. The global question is what do I do when the network system I am depending on isn't available? Related to that is which devices have stand alone capability, and which do not? And equally important, where is the data, how secure it is it, how fast can I move it, can I/they recover it, and will the vendor/my data be there tomorrow?
Thanks for posting this - yes, a big wake-up call. One of the comments in the professional computer article was that customers looked at the cloud like a utility company that had redundancy/specific availability and yet it doesn’t. We as clinical engineers know that with electricity in the healthcare enterprise there is regular and red power and the red power is for when the utility is unavailable and then back-up power is provided. Similar analogy with the cloud - first what is your philosophy with regard to centralized versus de-centralized control and then how distributed will that be. This is the future - think about home healthcare and remote monitoring and what systems will be presumed to be working in order to provide care - and your healthcare enterprise will not have control over large swathes of those systems. Managing risk and expectations becomes even more important.
And again - Just over one year after my initial post here, Amazon’s cloud was again in the news, this time knocked out by storms in the northeast on June 29, 2012. Local utilities had major outages as well, and perhaps hospital backup generators where challenged as was Amazon’s. None-the-less, this is another reminder that relying on external services, whose reliability and redundancy you might not actually understand, and without contingency plans, can lead to a bit of chaos when things go bad.
A NY Times story is at: http://www.nytimes.com/2012/07/02/technology/amazons-cloud-service-is-disrupted-by-a-summer-storm.html?_r=1&ref=todayspaper
And again:
http://www.nytimes.com/2012/12/27/technology/latest-netflix-disruption-highlights-challenges-of-cloud-computing.html?ref=todayspaper&_r=0
This time, dare we even say it, Netflix was down Christmas Eve/Christmas Day. And the suggestion is that it was a software control issue (Elastic Load Balancing) rather than a server outage.