Given the analogy between actual clouds and computer clouds, it now seems appropriate to extend the concept to storms that those clouds may bring. This was illustrated recently (April 21, 2011) when Amazon had a cloud outage (a mixed metaphor no doubt) in their Amazon Web Services business. This situation was covered by the NY Times (here), and the professional computer press (here) among others. As a result of Amazon’s problems some Web sites were reported to be down for as long as 11 hours, although actual loss of previously stored information has seemingly not been part of the problem–this time. However there is a related question for any new data that was or should have been generated during the outage. Where is it, and will the gap be properly filled in retroactively?
The Amazon postmortem explanation has to be what will be a classic, if it is not already a classic. In fact I can picture a pull down menu of explanations where this would have to be one of the choices. The explanation in short: a configuration error was made during a network upgrade. A far more detailed explanation was posted by Amazon here. From a Web page perspective an interesting aspect of the posted explanation is that while it is clearly on the amazon.com Web site, it is not easily found, if it all, by starting at amazon.com, or at least I didn’t find it from there.
One interesting aspect of the story is that different customers were affected in different ways in part because of the type of services they had under contract. For example Netflix said it had taken “full advantage” of the redundant cloud architecture which protects against local malfunctions, while other customers who were reliant on only the effected Virginia location were more adversely effected. Understanding what you are paying for can be a challenge here, especially for less sophisticated users, who in some cases would be exactly the ones turning to cloud providers.
As readers here probably know well, the computer cloud is a way for enterprises to outsource their computer operations in whole or in part including severs, software and data management, substituting Web access for the maintenance of their own servers and support. This concept has grown in popularity in recent years among both large and small organizations. Not surprisingly then, there was a good deal of sentiment expressed about this being a wake-up call that will cause a re-evaluation of reliance on the cloud, and that the reliability image would take a hit. What is perhaps most surprising about these sentiments is that seemingly knowledgeable people were willing to acknowledge that they were surprised that the Amazon system, or any large system, could go out. This is the it-will-always-work-and-always-be-there perspective that vendors love, and customers continue to buy into. (There is a wireless version of this that adds there-will-be-full-coverage-of-unlimited-capacity-and -we-will-not-interfere-with-anyone-and no one-will-interfere-with-us.) It might also be noted that some cloud data storage providers have already closed their doors, and there is no automatic guarantee that customer data will be accessible or easily transferred when cloud providers are gone.
While no medical systems were reported as being effected by the Amazon problem, this is none-the-less yet another example of what can go wrong with distributed information systems, and why preparedness generally, and risk management in particular, are necessary elements of today’s connectivity. ISO 80001, as discussed here by Tim Gee, is a starting point for healthcare to think about these issues. The global question is what do I do when the network system I am depending on isn’t available? Related to that is which devices have stand alone capability, and which do not? And equally important, where is the data, how secure it is it, how fast can I move it, can I/they recover it, and will the vendor/my data be there tomorrow?