Self proclaimed the, “oldest guy with a computer science degree,” John Rushby with SRI International started the second day with a presentation on Accidental Systems. Using aviation as a model, Rushby discussed interactive complexity and system failures. Exploring the causes of accidents, Rushby noted, “that sufficiently complex systems can produce accidents without a simple cause.” He related aviation failures to health care, noting that in many patient safety incidents, it is the system that fails rather than the clinician.
Aviation is a good model because the extensive reporting of both failures and incidents. Incidents are “near misses” rather than actual accidents. In aviation, because there was no crash, incidents are a rich source of reliability and failure data. He noted that the scarcity of near misses in health care (i.e., adverse events) makes similar analysis much more difficult.
Interactive complexity and tight coupling were noted as important factors to system safety. He then went through a number of failure scenarios in military aviation, demonstrating how complex systems can fail due to a lack of understanding interactions among various variables.
“Nearly all failure indications were not due to actual hardware failures, but to design oversights concerning unsynchronized computer operation.”
Rushby noted that we are pretty good at building and understanding components. But systems are about the interactions of components that manifests “emergent behavior.” We are not so good at understanding this because many interactions are unintended and unanticipated; only some are the result of component faults. There are often multiple and latent faults, and they can malfunction or exhibit an unintended function rather than a simple and obvious loss of function. Many failures are simply due to . . . complexity.
Unlike medical devices, the FAA certifies components only as part of an airplane or engine. That’s because it is not currently understood how to relate the behavior of a component in isolation to its possible behaviors in a system (i.e., in interaction with other components).
So, what are we doing in medical devices? We’re making components and plugging them together, creating “accidental systems.” These medical device systems are created without conscious design. The resulting interconnects produce desired behaviors – most of the time – but may promote unanticipated interactions leading to system failures or accidents.
The solution is simple; we just need to think it all through before we build it. There are two problems with this. First, it’s not known how to do this in general due to the complexity. Second, in health care we regulate components that are then assembled into accidental systems.
Rushby went on to describe some partial techniques for improving the safety of systems. Modes of interaction are key. Examples include interaction:
- Among computational components
- Through shared resources (e.g., the network)
- Through the controlled plant (the patient)
- Through human operators
- Through the larger environment
Computer scientists have worked out how to predict and verify the combined behavior of interacting systems – sometimes. One method is called “assume/guarantee reasoning.” This method and others can be employed as an informal method or formally. Other formal methods include automated theorem proving, model checking and static analysis.
A cool new technology is “assumption generation.” This is well suited to human physiology, which can be quite variable – unlike a radar or jet engine. Using this engine you can define the loosest set of variables. Assume/guarantee reasoning combines a model or specification of your component(s) that includes assumptions about its environment. The assumed environment can be made part of the component specification using techniques like interface automata. (Interface automata (IA) is a light-weight formalism used to describe the temporal interface behaviors of software components.) The IAs of various components can be combined to create a state machine that can verify that a collection of components satisfy each others IAs. Data types and a state machine are extracted from these assumptions with the goal of creating a high confidence in system operation across the assumed environmental range (the assumption guarantee).
He also offered some simple tips to reduce interactive complexity:
- Send sensor samples with use-by date rather than time stamp
- For sensor fusion, send intervals rather than point estimates
- Define data with respect to an ontology, not just basic types
- E.g., raw output of blood pressure sensor vs. corrected for bed height
- Critical things should not depend on less critical things
- E.g., intervention for low blood pressure depends on blood pressure which depends on bed height sensor
- Consequently, the bed height sensor is as critical as the blood pressure intervention or alarm
A key challenge with medical devices lies in the need to integrate device systems with other information technology infrastructure to facilitate the communications and sharing of data. Interaction through shared resources becomes a major factor in safe and effective systems engineering. The concept of partitioning went to the heart of the trade off between specialized embedded systems and general purpose computing platforms – or to look at it another way, between a standalone system and a medical device system deployed on a shared enterprise infrastructure.
Assume/guarantee reasoning is about computational interactions and relies on there being no paths for interaction other than those intended and considered. But commodity operating systems and networks provide lots of additional and unintended paths – hard to model variability. Typically, A and B get disrupted because X has gone bad and the system did not contain its fault manifestations. So safety- and security-critical functions in airplanes, cars, military, nuclear power etc. don’t use Windows, Ethernet, CAN etc. Avionics and military high-reliability applications are highly partitioned using very specialized system software, architectures and rigorous verification and validation. These make the world safe for assume/guarantee reasoning but are too expensive (and overkill) for medical device interoperability.
Contention on shared resources is a classic example where assume/guarantee reasoning breakdowns. An approach to solving this problem is interaction through a controlled plant. In medical devices, the controlled plant is the patient’s body. Medical device development would involve controller and plant models. A plant model may include only a few physiological parameters. Different devices will have different plant models and may be ignorant of other devices’ parameters – yet will interact in actual use.
Obvious perils include normal but unmodeled interactions, especially in the presence of faults. But problems can also occur due to inferior outcomes from lack of beneficial interaction, e.g., a harmonic relation between heart and breathing rates (Buchman).
Rushby’s final interaction is human interaction, with humans as cognitive agents (users) rather than the plant (or patient). Many things attributed to human error are in fact gross design errors. Even safety interlocks can introduce errors if the operator does not understand why an action is or is not happening. These kinds of problems suggest we may not be able to rely on skilled human intervention once we introduce automation – unless we design it right.
Modeling mental models can minimize human error. Operators use mental models to guide their interaction with automated systems. Problems arise due to divergence between an operator’s mental model and the actual behavior of the device or system. Design engineers can represent plausible mental models as state machines, e.g., use the training manual, and then simplify using insights of Denis Javaux. Then compare all behaviors of the mental model against the actual automation (using model checking). Divergences between the mental model and actual operation represent your automation “surprises.”
Beyond specific interactions, there is the larger environment. The purpose of a system is to change some relationships in the environment external to the system. So requirements specification should focus on those changes. But changing intended relationships may also change unintended ones. Requirements engineering should focus on these issues where unintended relationships can result in failures. This can be done by building models of the environment and exploring interactions. Model checking and other formal methods allow exploration of all possible behaviors.
So, now we have these various sources of unintended interactions and suggested some ways to detect and avoid them. The next challenge is to provide assurance that we’ve detected and avoided unintended interactions. All assurance is based on arguments that purport to justify certain claims, based on documented evidence. There are two approaches to assurance: implicit (standards based), and explicit (goal-based).
Aviation and security use a standards-based approach to software certification. In this approach, the applicant follows a prescribed method (or processes) to deliver prescribed outputs like documented requirements, designs, analyses, tests and outcomes, traceability, etc. Standards usually define only the evidence to be produced. The claims and arguments are implicit. Hence, it is hard to tell whether given evidence meets the intended use. This approach works well in fields that are stable or change slowly. A standards-based methodology tends to institutionalize lessons learned and best practices, but is less suitable with novel problems, solutions, and methods – like medical device interoperability.
The goal-based approach involves the applicant developing an assurance case whose outline form may be specified by standards or regulation. This approach makes an explicit set of goals or claims, and provides supporting evidence for the claims and arguments that link the evidence to the claims. Underlying assumptions and judgments are explicitly and unambiguously stated. This approach should allow different viewpoints and levels of detail. In practice, the case is evaluated by independent assessors who evaluate claims, evidence, and argument.
So what should evidence look like? There are two types of relevant evidence, evidence about the process, organization, people, and evidence about the product. Certification reviews are based on human judgment and consensus, e.g., requirements inspections, code walk throughs. Analysis can be repeated and checked by others, and potentially by machine. Formal methods/static analysis and tests are also used.
The interesting opportunity is to create a science of certification. Certification is ultimately a judgment that a system is adequately safe/secure/whatever for a given application in a given environment. But the judgment should be based on as much explicit and credible evidence as possible. A Science of Certification would be about ways to develop that evidence.
The bottom line is that there is enough damage caused by the practice of medicine that the introduction of inherently flawed medical device systems provides a net improvement. Consequently, there is no immediate requirement to rush to implement the kind of systems engineering and certification used in aviation or other high reliability endeavors.
A worst case scenario is a medical device system that results in death or injury to multiple patients – like when all the passengers are lost in a catastrophic airplane failure. This could quickly shift the health care environment where aggressive safety engineering is required.
All of this flies in the face of current software licensing via the ubiquitous EULA. The typical EULA asserts extensive liability limitations. Software vendors need to start making specific claims and backing them up with evidence.
Given the interactions discussed, how is criticality addressed? There is no simple answer to that – one way is to calculate all the paths that this could occur and ensure all the paths are workable. In the context of systems it’s really hard to know how to do these things.
A CANopen proponent witnessed to the gathered multitude about the use of CAN in high-reliability systems like helicopters, naval ships, etc. Rushby noted that this field is not without controversy, but noted that CAN is used but that he would prefer not to fly on an airplane using CAN.
A group has been working on an ICE standard on medical device interoperability. There is a requirement in the standard for the vendor to create a mental model, and a fall-back mode in the event of unanticipated failure. Tight coupling is inherent in close loop control. Adding slack into such a system may be advantageous in accommodating human operator variability.
What is the possible impact on medical devices of the sometimes draconian measures that aviation uses for time-sliced partitioning? This approach is very costly to developers, placing many constraints upon the engineer. The flexibility inherent in human physiology may be better served by an approach that is different from time-slicing partitioning.
How close do you think we are to the “science of certification?” Rushby said he thinks we’re far away from that. Much if this is due to the transformative change in the aviation industry. There is also much yet to be learned about controller-synthesis, and how it would roll up into a certification science.