Abstract
Many commercial and governmental systems employing microcontrollers or microprocessors require very high mission reliability. This goal can be achieved by incorporating single units with exceptionally low failure rates that are correspondingly costly, or under certain circumstances, by constructing systems employing redundant inferior units that are relatively cheap. This paper's objective is to provide an aid to cost-effective design by analyzing and interpreting the absolute and relative mission reliability of such systems as a function of redundancy level for linked, parallel units operating in either inactive or active standby modes under constant hazard rates.
Problem Statement and Computational Objectives
A critical design consideration in the ubiquitous usage of microcontrollers/processors in embedded control and other systems on which commercial and governmental organizations depend so heavily is high mission reliability. Such systems include, for example, remote sensing/surveillance in security applications, robotics, transportation, health care, process control, file servers, etc. The term "mission reliability" denotes the likelihood that for some mission length, T, such systems will perform under stated conditions within design specifications (Blanchard). Two non-mutually exclusive approaches to enhancing a system's reliability involve the improvement of individual component reliability and the incorporation of redundancy. Within these two approaches are two further options: redundant elements operating in either inactive or active standby.
As one might expect, because of the increasing application of embedded microcontrollers in civil and military systems where reliability is crucial (e.g., the automatic braking and stability control systems in cars, pacemakers, flight control systems, etc.), an extensive literature has evolved on this subject--some in the open literature [ARIC), some proprietary and some classified. The world's largest technical society, the Institute for Electrical and Electronic Engineers, even has a separate society devoted to this subject. As noted in the following, the two alternative modes of operation (inactive standby and active redundant units) have been addressed in the literature along with their relevant equations.
The objective of this paper is to evaluate the cost effectiveness of systems design by analyzing their absolute and relative mission reliabilities. The main departure of this paper is to provide specific quantitative results that reveal the incremental, relative gains (if any) in reliability versus the level of redundancy. This is done for a general system configuration with either inactive or active redundant components that operate under various failure rate scenarios--144 in all--that don't appear to have been considered explicitly elsewhere. The parametric results provided in six accompanying exhibits are fully interpreted in a detailed findings section. There the reader may discover some counterintuitive results; viz., redundancy doesn't always improve reliability and when it does, the gains can be quite marginal.
For some mission of duration, T, this paper examines the comparative reliability of:
(1) a single "unit" (i.e., microcontroller/processor);
(2) a system with a total of N units comprising one active primary unit and one or more (N-1) parallel, inactive units with separate power supplies switched by a single supervisory controller; and
(3) a system consisting of one active primary unit and (N-1) parallel fully active redundant units (which therefore does not require monitoring or switching). For comparative purposes, the overall probability of mission success as a function of redundancy level is determined for all three of these primary cases.
For Case (2), each redundant "path" of the system, as well as the primary path, is assumed to consist of independent parallel-series arrangements of independent units and independent, switched power supplies. All are considered in inactive, standby mode until the controller detects a failure of the primary unit and activates a redundant path and so on. During inactivity, it is assumed that neither a switched power source nor the unit coupled in series with it can fail.
In both Cases (2) and (3), failures are assumed to be independent, with exponential failure probability densities (i.e., constant hazard rate).
Single Unit Reliability: No Redundancy
In order to derive the absolute and relative reliability versus redundancy results for the general models incorporating redundancy (Cases 2 & 3), it is best to consider a simplified system involving just one unit with no redundancy. For a mission's success, the unit must succeed in operating within specifications and under stated conditions until time T. Assuming the failure time probability density function, f(t), is exponential with constant failure rate, r, the unit's reliability becomes the following:
[R.sub.1](T) = 1 - [T.[integral] 0]f(t)dt = [e.sup.-rT], T [greater than or equal to] 0
Clearly, for this or any other probability density function, a unit's reliability must eventually approach zero; i.e., the unit will wear out ultimately or fail due to some destructive event. For the exponential model, the mean time to failure and standard deviation are readily shown to be the reciprocal of the failure rate, r (Raheja).
Redundant Units: Inactive Standby
As noted, this case incorporates redundancy to improve overall mission reliability. Here, each redundant path of the system, as well as the primary path, is assumed to consist of parallel-series arrangements of independent units and independent, switched power supplies. All are considered in inactive, standby mode until the supervisory controller (monitor or detector) senses a failure of the primary unit and activates a redundant path. This continues with each unit failure until all N-1 redundant units are exhausted. During inactivity, it is assumed that neither a switched power source nor the unit coupled in series with it can fail. Failures are assumed to be independent with exponential failure densities (i.e., constant hazard rate). Employing the general, well known elementary theorem for the joint probability of independent composite events (Hoel), the reliability of the redundant unit configuration for a mission of duration T can be readily determined (ARINC) as the following
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
In this equation, r, [r.sub.p], and [r.sub.c] are the respective constant failure rates of the individual, independent units, switched power supplies and single supervisory controller. From the preceding equation for a single unit with no redundant elements and with some prescribed reliability, R(T,1), T can be found as:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
This allows the values of R(T,N) to be computed for different overall mission reliabilities and various ratios of the three failure rate parameters, including the case where all failure rates are equal. The improvements or degradations in percent relative reliabilities, 100 * R(T, N)/R(T,1) are summarized in the accompanying Exhibits (1-5) as a function of redundancy level for single unit mission reliabilities of 90% and 80%.
Redundant Active Units
This case also incorporates redundancy to improve overall mission reliability. Here, however, each redundant path of the system, as well as the primary path, is assumed to consist of parallel arrangements of N identical active units (with some internal controller for handling responses to service requests and perhaps functioning to determine load sharing). Since all units are active, there is no accompanying supervisory controller (monitor or detector) to switch separate power supplies to activate the standby (inactive) redundant units as in the previous case. As discussed earlier, failures of the N active units are assumed to be independent with exponential failure densities (i.e., constant hazard rate). Employing the general theorem for the joint probability of independent composite events (Hoel), the reliability of the redundant unit configuration for a mission of duration T can be readily determined as the following:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [r.sub.i] is the failure rate of the [i.sub.th] unit. If these are all equal to r, the above reliability equation becomes the familiar expression:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
For N=1, this result devolves to the earlier equation for a single primary active unit with no redundancy. For a prescribed mission reliability, this and the result for R(T,1) leads to the outcomes given in the accompanying Exhibit 6 for the percentage relative reliability improvements (or degradations) as a function of active redundancy level, N > 1.
Findings
Both dynamic (i.e., inactive standby units and active redundant units) provide noteworthy insights regarding the impact of redundancy on overall system mission reliability. The foregoing reliability equations and resultant summary exhibits of relative reliability versus redundancy level and relative component failure rates lead to the following conclusions, which can provide any system designer with useful, quantitative guidelines:
Inactive Parallel Units:
(1) For this dynamic failure model (with exponential failure probability density function), Exhibit 1 shows that the effects of redundancy are quite varied and not always beneficial, depending on the single unit reliability required and the associated common failure rate of the key independent system components: supervisory controller/monitor, switched power supplies and microcontroller/processors ("units").




Mobile Edition
Print
Get the Mag
Weekly Updates