( ESNUG 354 Item 9 ) ---------------------------------------------- [6/1/00]
Subject: ( ESNUG 353 #8 ) URL's On Fault-Tolerant System Design Strategies
> Does anybody know of a resource (web, book or article) describing
> architecture design for systems, storage or logic, whose components are
> prone to very high rate of failure, along the line of 0.1%-1%?
>
> - Greg Deych
From: Subhasish Mitra
Hi John,
You can take a look at the web-site of Center for Reliable Computing here at
Stanford University directed by Prof. Ed McCluskey. http://crc.stanford.edu
In the bibliography page, you can find a very comprehensive list of
publications related to fault-tolerance and digital test. Some of the major
innovations in the field of fault-tolerant computing happened at Stanford
CRC. Right now, the center is running 2 big projects on fault-tolerance:
(1) Fault-tolerance in reconfigurable systems and (2) Fault-tolerance in
space environment (they have their experiments running on a real satellite
in the space.)
About Greg' queries about fault-tolerance techniques in systems, here is a
simple, high-level classification:
(1) Memories: Error detecting and correcting codes ( Hamming codes, etc.)
(2) Logic:
Error detection:
Techniques include: Duplication, Diverse Duplications (different
implementations of the same logic function), parity prediction. For
circuits like adders, etc., parity prediction may be economical;
however, recent IBM papers on G5/G6 chose duplication for their
execution units compared to parity prediction. A source of problems
in these systems is the problem of common-mode failures (single cause,
affecting multiple modules, data integrity not guaranteed). It has
been shown recently that for random logic circuits, diverse
duplication has marginally more area-overhead than parity prediction.
However, diverse duplication provides significantly more protection
against common-mode failures.
Error correction:
Triple Modular redundancy, etc.
(3) Storage (Disks): RAIDS: Redundant array of inexpensive disks.
Another good source of real industrial data is the IBM Journal of Research
and Development.
- Subhasish Mitra
Stanford University
|
|