( ESNUG 354 Item 9 ) ---------------------------------------------- [6/1/00]

Subject: (  ESNUG 353 #8  )   URL's On Fault-Tolerant System Design Strategies

> Does anybody know of a resource (web, book or article) describing
> architecture design for systems, storage or logic, whose components are
> prone to very high rate of failure, along the line of 0.1%-1%?  
>
>     - Greg Deych


From: Subhasish Mitra 

Hi John,

You can take a look at the web-site of Center for Reliable Computing here at
Stanford University directed by Prof. Ed McCluskey.  http://crc.stanford.edu
In the bibliography page, you can find a very comprehensive list of
publications related to fault-tolerance and digital test.  Some of the major
innovations in the field of fault-tolerant computing happened at Stanford
CRC.  Right now, the center is running 2 big projects on fault-tolerance:
(1) Fault-tolerance in reconfigurable systems and (2) Fault-tolerance in
space environment (they have their experiments running on a real satellite
in the space.)

About Greg' queries about fault-tolerance techniques in systems, here is a
simple, high-level classification:

  (1) Memories: Error detecting and correcting codes ( Hamming codes, etc.)

  (2) Logic: 

      Error detection:

      Techniques include: Duplication, Diverse Duplications (different
      implementations of the same logic function), parity prediction.  For
      circuits like adders, etc., parity prediction may be economical;
      however, recent IBM papers on G5/G6 chose duplication for their
      execution units compared to parity prediction.  A source of problems
      in these systems is the problem of common-mode failures (single cause,
      affecting multiple modules, data integrity not guaranteed).  It has
      been shown recently that for random logic circuits, diverse
      duplication has marginally more area-overhead than parity prediction.
      However, diverse duplication provides significantly more protection
      against common-mode failures.

      Error correction:

      Triple Modular redundancy, etc.

  (3) Storage (Disks): RAIDS: Redundant array of inexpensive disks.

Another good source of real industrial data is the IBM Journal of Research
and Development.

    - Subhasish Mitra
      Stanford University


 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)