Home The Dirt Page Demos ESNUGs
Subscribe Feedback Photos Trip Reports
ESNUG
( ESNUG 353 Item 8 ) --------------------------------------------- [5/24/00]

Subject: Books, Papers, URL's On Fault-Tolerant System Design Strategies

> Does anybody know of a resource (web, book or article) describing
> architecture design for systems, storage or logic, whose components are
> prone to very high rate of failure, along the line of 0.1%-1%?  
>
>     - Greg Deych


From: Terje Mathisen 

You start with the component failure rate and the target failure rate.
Then you decide if you need error correction or if error detection and
retry is feasible.  If every single component, including the error
checking/voting hardware has the same, extremely high, error rate, then it
becomes _very_ hard.

What about the failures themselves?  Are they silent, soft, hard?

Will a faulty component stop working altogether, or will it just produce
wrong answers?  It helps a _lot_ if you can assume that some components,
are much better than the worst parts. :-)

    - Terje Mathisen

         ----    ----    ----    ----    ----    ----   ----

From: Greg Deych 

That could be the case, but I think I'll start off with hardware which
has more or less regular failure rates.  

Actually, it's easier if the component just fails altogether.  That
way, you can ignore it, rather then accounting for unpredictable
values it may output.

    - Greg Deych

         ----    ----    ----    ----    ----    ----   ----

From: Philip Koopman 

You'd really need to express that failure rate in terms of failures per
unit time and then contrast that to expected "mission time".  Do you mean
failures of 1% per hour?  How long until you get to repair it -- 1 hour or
10,000 hours?

The usual tool to deal with this is redundancy, and there are shelves and
shelves of books that deal with that.  But it only works for "moderate"
failure rates with a lot of caveats and a lot of money in many cases.

    - Philip Koopman
      Carnegie Mellon University

         ----    ----    ----    ----    ----    ----   ----

From: Greg Neff 

OK then, you might want to check out:

    "Fault Tolerant System Design" by
     Shem-Tov Levi and Ashok K. Agrawala
     McGraw-Hill
     ISBN 0-07-037515-1

Good book.

    - Greg Neff, VP Engineering
      Microsym Computers Inc.

         ----    ----    ----    ----    ----    ----   ----

From: janm@penfold.transactionsite.com (Jan Mikkelsen)

A few books:

  "Transaction Processing: Concepts & Techniques"
   By Gray & Reuter
   Morgan Kaufmann Publishing

  "Reliable Computer Systems: Design and Evaluation", Third Edition 
   By Siewiorek & Swarz

  "Atomic Transactions: In Concurrent and Distributed Systems"
   by Lynch, Merritt & Weihl
   Morgan Kaufmann Publishing

  "Fault Tolerance in Distributed Systems"
   By Pankaj Jalote

Tandem have (or had?) a thing called Tandem Information Manager (TIM) where
you could get the Tandem documentation on CD for around $150.  That might
also be useful.

    - Jan Mikkelsen
      TransactionSite Pty. Ltd.                  Sydney, Australia

         ----    ----    ----    ----    ----    ----   ----

From: "Daryl Bradley" 

Take a look at

      http://www.amp.york.ac.uk/external/media/cal/welcome.html

bits of 'something a bit different' on fault tolerant design

    - Daryl Bradley
      University of York                         York, UK

         ----    ----    ----    ----    ----    ----   ----

From: Mark Brehob 

This might be well known to those of you in the FPGA world:

  "A defect-tolerant Computer Architecure: Opportunities for Nanotechnology"
   by J. Heath, P. Juekes, G. Snider, R. Williams.
   Science, 12 June 1998, pages 1716-1721.

Note it is _defect_ tolerant, not fault tolerant per se.  That is it finds
the errors in the system _then_ starts to do work.  It assumes that it is
working with highly-broken components, but that they aren't in the process
of breaking as time goes on.

    - Mark Brehob
      Michigan State University

         ----    ----    ----    ----    ----    ----   ----

From: Brian Drummond 

I wonder if it's worth trawling for information from the vacuum-tube and
mercury delay line days (ACE, EDSAC, LEO etc), the late 1940's and very
early 1950's. They faced these problems and usually, certainly LEO (Lyons
Electronic Office) did, developed strategies to deal with them ... e.g.
regular checkpointing, running test patterns with over/under voltage to
catch marginal performance, etc.

As a start I'd search for M.V. (Maurice) Wilkes and see what turns up...

    - Brian Drummond

         ----    ----    ----    ----    ----    ----   ----

From: Greg Deych 

That sounds very promising.  I went searching for those kinds of
references yesterday, but without much success.  Unfortunatley, ACM's and
IEEE's digitial library goes back only till 1988 or so.

    - Greg Deych

         ----    ----    ----    ----    ----    ----   ----

From: eee@netcom.com (Mark Thorson)

I vaguely recall a paper with a title something like "On Building A Reliable
System From Unreliable Nodes" by von Neumann.  Does anyone remember that
more clearly?

    - Mark Thorson






Top Home  

"This here ain't no one's opinion 'cept my own."
This Web Site Is Modified Every 2 to 3 Days
Copyright 1999-2007 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |