( ESNUG 353 Item 8 ) --------------------------------------------- [5/24/00]
Subject: Books, Papers, URL's On Fault-Tolerant System Design Strategies
> Does anybody know of a resource (web, book or article) describing
> architecture design for systems, storage or logic, whose components are
> prone to very high rate of failure, along the line of 0.1%-1%?
>
> - Greg Deych
From: Terje Mathisen
You start with the component failure rate and the target failure rate.
Then you decide if you need error correction or if error detection and
retry is feasible. If every single component, including the error
checking/voting hardware has the same, extremely high, error rate, then it
becomes _very_ hard.
What about the failures themselves? Are they silent, soft, hard?
Will a faulty component stop working altogether, or will it just produce
wrong answers? It helps a _lot_ if you can assume that some components,
are much better than the worst parts. :-)
- Terje Mathisen
---- ---- ---- ---- ---- ---- ----
From: Greg Deych
That could be the case, but I think I'll start off with hardware which
has more or less regular failure rates.
Actually, it's easier if the component just fails altogether. That
way, you can ignore it, rather then accounting for unpredictable
values it may output.
- Greg Deych
---- ---- ---- ---- ---- ---- ----
From: Philip Koopman
You'd really need to express that failure rate in terms of failures per
unit time and then contrast that to expected "mission time". Do you mean
failures of 1% per hour? How long until you get to repair it -- 1 hour or
10,000 hours?
The usual tool to deal with this is redundancy, and there are shelves and
shelves of books that deal with that. But it only works for "moderate"
failure rates with a lot of caveats and a lot of money in many cases.
- Philip Koopman
Carnegie Mellon University
---- ---- ---- ---- ---- ---- ----
From: Greg Neff
OK then, you might want to check out:
"Fault Tolerant System Design" by
Shem-Tov Levi and Ashok K. Agrawala
McGraw-Hill
ISBN 0-07-037515-1
Good book.
- Greg Neff, VP Engineering
Microsym Computers Inc.
---- ---- ---- ---- ---- ---- ----
From: janm@penfold.transactionsite.com (Jan Mikkelsen)
A few books:
"Transaction Processing: Concepts & Techniques"
By Gray & Reuter
Morgan Kaufmann Publishing
"Reliable Computer Systems: Design and Evaluation", Third Edition
By Siewiorek & Swarz
"Atomic Transactions: In Concurrent and Distributed Systems"
by Lynch, Merritt & Weihl
Morgan Kaufmann Publishing
"Fault Tolerance in Distributed Systems"
By Pankaj Jalote
Tandem have (or had?) a thing called Tandem Information Manager (TIM) where
you could get the Tandem documentation on CD for around $150. That might
also be useful.
- Jan Mikkelsen
TransactionSite Pty. Ltd. Sydney, Australia
---- ---- ---- ---- ---- ---- ----
From: "Daryl Bradley"
Take a look at
http://www.amp.york.ac.uk/external/media/cal/welcome.html
bits of 'something a bit different' on fault tolerant design
- Daryl Bradley
University of York York, UK
---- ---- ---- ---- ---- ---- ----
From: Mark Brehob
This might be well known to those of you in the FPGA world:
"A defect-tolerant Computer Architecure: Opportunities for Nanotechnology"
by J. Heath, P. Juekes, G. Snider, R. Williams.
Science, 12 June 1998, pages 1716-1721.
Note it is _defect_ tolerant, not fault tolerant per se. That is it finds
the errors in the system _then_ starts to do work. It assumes that it is
working with highly-broken components, but that they aren't in the process
of breaking as time goes on.
- Mark Brehob
Michigan State University
---- ---- ---- ---- ---- ---- ----
From: Brian Drummond
I wonder if it's worth trawling for information from the vacuum-tube and
mercury delay line days (ACE, EDSAC, LEO etc), the late 1940's and very
early 1950's. They faced these problems and usually, certainly LEO (Lyons
Electronic Office) did, developed strategies to deal with them ... e.g.
regular checkpointing, running test patterns with over/under voltage to
catch marginal performance, etc.
As a start I'd search for M.V. (Maurice) Wilkes and see what turns up...
- Brian Drummond
---- ---- ---- ---- ---- ---- ----
From: Greg Deych
That sounds very promising. I went searching for those kinds of
references yesterday, but without much success. Unfortunatley, ACM's and
IEEE's digitial library goes back only till 1988 or so.
- Greg Deych
---- ---- ---- ---- ---- ---- ----
From: eee@netcom.com (Mark Thorson)
I vaguely recall a paper with a title something like "On Building A Reliable
System From Unreliable Nodes" by von Neumann. Does anyone remember that
more clearly?
- Mark Thorson
|
|