Adaptable Fault Tolerance Configurations for Multiprocessor Systems

Samia A. Ali

Call for Paper

May Edition

IJAIS solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 28 April 2026

Submit your paper

Know more

The week's pick

Optimized Decision Tree Classifier for Data Aggregation in Wireless Sensor Networks Using IoT Sensor Data

Jagan Kurma Raghuvaran Kendyala Varun Bitkuri Avinash Attipalli Jaya Vardhani Mamidala Sunil Jacob Enokkaren

Random Articles

Reseach Article

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

by Samia A. Ali

International Journal of Applied Information Systems

Foundation of Computer Science (FCS), NY, USA

Volume 3 - Number 2

Year of Publication: 2012

Authors: Samia A. Ali

http:/ijais12-450448

Samia A. Ali . Adaptable Fault Tolerance Configurations for Multiprocessor Systems. International Journal of Applied Information Systems. 3, 2 ( July 2012), 1-8. DOI=http:/ijais12-450448

@article{ http:/ijais12-450448,

author = { Samia A. Ali },

title = { Adaptable Fault Tolerance Configurations for Multiprocessor Systems },

journal = { International Journal of Applied Information Systems },

issue_date = { July 2012 },

volume = { 3 },

number = { 2 },

month = { July },

year = { 2012 },

issn = { 2249-0868 },

pages = { 1-8 },

numpages = {9},

url = { https://www.ijais.org/archives/volume3/number2/201-0448/ },

doi = { http:/ijais12-450448 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-07-05T10:45:21.190547+05:30

%A Samia A. Ali

%T Adaptable Fault Tolerance Configurations for Multiprocessor Systems

%J International Journal of Applied Information Systems

%@ 2249-0868

%V 3

%N 2

%P 1-8

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Fault-tolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double transient and permanent faults in any processor of multiprocessor systems. The tolerance for faults takes place in three consecutive steps; fault detection, fault diagnosing and system recovery. The overhead cost for the first (second) configuration is only 100% hardware (time) for fault detection, an extra 100% time for fault diagnoses and system recovery only for those processes running on the faulty processors. The advantages of the proposed configurations are the ease of applicability and the low associated overhead cost over the system without any fault tolerance. An enhancement is developed for both configurations to check upon the system state adequately to detect and recover from faults as soon as they infect the system. Simulations are performed to illustrate the usefulness of the proposed configurations.

References

Shivakumar, P. Keckler, S. W. , Moore, C. R. , Burger, D. , "Exploiting Microarchitectural Redundancy for Defect Tolerance", the 21st International Conference on Computer Design (ICCD), October, 2003.
Bernick, D. , Bruckert, B. , Vigna, P. D. , Garcia, D. , Jardine, R. , Klecka,J. , Smullen, J. , "NonStop® Advanced Architecture", DSN, 2005.
Anderson, T. , Lee, A. , "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981.
Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
Rotenberg, E. . AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002
Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99–110.
Fair, M. L. , Conklin, C. R. , Swaney, S. B. , Meaney, P. J. , Clarke, W. J. , Alves, L. C. , Modi, I. N. , Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
J. S. Plank and W. R. Elwasif, "Experimental assessment of workstation failures and their impact on checkpointing systems," in 28th International Symposium on Fault-Tolerant Computing, June 1998.
N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Transactions on Computers, vol. 46 ,Aug. 1997.
K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994.
J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.

Index Terms

Computer Science

Information Sciences

Keywords

Hardware Redundancy Time Redundancy Transient Fault Permanent Fault Cold Standby Spare