Google scholar arxiv informatics ads IJAIS publications are indexed with Google Scholar, NASA ADS, Informatics et. al.

Call for Paper

-

July Edition 2023

International Journal of Applied Information Systems solicits high quality original research papers for the July 2023 Edition of the journal. The last date of research paper submission is June 15, 2023.

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

Samia A. Ali Published in Architecture

International Journal of Applied Information Systems
Year of Publication 2012
© 2010 by IJAIS Journal
Authors Samia A. Ali
http:/ijais12-450448
Download full text
  1. Samia A Ali. Article: Adaptable Fault Tolerance Configurations for Multiprocessor Systems. International Journal of Applied Information Systems 3(2):1-8, July 2012. BibTeX

    @article{key:article,
    	author = "Samia A. Ali",
    	title = "Article: Adaptable Fault Tolerance Configurations for Multiprocessor Systems",
    	journal = "International Journal of Applied Information Systems",
    	year = 2012,
    	volume = 3,
    	number = 2,
    	pages = "1-8",
    	month = "July",
    	note = "Published by Foundation of Computer Science, New York, USA"
    }
    

Abstract

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Fault-tolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double transient and permanent faults in any processor of multiprocessor systems. The tolerance for faults takes place in three consecutive steps; fault detection, fault diagnosing and system recovery. The overhead cost for the first (second) configuration is only 100% hardware (time) for fault detection, an extra 100% time for fault diagnoses and system recovery only for those processes running on the faulty processors. The advantages of the proposed configurations are the ease of applicability and the low associated overhead cost over the system without any fault tolerance. An enhancement is developed for both configurations to check upon the system state adequately to detect and recover from faults as soon as they infect the system. Simulations are performed to illustrate the usefulness of the proposed configurations.

Reference

  1. Shivakumar, P. Keckler, S. W. , Moore, C. R. , Burger, D. , "Exploiting Microarchitectural Redundancy for Defect Tolerance", the 21st International Conference on Computer Design (ICCD), October, 2003.
  2. Bernick, D. , Bruckert, B. , Vigna, P. D. , Garcia, D. , Jardine, R. , Klecka,J. , Smullen, J. , "NonStop® Advanced Architecture", DSN, 2005.
  3. Anderson, T. , Lee, A. , "Fault-tolerance - Principles and Practice", Prentice Hall, Eaglewood Cliffs, 1981.
  4. Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
  5. Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
  6. Rotenberg, E. . AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
  7. Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002
  8. Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
  9. Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99–110.
  10. Fair, M. L. , Conklin, C. R. , Swaney, S. B. , Meaney, P. J. , Clarke, W. J. , Alves, L. C. , Modi, I. N. , Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
  11. J. S. Plank and W. R. Elwasif, "Experimental assessment of workstation failures and their impact on checkpointing systems," in 28th International Symposium on Fault-Tolerant Computing, June 1998.
  12. N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Transactions on Computers, vol. 46 ,Aug. 1997.
  13. K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Transactions on Parallel and Distributed Systems, vol. 5, Aug. 1994.
  14. J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.

Keywords

Hardware Redundancy, Time Redundancy, Transient Fault, Permanent Fault, Cold Standby Spare