Towards Proactive Fault Management of Enterprise Systems
Abstract
This paper introduces a model-based approach for autonomic fault management of computing systems. The proposed approach can recover a system from common faults while minimizing the impact on the system's quality of service and reducing potential revenue loss. When faults occur, the approach identifies fault types and accordingly compute the optimal recovery action with minimum impact on performance and operating cost using a predictive control algorithm. The paper introduces the formal settings of the model-based fault management approach and the underlying predictive control algorithm. The fault management approach has been verified on a testbed with respect to simulated faults including memory leak and network congestion. Simulation results show that our approach enabled effective automatic recovery from these faults with minimum impacts of system performance. 2015 IEEE.
Collections
- Computer Science & Engineering [2402 items ]