Cluster computing as the backbone of Sandia’s capacity computing has become crucial to many of Sandia’s missions. Viewing a cluster as a large collection of statistically similar devices allows us to detect aberrant node behavior due to various effects long before catastrophic failure occurs. This view is the basis for our Distributed, Intelligent RAS System for Large Computational Clusters project. RAS stands for Reliability, Availability, and Serviceability, and refers to the usability and stability of clusters. Our software tool, OVIS, allows a system administrator to make sense of environmental data even in the absence of a fundamental understanding of the total internal state of the nodes, the interaction between nodes, and the interaction of the machines with their environment. In addition, to advance problem detection, OVIS allows visualization of various configuration effects and can aid in their resolution.
OVIS performs statistical calculations on system and environmental data, characterizing single node behaviors based upon the behaviors of the entire set of nodes. Abnormal behavior is then automatically determined by detecting behaviors that have low statistical probabilities. An extremely simple example of this is the detection of a node's temperature value that is low compared to the temperature values of all other nodes in the system, when all nodes are situated in a uniform environment and under the same computational load. Here low can be defined as being a certain number of standard deviations away from the mean value of the all nodes’ temperatures. This is in contrast to traditional methods that merely check for the crossing of some threshold value. Our statistical methods can detect problems earlier than the traditional methods.
We have run the initial serial version of OVIS on on Sandia Platforms such as ICC/NWCC and Thunderbird. Current development work on OVIS focuses on advances that address scalability and fault-tolerance, new analysis capabilities, and enhanced information visualization. A new version of OVIS, OVIS-2, featuring a 3-D visual interface, a database back-end, increased analysis capabilities, and a more robust framework architecture will be released in the second quarter of 2008.
OVIS is available for free download at https://ovis.ca.sandia.gov
Read the Sandia News Release: http://www.sandia.gov/news/resources/releases/2007/ovis.html