Machine Room Diagnostic Daemon
Senior Project: 1996-1997
The NCAR Scientific Computing Division (SCD) is responsible for providing
state-of-the-art, high-performance computing resources to support the research
activities of atmospheric and related sciences around the country. The
supercomputers, mass storage devices, and associated high-speed networks that
comprise the core of NCAR's computing resources are, in turn, supported by an
array of peripheral devices and diagnostic equipment essential to keeping the
machine room operators informed of the physical status of the machines in the
room.
Problems arise however, in ascertaining when a given device will require
operator attention, due either to hardware failure or software error. While
many of the devices in the machine room have on-board diagnostics and are
capable of reporting their individual condition, there was no way for the
machine room operators to monitor all such devices from a single monitoring
station. This leads to device instabilities that may go unnoticed (or are
unattended to) until it fails, resulting in many lost hours of research
productivity until the device is brought back online.
This project involved the building of a software interface between a number of
key peripheral devices that reports the diagnostic information to a single
communications node on an operator's workstation. The reports are generated in
real time and provide current device status, diagnostic information, and other
critical component information as available on the device. Moreover, the
diagnostics daemon is highly extensible, capable of adding or deleting devices
as SCD acquires new hardware and test software over time. The software was
implemented in C++ using an object-oriented approach in a UNIX and Motif
environment.

|