Monitoring HP integrated Lights-Out (iLO)

If you using HP Servers with installed Integrated Lights-Out (iLO) it could be very helpful to know if your Server healthy. The iLO uses MIBs from Compaq, you find the necessary MIBS for example on Byte Spheres oidview.

Prerequisites

To make this work you need installed HP iLO on your Server. Your OpenNMS core server is up and running. You have configured SNMP on your HP Server and make sure your OpenNMS core server can read SNMP. You can assign the services manually or use the SNMP detectors in Provisiond.

This monitor supports the following SNMP OIDS from CPQIDA-MIB and CPQHLTH-MIB:

Name OID
cpqDaSpareStatus .1.3.6.1.4.1.232.3.2.4.1.1.3
cpqDaLogDrvStatus .1.3.6.1.4.1.232.3.2.3.1.1.4
cpqDaPhyDrvStatus .1.3.6.1.4.1.232.3.2.5.1.1.6
cpqHeThermalTempStatus .1.3.6.1.4.1.232.6.2.6.3.0
cpqHeThermalSystemFanStatus .1.3.6.1.4.1.232.6.2.6.4.0
cpqHeThermalCpuFanStatus .1.3.6.1.4.1.232.6.2.6.5.0
cpqHeFltTolPowerSupplyStatus .1.3.6.1.4.1.232.6.2.9.3.1.5
cpqHeResMemModuleCondition .1.3.6.1.4.1.232.6.2.14.11.1.5

Monitoring all this services

After this you need to create a monitor in your polling-configuration. We monitor the state normal. The following states are possible:

CPQIDA-MIB

cpqDaSpareStatus .1.3.6.1.4.1.232.3.2.4.1.1.3

  • other(1)
  • invalid(2)
  • failed(3)
  • inactive(4) - :up:
  • building(5)
  • active(6)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Drive-Spare"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.3.2.4.1.1.3"/>
    <parameter key="walk" value="true"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="4"/>
    <parameter key="match-all" value="true"/>
    <parameter key="reason-template" value="One or more spare drives are not inactive. The state should be inactive(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), invalid(2), failed(3), inactive(4), building(5), active(6) "/>
</service>

<monitor service="HP-iLO-Drive-Spare" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqDaLogDrvStatus .1.3.6.1.4.1.232.3.2.3.1.1.4

  • other(1)
  • ok(2) - :up:
  • failed(3)
  • unconfigured(4)
  • recovering(5)
  • readyForRebuild(6)
  • rebuilding(7)
  • wrongDevice(8)
  • badConnect(9)
  • overheating(10)
  • shutdown(11)
  • expanding(12)
  • notAvailable(13)
  • queuedForExpansion(14)
  • multipathAccessDegraded(15)
  • earsing(16)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Drive-Logical"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.3.2.3.1.1.4"/>
    <parameter key="walk" value="true"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="2"/>
    <parameter key="match-all" value="true"/>
    <parameter key="reason-template" value="One or more logical drives are not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), ok(2), failed(3), unconfigured(4), recovering(5), readyForRebuild(6), rebuilding(7), wrongDevice(8), badConnect(9), overheating(10), shutdown(11), expanding(12), notAvailable(13), queuedForExpansion(14), multipathAccessDegraded(15), earsing(16) "/>
</service>

<monitor service="HP-iLO-Drive-Logical" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqDaPhyDrvStatus .1.3.6.1.4.1.232.3.2.5.1.1.6

  • other(1)
  • ok(2) - :up:
  • failed(3)
  • predictiveFailure(4)
  • erasing(5)
  • eraseDone(6)
  • eraseQueued(7)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Drive-Physical"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.3.2.5.1.1.6"/>
    <parameter key="walk" value="true"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="2"/>
    <parameter key="match-all" value="true"/>
    <parameter key="reason-template" value="One or more physical drives are not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), ok(2), failed(3), predictiveFailure(4), erasing(5), eraseDone(6), eraseQueued(7)"/>
</service>

<monitor service="HP-iLO-Drive-Physical" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

CPQHLTH-MIB

cpqHeThermalTempStatus .1.3.6.1.4.1.232.6.2.6.3.0

  • other(1)
  • ok(2) - :up:
  • degraded(3)
  • failed(4)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Temperature"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.6.2.6.3.0"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="2"/>
    <parameter key="reason-template" value="Temperature status is not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), ok(2), degraded(3), failed(4)"/>
</service>

<monitor service="HP-iLO-Temperature" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqHeThermalSystemFanStatus .1.3.6.1.4.1.232.6.2.6.4.0

  • other(1)
  • ok(2) - :up:
  • degraded(3)
  • failed(4)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Fan-System"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.6.2.6.4.0"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="2"/>
    <parameter key="reason-template" value="System fan status is not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), ok(2), degraded(3), failed(4)"/>
</service>

<monitor service="HP-iLO-Fan-System" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqHeThermalCpuFanStatus .1.3.6.1.4.1.232.6.2.6.5.0

  • other(1)
  • ok(2) - :up:
  • degraded(3)
  • failed(4)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Fan-CPU"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.6.2.6.5.0"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="2"/>
    <parameter key="reason-template" value="CPU fan status is not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Syntax: other(1), ok(2), degraded(3), failed(4)"/>
</service>

<monitor service="HP-iLO-Fan-CPU" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqHeFltTolPowerSupplyStatus .1.3.6.1.4.1.232.6.2.9.3.1.5

  • noError(1) - :up:
  • generalFailure(2)
  • bistFailure(3)
  • fanFailure(4)
  • tempFailure(5)
  • interlockOpen(6)
  • epromFailed(7)
  • vrefFailed(8)
  • dacFailed(9)
  • ramTestFailed(10)
  • voltageChannelFailed(11)
  • orringdiodeFailed(12)
  • brownOut(13)
  • giveupOnStartup(14)
  • nvramInvalid(15)
  • calibrationTableInvalid(16)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Power-Supply"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.6.2.9.3.1.5"/>
    <parameter key="walk" value="true"/>
    <parameter key="operator" value="="/>
    <parameter key="operand" value="1"/>
    <parameter key="match-all" value="true"/>
    <parameter key="reason-template" value="One or more power supplies are not ok. The state should be noError(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: noError(1), generalFailure(2), bistFailure(3), fanFailure(4), tempFailure(5), interlockOpen(6), epromFailed(7), vrefFailed(8), dacFailed(9), ramTestFailed(10), voltageChannelFailed(11), orringdiodeFailed(12), brownOut(13), giveupOnStartup(14), nvramInvalid(15), calibrationTableInvalid(16)"/>
</service>

<monitor service="HP-iLO-Power-Supply" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

cpqHeResMemModuleCondition .1.3.6.1.4.1.232.6.2.14.11.1.5

  • other(1) - :up: (I´ve seen servers with other(1) and HP Insight Manager said system is fine)
  • ok(2) - :up:
  • degraded(3)

Test configuration in poller-configuration.xml

<service name="HP-iLO-Memory-Module"
         interval="300000"
         user-defined="false"
         status="on">
    <parameter key="retry" value="6"/>
    <parameter key="timeout" value="4950"/>
    <parameter key="port" value="161"/>
    <parameter key="oid" value=".1.3.6.1.4.1.232.6.2.14.11.1.5"/>
    <parameter key="walk" value="true"/>
    <parameter key="operator" value="&lt;"/>
    <parameter key="operand" value="3"/>
    <parameter key="match-all" value="true"/>
    <parameter key="reason-template" value="One or more memory modules are not ok. The state should be ok(${operand}) the observed value is ${observedValue}. Please check your HP Insight Manager. Syntax: other(1), ok(2), degraded(3)"/>
</service>

<monitor service="HP-iLO-Memory-Module" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

:woman_facepalming: You can fix me, I’m a wiki post.

1 Like