Restart detection monitor

It’s always important to work through your outages and alarm events to get an overview what happened in your system. In some cases you won’t recognize system restarts because the systems (especially VMs) nowadays do it very fast – often faster than your ICMP polling cycle.

With SNMP enabled the hrSystemUptime can be used to detect restarts from Windows or Linux machines. A similar concept can be used to detect restart of Java application with JMX and the JVM uptime metric.

This article describes how to use OpenNMS to use thresholds to detect server and Java application restarts and create events for alerting or logging.

:warning: Make sure you already collect SNMP and/or JMX metrics!

System Restarts

First of all you should create the threshold events. It’s not necessary to to that manually, but you will get the advantage of a more enriched event with detailed and formatted information, plus a quick-link to the metric graph.

    <event>
        <uei>uei.opennms.org/threshold/system/restart/detected</uei>
        <event-label>User-defined custom low threshold exceeded event for system-uptime [Warning]</event-label>
        <descr>&lt;p&gt;High threshold exceeded for %service% datasource %parm[ds]% on interface %interface% for node %nodelabel% (nodeId %nodeid%).&lt;/p&gt;&lt;br&gt;
        &lt;table style='width:50%; white-space: nowrap;'&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Data Source&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[ds]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Resource Label&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[label]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Resource Instance&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[instance]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Current Metric Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[value]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Threshold Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[threshold]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Rearm Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[rearm]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Trigger Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[trigger]%&lt;/td&gt;&lt;/tr&gt;
        &lt;/table&gt;        &lt;/br&gt;&lt;p&gt;All parameters: %parm[all]%&lt;/p&gt;</descr>
        <logmsg dest="logndisplay">&lt;b&gt;&lt;a href='graph/results.htm?resourceId=node[%nodeid%].nodeSnmp[]&amp;reports=netsnmp.hrSystemUptime'&gt;SYSTEM&lt;/b&gt;&lt;/a&gt; has been rebooted.</logmsg>
        <severity>Warning</severity>
        <alarm-data auto-clean="false" alarm-type="1" reduction-key="%uei%:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%"/>
    </event>
    <event>
        <uei>uei.opennms.org/threshold/system/restart/rearmed</uei>
        <event-label>User-defined custom low threshold rearmed event for system-uptime</event-label>
        <descr>&lt;p&gt;High threshold rearmed for %service% datasource %parm[ds]% on interface %interface% for node %nodelabel% (nodeId %nodeid%).&lt;/p&gt;&lt;br&gt;
        &lt;table style='width:50%; white-space: nowrap;'&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Data Source&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[ds]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Resource Label&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[label]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Resource Instance&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[instance]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Current Metric Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[value]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Threshold Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[threshold]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Rearm Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[rearm]%&lt;/td&gt;&lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;&lt;b&gt;Trigger Value&lt;/b&gt;&lt;/td&gt;&lt;td&gt;%parm[trigger]%&lt;/td&gt;&lt;/tr&gt;
        &lt;/table&gt;        &lt;/br&gt;&lt;p&gt;All parameters: %parm[all]%&lt;/p&gt;</descr>
        <logmsg dest="logndisplay">&lt;b&gt;&lt;a href='graph/results.htm?resourceId=node[%nodeid%].nodeSnmp[]&amp;reports=netsnmp.hrSystemUptime'&gt;SYSTEM&lt;/b&gt;&lt;/a&gt; reboot alarm has been cleared.</logmsg>
        <severity>Normal</severity>
        <alarm-data auto-clean="false" clear-key="uei.opennms.org/threshold/system/uptime/low/warning/exceeded:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%" alarm-type="2" reduction-key="%uei%:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%"/>
    </event>

Create a Threshd package. The idea is to filter for a specific group (TH-SYSTEM-UPTIME-L-10) which means: Threshold-System-Uptime-Low-10(minutes). Put your nodes in this category and the threshold will be applied to them automatically.

    <package name="TH-SYSTEM-UPTIME-L-10">
        <filter>categoryname == 'TH-SYSTEM-UPTIME-L-10'</filter>
        <include-range begin="1.1.1.1" end="254.254.254.254"/>
        <include-range begin="::1" end="ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"/>
        <service name="SNMP" interval="300000" user-defined="false" status="on">
            <parameter key="thresholding-group" value="TH-SYSTEM-UPTIME-L-10"/>
        </service>
    </package>

The second part of the threshold configuration is the threshold itself. It’s configured to fire at an uptime value < 10 minutes, triggered by a single sample. It will be rearmed after 20 minutes.

    <group name="TH-SYSTEM-UPTIME-L-10" rrdRepository="/var/lib/opennms/rrd/snmp/">
        <expression type="low" ds-type="node" value="10.0" rearm="20.0"
            trigger="1"
            triggeredUEI="uei.opennms.org/threshold/system/uptime/low/warning/exceeded"
            rearmedUEI="uei.opennms.org/threshold/system/uptime/low/warning/rearmed"
            filterOperator="or" expression="hrSystemUptime/6000"/>
    </group>

Reload Eventd and Threshd to activate the configuration without restarting OpenNMS. Or restart, if you like…

JVM Restarts

You can do the same thing with JVM uptime (via JMX) to get informed when your Java Virtual Machine restarts. Similar to the above config, the event and threshold configurations:

   <event>
      <uei>uei.opennms.org/threshold/jvm/restart/detected</uei>
      <event-label>User-defined custom low threshold exceeded event for jmx uptime [WARNING]</event-label>
      <descr>&lt;p>High threshold exceeded for %service% datasource %parm[ds]% on interface %interface% for node %nodelabel% (nodeId %nodeid%).&lt;/p>&lt;br>
        &lt;table style='width:50%; white-space: nowrap;'>
        &lt;tr>&lt;td>&lt;b>Data Source&lt;/b>&lt;/td>&lt;td>%parm[ds]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Resource Label&lt;/b>&lt;/td>&lt;td>%parm[label]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Resource Instance&lt;/b>&lt;/td>&lt;td>%parm[instance]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Current Metric Value&lt;/b>&lt;/td>&lt;td>%parm[value]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Threshold Value&lt;/b>&lt;/td>&lt;td>%parm[threshold]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Rearm Value&lt;/b>&lt;/td>&lt;td>%parm[rearm]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Trigger Value&lt;/b>&lt;/td>&lt;td>%parm[trigger]%&lt;/td>&lt;/tr>
        &lt;/table>        &lt;/br>&lt;p>All parameters: %parm[all]%&lt;/p></descr>
      <logmsg dest="logndisplay">JVM has been restarted.</logmsg>
      <severity>Warning</severity>
      <alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%" alarm-type="1" auto-clean="false"/>
   </event>
   <event>
      <uei>uei.opennms.org/threshold/jvm/restart/rearmed</uei>
      <event-label>User-defined custom low threshold rearmed event for jmx uptime</event-label>
      <descr>&lt;p>High threshold rearmed for %service% datasource %parm[ds]% on interface %interface% for node %nodelabel% (nodeId %nodeid%).&lt;/p>&lt;br>
        &lt;table style='width:50%; white-space: nowrap;'>
        &lt;tr>&lt;td>&lt;b>Data Source&lt;/b>&lt;/td>&lt;td>%parm[ds]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Resource Label&lt;/b>&lt;/td>&lt;td>%parm[label]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Resource Instance&lt;/b>&lt;/td>&lt;td>%parm[instance]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Current Metric Value&lt;/b>&lt;/td>&lt;td>%parm[value]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Threshold Value&lt;/b>&lt;/td>&lt;td>%parm[threshold]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Rearm Value&lt;/b>&lt;/td>&lt;td>%parm[rearm]%&lt;/td>&lt;/tr>
        &lt;tr>&lt;td>&lt;b>Trigger Value&lt;/b>&lt;/td>&lt;td>%parm[trigger]%&lt;/td>&lt;/tr>
        &lt;/table>        &lt;/br>&lt;p>All parameters: %parm[all]%&lt;/p></descr>
      <logmsg dest="logndisplay">JVM restart alarm has been cleared.</logmsg>
      <severity>Normal</severity>
      <alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%" alarm-type="2" clear-key="uei.opennms.org/threshold/jmx/uptime/low/warning/exceeded:%dpname%:%nodeid%:%interface%:%parm[ds]%:%parm[threshold]%:%parm[trigger]%:%parm[rearm]%:%parm[label]%" auto-clean="false"/>
   </event>
    <package name="TH-JMX-UPTIME-L-10">
        <filter>categoryname == 'TH-JMX-UPTIME-L-10'</filter>
        <include-range begin="1.1.1.1" end="254.254.254.254"/>
        <include-range begin="::1" end="ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"/>
        <service name="JMX-Basic" interval="300000" user-defined="true" status="on">
            <parameter key="thresholding-group" value="TH-JMX-UPTIME-L-10"/>
        </service>
    </package>
    <group name="TH-JMX-UPTIME-L-10" rrdRepository="/var/lib/opennms/rrd/snmp/">
        <expression description="Java Uptime low 10 and rearmed 20"
            type="low" ds-type="node" value="10.0" rearm="20.0"
            trigger="1"
            triggeredUEI="uei.opennms.org/threshold/jmx/uptime/low/warning/exceeded"
            rearmedUEI="uei.opennms.org/threshold/jmx/uptime/low/warning/rearmed"
            filterOperator="or" expression="Uptime / (1000 * 60)"/>
    </group>
``
1 Like