How do I get a poller to alert only after X number of polling failures

I have a poller service defined with the following config:

  <service name="ABBCS-KernelMon" interval="3600000" user-defined="false" status="on">
     <parameter key="script" value="/opt/opennms/scripts/run_monitor.sh"/>
     <parameter key="retry" value="0"/>
     <parameter key="args" value="${nodeid} ${nodelabel} ${ipaddr} ${svcname} /opt/opennms/scripts/check_kernel.sh"/>
     <parameter key="timeout" value="900000"/>
     <parameter key="rrd-base-name" value="ABBCS-KernelMon"/>
     <parameter key="rrd-repository" value="/opt/opennms/share/rrd/response"/>
     <parameter key="ds-name" value="ABBCS-KernelMon"/>
  </service>
  <monitor service="ABBCS-KernelMon" class-name="org.opennms.netmgt.poller.monitors.SystemExecuteMonitor"/>

which is basically just there to raise an alert if someone forgets to reboot a server after patching. This poller is set to run hourly, but as it stands, there is a chance that this will trigger an alert before the person doing the OS patching has had a chance to reboot.

What I’d like to do is to change it so that the poller requires multiple consecutive failures before it raises an alert, so that I can have a grace period of 2 or 3 hours where the condition exists, but the service won’t raise an alert, giving time to reboot the server, and to only raise the alert at a point in time where it is likely the reboot was skipped.

Is there a native way I can set the poller service to not trigger an alert unless it fails more than a specified number of checks?

It’s not possible right now. You can follow this issue that describes the feature.

https://issues.opennms.org/browse/NMS-10472

Feel free to add your ideas.

1 Like

I am often stumbling about this problem.

Another example. I am checking here the cluster state of a Fortigate:

      <service name="FG-Cluster-Status" interval="300000" user-defined="true" status="on">
         <parameter key="retry" value="5"/>
         <parameter key="timeout" value="500"/>
         <parameter key="port" value="161"/>
         <parameter key="oid" value="1.3.6.1.4.1.12356.101.13.2.1.1.1"/>
         <parameter key="operator" value="&lt;"/>
         <parameter key="operand" value="3"/>
         <parameter key="walk" value="true"/>
         <parameter key="match-all" value="count"/>
         <parameter key="minimum" value="2"/>
         <parameter key="maximum" value="2"/>
      </service>

The walk should always have 2 rows otherwise, the HA state is broken. This monitor works fine. But it happens really often, and it’s not untypical that SNMP behaves strangely. Of course, often the SNMP implementation is not good and I know that OpenNMS does exactly the correct thing, to bring the service monitor down when there are not 2 rows. But this fact does not help. There is absolutely no change to avoid this nodeLostService. There is a really big need for having a trigger for monitors.

Next one. It’s of course not related to SNMP issues. Also for all other monitors. In this case, it’s a HttpMonitor. It triggered this month (13 days) 160 times (= 160 alarms!). But mostly just a 1-minute outage since the downtime model takes care to recheck fast. Also here, the component should not behave like this. Even I am not absolutely sure if the service was really down. But the way between OpenNMS and services is long. Who knows what could be wrong in this way. But OpenNMS should be able to handle it with a poller trigger option.

What makes a trigger different than retries + timeouts? The trigger for a nodeLost service is defined by number of retries * timeout which should by <= polling interval. You can also delay a notification afterward.