Reinitialize PrimarySnmpInterface Events Spawning Endlessly on SNMP Monitored Nodes

Problem:
It seems that most, if not all at some point, of our SNMP monitored nodes are generating uei.opennms.org|nodes|reinitializePrimarySnmpInterface events approximately every minute.

I am worried this might be related to some cleanup of these events I did prior to a vacuum, but if so, I do not understand why that might be the case.

In terms of logged errors, we have some SnmpAgentTimeoutException ERROR messages in the log, but they seem to be mostly related to actual snmp issues (reachability, agent problems, etc.)

Resource Graphs are intact, so polling and collection is occurring, it is just these endless events are being generated, some 1500 per node per day.

One observation is that for nodes that experience an outage, we see the following sequence of standard events after which the problem (re-occurence reinitializePrimarySnmpInterface) seems to be resolved.

uei.opennms.org|internal|provisiond|nodeScanAborted
uei.opennms.org|nodes|dataCollectionFailed
uei.opennms.org|nodes|nodeDown
uei.opennms.org|nodes|nodeUp
uei.opennms.org|nodes|dataCollectionSucceeded

Expected outcome:

We only expect to see this event, if memory serves, during a capsd rescan or SNMP update.

OpenNMS version:
Debian 10 - Horizon 29.0.8-1 - Evergreening since 2014 ~ONMS 1.10|Debian 6

Recent events on our system include upgrades, which have completed successfully, on a regular basis, following a standard sequence we execute in the lab, and shortly after in production.

  • stop opennms
  • apt ugprade
  • merge back any .dpkg-new changes
  • install -dis
  • modify runas=root and systemd.service
  • purge |var|log
  • start opennms

Still running as root, but planning to switch to opennms user with AmbientCapabilities=CAP_NET_BIND_SERVICE at the next upgrade, as the lab instance has been running this way for a while now.

All our nodes are provisioned via requisition, with a few exceptions.

Other relevant data:

Also recent changes include a few tweaks to some events,

uei.opennms.org|internal|provisiond|nodeScanCompleted → logonly
uei.opennms.org|traps|compaqhp|cpqDa6AccelStatusChange → discardtraps (until we change a battery)

The qDa6AccelStatusChange events were a huge number, so I deleted them, and decided why not clean out some other stuff I did not care about before a vacuum to reclaim disk space after significant DB growth in recent months. So I modified the handy NMS-9303-cleanup-deprecated-events-im.sql to include the following events, and cleared 20M events pre-vacuum.

    eventuei = 'uei.opennms.org|traps|compaqhp|cpqDa6AccelStatusChange' OR
    eventuei = 'uei.opennms.org|traps|CPQHOST-MIB|cpqHoMibHealthStatusArrayChangeTrap' OR
    eventuei = 'uei.opennms.org|nodes|reinitializePrimarySnmpInterface' OR
    eventuei = 'uei.opennms.org|internal|discovery|hardwareInventorySuccessful' OR
    eventuei = 'uei.opennms.org|internal|provisiond|nodeScanCompleted'

I then did a full VACUUM and after and I added the following statements to vacuumd-configuration.xml after noticing a bunch of new hardwareInventorySuccessful events, possibly a new uei attributed to requistion updates in newer OpenNMS versions.

     DELETE FROM events
    WHERE eventuei='uei.opennms.org|internal|provisiond|nodeScanCompleted'
    AND (eventcreatetime < now() - interval '2 days');
 
 
    DELETE FROM events
    WHERE eventuei='uei.opennms.org|nodes|nodeUpdated'
    AND (eventcreatetime < now() - interval '2 days');
 

    DELETE FROM events
    WHERE eventuei='uei.opennms.org|internal|discovery|hardwareInventorySuccessful'
    AND (eventcreatetime < now() - interval '2 days');
  

We have some ERROR occurences in the logs; It has been a while (months,years) since we have had a reason to check on log health, I think most are related to SNMP client specific issues or reachability problems, but I’ll post an occurence of every ERROR in case they point to something obvious. I see a few things worth investigating in separate threads

collectd.log:2022-04-14 12:18:13,240 ERROR [collectd-Thread] o.o.n.c.Collectd: Unable to find service SNMP on interface with address 
enlinkd.log:2022-04-12 12:30:43,494 ERROR [EnhancedLinkd-Thread-6-of-5] o.h.u.JDBCExceptionReporter: ERROR: value too long for type character varying(16)
karaf.log:2022-04-12T12:11:40,858 | ERROR | FelixStartLevel  | ServiceBasedStrategyResolver     | 234 - org.opennms.core.snmp.api - 29.0.8 | SnmpStrategy class 'class Proxy5d12c59b_db49_48fb_9010_b559ad7c9d68' published as service with out 'implementation' key.  Ignoring.
karaf.log:2022-04-12T12:11:41,097 | ERROR | FelixStartLevel  | ServiceBasedStrategyResolver     | 234 - org.opennms.core.snmp.api - 29.0.8 | SnmpStrategy class 'class Proxy5d12c59b_db49_48fb_9010_b559ad7c9d68' published as service with out 'implementation' key.  Ignoring.
karaf.log:2022-04-12T12:12:09,949 | ERROR | Blueprint Extender: 1 | Felix                            | 5 - org.ops4j.pax.logging.pax-logging-api - 2.0.14 | Bundle org.opennms.core.soa [239] EventDispatcher: Error during dispatch. (java.util.ConcurrentModificationException)
karaf.log:2022-04-12T12:12:09,977 | ERROR | FelixDispatchQueue | Framework                        | 239 - org.opennms.core.soa - 29.0.8 | FrameworkEvent ERROR
poller.log:2022-04-14 12:18:13,391 ERROR [Poller:PollerEventProcessor-Thread] o.h.u.JDBCExceptionReporter: ERROR: duplicate key value violates unique constraint "one_outstanding_outage_per_service_and_location_idx1"
poller.log:2022-04-14 12:18:13,391 ERROR [Poller:PollerEventProcessor-Thread] o.o.n.p.Poller: Unable to schedule service 2551|10.6.8.174|ICMP
poller.log:2022-04-14 12:19:21,082 ERROR [Poller-Thread-9-of-30] o.o.n.p.QueryManagerDaoImpl: Failed to set the last good|fail timestamp for service named ICMP on node id 2551 and interface |10.4.1.10 for Down.
poller.log:2022-04-14 13:31:28,622 ERROR [Poller:PollerEventProcessor-Thread] o.o.n.p.PollerEventProcessor: Nodeid 1203 does not exist in pollable node map, unable to delete node.
provisiond.log:2022-04-15 11:40:30,245 ERROR [Provisiond:EventListener-Thread] o.o.n.p.s.Provisioner: getScheduleForNode fails
provisiond.log:2022-04-12 12:26:05,225 ERROR [scanExecutor-3] o.o.n.s.s.Snmp4JStrategy: org.opennms.netmgt.snmp.SnmpException: org.opennms.netmgt.snmp.SnmpAgentTimeoutException: Timeout retrieving response for 10.5.0.3.
web.log:2022-04-12 12:11:40,692 ERROR [Main] o.s.w.c.ContextLoader: Context initialization failed
web.log:org.postgresql.util.PSQLException: ERROR: syntax error at or near ".183"

Hopefully we did not break anything with our recent db cleanup, but it seems likely given the event we are having issues with, is one we deleted.

I modified this post to remove long uei in the title, and replace all slashes with a pipe symbol, as everything was being detected as links and there is a rule for new posters that states not more than 2.

My previous observation that an outage resolves the issue for a particular node is incorrect.

Looking more closely at some outages from earlier today which appeared to stop the spawning of these events, the reinitializePrimarySnmpInterface events actually only paused for a couple of hours, and then started rolling in again, approximately once per minute.

Comparing this uei to our lab instance, we typically only see these on the capsd interval, which is set to the default of 86400 seconds / 1 day.

We were able to trace the change in our reinitializePrimarySnmpInterface events to a specific date using our Daily reports. Since we had recently purged the events suring some DB maintenance, we had initially believed we were not able to determine when the problem initially started.

We figured out the date, and sure enough we actually had modified our requisition scan interval from 1d to 1m for the requisition where a large number of our SNMP monitored nodes were located.

Just some human error in a configuration change that we had missed. All back to normal.

1 Like