It seems that most, if not all at some point, of our SNMP monitored nodes are generating uei.opennms.org|nodes|reinitializePrimarySnmpInterface events approximately every minute.
I am worried this might be related to some cleanup of these events I did prior to a vacuum, but if so, I do not understand why that might be the case.
In terms of logged errors, we have some SnmpAgentTimeoutException ERROR messages in the log, but they seem to be mostly related to actual snmp issues (reachability, agent problems, etc.)
Resource Graphs are intact, so polling and collection is occurring, it is just these endless events are being generated, some 1500 per node per day.
One observation is that for nodes that experience an outage, we see the following sequence of standard events after which the problem (re-occurence reinitializePrimarySnmpInterface) seems to be resolved.
We only expect to see this event, if memory serves, during a capsd rescan or SNMP update.
Debian 10 - Horizon 29.0.8-1 - Evergreening since 2014 ~ONMS 1.10|Debian 6
Recent events on our system include upgrades, which have completed successfully, on a regular basis, following a standard sequence we execute in the lab, and shortly after in production.
- stop opennms
- apt ugprade
- merge back any .dpkg-new changes
- install -dis
- modify runas=root and systemd.service
- purge |var|log
- start opennms
Still running as root, but planning to switch to opennms user with AmbientCapabilities=CAP_NET_BIND_SERVICE at the next upgrade, as the lab instance has been running this way for a while now.
All our nodes are provisioned via requisition, with a few exceptions.
Other relevant data:
Also recent changes include a few tweaks to some events,
uei.opennms.org|internal|provisiond|nodeScanCompleted → logonly
uei.opennms.org|traps|compaqhp|cpqDa6AccelStatusChange → discardtraps (until we change a battery)
The qDa6AccelStatusChange events were a huge number, so I deleted them, and decided why not clean out some other stuff I did not care about before a vacuum to reclaim disk space after significant DB growth in recent months. So I modified the handy NMS-9303-cleanup-deprecated-events-im.sql to include the following events, and cleared 20M events pre-vacuum.
eventuei = 'uei.opennms.org|traps|compaqhp|cpqDa6AccelStatusChange' OR eventuei = 'uei.opennms.org|traps|CPQHOST-MIB|cpqHoMibHealthStatusArrayChangeTrap' OR eventuei = 'uei.opennms.org|nodes|reinitializePrimarySnmpInterface' OR eventuei = 'uei.opennms.org|internal|discovery|hardwareInventorySuccessful' OR eventuei = 'uei.opennms.org|internal|provisiond|nodeScanCompleted'
I then did a full VACUUM and after and I added the following statements to vacuumd-configuration.xml after noticing a bunch of new hardwareInventorySuccessful events, possibly a new uei attributed to requistion updates in newer OpenNMS versions.
DELETE FROM events WHERE eventuei='uei.opennms.org|internal|provisiond|nodeScanCompleted' AND (eventcreatetime < now() - interval '2 days'); DELETE FROM events WHERE eventuei='uei.opennms.org|nodes|nodeUpdated' AND (eventcreatetime < now() - interval '2 days'); DELETE FROM events WHERE eventuei='uei.opennms.org|internal|discovery|hardwareInventorySuccessful' AND (eventcreatetime < now() - interval '2 days');
We have some ERROR occurences in the logs; It has been a while (months,years) since we have had a reason to check on log health, I think most are related to SNMP client specific issues or reachability problems, but I’ll post an occurence of every ERROR in case they point to something obvious. I see a few things worth investigating in separate threads
collectd.log:2022-04-14 12:18:13,240 ERROR [collectd-Thread] o.o.n.c.Collectd: Unable to find service SNMP on interface with address enlinkd.log:2022-04-12 12:30:43,494 ERROR [EnhancedLinkd-Thread-6-of-5] o.h.u.JDBCExceptionReporter: ERROR: value too long for type character varying(16) karaf.log:2022-04-12T12:11:40,858 | ERROR | FelixStartLevel | ServiceBasedStrategyResolver | 234 - org.opennms.core.snmp.api - 29.0.8 | SnmpStrategy class 'class Proxy5d12c59b_db49_48fb_9010_b559ad7c9d68' published as service with out 'implementation' key. Ignoring. karaf.log:2022-04-12T12:11:41,097 | ERROR | FelixStartLevel | ServiceBasedStrategyResolver | 234 - org.opennms.core.snmp.api - 29.0.8 | SnmpStrategy class 'class Proxy5d12c59b_db49_48fb_9010_b559ad7c9d68' published as service with out 'implementation' key. Ignoring. karaf.log:2022-04-12T12:12:09,949 | ERROR | Blueprint Extender: 1 | Felix | 5 - org.ops4j.pax.logging.pax-logging-api - 2.0.14 | Bundle org.opennms.core.soa  EventDispatcher: Error during dispatch. (java.util.ConcurrentModificationException) karaf.log:2022-04-12T12:12:09,977 | ERROR | FelixDispatchQueue | Framework | 239 - org.opennms.core.soa - 29.0.8 | FrameworkEvent ERROR poller.log:2022-04-14 12:18:13,391 ERROR [Poller:PollerEventProcessor-Thread] o.h.u.JDBCExceptionReporter: ERROR: duplicate key value violates unique constraint "one_outstanding_outage_per_service_and_location_idx1" poller.log:2022-04-14 12:18:13,391 ERROR [Poller:PollerEventProcessor-Thread] o.o.n.p.Poller: Unable to schedule service 2551|10.6.8.174|ICMP poller.log:2022-04-14 12:19:21,082 ERROR [Poller-Thread-9-of-30] o.o.n.p.QueryManagerDaoImpl: Failed to set the last good|fail timestamp for service named ICMP on node id 2551 and interface |10.4.1.10 for Down. poller.log:2022-04-14 13:31:28,622 ERROR [Poller:PollerEventProcessor-Thread] o.o.n.p.PollerEventProcessor: Nodeid 1203 does not exist in pollable node map, unable to delete node. provisiond.log:2022-04-15 11:40:30,245 ERROR [Provisiond:EventListener-Thread] o.o.n.p.s.Provisioner: getScheduleForNode fails provisiond.log:2022-04-12 12:26:05,225 ERROR [scanExecutor-3] o.o.n.s.s.Snmp4JStrategy: org.opennms.netmgt.snmp.SnmpException: org.opennms.netmgt.snmp.SnmpAgentTimeoutException: Timeout retrieving response for 10.5.0.3. web.log:2022-04-12 12:11:40,692 ERROR [Main] o.s.w.c.ContextLoader: Context initialization failed web.log:org.postgresql.util.PSQLException: ERROR: syntax error at or near ".183"
Hopefully we did not break anything with our recent db cleanup, but it seems likely given the event we are having issues with, is one we deleted.
I modified this post to remove long uei in the title, and replace all slashes with a pipe symbol, as everything was being detected as links and there is a rule for new posters that states not more than 2.