vmWare CPU Usage Threshold

Hello.

I’m running OpenNMS 28.0.2.

I’m now monitoring my Virtual Machine via the vCenter configuration in NMS. That’s working good, I have all the metric for the datastore, cpu, memory, cpu ready, cpu latency …

Today, I was having a virtual machine running with 100% CPU usage during more than 1 hour, but no alarm or event got triggered about that high CPU usager. I’m getting some for TCP, Discard packet … but for CPU nothing.

Does the default High CPU Threshold configure in OpenNMS is only working when it’s a physical server?

Do I need to configure a specific Threshold to notifiy me when my Virtual Machine CPU usager is getting higher than a specific %?

Does anybody with vmWare/vCenter metric already configure that? What I need to do? I’m refering to CpuUsageAvgNode (CPU Usage) Graph.

Thanks

The default high threshold for CPU is only for the SNMP CPU data source. If the VMware metrics provide that as a different datasource name, you’d need to create a distinct high threshold for that DS.

There is definitely something that I don’t understand or doing wrong.

Here is the Threshold I configured.

I have a server running since this morning between 65 and 75% of CPU and I’m never getting any notification, events generated in openN’S for that Virtual Server.

I also tried to change Data Source for VmWare 5 CPU and nothing.

Here is the file containing the Graph Data.
/opt/opennms/etc/snmp-graph.properties.d/vmware5-cpu-graph.properties

Here is the graph config from the file above.
report.vmware5.CpuUsageAvgNode.name=CpuUsageAvgNode
report.vmware5.CpuUsageAvgNode.columns=CpuUsageAvg
report.vmware5.CpuUsageAvgNode.type=nodeSnmp
report.vmware5.CpuUsageAvgNode.command=–title=“CPU Usage” \

Here is my “CPU Usage” graph.

With those informations, does my Threshold is correctly configure?
It is also configure into the netsnmp group.

That’s what you’re doing wrong. The netsnmp threshold group has a filter on it so it only applies to nodes that have a sysobjectid that matches net-snmpd, and I’m fairly certain your vmware nodes won’t have that sysobjectid.

May I use any other Threshold group to add my CpuUsageAvg or I need to create a specific one for vmWare.

What’s your though?
Thanks

I would create one specific to vmware.

Same result. Still not working. No event, no notification.
Here is the config in the Threshold.xml file.

   <group name="vmWare" rrdRepository="/opt/opennms/share/rrd/snmp/">
      <expression description="Trigger an alert when the five minute CPU load average metric reaches or goes above 60% for two consecutive measurement intervals" type="high" ds-type="vmware5Cpu" value="60" rearm="50" trigger="1" filterOperator="OR" expression="CpuUsageAvg / 100"/>
   </group>

Example of data.

Date/Time CPU usage as percent
1637895600000 NaN
1637895300000 6856.66
1637895000000 7286.92
1637894700000 7060.6033333333335
1637894400000 6784.6033333333335
1637894100000 6706.18
1637893800000 7197.51

Do I need to restart a service or daemon when doing changes? If I remember, that config in the Web UI is automatically taken.

I rebooted the openNMS server and still nothing.

I changed the ds-type to be “node” because this is what need to be put there. Same thing

<expression description="Trigger an alert when the five minute CPU load average metric reaches or goes above 70% for two consecutive measurement intervals" type="high" ds-type="node" value="70.0" rearm="60.0" trigger="1" filterOperator="OR" expression="CpuUsageAvg / 100"/>

I created a package in the threshd-configuration.xml and nothing.
I saw that the MIB2 package do not have any specific filter, so I put my configuration there for testing and same result.

 <package name="mib2">
      <filter>IPADDR != '0.0.0.0'</filter>
      <include-range begin="1.1.1.1" end="254.254.254.254"/>
      <include-range begin="::1" end="ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"/>
      <service name="SNMP" interval="300000" user-defined="false" status="on">
         <parameter key="thresholding-group" value="mib2"/>
      </service>
   </package>

On the node running at 100% today, here are the SNMP file generated.
ls -l /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.*
-rw-rw-r-- 1 root root 37388 Nov 29 12:09 /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.jrb
-rw-rw-r-- 1 root root 72 Jul 28 15:14 /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.meta

Here is the graph associated with the value I’m trying to monitor.

report.vmware6.CpuUsageAvg.name=vmware6.cpu.usage.average
report.vmware6.CpuUsageAvg.columns=CpuUsageAvg
report.vmware6.CpuUsageAvg.type=nodeSnmp
report.vmware6.CpuUsageAvg.command=--title="VMware6 cpu.usage.average" \
--vertical-label="CpuUsageAvg" \
DEF:xxx={rrd1}:CpuUsageAvg:AVERAGE \
LINE2:xxx#0000ff:"CpuUsageAvg" \
GPRINT:xxx:AVERAGE:"Avg  \\: %8.2lf %s" \
GPRINT:xxx:MIN:"Min  \\: %8.2lf %s" \
GPRINT:xxx:MAX:"Max  \\: %8.2lf %s\\n"

But, still not able to get the CPU threshold trigerred.

Any idea?

I upgrade this morning to 29.0.1 and same result. The CpuUsageAvg on ds-type=node is still not working.

I have over 100 VM Server that I want to monitor and not able to get it working.

It can monitor disk space usage hrStorageUsed, packet discard In/Out, TCPError In/Out, but CPU, nothing works.

I tried the configuration in mibs2, netsnmp, I also created a new group named vmWare and nothing. The threshold looks not working.

Nobody are monitoring vmWare CPU usage?
Is there an another way or DataSource I can use to monitor my VM CPU?

Thanks

I’m following the documentation. I’m following what people said in forum, but still not able to get that threshold working.

What I understand, if I have a Graph that display those values, it means the metric is collected, right?

After that, I identified the ds-name and ds-type for them and created a Threshold on the web interface. See picture above where we see the ds-name and ds-type and the file below where we see the exact same thing.

./snmp-graph.properties.d/vmware7-graph-simple.properties:report.vmware7.CpuUsageAvg.columns=CpuUsageAvg
./snmp-graph.properties.d/vmware7-graph-simple.properties:report.vmware7.CpuUsageAvg.type=nodeSnmp

I created my threshold in the mibs group because there a no specific filter for that one.

<package name="mib2">
      <filter>IPADDR != '0.0.0.0'</filter>
      <include-range begin="1.1.1.1" end="254.254.254.254"/>
      <include-range begin="::1" end="ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"/>
      <service name="SNMP" interval="300000" user-defined="false" status="on">
         <parameter key="thresholding-group" value="mib2"/>
      </service>
   </package>

See threshold in the threshold.xml file.

<expression description="Trigger an alert when the CPU load metric reaches or goes above 70% for two consecutive measurement intervals" type="high" ds-type="node" value="70.0" rearm="60.0" trigger="1" filterOperator="OR" expression="CpuUsageAvg /100"/>

Is there anything else I should do to get it working? What’s wrong in my configuration?

Thanks

That’s very strange, it all looks correct to me.

If you set collectd to log at DEBUG in log4j2.xml, wait a couple collection cycles and then make the collect.log available somewhere, I can try to take a look at it, time permitting. No promises on turnaround, though (unless you’re secretly a support customer?)

One thing I always forget about you might try is dumping the existing threshold states from Postgres. In the Karaf shell:

admin@opennms()> threshold-clear-all --help
DESCRIPTION
        opennms:threshold-clear-all

        Clears all threshold states

SYNTAX
        opennms:threshold-clear-all [options]

OPTIONS
        -p, --persisted-only
                When set, clears only the persisted state
        --help
                Display this help message

Hi dino2gnt.

Thanks for your help.

Here is what I found in the Collectd.log in DEBUG mode.

2021-12-01 09:53:25,910 INFO  [Collectd-Thread-27-of-50] o.o.n.t.ThresholdingSetImpl: applyThresholds: Processing threshold CpuUsageAvg /100 : {evaluator=HIGH, dsName=CpuUsageAvg /100, dsType=node, evaluators=[{ds=CpuUsageAvg /100, value=70.0, rearm=60.0, trigger=1}]} on resource node[593]
2021-12-01 09:53:25,910 INFO  [Collectd-Thread-27-of-50] o.o.n.t.CollectionResourceWrapper: getAttributeValue: can't find attribute called CpuUsageAvg on node[593]
2021-12-01 09:53:25,910 INFO  [Collectd-Thread-27-of-50] o.o.n.t.ThresholdingSetImpl: applyThresholds: Could not get data source value for 'CpuUsageAvg', not evaluating threshold

Weird. If I’m having that graph, I should get the attribute value. No?

Thanks for you help

Hi Dino2gnt.

I did test. I changed the datasource to be VmWare 7 Cpu and I’m no longer having the error in the reply above.

As you can see below, the CollectD is updating the metric data in the RRD folder for that machine.

ls -l /opt/opennms/share/rrd/snmp/579/CpuUsage*
-rw-rw-r-- 1 opennms opennms 37388 Dec  2 08:00 /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.jrb
-rw-rw-r-- 1 opennms opennms    72 Dec  1 12:12 /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.meta
-rw-rw-r-- 1 opennms opennms 37388 Dec  2 08:00 /opt/opennms/share/rrd/snmp/579/CpuUsagemhzAvg.jrb
-rw-rw-r-- 1 opennms opennms    78 Dec  1 12:12 /opt/opennms/share/rrd/snmp/579/CpuUsagemhzAvg.meta

So it looks collecting the metric correctly.

The only thing, I need to get the threshold working. I have no event on the machine saying the threshold has been reach and no notification.

We can see in the collectd.log that the value is updated.

2021-12-01 21:39:03,286 DEBUG [Collectd-Thread-48-of-50] o.o.n.r.j.JRobinRrdStrategy: createDefinition: filename [/opt/opennms/share/rrd/snmp/579/CpuUsageAvg.jrb] already exists returning null as definition
2021-12-01 21:39:03,286 INFO  [Collectd-Thread-48-of-50] o.o.n.c.p.r.RrdPersistOperationBuilder: updateRRD: updating RRD file /opt/opennms/share/rrd/snmp/579/CpuUsageAvg.jrb with values '1638412743:10000'

My threshold divide the value by 100 to have it on % base. CpuUsageAvg / 100.
Any idea with that latest information?

Thanks