Monitoring systemd services using Prometheus' NodeExporter

What is systemd?

Its main aim is to unify service configuration and behavior across Linux distributions; systemd's primary component is a “system and service manager”—an init system used to bootstrap user space and manage user processes.

Source: systemd - Wikipedia

In other words: systemd knows about all the services that are running (or not) in most of the Linux systems.

In these examples, we want to monitor if the puppet service is running.
So we are talking about this running state:

[20:40]root@node1:~# systemctl status puppet
● puppet.service - Puppet agent
   Loaded: loaded (/lib/systemd/system/puppet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-04-13 20:30:13 UTC; 10min ago
 Main PID: 25901 (puppet)
    Tasks: 2 (limit: 39321)
   Memory: 46.9M
      CPU: 8.053s
   CGroup: /system.slice/puppet.service
           └─25901 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent --no-daemonize

Apr 13 20:30:13 node1 systemd[1]: Started Puppet agent.
Apr 13 20:30:15 node1 puppet-agent[25901]: Starting Puppet client version 6.15.0
Apr 13 20:30:30 node1 puppet-agent[26049]: Applied catalog in 4.62 seconds

What is a Node Exporter?

Node Exporter is a client software from a monitoring tool called Prometheus. It provides a big bunch of system-related metrics via HTTP protocol. For this article here the focus is on the systemd metrics.

Node Exporters’ systemd metrics output look like this:

node_systemd_unit_state{name="puppet.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="puppet.service",state="active",type="simple"} 1
node_systemd_unit_state{name="puppet.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="puppet.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="puppet.service",state="inactive",type="simple"} 1

The important line here is:

node_systemd_unit_state{name="puppet.service",state="active",type="simple"} 1

It tells us that puppet.service's active state is 1. So it is running.

Configurations in OpenNMS

This article includes two scenarios:

  1. A simple detector and monitor configuration
  2. A more complex monitor configuration using MetaData

Simple Systemd-Unit detector / monitor

Based on the information above we can use a service detector to detect the service on nodes:

<detector name="Systemd-Unit:Puppet" class="org.opennms.netmgt.provision.detector.web.WebDetector">
    <parameter key="responseText" value="~.*node_systemd_unit_state.*puppet\.service.**1$"/>
    <parameter key="port" value="9100"/>
    <parameter key="path" value="/metrics"/>

We are using the WebDetector here and searching with a regex in the response text for the important line mentioned above. The port and path is the default Node Exporter is usually using.

The corresponding monitor is configured pretty similar:

<service name="Systemd-Unit:Puppet" interval="60000" user-defined="true" status="on">
   <parameter key="retry" value="1" />
   <parameter key="timeout" value="3000" />
   <parameter key="port" value="9100" />
   <parameter key="url" value="/metrics" />
   <parameter key="response-text" value="~.*node_systemd_unit_state.*puppet\.service.**1$" />
   <parameter key="rrd-repository" value="/opt/opennms/share/rrd/response" />
   <parameter key="rrd-base-name" value="puppet" />
   <parameter key="ds-name" value="puppet" />
<monitor service="Systemd-Unit:Puppet" class-name="org.opennms.netmgt.poller.monitors.HttpMonitor"/>

Systemd-Unit monitor using MetaData

This configuration is more or less the same we’ve used here.

Every node requires service MetaData definitions. With those we define, which systemd units should be monitored.

<model-import xmlns=""
   <node foreign-id="node1" node-label="node1">
      <interface ip-addr="" status="1" snmp-primary="N">
         <monitored-service service-name="Systemd-Unit:Puppet">
            <meta-data context="requisition" key="systemd-unit" value="puppet" />
         <monitored-service service-name="Systemd-Unit:SNMP">
            <meta-data context="requisition" key="systemd-unit" value="snmpd" />
         <monitored-service service-name="Systemd-Unit:Nginx">
            <meta-data context="requisition" key="systemd-unit" value="nginx" />
      <meta-data context="requisition" key="ne_port" value="9100" />
      <meta-data context="requisition" key="ne_path" value="/metrics" />

The value of systemd-unit is the exact service name used in systemd. Other parameters like ne_port or ne_path are not necessarily required. They could be static in the monitor definition below. But in this case, it would be possible to define different ports and/or paths for each node.

The monitor configuration is very generic and is only required once in poller-configuration.xml:

<service name="Systemd-Unit" interval="60000" user-defined="true" status="on">
   <parameter key="retry" value="1" />
   <parameter key="timeout" value="3000" />
   <parameter key="port" value="${requisition:ne_port|9100}" />
   <parameter key="url" value="${requisition:ne_path|/metrics}" />
   <parameter key="response-text" value="~.*node_systemd_unit_state.*${requisition:systemd-unit}\.service.**1$" />
   <parameter key="rrd-repository" value="/opt/opennms/share/rrd/response" />
   <parameter key="rrd-base-name" value="${service:name}" />
   <parameter key="ds-name" value="${service:name}" />
<monitor service="Systemd-Unit" class-name="org.opennms.netmgt.poller.monitors.HttpMonitor"/>

The pattern basically matches against the service definitions in each node and uses this service name then.
port and url are using here default values if nothing else would be defined in the nodes MetaData, but as explained above we have here the chance to overwrite the default in each node.
In the response-text value, OpenNMS adds individually the defined systemd-unit value for each service defined in the node.

:woman_facepalming: You can fix me, I’m a wiki post.