Performance Tuning

NOTE: This is based on an older article, and hardware and specific setting values described here are not current, however the concepts around sizing are still relevant.

Tuning Overview


  • If you design a new OpenNMS system carefully read the hardware considerations below.
  • If you have already a running system, you might still find some possibilities to get nearer to the design described below
  • Disk I/O and system memory are the points you should look at
  • Remember that a 64-bit CPU is required in order for a single process to address more than about 2GB of memory, even with a PAE-aware kernel.

Operating system

  • There are some parameters regarding the filesystems for database and collected data to tune
  • System’s shared memory pool might need increasing for the database
  • If you have 64-bit hardware, be sure to install a 64-bit operating system in order to address more than 4GB of physical memory

Database PostgreSQL

  • Very important part as there are a lot of parameters to tune. Newer releases (8.4 and later) tend to be configured much more sanely by default compared to older ones.

Java virtual machine

  • Heap space, permanent generation size, and garbage collection


  • here you can generate a lot of data so carefully design what you really need
  • logging
  • data collection
  • data storage and consolidation
  • discovery
  • polling
  • houskeeping

Hardware considerations

If at all possible, use a server with a 64-bit CPU as this will enable the CPU to address more than 4GB of physical memory. Remember that even with a PAE-aware kernel / operating system, most 32-bit OSes don’t allow a given process to address more than about 2GB of memory.

Probably the biggest performance improvement on systems that are collecting a lot of RRD data is to move PostgreSQL and Tomcat to a separate system from OpenNMS daemons! Huge difference.

On a server with hardware RAID, consider investing in a battery-backed write cache. On a HP DL380 G4, the I/O wait of the server dropped from an average of 15% to almost nil with the addition of a 128 MB BBWC. Additionally, ensure that you have ample memory on the system, on a HP G4 - single processor 4 Gigs of memory monitoring about 300 devices with 700 interfaces, our I/O wait time steadily began to climb. The CPU wait time was obsessively hogging all of the processor, making OpenNMS crawl, we resolved this by upping our memory to 12 Gigs of memory, which in turn brought the wait time back down to 1%.

For a small collection of monitored nodes, moving the RRD data area into a tmpfs / RAM drive may also alleviate the I/O wait caused by all of the writing required by the RRD data. The trade-off is that a server crash or power-down will cause the RRD files to be lost, unless you implement a sync tool to sync the RAM drive to a disk backup.

Disk Tuning

Because OpenNMS is well-equipped for gathering and recording details regarding network and systems performance and behavior, it tends to be a write-heavy application. If your environment offers a very large number of data points to be managed, it would serve you well to ensure that a large degree of spindle separation exists. In particular and where possible, ensure that:

  • OpenNMS SNMP Collection
  • OpenNMS Response Time Collection
  • OpenNMS (and system) logging
  • PostgreSQL Database
  • PostgreSQL Writeahead logging

…occur on separate spindles, and in some cases separate drives or separate devices. Further, in a *Nix environment, it may behoove you to ensure that the RRD’s end up on different mounts, so one has the option of mounting with the ‘’‘noatime’’’ directive without compromising other aspects of the system configuration.

The defaults for the opennms directories mentioned above are

/opt/opennms/logs or /var/log/opennms

but watch out for symbolic links!

The defaults for the postgresql directories mentioned above are


but note these may change slightly depending on the distro.

As a filesystem, the best performance is achieved with XFS. EXT(2,3) have built-in limitations in the number of file descriptors per directory and cannot be used on larger installations.

The data storage is the critical factor, hence the capacity of the storage must match the size of the installation: Best performance is achieved with SAN’s (FibreChannel + Netapp or EMC or …). The important point is that the IO Queue is kept on the “other” device and not on the OpenNMS Server.

Recently good results for smaller systems have been reported with SSD Drives.

To tell if you have a bottle neck with your disk you can use a couple of quick things. In “top” you can look for the waiting CPU percentage. For instance in top you hit “1” to break out all of the individual cores/CPU’s and see that one of the CPU’s has 100% wait. This could be from the swap file or any of the directories listed above.

The “nmon” program can show more detailed information. You will be able to see what spindles are being used when and how much read it has versus writes.

Memory-backed File Systems

One option, if your server has a lot of RAM, is to modify the OpenNMS startup scripts to maintain a memory-backed file system, combined with automatic backups and restores that handle any internally decided risk levels/SLAs. In Linux, this would be a tmpfs file system.

# XXX Custom code herein for dealing with memory drives
mount | grep -q rrd
if [ $? -ne 0 ]; then
        # RRD location is not present, create it and
        # unpack our data.
        mount -t tmpfs -o size=2G,nr_inodes=200k,mode=0700 tmpfs /opt/opennms/share/rrd
        cd /
        tar xf /mnt/db-backup/opennms-rrd.tar
# XXX End custom code

This modification to /opt/opennms/bin/opennms is matched with a crontab entry that generates the opennms-rrd.tar file periodically.

In-the-field: On a DL380 G4, with 6 GB of RAM, 2 GB of RAM was allocated to a memory-backed file system. This reduced the disk I/O load (one shared RAID-10 for Postgres, OS and JRBs; with battery-backed cache) from 300 IOPS to 10 IOPS, along with a correlated drop in load average and response times for the OpenNMS UI.

N.B. In Linux, a tmpfs file system will go to swap if memory pressure demands real memory for applications. This can have a very negative effect on the I/O load and system performance.

Operating system

  • Do run a 64-bit kernel so that OpenNMS will be able to address more than 2GB of memory.
  • Do put OpenNMS logs and RRDs and PostgreSQL data on separate spindles or separate RAID sets. Read details for postgres and RRD below.
  • Do run on a modern kernel. Linux 2.6 and later as well as Solaris 10 or newer are good. Stay away from Linux 2.4, in particular.
  • Set noatime mount flag on filesystems hosting data for #4 above.
  • Adapt the systems shared memory to the database, see PostgreSQL and system’s shared memory section
  • Solaris 10 systems may require increasing ICMP buffer size if polling large numbers of systems (ndd -set /dev/icmp icmp_max_buf 2097152). Use netstat -s -p | grep ICMP and check the value of icmpInOverflows to determine if you’re overflowing the ICMP buffer.

Running OpenNMS in a Virtual Machine

Virtualization is a staple in modern infrastructures so the question about what can be the conditions under which OpenNMS can be runned in VM arose.
OpenNMS is known to run in several Virtualization Enviromments.
In line of principle there is no limitation known for running opennms in a VM but underlining hardware is critical, so Virtual Machine must be setted up carefully and paying attention at the performance of the hardware on which are built.
OpenNMS can take advantage in running in VM: consider as an example a case in which you set up opennms use a target rrd repository over a very fast storage.

As an example of well established and working virtualization we report the following real case:

  • Virtualization Software: Proxmox-Ve
  • Hardware: 2 x Dell PowerEdge R710
  • Every Server is configured with:
    • 2 Intel(R) Xeon(R) CPU L5640 @ 2.27GHz”
    • RAM 64GiB
    • 2 disks SAS 600GB 15k rmp 3.5 (raid1) used for Operating System
    • 2 disks SSD (raid1) used for Data Collection

LVM is used widely to configure Volumes with machine local disks.

On this hardware we have three VM, one used for Rancid Integration, one with OpenNMS and the latter running postgresql.

opennms VM is OpenVZ based VM while postgres is a KVM based VM.

The postgressql data is placed over an external storage.

The size of the network monitored with such infrastructure is:

  • Nodes: 3500
  • IP Interfaces: 17000
  • SNMP Interfaces: 70000
  • Average size of Events Table: 9402684
  • Size of repository rrd: 95GiB

Database PostgreSQL

Shared Buffers

The default shared_buffers parameter in postgresql.conf is extremely conservative, and in most cases with modern servers, this can be significantly tweaked for a big performance boost, and drop in I/O wait time. This change will need to be in-line with kernel parameter changes to shmmax.

The PostgreSQL project wiki aggregates many good links in its Performance Optimization article. Among others linked from there, see Postgres Wiki tuning page and this PostgreSQL performance page for recommendations on this and other PostgreSQL settings.

If you want to put PostgreSQL on a different box then you want to change the SQL host look in opennms-datasources.xml. The PostgreSQL server will also need iplike installed and configured.

PostgreSQL 9.1 tuning

Summarized from these blog posts.

For a system that has been running for some time, a good start is to determine what resources are available. Linux systems have a nice SNMP “System Memory Stats” graph to review how system memory is used.

Next, db size can be found:

opennms=# select pg_size_pretty(pg_database_size('opennms')) as db_size;
 691 MB
(1 row)

By creating a view of a complex query,

CREATE EXTENSION pg_buffercache;
create view v_database_cache as SELECT c.relname,pg_size_pretty(count(*) * 8192) as buffered,round(100.0 * count(*) /(SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent,round(100.0 * count(*) * 8192 /pg_relation_size(c.oid),1) AS percent_of_relation 
FROM pg_class c INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.oid,c.relname ORDER BY 3 DESC LIMIT 10;

that query can be run to view cache used by table.

opennms=# select * from v_database_cache;
            relname            | buffered | buffers_percent | percent_of_relation 
 events                        | 266 MB   |            52.0 |                64.5
 event_archives                | 22 MB    |             4.3 |               100.0
 notifications                 | 20 MB    |             3.9 |               100.1
 events_nodeid_display_ackuser | 5616 kB  |             1.1 |                33.2
 outages                       | 5000 kB  |             1.0 |               100.3
 events_ipaddr_idx             | 4496 kB  |             0.9 |                27.5
 events_nodeid_idx             | 4456 kB  |             0.8 |                36.4
 events_uei_idx                | 3648 kB  |             0.7 |                10.5
 iprouteinterface              | 2096 kB  |             0.4 |               101.6
 events_time_idx               | 2256 kB  |             0.4 |                20.0
(10 rows)

A nice method to work and populate cache can result in better UI responsiveness. You can load up the cache using:

psql -A opennms -c "select * from events; " > /dev/null

And then verify with a query using the view above.

opennms=# select * from  v_database_cache ;
            relname            | buffered | buffers_percent | percent_of_relation 
 events                        | 181 MB   |            17.7 |               100.0
 notifications                 | 44 MB    |             4.3 |                84.6
 outages                       | 16 MB    |             1.6 |               100.0
 events_ipaddr_idx             | 4128 kB  |             0.4 |                40.9
 bridgemaclink                 | 4704 kB  |             0.4 |               100.7
 events_nodeid_idx             | 4008 kB  |             0.4 |                51.9
 events_nodeid_display_ackuser | 4480 kB  |             0.4 |                42.9
 assets                        | 2848 kB  |             0.3 |               101.1
 snmpinterface                 | 1576 kB  |             0.2 |               100.0
 bridgemaclink_pk_idx2         | 2296 kB  |             0.2 |               100.0

PostgreSQL and system’s shared memory

If your OpenNMS system tend to have long response times and has

  • no disk I/O-waits
  • a lot of CPU idle time

then try to increase your operating systems shared memory (and that of postgres) as described above. The values written above are the absolut minimum values. Increasing the systems shared memory may greatly boost OpenNMS performance as it will speed up the communication between OpenNMS and the database. Try different values for the system’s shared memory, even up to 10 times or more of the minimum value as described above.
For further details see the links to postgres Wiki doku mentioned above.

PostgreSQL any Version

One additional configuration that seems to make a tremendous amount of peformance improvement is having the write-head logs on a separate spindle (even better a separate disk controller/channel). The way to do this is:

  • shutdown opennms / tomcat
  • shutdown postgresql
  • cd to $PG_DATA
  • mv pg_xlog
  • ln -s /pg_xlog pg_xlog
  • restart postgresql

Make sure postgres data and write-ahead logs do not live on a RAID-5 disk subsystem.

IPLIKE stored procedure

See the documentation in iplike to be sure you have the best version of iplike running

Postgres and disk I/O waits

Standard postgres configuration writes transactions to the disk before comitting them. If there are I/O-problems (waitstates) database transactions suffer, high application responsetimes are the result. On test machines -running most times on inappropriate hardware- synchronous writes may be disabled. In case of a system crash database inconsistencies may result, requesting rollback of the transaction log etc… For test systems this is normally no problem.

Try with following configuration changes in postgresql.conf on postgres 8.3 (or newer):

fsync = off
synchronous_commit = on
commit_delay = 1000

Find problems due to long-running queries

If there is a reasonable suspicion that some queries are running for a very long time edit the postgresql.conf and change the parameter (PostgreSQL up to 8.3)


This will log all queries running for more than 1000 ms to postgresql.log.

After this change a stop/start of opennms and postgres is required. Don’t forget to remove this configuration after debugging is finished.

Probably you will find that most times “bad database response time” is not due to a single query running for a long time but due to thousands of queries running for a very short time.

Optimization for a lot of small queries

If anybody knows how to optimize PostgreSQL / OpenNMS for this please add it here! There are parameters like max_connections in postgresql.conf and c3p0.maxPoolSize in $OPENNMS_HOME/etc/ which might help here.

Java Virtual Machine (JVM)

The following phaenomena of opennms are typical for running low on memory in the java virtual machine:

  • long response times
  • garbage collection is running very often and takes a lot of time (see below)
  • alarms that should have been cleared automatically are still listed as alarms

Tuning heap size

Enable extensive garbage collection logging (see below) to see the behaviour looking at output.log. If garbage collections regularly take a lot of time (0.5 seconds is an empirical threshold) or are running very often (more than every 10-20 seconds) the java heap size should be increased. If it’s running every 10 seconds and takes 9 seconds the system is stuck…

Parameters for tuning java may be added in $OPENNMS_HOME/etc/opennms.conf. If that file doesn’t already exist, check in $OPENNMS_HOME/etc/examples/opennms.conf for a template.

The most important parameter is the java heap size


The default value is 512, which should be considered a conservative minimum, sufficient only for test cases with one to five managed devices. This value can be validated with ps -ef | grep java

You can roughly test performance improvement opening the event list from opennms, adding ?limit=250 to the url and pressing Return


Now there should be 250 events in your list. Press F5 (at least with Firefox and IE this is the Reload-Page button) and stop the time until the page finished to refresh. Repeat this several times to get a good mean value. Now stop opennms, change the heap size as described above, restart opennms and wait for about 10 minutes to let it settle down after starting.
Repeat the measurements then increase the heap size again as described above. You will get a table like

 heap refresh time
 1536 5-7 sec.
 2048 3-4 sec.
 3072 1-2 sec.

Watch out for memory and swap on your system (by example using top) and decide which value to keep in the config file.

To speed up the start phase of the java virtual machine you might want to add


though speeding up the startup time in most cases is not a big problem and the parameter sometimes doesn’t help at all.

Lastly, if you use a JAVA_HEAP_SIZE greater than 4GB, you are recommended to use the G1 Garbage Collector (G1GC). Example settings are listed below (Note: these settings were used successfully on a system collecting JVM / JMX statistics with 1300 nodes and over 5000 services with a JAVA_HEAP_SIZE of 16GB):

 ADDITIONAL_MANAGER_OPTIONS="${ADDITIONAL_MANAGER_OPTIONS} -verbose:gc -XX:+UseG1GC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:+UseCompressedOops -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:/var/log/opennms/gc.log"

Tuning the maximum Permanent Generation size

If you’re seeing messages in your logs containing a mention of:

 java.lang.OutOfMemoryError: PermGen space

Then you probably need to allocate more memory to the garbage collector’s permanent generation. This section of JVM memory is allocated separately from the heap, and its default maximum size varies according to the platform on which the JVM is running. The OpenNMS 1.8 start script on UNIX and Linux platforms sets the maximum size to 128MB, but you can adjust this value in $OPENNMS_HOME/etc/opennms.conf. For example:


Tuning garbage collection

If you have a system with a lot of cores and threads like sun’s niagara cpu you might run into a problem known as “Amdahl’s Law”. You can try to optimize garbage collection using different garbage collectors.


 -verbose:gc \
 -XX:+PrintGCDetails \
 -XX:+PrintTenuringDistribution \

you will get a lot of time information about garbage collection in the output.log of opennms. The default garbage collector used by opennms is incgc (e.g. -XX:+incgc), others to try are ConcMarkSweepGC (-XX:+UseConcMarkSweepGC) and the ParallelGC (-XX:+UseParallelGC) which might be the best if you have a lot of cores/threads. If you have settled down you configuration remove the lines containing verbose and Print from the options:


Parallel thread library on Solaris systems

It is also useful to use libumem instead of standard IO libraries on Solaris 10.
If you want to enable libumem on an existing application, you can use the LD_PRELOAD environment variable (or LD_PRELOAD_64 for 64 bit applications) to interpose the library on the application and cause it to use the malloc() family of functions from libumem instead of libc. opennms start opennms start

To confirm that you are using libumem, you can use the pldd(1) command to list the dynamic libraries being used by your application. For example:

 $ pgrep -l opennms
 2239 opennms
 $ pldd 2239
 2239:    opennms



By default the daemons log at WARN and webapp log at DEBUG level. This causes a lot of extra disk I/O. You can reduce the logging substantially by setting the level to WARN in /opt/opennms/etc/log4j2.xml
( prior OpenNMS 14).

You don’t need to restart OpenNMS, the changes will take effect a few second later.

Data Collection

High disk I/O load due to data collection is the major reason for performance problems in many OpenNMS systems.
Hardware and filesystem layout as described above helps a lot.

Another approach is to omit all unnecessary data collections.

Don’t collect what you don’t need

While the “default” snmp-collection definitions in datacollection-config.xml provide an easy-to-go data collection definition for small network systems in larger environments it’s undesirable to collect everything that can be collected. Probably in those environments a better approach would be to NOT use default data collection but to start with defining packages in collectd-configuration.xml and corresponding snmp-collections in datacollection-config.xml to ensure only those values are collected you really care about.

Don’t try to collect what you don’t get

If you try to collect a lot of data from nodes which don’t provide those values you will get a lot of threads waiting for timeouts or getting errors. If you have specific nodes with problems look in your $OPENNMS_HOME/share/rrd/snmp/[nodeid] directory for the node(s) in question and note all the mib objects that are actually being collected.

Another possibility is to change the logging for collectd from WARN to DEBUG:

 <KeyValuePair key="collectd"             value="DEBUG" />

and then fgrep for node[your_nodeid] in collectd.log.

There you should see which variables OpenNMS tries to collect and which variables are successfully collected. The successful ones normally end up in the jRRD files, all others defined in data-collection for this [type of] node can’t be collected for some reason.

If there are too many unsuccessful tries change your datacollection-config.xml. You may omit those values for all devices or create new collection groups that contain only those mib objects the node(s) provide values for. Add a systemDef for your node(s) providing the the same values. In collectd-configuration.xml define a separate package for your node and reference the snmp-collection you just created in datacollection-config.xml. Make sure the node is only in this one package. This gives you an environment to work in that is free of any extra clutter and avoids requesting extraneous mib objects that you won’t get a response for. Then experiment with different values for max-vars-per-pdu, timeout and also snmp v1 or v2c.

Don’t forget to change back logging to WARN once you have finished debugging.


Writing all the snmp-collected data and the results from polling the service (response times) to rrd files produces a lot of disk I/O, so look for disk tuning below. For further tuning see the fundamentals and some more detailed pages like

  • [[RRD performance fundamentals]]
  • [[RRD_store_by_group_feature]]
  • [[Queueing_RRD]]

Tomcat (if not using built-in Jetty server)

Note that there’s no need to use Tomcat since OpenNMS version 1.3.7 unless you have a specific requirement that the built-in Jetty server in OpenNMS cannot meet.

If not already done at installation time;
To allow Tomcat to access more memory than the default. The easiest way to do this is via the CATALINA_OPTS environment variable. If the Tomcat software being used has a configuration file as above, it can be added to that file. Otherwise it is best just to add it to


The -Xmx option allows Tomcat to access up to 1GB of memory. Of course, the assumes that there is 1GB of available memory on the system. It will need to be tuned to the particular server in use.

Jetty built-in server

Similar to tomcat configuration, you can change the JVM startup options in $OPENNMS_HOME/etc/opennms.conf file. To increase the maximum heap size (-Xmx java option), add into $OPENNMS_HOME/etc/opennms.conf.


On Ubuntu $OPENNMS_HOME is defined in /usr/sbin/opennms as /usr/share/opennms, so the option must be added into /usr/share/opennms/etc/opennms.conf file.

Provisiond service detection / rescan

If you have a few hundred or more nodes, and adding or rescanning a node takes a long time, you might consider turning up the maximum number of Provisiond scan threads for initial detection of services (scanThreads) or rescans (rescanThreads) at the top of provisiond-configuration.xml. Both attributes have a default value of 10, which should be enough on small-to-medium systems. Doubling these values is probably adequate for networks of up to several thousand nodes.

Provisiond attempts to detect every service listed in the foreign-source definition for every interface of every node during a rescan. Provisiond’s highly asynchronous model for service detection allows each scan thread to handle multiple detectors on multiple interfaces simultaneously. Compared to Capsd, whose scan threads worked synchronously, this model is far more performant (albeit sometimes harder to debug).

Try removing any service detectors that you know are not relevant in your environment, and consider lowering the values of ‘timeou’’ and retries’ if appropriate.

Poller threads

If you have good hardware and find your pollers are not completing in time, you can turn up the maximum number of poller threads at the top of poller-configuration.xml.

To find out how many threads are actually being used, make sure DEBUG level logging is enabled for daemon/poller.log, then run:

    $ tail -f poller.log | egrep 'PollerScheduler.*adjust:'
    2007-09-05 10:30:32,755 DEBUG [PollerScheduler-45 Pool] RunnableConsumerThreadPool$SizingFifoQueue:
        adjust: started fiber PollerScheduler-45 Pool-fiber2 ratio = 1.0227273, <b>alive = 44</b>
    2007-09-05 10:30:12,783 DEBUG [PollerScheduler-45 Pool-fiber29] RunnableConsumerThreadPool$SizingFifoQueue:
        adjust: calling stop on fiber PollerScheduler-45 Pool-fiber3

Watch the output for a while after startup. The “alive” count shows the number of active poller threads (minus one – the new thread isn’t counted). If the number of threads is continually pegged at the maximum (default 30), you might want to add more threads.

Changes in 1.12 and newer

In OpenNMS 1.12 the RunnableConsumerThreadPool no longer exists, it seems to have been replaced by ThreadPoolExecuter. Adjust your grep and expectations to roughly:

    $ tail -f poller.log | fgrep 'thread pool statistics:'
    2014-12-15 11:41:04,236 DEBUG [java.util.concurrent.ThreadPoolExecutor@34f5d77e
      [Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]]
      LegacyScheduler: thread pool statistics:
        <b>activeCount=3</b>, taskCount=5265, completedTaskCount=5262, completedRatio=0.999, <b>poolSize=30</b>

Note: that’s all one line in the log file. It’s hideous. Here’s a simplifier:

    $ tail -f poller.log | sed -r -n '/thread pool stat/{s/([-0-9: ,]{23}) .* (thread pool stat.*)/\1 \2/;p}'

Watch activeCount and compare it with poolSize.

Event Handling

All incoming events have to be checked against the configured events to classify them and to handle the parameters correctly. There are a lot of predefined events in opennms. Incoming events are compared to the list of configured events until the first match is found. If you have a lot of incoming events you might consider to make the following changes in $OPENNMS_HOME/etc/eventconf.xml

  • comment out vendor events that you don’t need
  • put the vendor events that make most of your incoming events on top of the list
  • Take care that Standard, default and programmatic events keep their place at the end of the list.
    As there probably are a lot of events hitting the Standard- or default-events configured at the end of the list resorting the event list won’t help as much as commenting out.

Event Archiving

In the OpenNMS “contrib” directory, we have a small script for helping performance by archiving events into a historical event table and updating the references to the archived event to an event place holder.

You can download the latest version of the script here.

It is recommended that you run this script by passing in a timestamp argument such that you archive one day’s worth of events beginning with the oldest day up to the point you want to keep live events (default is 9 weeks). Then run this script without a timestamp parameter, from cron as often as you like from there out.

./ "2008/01/01"

To analyze why your event table is so large, have a look at [[Event_Maintenance]].