Sizing Cassandra for Newts

The use case for having Cassandra (or ScyllaDB) as the backend for the performance metrics in OpenNMS, it to be able to store a huge amount of non-aggregated data, which is not possible with RRDtool.

RRDtool is very good for those installations with a finite and predictable amount of metrics where the size and the I/O requirements are feasible for modern SSD based disks. This is important as RRDtool only scales vertically, meaning that when the limits of the current disks are reached (mostly due to speed, not space) a faster disk is required.

This is when Cassandra or ScyllaDB can help. Although there is a high learning curve, as using any of these applications require a commitment to have qualified personnel to manage this database.

ScyllaDB is binary compatible with Cassandra (even at SSTable level), but they are implemented very differently. Cassandra is implemented in Java, meaning that all the JVM tuning is required, as well as other internal Cassandra tuning because it will be running within a JVM. On the other hand, ScyllaDB is implemented in modern C++ taking advantage of the CPU where it is running. That means ScyllaDB can manage huge machines as nodes whereas Cassandra would require multiple instances on huge machines. In terms of performance, it is feasible to have faster results with ScyllaDB compared with its Java sibling.

Configuring and managing both applications is different, even if they will provide the same operational result with OpenNMS, so the decision should be carefully analyzed, especially by the team that is going to support this database.

From now on, consider the names Cassandra and ScyllaDB interchangeable, and to simplify the upcoming discussion, the term Cassandra will be used.

Sizing Terms

When sizing Cassandra we need to know the following:

  • Number of Nodes
  • Replication Factor
  • Single DC or Multi-DC environment
  • Number of disks per node
  • Total disk space per node
  • Total retention (TTL)
  • Injection Rate

Note that the last one (i.e. Injection Rate) is the only operational requirement that is not necessarily easy to estimate unless the customer knows exactly how many metrics are going to be collected at the chosen collection interval to obtain this number.

The following sections will explain how to calculate the cluster, but we will start discussing the injection rate first.

Evaluation Layer

Since it is very common to not know the number of metrics to be collected, the evaluation layer has been implemented in OpenNMS, to perform data collection as usual against the expected inventory, to “count” the number of elements involved.

To use this evaluation layer, the following change is required:

echo "echo 'org.opennms.timeseries.strategy=evaluate' > \
   /opt/opennms/etc/opennms.properties.d/timeseries.properties"

Then, restart OpenNMS.

It is very important to know that no data will be persisted on disk when this feature is enabled, meaning that it worth considering having a test or development server, when having OpenNMS is in production, this separate test/devel server should be able to reach all the nodes from the production environment.

This feature is going to count the following:

  • Number of Nodes involved in data collection.
  • Number of IP Interfaces involved in persisting response time data from the poller.
  • Number of unique MibObj Groups (based on the active datacollection-config.xml)
  • Number of unique OpenNMS resources
  • Number of unique numeric metrics
  • Number of unique string-based metrics
  • Injection rate (for numeric metrics)

Here is an example:

2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.groups, value=1341107
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.interfaces, value=0
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.nodes, value=6883
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.numeric-attributes, value=4499569
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.resources, value=507456
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=GAUGE, name=evaluate.string-attributes, value=1904879
2016-05-23 06:03:12,374 INFO  [metrics-logger-reporter-1-thread-1] EvaluationMetrics: type=METER, name=evaluate.samples, count=163832495, mean_rate=9415.655559643415, m1=7256.328061613966, m5=9467.944242318974, m15=9550.126418154872, rate_unit=events/second

From the above example, we can easily conclude that on the environment where the evaluation layer was running, the injection rate is about 9500 samples per second.

The other values are still extremely useful as the main 2 settings required for Newts are based on them.

Estimations based on RRD/JRB files

When for some reason it is impossible to execute the evaluation layer, it is still possible to obtain some estimates based on the RRD/JRB files that OpenNMS is currently updating on the production server.

The following tool can help on this regard:

It is important to notice that if storeByGroup is not used, the tool will only offer the total metrics but it won’t be able to offer the amount of newts resources (or groups) required to set the resource cache.

Newts Caches

As shown before, from the evaluation layer we can get not only the injection rate but also important information about the expected resources a d metrics.

2 caches should be configured when using Newts that will be reserved when OpenNMS starts from the Java Heap Size. That means, the effective Heap Size available for the rest of the OpenNMS daemons will be the total heap minus the size of the caches in question.

  • Ring Buffer
  • Resource Cache

The ring buffer size value has to be a power of 2 due to how this cache works.

When Newts is chosen as the persistence layer, and Collectd is gathering metrics from the target devices, at the end of each collection attempt, the data is passed from the collector implementation (for example the SNMP Collector) to the persistence layer. In the case of Newts, the data is added to this ring buffer and the persistence operation finishes, so Collectd can schedule the next collection attempt.

Then, the configured “write threads” will extract data from the ring buffer and push it to Cassandra using the Newts library and the Datastax Driver (which is compatible with either Cassandra or ScyllaDB).

During the persistence phase, the resource cache is built. Entries will be added or removed according to with the incoming CollectionSets. The size of this buffer should be enough to cover all the entries that will be used to accelerate other Newts related features like enumerate resources, metrics and string attributes for graphing purposes.

Unfortunately, filling up this cache has been proven to be a very expensive and intensive operation when OpenNMS starts. It is intensive as it will be performing what’s called “Newts Indexing” at a very high rate against Cassandra to either write new entries to it or read existing entries from it. When this is happening the “Newts Writing” speed is affected, meaning that lots of entries will live on the ring buffer for a while, until the indexing is completed.

For this reason, the ring buffer size must be configured to accommodate all the metrics temporarily, even if it was designed for a different purpose.

Once the indexing is done, the ring buffer will be barely filled up, unless there is a problem with Cassandra.

The capacity of the Cassandra cluster dictates how fast it can write and read data. Depending on how fast it is, it can complete the indexing phase on a very short time, or this period can take several minutes.

The bigger the cluster, the faster it would be, but the idea is to size the cluster based on the data that will be stored, rather than doing it just to accommodate the indexing phase.

On big installation, the benefit is that a big cluster can be easily justified, so the injection rate capacity of the cluster itself might be fast enough.

Unfortunately, know how fast a cluster can receive data for writing purposes cannot be easily estimated, as it depends on multiple factors. For this reason, after choosing the technology (ScyllaDB or Cassandra), and the hardware (even if it is an estimate), field tests are required to understand if the cluster is fast enough. Fortunately, OpenNMS provides a tool for this purpose.

Back to the resource cache, here is how to estimate its size.

The data that will be stored on the resource cache is going to be the number of resources, plus the number of unique groups. In average, each entry takes 1KB, meaning that the chosen entry can be seen as the number of Kilobytes from the heap. A similar rule applies to the ring buffer.

For example, let’s say you have one router with 2 physical interfaces, using the default snmp-collection. You are going to have the following entries on the cache:

response:10.0.0.1:icmp
response:11.0.0.1:icmp
response:12.0.0.1:icmp
snmp:fs:Office:router:mib2-tcp
snmp:fs:Office:router:juniper-fwdd-process
snmp:fs:Office:router:ge_0_0
snmp:fs:Office:router:ge_0_0:mib2-X-interfaces
snmp:fs:Office:router:ge_0_0:mib2-X-interfaces-pkts
snmp:fs:Office:router:ge_0_0:mib2-interface-errors
snmp:fs:Office:router:ge_0_1
snmp:fs:Office:router:ge_0_1:mib2-X-interfaces
snmp:fs:Office:router:ge_0_1:mib2-X-interfaces-pkts
snmp:fs:Office:router:ge_0_1:mib2-interface-errors

From the list, the resource cache is going to have 13 entries for this device, where the first 3 come from the poller (response time for ICMP on each IP of the device). Then, we have the groups associated with the node-level resource. Then, a set for each interface (one for the interface itself, and one entry for each MibObj group).

Back to the results from the evaluation layer:

type=GAUGE, name=evaluate.groups, value=1341107
type=GAUGE, name=evaluate.interfaces, value=0
type=GAUGE, name=evaluate.nodes, value=6883
type=GAUGE, name=evaluate.numeric-attributes, value=4499569
type=GAUGE, name=evaluate.resources, value=507456
type=GAUGE, name=evaluate.string-attributes, value=1904879
type=METER, name=evaluate.samples, count=163832495, mean_rate=9415.655559643415, m1=7256.328061613966, m5=9467.944242318974, m15=9550.126418154872, rate_unit=events/second

On the installation where the evaluation layer was enabled, we can infer that the size of the resource cache will be:

groups + resources = 1341107 + 507456 = 1848563

Finally, we can round it and configured the following to OpenNMS:

echo "org.opennms.newts.config.cache.max_entries=2000000" >> \
  /opt/opennms/etc/opennms.properties.d/newts.properties

As mentioned, for the ring buffer there is no rule, as it depends on how fast the chosen cluster would be. As a rule of thumbs, a good starting point would be the nearest power of 2 greater than 2 times the size of the resource cache. In this particular case it would be:

echo "org.opennms.newts.config.ring_buffer_size=4194304" >> \
  /opt/opennms/etc/opennms.properties.d/newts.properties

IMPORTANT: It is recommended to perform fields tests to evaluate the impact of the Newts indexing phase when OpenNMS starts to validate the chosen value for the ring buffer.

As mentioned, the size of each entry is approximately 1KB, meaning that the values from the configuration can be seen as Gigabytes. In other words, to complete the configuration for this setup, and considering that over 6GB of the heap will be associated with these 2 caches, the total size of the heap should be greater than 8GB. That might be the smallest number as OpenNMS requires at least 2GB for basic operations. For production starting with 16GB (meaning 10GB for OpenNMS and 6GB for the buffers) would be better.

There are 2 implementations of the resource cache:

  • org.opennms.netmgt.newts.support.GuavaSearchableResourceMetadataCache
  • org.opennms.netmgt.newts.support.RedisResourceMetadataCache

Based on field tests, it is not recommended using the implementation based on Redis, as it is considered extremely slow for production loads, based on the time required to empty the resource buffer after the indexing is done; even if this would be a very attractive option especially when having external WebUI servers for OpenNMS that won’t have a cache. If the cache doesn’t exist, rendering the Choose Resources page and the Graphs pages can take a considerable amount of time, depending on how many active resource types have been configured on the system.

Cluster Size

To estimate the cluster size, in other words:

  • Number of Nodes
  • Number of disks per node
  • Total disk space per node

We need to know the following:

  • Injection Rate
  • Total Retention
  • Replication Factor

As a general rule, the more data you want to keep the greater the disk space per node will be. This applies to all these 3 settings.

The replication factor facilitates High Availability, but it is important to keep in mind that this is not a way to back up the cluster.

The minimum size for a cluster is 3 with a replication factor of 2. Having a replication factor of 2 means, one node can be down without losing data.

The formula of how many nodes can be down at a given time is:

NumberOfNodesDownSimultaneously = ReplicationFactor - 1

For bigger cluster, it makes sense to have a reasonable higher replication factor, but that impose restrictions on the disk space for obvious reasons. Increasing a replication factor means having an additional copy of the data somewhere else.

The discussion about Multi-DC won’t be covered here, but one thing to keep in mind is that a Multi-DC can serve as a disaster recovery solution and maybe can be considered as a backup strategy.

The retention or TTL, which is how we call it within OpenNMS, is the amount of time the data will be kept on the cluster. When this time expires, the data will be removed automatically. This is done through a Cassandra feature also called TTL, which is a property associated with every single metric inserted to the cluster.

The greater the retention is, the more disk space will be needed.

Another factor when choosing the disk space is the compaction strategy. The default compaction strategy used by Cassandra and Newts is STCS (Size Tied Compaction Strategy). This strategy is well known for wasting disk space. In fact, to use this strategy the physical disk size on each node should be 2 times the expected data to be stored. In other words, it is required a 50% overhead to perform compactions.

Considering that it is mandatory to use local and ultra-fast disks (i.e. Tier 1 Server Grade SSDs), a 50% overhead can be a very expensive feature.

For this reason, it is extremely important to reduce the overhead, and the only way to do it is by using a different compaction strategy.

Time Series Data

When the data to be stored in Cassandra is based on time series metrics, meaning immutable entries based on timestamps to be stored on tables, the best approach is to use TWCS (Time Window Compaction Strategy).

The overhead when using this strategy can be as big as each time-windowed chunk. The size of the chunk depends on how TWCS is configured, but in practice, it can be around 5% of the disk space. Compared with STCS, it is clear which one is the winner.

However, it is important to keep in mind the time series consideration: the data has to be immutable. In other words, once stored, it won’t be altered or manually modified. it will be only evicted by TTL when it is time to removed. If this is not the case, TWCS won’t help as much as it should, meaning the overhead on disk space will be greater.

Fortunately, we can consider the data stored by OpenNMS through Newts time series data, so we can use this strategy.

IMPORTANT: The keyspace for Newts has to be manually configured when a different compaction strategy or different replication strategies are going to be used. That means, the usage of cqlsh is mandatory, and the $OPENNMS_HOME/bin/newts facility won’t work in this case, and should never be used.

There is going to be a section dedicated to the TTL, but let’s assume a year of retention. Here is a way to configure this strategy:

CREATE TABLE newts.samples (
  context text,
  partition int,
  resource text,
  collected_at timestamp,
  metric_name text,
  value blob,
  attributes map<text, text>,
  PRIMARY KEY((context, partition, resource), collected_at, metric_name)
) WITH compaction = {
  'compaction_window_size': '7',
  'compaction_window_unit': 'DAYS',
  'expired_sstable_check_frequency_seconds': '86400',
  'class': 'TimeWindowCompactionStrategy'
} AND gc_grace_seconds = 604800
  AND read_repair_chance = 0;

The “window size” is configured through 2 settings:

  • compaction_window_size
  • compaction_window_unit

The reason why choosing 7 days is the following: for 1-year retention, the number of compacted chunks are going to be 52 (as there are 52 weeks on a year). This is a little bit higher than the recommended number of chunks, but in practice this is reasonable, especially to simplify the calculations. For different retentions, try to target around 40 chunks.

WARNING: The compaction strategy is declared with the samples table which makes it a global parameter on the keyspace.

TTL

WARNING: This is a global setting in OpenNMS.

When configuring OpenNMS, the administrator should choose one value to be used as the retention for every single metric collected on the system.

Unlike with RRDtool, it is impossible to have different retention values for different metrics, meaning that if a given customer wants different retentions, the customer in question would have to configure an OpenNMS server with a dedicated Newts keyspace for each TTL. With this schema in place, Grafana is the only way to have a unified view of all the metrics.

As mentioned, the retention is the amount of time a given metric will exist on the Cassandra cluster. That means, when this time expires, the data will be evicted from the keyspace during compaction.

Sizing Formula

Knowing the number of metrics to be persisted (a.k.a. metricsCapacity), it is possible to assume the following:

availBytesPerNode = totalDiskSpacePerNodeInBytes * (1 - percentageOverhead) 

clusterUsableDiskSpace = (availBytesPerNode * numberOfNodes) / replicationFactor

sampleCapacityInBytes = clusterUsableDiskSpace / averageSampleSize

totalSamplesPerMetric = (ttl * 86400) / (collectionStep * 60)

metricsCapacity = sampleCapacityInBytes / totalSamplesPerMetric

totalSamplePerMetric would be the total amount of rows on the newts.samples table in Cassandra per metric.

Each installation is different, but on average we can consider that the size of a single metric (i.e. averageSampleSize) is going to be about 18 Bytes (more or less).

Choose the number of nodes to calculate the required available bytes per node; or vice versa, chose the disk size to calculate the number of nodes.

In general, for Cassandra, it is recommended to never use a disk greater than 4TB. ScyllaDB is different and they recommend to have a “30:1” relationship between the disk space in gigabytes and the available RAM on the system.

For example, let’s assume 3TB per node, and the request is to collect data from 35 million metrics every 5 minutes for 3 months, assuming TWCS with 5% of overhead, and a replication factor of 2. The amount of nodes can calculate like this:

numberOfNodes = ((ttl * 86400) / (collectionStep * 60) * metricsCapacity * averageSampleSize * replicationFactor) / (totalDiskSpacePerNodeInBytes * (1 - percentageOverhead))

In other words,

numOfNodes = ( ((90 * 86400) / (5 * 60)) * 35000000 * 18 * 2 ) / (2.85 * 2^40) = 10

With the above assumptions, a 10 nodes cluster with 3TB of disk space, using TWCS with a replication factor of 2 are required to persist 35 million metrics every 5 min.

Let’s say the injection rate is well known (either because it was part of the requirements, or because the evaluation layer has been used). In this case, the total amount of metrics to be collected is the injection rate multiplied by the collection interval, which is another way to obtain the number of metrics.

As shown, the assumption includes knowing the number of metrics to be collected. When this is not known, an assumption on the number of nodes has to be made to estimate the total amount of metrics the cluster will be able to handle.

Cassandra grows linearly. That means, to have 4 times that capacity (a.k.a) 140 million, we need to multiply by 4 the cluster size. It is not recommended to increase the disk size per Cassandra Node as this is an anti-pattern, so the obvious variable to increase is the number of nodes within the cluster.

If we increase from 10 to 40, keeping the same assumptions, we cover the new requirement. That said, for that amount of nodes, it is better to increase the replication factor to have more room for potential outages. If that parameter changes, more nodes will be required, which can be easily inferred from the formulas.

OpenNMS Configuration

At this point we know that we should configure in OpenNMS the following parameters at a minimum:

  • Retention or TTL
  • Resource Cache Size
  • Ring Buffer Size
  • Writer Threads
  • Heap Size

We already provided a way to calculate the cache sizes and the size of the cluster itself.

One thing to keep in mind is that it is very possible to end with the same injection rate with a different combination of resources and groups, which should be impacting the ring buffer.

To explain that, let’s introduce the stress tool.

Stress Tool

OpenNMS offers a tool that generates random traffic similar to how Collectd does it to understand if the OpenNMS settings, the Newts settings, and the chosen Cassandra cluster can fulfill the needs.

WARNING: It is important to keep in mind that the actual work of Collectd can be more expensive compared with using this tool. For this reason, the chosen OpenNMS should never reach CPU usage over 20% when executing the stress tests, to give run for Collectd and for all the rest of the OpenNMS daemons that would be running and doing work on any given production environment.

This tool is a Karaf Command, which requires access to the Karaf Shell:

ssh -o ServerAliveInterval=10 -p 8101 admin@localhost

The ServerAliveInterval is mandatory to keep the session alive; otherwise, you would have to restart OpenNMS if the session is closed.

From there, executing metrics:stress --help provides an overview of the parameters you can tune during the test.

WARNING: It is recommended to use this on a clean installation of OpenNMS and the cluster that will be evaluated.

Some tests have been executed using ScyllaDB and the results were published here.

An interesting fact from that test is that the bigger the cluster is, the faster it will be. That means, adding more nodes to the cluster will make it faster, as the work will be evenly distributed across all the nodes.

Interestingly, keep in mind the resource cache, and here is why:

The following command:

metrics:stress -r 60 -n 15000 -f 100 -g 1 -a 20 -s 1 -t 200 -i 300

Injects 5000 string metrics per second, and 100000 numeric metrics per second, creating 3000000 entries on the resource cache.

On the other hand, the following command:

metrics:stress -r 60 -n 15000 -f 20 -g 5 -a 20 -s 1 -t 200 -i 300

Also injects 5000 string metrics per second and 100000 numeric metrics per second, but it creates 1800000 entries on the resource cache.

That means, even if the injection rate is the same, the requirements for the resource cache and the ring buffer will be different.

Based on the fields tests on a 16 nodes cluster, a single OpenNMS server can inject 100000 samples per second having 2000000 of entries in the resource cache and 4194304 entries on the ring buffer, using the second command from above.

That said, with the same setup, to handle the first command, the cache values have to be doubled (4000000 entries for the resource cache and 8388608 entries for the resource buffer).

Based on field tests, having larger buffer sizes can be counter-productive, meaning that the load has to be divided into multiple OpenNMS servers.

While figuring out the cache sizes for the expected load, other parameters can be tuned, but the way on which the data collection process is configured can influence on the resource cache directly, as there is where the group of objects to collect are configured (like the MibObj group inside a datacollection-group).

Data Collection Configuration

As mentioned, it is crucial to review all the metrics that are going to be collected to avoid pushing the caches and the injection rate with unnecessary data. If this step is not performed, the environment could end with a big cluster and a big machine for OpenNMS (or multiple ones) for no reason, besides collecting “all the data that’s available”, instead of storing “only the data that’s needed”.

For example, consider the following content extracted from $OPENNMS_HOME/etc/datacollection/mib2.xml:

<group name="mib2-X-interfaces" ifType="all">
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.1"  instance="ifIndex" alias="ifName" type="string"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.15" instance="ifIndex" alias="ifHighSpeed" type="string"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.6"  instance="ifIndex" alias="ifHCInOctets" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.10" instance="ifIndex" alias="ifHCOutOctets" type="Counter64"/>
</group>
<group name="mib2-X-interfaces-pkts" ifType="all">
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.7"  instance="ifIndex" alias="ifHCInUcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.8"  instance="ifIndex" alias="ifHCInMulticastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.9"  instance="ifIndex" alias="ifHCInBroadcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.11" instance="ifIndex" alias="ifHCOutUcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.12" instance="ifIndex" alias="ifHCOutMulticastPkt" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.13" instance="ifIndex" alias="ifHCOutBroadcastPkt" type="Counter64"/>
</group>
<group name="mib2-interface-errors" ifType="all">
  <mibObj oid=".1.3.6.1.2.1.2.2.1.13" instance="ifIndex" alias="ifInDiscards" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.14" instance="ifIndex" alias="ifInErrors" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.19" instance="ifIndex" alias="ifOutDiscards" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.20" instance="ifIndex" alias="ifOutErrors" type="counter"/>
</group>

The above section is perfectly valid. Now imagine a scenario where there is a need to monitor 1000 Cisco Nexus Switches, each of them with 1500 Interfaces (between physical and virtual interfaces).

Because we have 3 groups associated with interface statistics, the Newts persistence strategy is going to create:

1000 * 1500 + 1000 * 1500 * 3 = 6000000

In other words, an entry per each resource (in this case per interface), plus an instance per each group on each interface.

With that size for the cache, which is not considering the node level resources, the groups for the node level resources, and other resources in general would require a ring buffer of 16777216 or a tremendously big cluster to be able to handle the indexing with a smaller ring buffer.

Now, if we combine all these 3 groups into one which is entirely possible because all share the same resource type (i.e. the value of the instance is the same), we dramatically reduce the resource cache to:

1000 * 1500 + 1000 * 1500 = 3000000

Meaning, we could be able to handle the load with a ring buffer of 8388608. This is important considering that for a Java application, having a Heap Size greater than 31GB can be dangerous.

The proposed solution to reduce the entries on the resource cache is to have:

<group name="mib2-X-interfaces-full" ifType="all">
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.1"  instance="ifIndex" alias="ifName" type="string"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.6"  instance="ifIndex" alias="ifHCInOctets" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.7"  instance="ifIndex" alias="ifHCInUcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.8"  instance="ifIndex" alias="ifHCInMulticastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.9"  instance="ifIndex" alias="ifHCInBroadcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.10" instance="ifIndex" alias="ifHCOutOctets" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.11" instance="ifIndex" alias="ifHCOutUcastPkts" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.12" instance="ifIndex" alias="ifHCOutMulticastPkt" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.13" instance="ifIndex" alias="ifHCOutBroadcastPkt" type="Counter64"/>
  <mibObj oid=".1.3.6.1.2.1.31.1.1.1.15" instance="ifIndex" alias="ifHighSpeed" type="string"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.13"    instance="ifIndex" alias="ifInDiscards" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.14"    instance="ifIndex" alias="ifInErrors" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.19"    instance="ifIndex" alias="ifOutDiscards" type="counter"/>
  <mibObj oid=".1.3.6.1.2.1.2.2.1.20"    instance="ifIndex" alias="ifOutErrors" type="counter"/>
</group>

Of course, that’s assuming that all the metrics will be used by the operators of the monitoring platform; otherwise, it is advised to remove what’s not going to be used.

Write Threads

A thread pool will be dedicated to extract metrics from the ring buffer and push them to Cassandra through Newts. The number of threads should be tuned to be the number of cores of your OpenNMS server.

During fields tests, we found that increasing the number of threads is not necessarily useful, as there impact on the injection rate is not dramatic, but having more threads working could impact the overall CPU usage of the OpenNMS server.

Datastax Driver Settings

There are 2 parameters that can be tuned on the Driver:

  • org.opennms.newts.config.max-connections-per-host
  • org.opennms.newts.config.max-requests-per-connection

To learn more about them, please refer to the driver’s documentation.

1 Like