Tweek parameters in flow Definition in Telemetryd and their usage

Problem:
[Trying to get Flow Support working. I went through the setup and got a running system with flows (netflow 9 from a Microtic Router) visualized in Helm flow deep dive. But it seems that some data is missing, especilally at high interface load (1 Gbit/s and above).
The Router is exporting about 5000 flows/s, consumed bandwith on the incoming interface is something like 1,5 Mbit/s.
I went through all the parameter in telemetryd-config but are not quite shure, what all the parameters really mean. Furthermore I do not now which units are used for the parameter, especially cache sizes (kbyte, megabyte, number of Flow messages…) and all the timeouts (seconds, milliseconds, minutes ?)
Is there some info about this, best practise examples, howtos where these things are explained, or are there resources where one can gather all this together ( e.g. ghithub…)
I also collected the JVM an netflow performance metrics, and dont understand what the collours really mean ? Are these flows dropped, or delayed ? Who can help here ?
As this is only a proof of concept elasticsearch is on the same maschine as ONMS with 8 Cores , 64 Megs of RAM and 1 Tb of storage. cpu-usage averages to about 1 ( 1 min avergage) with peaks to 3 ( from 8 max.). Do I need a better hardware ?
]

Expected outcome:
[Visualizing Flow from Ports wiht up to 5 Gbit/s Traffic, max 10000 flows/sec]

28.0.1

Other relevant data:
[none]

Regards

Hans

Have you tried the documentation?

Hello,
yes I did read everything I could find about parsers, listeners, queues and their parameters. That is why I am asking the questions because I could not find any answers there. I also searched the discourse lists for topic flow or telemetryd wirh no success

Regards
Hans

If you’ve read the docs and still have questions, you’re going to need to be specific about which parameters you have questions about.

Hello Dino,

so lets dive into it:

Telemetryd Queues:
Are there hints on how the parameters influence performance ? Some best practice ?

  • threads: default 2 x nr of cores: How timeconsuming is message dispatching for flows. Can this value be increased safely, and is there a way to monitor this ?

*queue-size: default 10000 (messages). Is this 10000 flows or 10000 ntflow 9 messages (with more than 10000 flows in summary)? is it better to decrease this value in order to have faster dispatching, or is it better to increase value in order not to loose messages.

netflow 9 UDP Parser:

  • templateTimeout: default 30 minutes: If I want to change it to adapt to the flow source, is this value in minutes, seconds, miliseconds ?

  • dnsLookupsEnabled:default true: how much does this value influence performance ? Are there timeouts; Is it ok to disable dnslookup ?

  • flowActivieTimeoutFallback, flowInactiveFallback, flowSamplingIntervallFallback: Meaning is clear, but wht are the units ? Second, miliseconds, minutes?

Some additional question on monitoring the performance: Howto interpret the opennms-jvm, sink consumer, and telemetry adapter/listener metrics related to flows?

For example Message dispatch time; What do the colors mean, should everything be green ?

or Packets per log: Is this looking ok ?

furthermore there are metrics for flow log enrichment, flow log persisting latency, flows persisted. Is one of then a measure if some flows where not persisted i.e.flows lost ?

Regards Hans

You can increase it, but at some point the additional parallelism won’t benefit you anymore unless you’re also adding cores. This is more often tuned on Minion and Sentinel where you’re also constrained by the number of partitions in your Kafka topic (you need as many partitions as threads in a 1:1 relationship in that case)

It’s 10000 flow documents (or logs), after it has been processed by the parser. A single log can contain multiple flows, that’s the flowPerLog graph you’ve posted above. You can view the size of the queue for a given protocol as Sink Producer Metrics graph resource Telemetry-protocol. If you can’t maintain or control the growth of the queue, you aren’t processing flows fast enough, so it’s a good barometer.

You will (probably) never need to adjust this. As I understand it, flow observers will send a special option template packets at some interval that describes the format of flow packet they’re sending to the receiver. If we don’t get a new option template in templateTimeout minutes, we expire the template. This is actually in minutes, which is rare (almost everything is in milliseconds in OpenNMS).

DNS lookups can add a ton of latency to flow enrichment and even with the caching we do internally, performing reverse DNS lookups during enrichment can be a huge performance hit. If you’re struggling with flow performance, set this to false and see if there’s an improvement.

I’ve never seen these changed in a production environment. Unit is milliseconds (that’s true in almost all cases in OpenNMS that I can think of, templateTimeout being the only exception I can think of offhand)

It’s a histogram. The redder/hotter colors are higher percentiles, e.g. the dark red is 999th percentile latency, meaning 99.9% of your packets are faster than this. We should probably be more consistent with the coloring on those graphs. :-/

Packets per log is a measure of the number of flows per packet we’re processing and how that’s varying. If 50% of your netflow packets contain 100 flow logs, your light green line would be at 100. In your graph, 50% of your flow logs contain an average of 2.61 flows.

There’s counters for flows dropped, but I don’t think there’s graphs shipped for them. You can easily visualize them in Helm, though, if you’re curious.

There’s also an mbean for invalid flows (from the parser) but it’s not collected or graphed out of the box (though it probably should be).

Hello dino,
Thank you very much for the comprehensive reply. I was not able to spend more time on the Issue for the last two weeks. When I do have further questions, I will get in touch.
Regards
Hans