Filter measurement API incredibly slowish and unuseable


#1

Hello,

I’m trying to make use of the measurement filter API through Percentile calculation. Unfortunately it seems impossible to do this on any serious amount of samples. The Derivate filter behaves similar, so it seems likely that the issue is not the apache percentile code, but something bigger:

user@opennms ~
 ❯ curl -X POST -H "Accept: application/json" \                                                                               [16:03:10]
     -H "Content-Type: application/json" \
     -u foo:bar "http://127.0.0.1:8980/opennms/rest/measurements" \
     -d '{
           "start": 1550588428000,
           "end": 1550934028000,
           "step": 1,
           "maxrows": 0,
           "source": [
               {
                   "aggregation": "AVERAGE",
                   "attribute": "ifHCInOctets",
                   "label": "ifHCInOctets",
                   "resourceId": "nodeSource[internet-routers:edge1.mars].interfaceSnmp[TenGigE0_0_1_0-d46d50274570]",
                   "transient": "false"
               }
           ],
           "filter": [
               {
                   "name": "Percentile",
                   "parameter": [
                       {
                           "key": "inputColumn",
                           "value": "ifHCInOctets"
                       },
                       {
                           "key": "outputColumn",
                           "value": "ifHCInOctetsPerc"
                       }
                   ]
               }
           ]
         }' | jq '.columns[1].values' | uniq -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  303k    0  302k  100   980   7006     22  0:00:44  0:00:44 --:--:-- 75373
      1 [
   5761   65802853.94648651,
      1   65802853.94648651
      1 ]

When doing a query for ifHCInOctets with Percentile filter and in that demonstration with a time period of 4 days, it takes 44 seconds to respond.

As you can see I do a filter with jq and uniq to count the list of returned Percentile values. 5761 identical Percentile values = 5761 samples in 4 days, one sample per minute.

Doing now the same query without Percentile filter…

 user@opennms ~
 ❯ curl -X POST -H "Accept: application/json" \                                                                               
     -H "Content-Type: application/json" \
     -u foo:bar "http://127.0.0.1:8980/opennms/rest/measurements" \
     -d '{
           "start": 1550588428000,
           "end": 1550934028000,
           "step": 1,
           "maxrows": 0,
           "source": [
               {
                   "aggregation": "AVERAGE",
                   "attribute": "ifHCInOctets",
                   "label": "ifHCInOctets",
                   "resourceId": "nodeSource[internet-routers:edge1.mars].interfaceSnmp[TenGigE0_0_1_0-d46d50274570]",
                   "transient": "false"
               }
           ]
         }' | jq '.columns[0].values' | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  190k    0  189k  100   507  1697k   4539 --:--:-- --:--:-- --:--:-- 1707k
5764

5764 values returned, just the single array on AVERAGE aggregation, counted with wc -l.
This time the query took not even 1 second!

What leads to such an inefficiency that the Percentile filter on a ~6.000 numbers list takes incredible 44 seconds? This Xeon 24-core / DL380 with a load-average of 3 does the calculation of 70.000 samples with few lines of Perl code in 2 seconds. My goal is of course not doing benchmarks, but calculating each month the 95th percentile of some hundreds of ports.

Ideas to this behavior much appreciated!

thanks

(running 23.0.2-1 on CentOS 7)


#2

So perhaps someones does at least know if the query time for a result set of such size is the expected behavior, or if something particular in my setup is just broken?


#3

So is this performance the expected behavior or the normal experienced performance? Noone using performance data with a filter? Can noone do the same query just for a comparison?


#4

I was able to reproduce the problem and found the bottleneck - filters are terribly slow for any result set with more than 1000 values.

Details here: https://issues.opennms.org/browse/NMS-10589


#5

“Cool” jesse, thanks. Hopefully someone is able to fix…


#6

Can someone suggest on “easiest way to apply the code change”?