30 seconds outage

What is it?

A poll consists of a connection to a particular port on a remote interface, and then a test to see if the service on that port returns an expected response. If the response is not received within the timeout and there are retries configured, it will be tried again. If the number of retries is exceeded, the service is considered down.
In some networks, however, short, intermittent failures are common. With the default downtime model, a failed service will be polled again in 30 seconds. This will result in what is known as a “30 second outage”.

Note that this is a real problem: a user attempting to access that resource would also have experienced a timeout. But in some networks these 30 second outages can be annoying yet hard to correct.

Possible causes

  • Packet loss due to Ethernet duplex mismatch
  • Packet loss due to collisions
  • Packet loss due to other factors (wireless interference, congestion, intermittent link failure, etc.)
  • DNS name resolution
  • High CPU load on the device to be tested
  • Service is recycling
  • Routing problems
  • SSH key generation
  • SMTP Greylisting or other funky feature

Workarounds

Fixing the root of the problem

We’re serious. You should seriously look into fixing the root of the problem instead of implementing a workaround. See the list of possible causes above.

Increase timeout and retries

If you couldn’t fix the problem you might increase the timeout and / or the number of retries befor the service is considered to be down. See Common Configuration Parameters for details.

Initial delay in notifications

If 30 second outages persist in spite of your efforts to track down the root cause, you may set an initial time delay for notifications. This is done in the configuration file destinationPaths.xml. For example:

    <path name="Email-Admin" initial-delay="1m">
        <target>
                <name xmlns="">Admin</name>
                <command xmlns="">javaEmail</command>
        </target>
    </path>

This will delay the notification for 1 minute. If service is restored in less than 1 minute the notification will be cancelled.

Setting serviceUnresponsiveEnabled

serviceUnresponsiveEnabled in poller-configuration.xml

The option was added to denote a failure as when the port connection fails and not the response. In this case, an unresponsive service does not generate an outage, but only a “service unresponsive” event. To enable this behavior, set this value to “true”. See Polling Configuration for details.