Outlier Detection
Last updated
Was this helpful?
Last updated
Was this helpful?
Reference:
Outlier detection is a passive health check that tracks which instances assigned in a cluster are up or down, using user-defined rules. If a cluster is found to be down, the proxy will eject the unresponsive instance, diverting traffic preventing timeouts and disruptions throughout the mesh. After a specified amount of time, that instance will come back online, however the ejection time grows with each subsequent ejection.
Reference:
There are three main methods for outlier detection which are used in determining if a cluster should be ejected:
Consecutive 5xx Responses - The number of consecutive 5xx responses the service or workload returns. This is configured with "consecutive_5xx"
.
Consecutive Gateway Failure - same as consecutive 5xx responses, but only for 502, 503, and 504 status codes (i.e. not including internal server errors). This is configured with the attribute "enforcing_consecutive_gateway_failure"
.
Success Rate - A percentage of successful requests over the reporting interval. Rather than a fixed number, envoy uses mean - (standard deviation * factor)
to calculate the ejection threshold dynamically. This is configured with "success_rate_stdev_factor"
.
Envoy exposes a few additional settings to fine tune ejection rules:
Enforcement Probability - each detection type has a related attribute (e.g. enforcing_consecutive_gateway_failure
) which weighs how likely an actual is when an outlier is detected. This can be used to toggle a specific type of ejection or to experiment with different levels of outlier detection types in a deployment.
Ejection Configuration - there are two additional settings for ejection:
base_ejection_time_msec
: factor of how long a host is ejected for. This is multiplied by the number of times the host has been ejected for the real time.
max_ejection_percent
: the maximum percentage of instances in a cluster which can be ejected at a given time.
Success Rate - because success rate is based on a statistical measure, there are a few basic safety checks to ensure that the is low enough for a ejection to be meaningful:
success_rate_minimum_hosts
: the minimum number of hosts of a cluster before success rate outlier detection is enabled.
success_rate_request_volume
: the minimum number of requests in the reporting interval before success rate outlier detection is enabled.
success_rate_stddev_factor
: how important the standard deviation is in calculating the success rate threshold. A lower factor (0-100) means that success rates between instances can vary a lot between each other before they are ejected. A higher factor means that instances varying only a little from sister instances will be ejected. See for implementation and examples.
Reporting: outlier detection rules are evaluated every interval
seconds. This is when host will be ejected or returned to the cluster.