# Outlier Detection

## Introduction

*Reference:* [*Envoy Outlier Detection*](https://www.envoyproxy.io/docs/envoy/v1.15.0/intro/arch_overview/upstream/outlier.html)

Outlier detection is a passive health check that tracks which instances assigned in a cluster are up or down, using user-defined rules. If a cluster is found to be down, the proxy will eject the unresponsive instance, diverting traffic preventing timeouts and disruptions throughout the mesh. After a specified amount of time, that instance will come back online, however the ejection time grows with each subsequent ejection.

### Detection Types

*Reference:* [*Envoy: Outlier Detection Types*](https://www.envoyproxy.io/docs/envoy/v1.15.0/intro/arch_overview/upstream/outlier.html#detection-types)

There are three main methods for outlier detection which are used in determining if a cluster should be ejected:

* *Consecutive 5xx Responses* - The number of consecutive 5xx responses the service or workload returns. This is configured with `"consecutive_5xx"`.
* *Consecutive Gateway Failure* - same as consecutive 5xx responses, but only for 502, 503, and 504 status codes (i.e. not including internal server errors). This is configured with the attribute `"enforcing_consecutive_gateway_failure"`.
* *Success Rate* - A percentage of successful requests over the reporting interval. Rather than a fixed number, envoy uses `mean - (standard deviation * factor)` to calculate the ejection threshold dynamically. This is configured with `"success_rate_stdev_factor"`.

### Additional Settings

Envoy exposes a few additional settings to fine tune ejection rules:

* *Enforcement Probability* - each detection type has a related attribute (e.g. `enforcing_consecutive_gateway_failure`) which weighs how likely an actual is when an outlier is detected. This can be used to toggle a specific type of ejection or to experiment with different levels of outlier detection types in a deployment.
* *Ejection Configuration* - there are two additional settings for ejection:
  * `base_ejection_time_msec`: factor of how long a host is ejected for. This is multiplied by the number of times the host has been ejected for the real time.
  * `max_ejection_percent`: the maximum percentage of instances in a cluster which can be ejected at a given time.
* *Success Rate* - because success rate is based on a statistical measure, there are a few basic safety checks to ensure that the [p value](https://en.wikipedia.org/wiki/P-value) is low enough for a ejection to be meaningful:
  * `success_rate_minimum_hosts`: the minimum number of hosts of a cluster before success rate outlier detection is enabled.
  * `success_rate_request_volume`: the minimum number of requests in the reporting interval before success rate outlier detection is enabled.
  * `success_rate_stddev_factor`: how important the standard deviation is in calculating the success rate threshold. A lower factor (0-100) means that success rates between instances can vary a lot between each other before they are ejected. A higher factor means that instances varying only a little from sister instances will be ejected. See [source code](https://github.com/envoyproxy/envoy/blob/v1.10.0/source/common/upstream/outlier_detection_impl.cc#L396-L430) for implementation and examples.
* *Reporting*: outlier detection rules are evaluated every `interval` seconds. This is when host will be ejected or returned to the cluster.