# Outlier Detection

## Introduction

*Reference:* [*Envoy Outlier Detection*](https://www.envoyproxy.io/docs/envoy/v1.10.0/intro/arch_overview/outlier.html?highlight=outlier#outlier-detection)

Outlier detection is a passive health check that tracks which instances assigned in a cluster are up or down, using user-defined rules. If a cluster is found to be down, the proxy will eject the unresponsive instance, diverting traffic preventing timeouts and disruptions throughout the mesh. After a specified amount of time, that instance will come back online, however the ejection time grows with each subsequent ejection.

### Detection Types

*Reference:* [*Envoy: Outlier Detection Types*](https://www.envoyproxy.io/docs/envoy/v1.10.0/api-v2/api/v2/cluster/outlier_detection.proto#envoy-api-msg-cluster-outlierdetection)

There are three main methods for outlier detection which are used in determining if a cluster should be ejected:

* *Consecutive 5xx Responses* - The number of consecutive 5xx responses the service or workload returns. This is configured with `"consecutive_5xx"`.
* *Consecutive Gateway Failure* - same as consecutive 5xx responses, but only for 502, 503, and 504 status codes (i.e. not including internal server errors). This is configured with the attribute `"enforcing_consecutive_gateway_failure"`.
* *Success Rate* - A percentage of successful requests over the reporting interval. Rather than a fixed number, envoy uses `mean - (standard deviation * factor)` to calculate the ejection threshold dynamically. This is configured with `"success_rate_stdev_factor"`.

### Additional Settings

Envoy exposes a few additional settings to fine tune ejection rules:

* *Enforcement Probability* - each detection type has a related attribute (e.g. `enforcing_consecutive_gateway_failure`) which weighs how likely an actual is when an outlier is detected. This can be used to toggle a specific type of ejection or to experiment with different levels of outlier detection types in a deployment.
* *Ejection Configuration* - there are two additional settings for ejection:
  * `base_ejection_time_msec`: factor of how long a host is ejected for. This is multiplied by the number of times the host has been ejected for the real time.
  * `max_ejection_percent`: the maximum percentage of instances in a cluster which can be ejected at a given time.
* *Success Rate* - because success rate is based on a statistical measure, there are a few basic safety checks to ensure that the [p value](https://en.wikipedia.org/wiki/P-value) is low enough for a ejection to be meaningful:
  * `success_rate_minimum_hosts`: the minimum number of hosts of a cluster before success rate outlier detection is enabled.
  * `success_rate_request_volume`: the minimum number of requests in the reporting interval before success rate outlier detection is enabled.
  * `success_rate_stddev_factor`: how important the standard deviation is in calculating the success rate threshold. A lower factor (0-100) means that success rates between instances can vary a lot between each other before they are ejected. A higher factor means that instances varying only a little from sister instances will be ejected. See [source code](https://github.com/envoyproxy/envoy/blob/v1.10.0/source/common/upstream/outlier_detection_impl.cc#L396-L430) for implementation and examples.
* *Reporting*: outlier detection rules are evaluated every `interval` seconds. This is when host will be ejected or returned to the cluster.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://greymatter.gitbook.io/grey-matter-documentation/usage/fabric/resilience/outlier_detection.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
