Outlier Detection

Introduction

Reference: Envoy Outlier Detection

Outlier detection is a passive health check that tracks which instances assigned in a cluster are up or down, using user-defined rules. If a cluster is found to be down, the proxy will eject the unresponsive instance, diverting traffic preventing timeouts and disruptions throughout the mesh. After a specified amount of time, that instance will come back online, however the ejection time grows with each subsequent ejection.

Detection Types

Reference: Envoy: Outlier Detection Types

There are three main methods for outlier detection which are used in determining if a cluster should be ejected:

  • Consecutive 5xx Responses - The number of consecutive 5xx responses the service or workload returns. This is configured with "consecutive_5xx".

  • Consecutive Gateway Failure - same as consecutive 5xx responses, but only for 502, 503, and 504 status codes (i.e. not including internal server errors). This is configured with the attribute "enforcing_consecutive_gateway_failure".

  • Success Rate - A percentage of successful requests over the reporting interval. Rather than a fixed number, envoy uses mean - (standard deviation * factor) to calculate the ejection threshold dynamically. This is configured with "success_rate_stdev_factor".

Additional Settings

Envoy exposes a few additional settings to fine tune ejection rules:

  • Enforcement Probability - each detection type has a related attribute (e.g. enforcing_consecutive_gateway_failure) which weighs how likely an actual is when an outlier is detected. This can be used to toggle a specific type of ejection or to experiment with different levels of outlier detection types in a deployment.

  • Ejection Configuration - there are two additional settings for ejection:

    • base_ejection_time_msec: factor of how long a host is ejected for. This is multiplied by the number of times the host has been ejected for the real time.

    • max_ejection_percent: the maximum percentage of instances in a cluster which can be ejected at a given time.

  • Success Rate - because success rate is based on a statistical measure, there are a few basic safety checks to ensure that the p value is low enough for a ejection to be meaningful:

    • success_rate_minimum_hosts: the minimum number of hosts of a cluster before success rate outlier detection is enabled.

    • success_rate_request_volume: the minimum number of requests in the reporting interval before success rate outlier detection is enabled.

    • success_rate_stddev_factor: how important the standard deviation is in calculating the success rate threshold. A lower factor (0-100) means that success rates between instances can vary a lot between each other before they are ejected. A higher factor means that instances varying only a little from sister instances will be ejected. See source code for implementation and examples.

  • Reporting: outlier detection rules are evaluated every interval seconds. This is when host will be ejected or returned to the cluster.

Last updated

Was this helpful?