Health Checks

Reference: Envoy Health Checking

Grey Matter supports the configuration of active health checks on an upstream cluster. Health checking is configured per cluster object in the field health_check, and will be used by Envoy to determine whether or not to route to the cluster. Grey Matter offers two different types of health checking, HTTP and TCP.

Configuration

Health checking in Grey Matter is set through the cluster object health_check field. This field takes a list of desired health check objects. A cluster with a health check enabled should look like the object below:

{
  "cluster_key": "example-cluster",
  ...
  "health_checks": [
    {
      "timeout_msec": 2000,
      "interval_msec": 10000,
      "unhealthy_threshold": 6,
      "healthy_threshold": 1,
      "health_checker": {
        "http_health_check": {
          "path": "/health"

        }
      }
    }
  ],
  ...
}

Note: The following fields are required: timeout_msec, interval_msec, health_checker.

Fields

`timeout_msec`

The time in milliseconds to wait for a health check response. If the timeout is reached without a response, the health check attempt will be considered a failure. This value is required and must be greater than 0.

`interval_msec`

The time interval in between health checks after the first health check. The first round of health checks will occur during startup before any traffic is routed to a cluster, so the first interval of health checks will be the value of no_traffic_interval_msec. This value is required and must be greater than 0.

`interval_jitter_msec`

An optional jitter amount that is added to each interval value calculated by the proxy. Defaults to 0.

`unhealthy_threshold`

The number of failed health checks required before a host is marked as unhealthy. Note that for http health checking, if a host responds with a 503 status this value is ignored and the host is considered unhealthy immediately.

`healthy_threshold`

The number of successful health checks required before a host is marked healthy. During startup, only a single successful health check is required to mark a host healthy.

`reuse_connection`

A boolean value indicating whether or not to reuse a health check connection between health checks. Defaults to true.

`no_traffic_interval_msec`

When a cluster has never had traffic routed to it (ie on startup), this is the interval used for health checking instead of interval_msec. Once the cluster has been used for traffic routing, the interval will shift to the interval_msec value. This should be a longer interval, which allows cluster info to be checked without sending large amounts of active health checking traffic for no reason. Defaults to 60s.

`unhealthy_interval_msec`

When a cluster is marked as unhealthy, this is the interval used for health checking instead of interval_msec. As soon as the host is marked as healthy, the interval will shift back to the interval_msec value. Defaults to the value of interval_msec.

`unhealthy_edge_interval_msec`

The health check interval used for the first health check immediately after a host is marked as unhealthy. After this initial health check, the interval will shift to unhealthy_interval_msec. Defaults to the value of unhealthy_interval_msec.

`healthy_edge_interval_msec`

The health check that is used for the first health check immediately after a host is marked as healthy. After this initial health check, the interval will shift back to the standard interval_msec. Defaults to the value of interval_msec.

`health_checker`

An object that defines the type of health checking to use. This object is required and one and only one of the following fields must be set.

Fields:

http_health_check

Configures the HTTP health check endpoint for each instance in a cluster.

Fields:

host
- the value of the host header in the HTTP health check request
- defaults to an empty string
- if empty, the name of the cluster being health checked will be used.
path
- the HTTP path that will be requested during health checking
- this value is required and cannot be an empty string
service_name
- an optional value which is compared to the X-Envoy-Upstream-Healthchecked-Cluster header to validate the identity of the health checked cluster
request_headers_to_add
- a list of HTTP readers that should be added to each health check request that is sent to the cluster

tcp_health_check

Configures the TCP health check endpoint for each instance in a cluster.

Fields:

send
- a base64 encoded string representing an array of bytes to be sent in health check requests
- if empty, implies a connect-only health check
receive
- an array of base64 encoded strings, each representing an array of bytes that is expected in health check responses
- a "fuzzy" matching is preformed when checking the response, such that each binary block must be found and in the order specified, but not necessarily contiguously

Stats

If health checking is enabled on a cluster, a series of health check statistics will be reported in its /stats endpoint, and will look like the following:

cluster.service.health_check.attempt: 20998
cluster.service.health_check.degraded: 0
cluster.service.health_check.failure: 10583
cluster.service.health_check.healthy: 1
cluster.service.health_check.network_failure: 10583
cluster.service.health_check.passive_failure: 0
cluster.service.health_check.success: 10415
cluster.service.health_check.verify_cluster: 0

If the Envoy log level in the Sidecar is set to debug, the logs will also show when health checking has been implemented.

Health Check Results

When active health checking is configured on a cluster, Envoy will route to the target instance if the health check is successful, and will not route to the instance while the health check is failing.

Envoy uses the active health check results combined with the service discovery status to make other decisions about routing, more details on this information can be found in the service discovery docs.

PreviousCircuit Breakers NextOutlier Detection

Last updated 5 years ago

Was this helpful?