Health Checks

Reference: Envoy Health Checking

Grey Matter supports the configuration of active health checks on an upstream cluster. Health checking is configured per cluster object in the field health_check, and will be used by Envoy to determine whether or not to route to the cluster. Grey Matter offers two different types of health checking, HTTP and TCP.

Configuration

Health checking in Grey Matter is set through the cluster object health_check field. This field takes a list of desired health check objects. A cluster with a health check enabled should look like the object below:

{
  "cluster_key": "example-cluster",
  ...
  "health_checks": [
    {
      "timeout_msec": 2000,
      "interval_msec": 10000,
      "unhealthy_threshold": 6,
      "healthy_threshold": 1,
      "health_checker": {
        "http_health_check": {
          "path": "/health"

        }
      }
    }
  ],
  ...
}

Note: The following fields are required: timeout_msec, interval_msec, health_checker.

Fields

timeout_msec

The time in milliseconds to wait for a health check response. If the timeout is reached without a response, the health check attempt will be considered a failure. This value is required and must be greater than 0.

interval_msec

The time interval in between health checks after the first health check. The first round of health checks will occur during startup before any traffic is routed to a cluster, so the first interval of health checks will be the value of no_traffic_interval_msec. This value is required and must be greater than 0.

interval_jitter_msec

An optional jitter amount that is added to each interval value calculated by the proxy. Defaults to 0.

unhealthy_threshold

The number of failed health checks required before a host is marked as unhealthy. Note that for http health checking, if a host responds with a 503 status this value is ignored and the host is considered unhealthy immediately.

healthy_threshold

The number of successful health checks required before a host is marked healthy. During startup, only a single successful health check is required to mark a host healthy.

reuse_connection

A boolean value indicating whether or not to reuse a health check connection between health checks. Defaults to true.

no_traffic_interval_msec

When a cluster has never had traffic routed to it (ie on startup), this is the interval used for health checking instead of interval_msec. Once the cluster has been used for traffic routing, the interval will shift to the interval_msec value. This should be a longer interval, which allows cluster info to be checked without sending large amounts of active health checking traffic for no reason. Defaults to 60s.

unhealthy_interval_msec

When a cluster is marked as unhealthy, this is the interval used for health checking instead of interval_msec. As soon as the host is marked as healthy, the interval will shift back to the interval_msec value. Defaults to the value of interval_msec.

unhealthy_edge_interval_msec

The health check interval used for the first health check immediately after a host is marked as unhealthy. After this initial health check, the interval will shift to unhealthy_interval_msec. Defaults to the value of unhealthy_interval_msec.

healthy_edge_interval_msec

The health check that is used for the first health check immediately after a host is marked as healthy. After this initial health check, the interval will shift back to the standard interval_msec. Defaults to the value of interval_msec.

health_checker

An object that defines the type of health checking to use. This object is required and one and only one of the following fields must be set.

Fields:

http_health_check

Configures the HTTP health check endpoint for each instance in a cluster.

Fields:

  • host

    • the value of the host header in the HTTP health check request

    • defaults to an empty string

    • if empty, the name of the cluster being health checked will be used.

  • path

    • the HTTP path that will be requested during health checking

    • this value is required and cannot be an empty string

  • service_name

    • an optional value which is compared to the X-Envoy-Upstream-Healthchecked-Cluster header to validate the identity of the health checked cluster

  • request_headers_to_add

    • a list of HTTP readers that should be added to each health check request that is sent to the cluster

tcp_health_check

Configures the TCP health check endpoint for each instance in a cluster.

Fields:

  • send

    • a base64 encoded string representing an array of bytes to be sent in health check requests

    • if empty, implies a connect-only health check

  • receive

    • an array of base64 encoded strings, each representing an array of bytes that is expected in health check responses

    • a "fuzzy" matching is preformed when checking the response, such that each binary block must be found and in the order specified, but not necessarily contiguously

Stats

If health checking is enabled on a cluster, a series of health check statistics will be reported in its /stats endpoint, and will look like the following:

cluster.service.health_check.attempt: 20998
cluster.service.health_check.degraded: 0
cluster.service.health_check.failure: 10583
cluster.service.health_check.healthy: 1
cluster.service.health_check.network_failure: 10583
cluster.service.health_check.passive_failure: 0
cluster.service.health_check.success: 10415
cluster.service.health_check.verify_cluster: 0

If the Envoy log level in the Sidecar is set to debug, the logs will also show when health checking has been implemented.

Health Check Results

When active health checking is configured on a cluster, Envoy will route to the target instance if the health check is successful, and will not route to the instance while the health check is failing.

Envoy uses the active health check results combined with the service discovery status to make other decisions about routing, more details on this information can be found in the service discovery docs.

Last updated

Was this helpful?