# Health Checks

*Reference:* [*Envoy Health Checking*](https://www.envoyproxy.io/docs/envoy/v1.15.0/intro/arch_overview/upstream/health_checking)

Grey Matter supports the configuration of active health checks on an upstream cluster. Health checking is configured per [cluster](https://greymatter.gitbook.io/grey-matter-documentation/1.3/reference/api/fabric-api/cluster) object in the field `health_check`, and will be used by Envoy to determine whether or not to route to the cluster. Grey Matter offers two different types of health checking, HTTP and TCP.

## Configuration

Health checking in Grey Matter is set through the [cluster object](https://greymatter.gitbook.io/grey-matter-documentation/1.3/reference/api/fabric-api/cluster/health-check) `health_check` field. This field takes a list of desired health check objects. A cluster with a health check enabled should look like the object below:

```javascript
{
  "cluster_key": "example-cluster",
  ...
  "health_checks": [
    {
      "timeout_msec": 2000,
      "interval_msec": 10000,
      "unhealthy_threshold": 6,
      "healthy_threshold": 1,
      "health_checker": {
        "http_health_check": {
          "path": "/health"

        }
      }
    }
  ],
  ...
}
```

> Note: The following fields are **required**: [`timeout_msec`](#timeoutmsec), [`interval_msec`](#intervalmsec), [`health_checker`](#healthchecker).

### Fields

#### `timeout_msec`

The time in milliseconds to wait for a health check response. If the timeout is reached without a response, the health check attempt will be considered a failure. This value is **required** and **must** be greater than 0.

#### `interval_msec`

The time interval in between health checks after the first health check. The first round of health checks will occur during startup before any traffic is routed to a cluster, so the first interval of health checks will be the value of [`no_traffic_interval_msec`](#notrafficintervalmsec). This value is **required** and **must** be greater than 0.

#### `interval_jitter_msec`

An optional jitter amount that is added to each interval value calculated by the proxy. Defaults to 0.

#### `unhealthy_threshold`

The number of failed health checks required before a host is marked as unhealthy. *Note* that for http health checking, if a host responds with a 503 status this value is ignored and the host is considered unhealthy immediately.

#### `healthy_threshold`

The number of successful health checks required before a host is marked healthy. During startup, only a single successful health check is required to mark a host healthy.

#### `reuse_connection`

A boolean value indicating whether or not to reuse a health check connection between health checks. Defaults to `true`.

#### `no_traffic_interval_msec`

When a cluster has never had traffic routed to it (ie on startup), this is the interval used for health checking instead of [`interval_msec`](#intervalmsec). Once the cluster has been used for traffic routing, the interval will shift to the `interval_msec` value. This should be a longer interval, which allows cluster info to be checked without sending large amounts of active health checking traffic for no reason. Defaults to 60s.

#### `unhealthy_interval_msec`

When a cluster is marked as unhealthy, this is the interval used for health checking instead of [`interval_msec`](#intervalmsec). As soon as the host is marked as healthy, the interval will shift back to the `interval_msec` value. Defaults to the value of `interval_msec`.

#### `unhealthy_edge_interval_msec`

The health check interval used for the first health check immediately after a host is marked as unhealthy. After this initial health check, the interval will shift to [`unhealthy_interval_msec`](#unhealthyintervalmsec). Defaults to the value of `unhealthy_interval_msec`.

#### `healthy_edge_interval_msec`

The health check that is used for the first health check immediately after a host is marked as healthy. After this initial health check, the interval will shift back to the standard [`interval_msec`](#intervalmsec). Defaults to the value of `interval_msec`.

#### `health_checker`

An object that defines the type of health checking to use. This object is **required** and **one and only one of the following fields must** be set.

Fields:

* [`http_health_check`](#httphealthcheck)
* [`tcp_health_check`](#tcphealthcheck)

**http\_health\_check**

Configures the HTTP health check endpoint for each instance in a cluster.

Fields:

* `host`
  * the value of the host header in the HTTP health check request
  * defaults to an empty string
  * if empty, the name of the cluster being health checked will be used.
* `path`
  * the HTTP path that will be requested during health checking
  * this value is **required** and cannot be an empty string
* `service_name`
  * an optional value which is compared to the `X-Envoy-Upstream-Healthchecked-Cluster` header to validate the identity of the health checked cluster
* `request_headers_to_add`
  * a list of HTTP readers that should be added to each health check request that is sent to the cluster

**tcp\_health\_check**

Configures the TCP health check endpoint for each instance in a cluster.

Fields:

* `send`
  * a base64 encoded string representing an array of bytes to be sent in health check requests
  * if empty, implies a connect-only health check
* `receive`
  * an array of base64 encoded strings, each representing an array of bytes that is expected in health check responses
  * a "fuzzy" matching is preformed when checking the response, such that each binary block must be found and in the order specified, but not necessarily contiguously

### Stats

If health checking is enabled on a cluster, a series of [health check statistics](https://www.envoyproxy.io/docs/envoy/v1.15.0/configuration/upstream/cluster_manager/cluster_stats#health-check-statistics) will be reported in its `/stats` endpoint, and will look like the following:

```bash
cluster.service.health_check.attempt: 20998
cluster.service.health_check.degraded: 0
cluster.service.health_check.failure: 10583
cluster.service.health_check.healthy: 1
cluster.service.health_check.network_failure: 10583
cluster.service.health_check.passive_failure: 0
cluster.service.health_check.success: 10415
cluster.service.health_check.verify_cluster: 0
```

If the Envoy log level in the Sidecar is set to debug, the logs will also show when health checking has been implemented.

### Health Check Results

When active health checking is configured on a cluster, Envoy will route to the target instance if the health check is successful, and will not route to the instance while the health check is failing.

Envoy uses the active health check results combined with the service discovery status to make other decisions about routing, more details on this information can be found in the [service discovery docs](https://www.envoyproxy.io/docs/envoy/v1.15.0/intro/arch_overview/upstream/service_discovery).
