Health Checks
Reference: Envoy Health Checking
Grey Matter supports the configuration of active health checks on an upstream cluster. Health checking is configured per cluster object in the field health_check
, and will be used by Envoy to determine whether or not to route to the cluster. Grey Matter offers two different types of health checking, HTTP and TCP.
Configuration
Health checking in Grey Matter is set through the cluster object health_check
field. This field takes a list of desired health check objects. A cluster with a health check enabled should look like the object below:
{
"cluster_key": "example-cluster",
...
"health_checks": [
{
"timeout_msec": 2000,
"interval_msec": 10000,
"unhealthy_threshold": 6,
"healthy_threshold": 1,
"health_checker": {
"http_health_check": {
"path": "/health"
}
}
}
],
...
}
Note: The following fields are required:
timeout_msec
,interval_msec
,health_checker
.
Fields
timeout_msec
timeout_msec
The time in milliseconds to wait for a health check response. If the timeout is reached without a response, the health check attempt will be considered a failure. This value is required and must be greater than 0.
interval_msec
interval_msec
The time interval in between health checks after the first health check. The first round of health checks will occur during startup before any traffic is routed to a cluster, so the first interval of health checks will be the value of no_traffic_interval_msec
. This value is required and must be greater than 0.
interval_jitter_msec
interval_jitter_msec
An optional jitter amount that is added to each interval value calculated by the proxy. Defaults to 0.
unhealthy_threshold
unhealthy_threshold
The number of failed health checks required before a host is marked as unhealthy. Note that for http health checking, if a host responds with a 503 status this value is ignored and the host is considered unhealthy immediately.
healthy_threshold
healthy_threshold
The number of successful health checks required before a host is marked healthy. During startup, only a single successful health check is required to mark a host healthy.
reuse_connection
reuse_connection
A boolean value indicating whether or not to reuse a health check connection between health checks. Defaults to true
.
no_traffic_interval_msec
no_traffic_interval_msec
When a cluster has never had traffic routed to it (ie on startup), this is the interval used for health checking instead of interval_msec
. Once the cluster has been used for traffic routing, the interval will shift to the interval_msec
value. This should be a longer interval, which allows cluster info to be checked without sending large amounts of active health checking traffic for no reason. Defaults to 60s.
unhealthy_interval_msec
unhealthy_interval_msec
When a cluster is marked as unhealthy, this is the interval used for health checking instead of interval_msec
. As soon as the host is marked as healthy, the interval will shift back to the interval_msec
value. Defaults to the value of interval_msec
.
unhealthy_edge_interval_msec
unhealthy_edge_interval_msec
The health check interval used for the first health check immediately after a host is marked as unhealthy. After this initial health check, the interval will shift to unhealthy_interval_msec
. Defaults to the value of unhealthy_interval_msec
.
healthy_edge_interval_msec
healthy_edge_interval_msec
The health check that is used for the first health check immediately after a host is marked as healthy. After this initial health check, the interval will shift back to the standard interval_msec
. Defaults to the value of interval_msec
.
health_checker
health_checker
An object that defines the type of health checking to use. This object is required and one and only one of the following fields must be set.
Fields:
http_health_check
Configures the HTTP health check endpoint for each instance in a cluster.
Fields:
host
the value of the host header in the HTTP health check request
defaults to an empty string
if empty, the name of the cluster being health checked will be used.
path
the HTTP path that will be requested during health checking
this value is required and cannot be an empty string
service_name
an optional value which is compared to the
X-Envoy-Upstream-Healthchecked-Cluster
header to validate the identity of the health checked cluster
request_headers_to_add
a list of HTTP readers that should be added to each health check request that is sent to the cluster
tcp_health_check
Configures the TCP health check endpoint for each instance in a cluster.
Fields:
send
a base64 encoded string representing an array of bytes to be sent in health check requests
if empty, implies a connect-only health check
receive
an array of base64 encoded strings, each representing an array of bytes that is expected in health check responses
a "fuzzy" matching is preformed when checking the response, such that each binary block must be found and in the order specified, but not necessarily contiguously
Stats
If health checking is enabled on a cluster, a series of health check statistics will be reported in its /stats
endpoint, and will look like the following:
cluster.service.health_check.attempt: 20998
cluster.service.health_check.degraded: 0
cluster.service.health_check.failure: 10583
cluster.service.health_check.healthy: 1
cluster.service.health_check.network_failure: 10583
cluster.service.health_check.passive_failure: 0
cluster.service.health_check.success: 10415
cluster.service.health_check.verify_cluster: 0
If the Envoy log level in the Sidecar is set to debug, the logs will also show when health checking has been implemented.
Health Check Results
When active health checking is configured on a cluster, Envoy will route to the target instance if the health check is successful, and will not route to the instance while the health check is failing.
Envoy uses the active health check results combined with the service discovery status to make other decisions about routing, more details on this information can be found in the service discovery docs.
Last updated
Was this helpful?