Retry Policies

A retry policy is a way for the Grey Matter Sidecar or Edge to automatically retry a failed request on behalf of the client. This is mostly transparent to the client; they will only get the status and return of the final request attempted (failed or succeeded). The only effects they should see from a successful retry is a longer average request time and fewer failures.

Retry policies are used in the Fabric Mesh to improve resiliency. Each request, to Edge or to a service, can experience intermittent interruptions, high network latency, and server errors. Normally, it's up to the calling system to detect the error and re-issue the request, but retry policies handle this automatically. This generally helps to lower failure rates seen by users, with the cost of adding some additional traffic throughout the mesh.

Defaults

The Fabric Mesh has a default retry policy, which will apply anytime no specific policy is set. This default performs up to 2 retries, each with a timeout of 60s. Retries will be attempted on errors of connect-failure, refused-stream, and gateway-error.

Default Policy

"retry_policy": {
  "num_retries": 2,
  "per_try_timeout_msec": 60000,
  "timeout_msec": 60000
}

API config

Retry policies have 3 fields that can be configured. "connect-failure,refused-stream,gateway-error".

Fields

`num_retries`

This is the max number of retries attempted. Setting this field to N will cause up to N retries to be attempted before returning a result to the user.

Setting to 0 means only the original request will be sent and no retries are attempted. A value of 1 means the original request plus up to 1 retry will be sent, resulting in potentially 2 total requests to the server. A value of N will result in up to N+1 total requests going to the service.

`per_try_timeout_msec`

This is the timeout for each retry. The retry attempts can have longer or shorter timeouts than the original request. However, if the per_try_timeout_msec is too long, it is possible that not all retries will be attempted as it would violate the timeout_msec field (see below).

`timeout_msec`

This is the total timeout for the entire chain: initial request + all timeouts. This should typically be set large enough to accommodate the request and all retries desired.

Sample Policy

"retry_policy": {
  "num_retries": 3,
  "per_try_timeout_msec": 2000,
  "timeout_msec": 10000
}

Example

Retry Policies are configured in the route API object.

{
  "route_key": "my-service-route",
  "domain_key": "service",
  "zone_key": "default-zone",
  "path": "/services/service/1.0",
  "route_match": {
    "path": "",
    "match_type": ""
  },
  "prefix_rewrite": "/",
  "redirects": null,
  "shared_rules_key": "service-shared-rules",
  "rules": null,
  "response_data": {},
  "cohort_seed": null,
  "retry_policy": {
    "num_retries": 3,
    "per_try_timeout_msec": 2000,
    "timeout_msec": 10000
  },
  "high_priority": false,
  "filter_metadata": null,
  "checksum": "3a12044abf1b7c83f3265dcf00e5353b38adb4623a4f43b6bda53e78fd5d58b0"
}

Notes

Apparently Conflicting Metrics

Having retries setup in the mesh can lead to confusing behavior when looking at service metrics. This has implications for both setting SLOs and how users experience the system. This manifests mostly in the dashboard where, after setting up retries, the success/failure rate of the upstream service will no longer match what the client experience. This is because the data plane will have retried some requests on their behalf. These retried requests will add some successes and some failures to the metrics that the service sees. To demonstrate this, consider the below scenarios.

0 retries

A service fails about 10% of the time
A user sends 100 requests, and experiences ~90 successes and ~10 failures
The dashboard metrics all show metrics consistent with this experience

2 retries

NOTE: numbers shown here are example of a plausible outcome only. Retries do not guarantee 100% success rate

A service fails about 10% of the time
A user sends 100 requests, and experiences 100 successes and 0 failures
The dashboard metrics show that ~110 requests hit the service, with 10 failures and 100 successes.
The client sees an error rate of 0%, while the dashboard shows an error rate of 10/110=.091%

This obviously complicates the metrics story in the mesh. This does not mean the dashboard is wrong; just that the story is now more complicated. The service error rate is correct, but that is not what the user error rate is.

Reference

Envoy Retry Policy API

PreviousOutlier Detection NextSecurity

Last updated 4 years ago

Was this helpful?