# Metrics

The Grey Matter Metrics Filter sets up a local metrics server to gather and report real-time statistics for the sidecar, microservice, and host system.

## Gathered Metrics

### Total Stats

* metrics version
* total requests
* total HTTP
* total HTTPs
* total RPC
* total RPC/TLS
* total requests
* total 200
* total 2xx
* latency (avg)
* latency (count)
* latency max
* latency min
* latency sum
* latency p50
* latency p90
* latency p95
* latency p99
* latency p9990
* latency p9999
* number of errors
* incoming throughput
* outgoing throughput

### Route Stats

For each route that is addressed, the following stats will be computed and reported.

* total requests
* total 200
* total 2xx
* latency (avg)
* latency (count)
* latency max
* latency min
* latency sum
* latency p50
* latency p90
* latency p95
* latency p99
* latency p9990
* latency p9999
* number of errors
* incoming throughput
* outgoing throughput

### Host Stats

* number of goroutines
* start time
* CPU percent used
* CPU cores on system
* os
* os Architecture
* memory available
* memory used
* memory used %
* process memory used

## Prometheus

Optionally, this filter can serve the computed statistics in a form suitable for scraping by [Prometheus](https://prometheus.io/). The prometheus endpoint will be hosted at `{METRICS_HOST}:{METRICS_PORT}{METRICS_PROMETHEUS_URI_PATH}`, which can then be scraped directly through the supported Prometheus [service discovery](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) mechanisms.

## AWS CloudWatch

The metrics filter can also push the compiled statistics directly to AWS Cloudwatch. This allows the Grey Matter Proxy metrics to be directly used to trigger things like AutoScale actions or just for tighter monitoring directly in AWS.

### Filter Configuration Option

| Name                                         | Type    | Default       | Description                                                             |
| -------------------------------------------- | ------- | ------------- | ----------------------------------------------------------------------- |
| `metrics_port`                               | Integer | `8081`        | Port the metrics server listens on                                      |
| `metrics_host`                               | String  | `0.0.0.0`     | Host the metrics server listens on                                      |
| `metrics_dashboard_uri_path`                 | String  | `/metrics`    | The `HTTP` path to query JSON metrics data                              |
| `metrics_prometheus_uri_path`                | String  | `/prometheus` | The `HTTP` path to be scraped by Prometheus                             |
| `prometheus_system_metrics_interval_seconds` | Integer | `15`          |                                                                         |
| `metrics_ring_buffer_size`                   | Integer | `4096`        | Size of the cache of active metrics data                                |
| `metrics_key_function`                       | String  | ""            | Function to provide internal rollup of URL paths when reporting metrics |
| `metrics_key_depth`                          | String  | "1"           | Truncate URLs to the first path section                                 |
| `use_metrics_tls`                            | Boolean | `false`       | <p>If true, metrics server</p><p>uses TLS</p>                           |
| `server_ca_cert_path`                        | String  |               | SSL Trust file to use when serving metrics over TLS                     |
| `server_cert_path`                           | String  |               | SSL Certificate to use when serving metrics over TLS                    |
| `server_key_path`                            | String  |               | SSL Private Key file to use when serving metrics over TLS               |
| `enable_cloudwatch`                          | Boolean | `false`       | If true, report metrics to AWS Cloudwatch                               |
| `cw_reporting_interval_seconds`              | Integer |               | Interval to send metrics to AWS Cloudwatch                              |
| `cw_namespace`                               | String  |               | Namespace for Cloudwatch Metrics                                        |
| `cw_dimensions`                              | String  |               | Dimensions to report to Cloudwatch                                      |
| `cw_metrics_routes`                          | String  |               | URI paths to send metrics for                                           |
| `cw_metrics_values`                          | String  |               | Metrics keys to send metrics for                                        |
| `cw_debug`                                   | Boolean | false         | Verbose debugging for Cloudwatch connection                             |
| `aws_region`                                 | String  |               | AWS region for access                                                   |
| `aws_access_key_id`                          | String  |               | AWS access key                                                          |
| `aws_secret_access_key`                      | String  |               | AWS Secrete Access Key                                                  |
| `aws_session_token`                          | String  |               | AWS Session Token                                                       |
| `aws_profile`                                | String  |               | AWS Profile to use for login                                            |
| `aws_config_file`                            | String  |               | Location on disk of AWS config file                                     |

#### Example Configuration

```yaml
http_filters:
- name: gm.metrics
  config:
      metrics_port: 9080
      metrics_host: 0.0.0.0
      metrics_dashboard_uri_path: "/metrics"
      metrics_prometheus_uri_path: "/prometheus"
      metrics_ring_buffer_size: 4096
      use_metrics_tls: false
      enable_cloudwatch: false
```

#### Example Responses

**/metrics**

```javascript
{
  "grey-matter-metrics-version": "1.0.0",
  "Total/requests": 22091,
  "HTTP/requests": 0,
  "HTTPS/requests": 22091,
  "RPC/requests": 0,
  "RPC_TLS/requests": 0,
  "route/services/catalog/1.0/summary/GET/requests": 3345,
  "route/services/catalog/1.0/summary/GET/routes": "",
  "route/services/catalog/1.0/summary/GET/status/200": 3345,
  "route/services/catalog/1.0/summary/GET/status/2XX": 3345,
  "route/services/catalog/1.0/summary/GET/latency_ms.avg": 0.000000,
  "route/services/catalog/1.0/summary/GET/latency_ms.count": 7,
  "route/services/catalog/1.0/summary/GET/latency_ms.max": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.min": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.sum": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p50": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p90": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p95": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p99": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p9990": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p9999": 0,
  "route/services/catalog/1.0/summary/GET/errors.count": 0,
  "route/services/catalog/1.0/summary/GET/in_throughput": 0,
  "route/services/catalog/1.0/summary/GET/out_throughput": 25970425,
  "route/services/sense/1.0/recommendation/GET/requests": 3350,
  "route/services/sense/1.0/recommendation/GET/routes": "",
  "route/services/sense/1.0/recommendation/GET/status/200": 3341,
  "route/services/sense/1.0/recommendation/GET/status/503": 9,
  "route/services/sense/1.0/recommendation/GET/status/2XX": 3341,
  "route/services/sense/1.0/recommendation/GET/status/5XX": 9,
  "route/services/sense/1.0/recommendation/GET/latency_ms.avg": 0.000000,
  "route/services/sense/1.0/recommendation/GET/latency_ms.count": 7,
  "route/services/sense/1.0/recommendation/GET/latency_ms.max": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.min": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.sum": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p50": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p90": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p95": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p99": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p9990": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p9999": 0,
  "route/services/sense/1.0/recommendation/GET/errors.count": 0,
  "route/services/sense/1.0/recommendation/GET/in_throughput": 0,
  "route/services/sense/1.0/recommendation/GET/out_throughput": 1450994,
  "all/requests": 21924,
  "all/routes": "",
  "all/status/304": 112,
  "all/status/200": 21803,
  "all/status/503": 9,
  "all/status/2XX": 21803,
  "all/status/5XX": 9,
  "all/status/3XX": 112,
  "all/latency_ms.avg": 0.013428,
  "all/latency_ms.count": 4096,
  "all/latency_ms.max": 13,
  "all/latency_ms.min": 0,
  "all/latency_ms.sum": 55,
  "all/latency_ms.p50": 0,
  "all/latency_ms.p90": 0,
  "all/latency_ms.p95": 0,
  "all/latency_ms.p99": 0,
  "all/latency_ms.p9990": 4,
  "all/latency_ms.p9999": 13,
  "all/errors.count": 0,
  "all/in_throughput": 132437,
  "all/out_throughput": 3622059,
  "route//GET/requests": 13,
  "route//GET/routes": "",
  "route//GET/status/304": 12,
  "route//GET/status/200": 1,
  "route//GET/status/3XX": 12,
  "route//GET/status/2XX": 1,
  "route//GET/latency_ms.avg": 0.000000,
  "route//GET/latency_ms.count": 1,
  "route//GET/latency_ms.max": 0,
  "route//GET/latency_ms.min": 0,
  "route//GET/latency_ms.sum": 0,
  "route//GET/latency_ms.p50": 0,
  "route//GET/latency_ms.p90": 0,
  "route//GET/latency_ms.p95": 0,
  "route//GET/latency_ms.p99": 0,
  "route//GET/latency_ms.p9990": 0,
  "route//GET/latency_ms.p9999": 0,
  "route//GET/errors.count": 0,
  "route//GET/in_throughput": 0,
  "route//GET/out_throughput": 1628356,
  "go_metrics/runtime/num_goroutines": 6,
  "system/start_time": 1570507704592,
  "system/cpu.pct": 100.000000,
  "system/cpu_cores": 4,
  "os": "linux",
  "os_arch": "amd64",
  "system/memory/available": 5576384512,
  "system/memory/used": 10214662144,
  "system/memory/used_percent": 63.169011,
  "process/memory/used": 72286456
}
```

**/prometheus**

```
...
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.005"} 1
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.01"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.025"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.05"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.1"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.25"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="1"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="2.5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="10"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="+Inf"} 2
http_request_duration_seconds_sum{key="all",method="",status="401"} 0.01088538
http_request_duration_seconds_count{key="all",method="",status="401"} 2
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.005"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.01"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.025"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.05"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.1"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.25"} 7
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="1"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="2.5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="10"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="+Inf"} 9
http_request_duration_seconds_sum{key="all",method="",status="503"} 1.9743323400000001
http_request_duration_seconds_count{key="all",method="",status="503"} 9
# HELP http_request_size_bytes number of bytes read from the request
# TYPE http_request_size_bytes counter
http_request_size_bytes{key="/",method="GET",status="200"} 0
http_request_size_bytes{key="/",method="GET",status="304"} 0
http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="200"} 0
http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="304"} 0
http_request_size_bytes{key="/appConfig.js",method="GET",status="304"} 0
http_request_size_bytes{key="/favicon.ico",method="GET",status="200"} 0
http_request_size_bytes{key="/manifest.json",method="GET",status="304"} 0
http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="200"} 0
http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="304"} 0
http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="200"} 0
http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="304"} 0
http_request_size_bytes{key="/services/catalog/1.0/metrics",method="GET",status="200"} 0
http_request_size_bytes{key="/services/catalog/1.0/summary",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/props",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/read",method="POST",status="200"} 1379
http_request_size_bytes{key="/services/data/latest/self",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/show",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/static",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/static",method="GET",status="304"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="206"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="304"} 0
http_request_size_bytes{key="/services/gm-control-api/1.0/v1.0",method="GET",status="200"} 0
http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="200"} 0
http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="401"} 0
http_request_size_bytes{key="/services/jwt/latest/tokens",method="GET",status="307"} 0
http_request_size_bytes{key="/services/kibana/1.0/api",method="GET",status="200"} 0
http_response_size_bytes{key="all",method="",status="200"} 1.61519157e+08
http_response_size_bytes{key="all",method="",status="206"} 8.7419618e+07
http_response_size_bytes{key="all",method="",status="304"} 0
http_response_size_bytes{key="all",method="",status="307"} 67
http_response_size_bytes{key="all",method="",status="401"} 102
http_response_size_bytes{key="all",method="",status="503"} 513
# HELP non_tls_requests Number of requests not using TLS.
...
```

### Per-Route configuration

```javascript
{
  "metrics_key_function": <string>,
  "metrics_key_depth": <string>
}
```

See [Routing](https://greymatter.gitbook.io/grey-matter-documentation/1.7-beta/usage/traffic_control/routing).

### Setting `metrics_key_depth` value

Typically, the greater `metrics_key_depth`, the finer-grained metrics you will end up with for analysis. However, there are some tradeoffs to consider.

#### Edge Proxy

As you see in the [gm.metrics filter documentation](#filter-configuration-option), `metrics_key_depth` will be set to 1 by default. The resulting metrics for an edge proxy would look something like this:

![](https://3431003532-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LsNFVozLgvw3NDMzxBg-1847203797%2Fuploads%2Fgit-blob-e8912a187632ed186d0304af99c762b1d5f1baec%2Fprometheus.png?alt=media)

Note that `key` field above only goes down 1 subdirectory. Does this provide enough granularity of the information? It depends.

Let's say we have following endpoints:

* `https://greymatter.io/apis/my-service/stores/`
* `https://greymatter.io/apis/my-service/users/37`
* `https://greymatter.io/apis/another-service/featured/2020/09`
* `https://greymatter.io/apis/another-service/home.html`

With `metrics_key_depth` of 1, the average response time for the above routes get rolled up to one key:

* `/apis`

If you chose `metrics_key_depth` of 2, the same URLs get rolled up to two:

* `/apis/my-service`
* `/apis/another-service`

This would likely give you an idea of the average response time for each micro service. If URLs are structured as something like `https://[domain]/[service]/` in your environment, you can get the same granularity of the information for `metrics_key_depth` of 1 (i.e. `key="/my-service"` and `key="/another-service"`).

If you chose `metrics_key_depth` of 3, the URLs in the example would get rolled up to:

* `/apis/my-service/stores/`
* `/apis/my-service/users/`
* `/apis/another-service/featured/`
* `/apis/another-service/home.html`

These look fine for these example URLs. But if URLs are structured like `https://[domain]/[service]/` and `my-service` has millions of `users`, then you will end up with keys that look like: `/my-service/users/[id]` for each and every single user IDs - which will be millions.

The motivation behind choosing the default value of 1 is to minimize the size of the data storage. As stated in [Prometheus' best practices](https://prometheus.io/docs/practices/naming/):

> CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

Keep in mind that this example is for the edge proxy where requests for many different microservices will flow through just the same. For this reason, the safe option would be to choose a small number for `metrics_key_depth` to prevent the cardinality explosions due to a service that may get added in future.

#### Sidecar Proxy

Service sidecars can also have `gm.metrics` filter. Because this is specific to a service it sits next to, we can go down a little deeper if we wanted to.

Let's take `my-service` from the first example:

* `https://greymatter.io/apis/my-service/stores/`
* `https://greymatter.io/apis/my-service/users/`

`metrics_key_depth` of 1 will give us:

* `/stores`
* `/users`

It is typical to have a mesh route object that will rewrite a path `/apis/my-service/` to `/` before forwarding the request to a side car. So even though we have a depth of 1, it still gives us timeseries data with finer-grained path.

#### Balancing between data storage and data granularity

In short, the greater the `metrics_key_depth`, the faster the data storage will fill up. However, if highly rolled up "average" metrics will not give users the information they need, then there is no point in collecting them. In these scenarios, other strategies besides reducing the `metrics_key_depth` value should be considered (such as data retention periods or shipping to cheaper storage).
