Metrics

The Grey Matter Metrics Filter sets up a local metrics server to gather and report real-time statistics for the sidecar, microservice, and host system.

Gathered Metrics

Total Stats

metrics version
total requests
total HTTP
total HTTPs
total RPC
total RPC/TLS
total requests
total 200
total 2xx
latency (avg)
latency (count)
latency max
latency min
latency sum
latency p50
latency p90
latency p95
latency p99
latency p9990
latency p9999
number of errors
incoming throughput
outgoing throughput

Route Stats

For each route that is addressed, the following stats will be computed and reported.

total requests
total 200
total 2xx
latency (avg)
latency (count)
latency max
latency min
latency sum
latency p50
latency p90
latency p95
latency p99
latency p9990
latency p9999
number of errors
incoming throughput
outgoing throughput

Host Stats

number of goroutines
start time
CPU percent used
CPU cores on system
os
os Architecture
memory available
memory used
memory used %
process memory used

Prometheus

Optionally, this filter can serve the computed statistics in a form suitable for scraping by Prometheus. The prometheus endpoint will be hosted at {METRICS_HOST}:{METRICS_PORT}{METRICS_PROMETHEUS_URI_PATH}, which can then be scraped directly through the supported Prometheus service discovery mechanisms.

AWS CloudWatch

The metrics filter can also push the compiled statistics directly to AWS Cloudwatch. This allows the Grey Matter Proxy metrics to be directly used to trigger things like AutoScale actions or just for tighter monitoring directly in AWS.

Filter Configuration Option

Name

Type

Default

Description

metrics_port

Integer

8081

Port the metrics server listens on

metrics_host

String

0.0.0.0

Host the metrics server listens on

metrics_dashboard_uri_path

String

/metrics

The HTTP path to query JSON metrics data

metrics_prometheus_uri_path

String

/prometheus

The HTTP path to be scraped by Prometheus

prometheus_system_metrics_interval_seconds

Integer

15

metrics_ring_buffer_size

Integer

4096

Size of the cache of active metrics data

metrics_key_function

String

Function to provide internal rollup of URL paths when reporting metrics

metrics_key_depth

String

"1"

Truncate URLs to the first path section

use_metrics_tls

Boolean

false

If true, metrics server

uses TLS

server_ca_cert_path

String

SSL Trust file to use when serving metrics over TLS

server_cert_path

String

SSL Certificate to use when serving metrics over TLS

server_key_path

String

SSL Private Key file to use when serving metrics over TLS

enable_cloudwatch

Boolean

false

If true, report metrics to AWS Cloudwatch

cw_reporting_interval_seconds

Integer

Interval to send metrics to AWS Cloudwatch

cw_namespace

String

Namespace for Cloudwatch Metrics

cw_dimensions

String

Dimensions to report to Cloudwatch

cw_metrics_routes

String

URI paths to send metrics for

cw_metrics_values

String

Metrics keys to send metrics for

cw_debug

Boolean

false

Verbose debugging for Cloudwatch connection

aws_region

String

AWS region for access

aws_access_key_id

String

AWS access key

aws_secret_access_key

String

AWS Secrete Access Key

aws_session_token

String

AWS Session Token

aws_profile

String

AWS Profile to use for login

aws_config_file

String

Location on disk of AWS config file

Example Configuration

http_filters:
- name: gm.metrics
  config:
      metrics_port: 9080
      metrics_host: 0.0.0.0
      metrics_dashboard_uri_path: "/metrics"
      metrics_prometheus_uri_path: "/prometheus"
      metrics_ring_buffer_size: 4096
      use_metrics_tls: false
      enable_cloudwatch: false

Example Responses

/metrics

{
  "grey-matter-metrics-version": "1.0.0",
  "Total/requests": 22091,
  "HTTP/requests": 0,
  "HTTPS/requests": 22091,
  "RPC/requests": 0,
  "RPC_TLS/requests": 0,
  "route/services/catalog/1.0/summary/GET/requests": 3345,
  "route/services/catalog/1.0/summary/GET/routes": "",
  "route/services/catalog/1.0/summary/GET/status/200": 3345,
  "route/services/catalog/1.0/summary/GET/status/2XX": 3345,
  "route/services/catalog/1.0/summary/GET/latency_ms.avg": 0.000000,
  "route/services/catalog/1.0/summary/GET/latency_ms.count": 7,
  "route/services/catalog/1.0/summary/GET/latency_ms.max": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.min": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.sum": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p50": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p90": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p95": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p99": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p9990": 0,
  "route/services/catalog/1.0/summary/GET/latency_ms.p9999": 0,
  "route/services/catalog/1.0/summary/GET/errors.count": 0,
  "route/services/catalog/1.0/summary/GET/in_throughput": 0,
  "route/services/catalog/1.0/summary/GET/out_throughput": 25970425,
  "route/services/sense/1.0/recommendation/GET/requests": 3350,
  "route/services/sense/1.0/recommendation/GET/routes": "",
  "route/services/sense/1.0/recommendation/GET/status/200": 3341,
  "route/services/sense/1.0/recommendation/GET/status/503": 9,
  "route/services/sense/1.0/recommendation/GET/status/2XX": 3341,
  "route/services/sense/1.0/recommendation/GET/status/5XX": 9,
  "route/services/sense/1.0/recommendation/GET/latency_ms.avg": 0.000000,
  "route/services/sense/1.0/recommendation/GET/latency_ms.count": 7,
  "route/services/sense/1.0/recommendation/GET/latency_ms.max": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.min": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.sum": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p50": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p90": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p95": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p99": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p9990": 0,
  "route/services/sense/1.0/recommendation/GET/latency_ms.p9999": 0,
  "route/services/sense/1.0/recommendation/GET/errors.count": 0,
  "route/services/sense/1.0/recommendation/GET/in_throughput": 0,
  "route/services/sense/1.0/recommendation/GET/out_throughput": 1450994,
  "all/requests": 21924,
  "all/routes": "",
  "all/status/304": 112,
  "all/status/200": 21803,
  "all/status/503": 9,
  "all/status/2XX": 21803,
  "all/status/5XX": 9,
  "all/status/3XX": 112,
  "all/latency_ms.avg": 0.013428,
  "all/latency_ms.count": 4096,
  "all/latency_ms.max": 13,
  "all/latency_ms.min": 0,
  "all/latency_ms.sum": 55,
  "all/latency_ms.p50": 0,
  "all/latency_ms.p90": 0,
  "all/latency_ms.p95": 0,
  "all/latency_ms.p99": 0,
  "all/latency_ms.p9990": 4,
  "all/latency_ms.p9999": 13,
  "all/errors.count": 0,
  "all/in_throughput": 132437,
  "all/out_throughput": 3622059,
  "route//GET/requests": 13,
  "route//GET/routes": "",
  "route//GET/status/304": 12,
  "route//GET/status/200": 1,
  "route//GET/status/3XX": 12,
  "route//GET/status/2XX": 1,
  "route//GET/latency_ms.avg": 0.000000,
  "route//GET/latency_ms.count": 1,
  "route//GET/latency_ms.max": 0,
  "route//GET/latency_ms.min": 0,
  "route//GET/latency_ms.sum": 0,
  "route//GET/latency_ms.p50": 0,
  "route//GET/latency_ms.p90": 0,
  "route//GET/latency_ms.p95": 0,
  "route//GET/latency_ms.p99": 0,
  "route//GET/latency_ms.p9990": 0,
  "route//GET/latency_ms.p9999": 0,
  "route//GET/errors.count": 0,
  "route//GET/in_throughput": 0,
  "route//GET/out_throughput": 1628356,
  "go_metrics/runtime/num_goroutines": 6,
  "system/start_time": 1570507704592,
  "system/cpu.pct": 100.000000,
  "system/cpu_cores": 4,
  "os": "linux",
  "os_arch": "amd64",
  "system/memory/available": 5576384512,
  "system/memory/used": 10214662144,
  "system/memory/used_percent": 63.169011,
  "process/memory/used": 72286456
}

/prometheus

...
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.005"} 1
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.01"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.025"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.05"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.1"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.25"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="1"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="2.5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="5"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="10"} 2
http_request_duration_seconds_bucket{key="all",method="",status="401",le="+Inf"} 2
http_request_duration_seconds_sum{key="all",method="",status="401"} 0.01088538
http_request_duration_seconds_count{key="all",method="",status="401"} 2
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.005"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.01"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.025"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.05"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.1"} 0
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.25"} 7
http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="1"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="2.5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="5"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="10"} 9
http_request_duration_seconds_bucket{key="all",method="",status="503",le="+Inf"} 9
http_request_duration_seconds_sum{key="all",method="",status="503"} 1.9743323400000001
http_request_duration_seconds_count{key="all",method="",status="503"} 9
# HELP http_request_size_bytes number of bytes read from the request
# TYPE http_request_size_bytes counter
http_request_size_bytes{key="/",method="GET",status="200"} 0
http_request_size_bytes{key="/",method="GET",status="304"} 0
http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="200"} 0
http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="304"} 0
http_request_size_bytes{key="/appConfig.js",method="GET",status="304"} 0
http_request_size_bytes{key="/favicon.ico",method="GET",status="200"} 0
http_request_size_bytes{key="/manifest.json",method="GET",status="304"} 0
http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="200"} 0
http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="304"} 0
http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="200"} 0
http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="304"} 0
http_request_size_bytes{key="/services/catalog/1.0/metrics",method="GET",status="200"} 0
http_request_size_bytes{key="/services/catalog/1.0/summary",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/props",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/read",method="POST",status="200"} 1379
http_request_size_bytes{key="/services/data/latest/self",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/show",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/static",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/static",method="GET",status="304"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="200"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="206"} 0
http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="304"} 0
http_request_size_bytes{key="/services/gm-control-api/1.0/v1.0",method="GET",status="200"} 0
http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="200"} 0
http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="401"} 0
http_request_size_bytes{key="/services/jwt/latest/tokens",method="GET",status="307"} 0
http_request_size_bytes{key="/services/kibana/1.0/api",method="GET",status="200"} 0
http_response_size_bytes{key="all",method="",status="200"} 1.61519157e+08
http_response_size_bytes{key="all",method="",status="206"} 8.7419618e+07
http_response_size_bytes{key="all",method="",status="304"} 0
http_response_size_bytes{key="all",method="",status="307"} 67
http_response_size_bytes{key="all",method="",status="401"} 102
http_response_size_bytes{key="all",method="",status="503"} 513
# HELP non_tls_requests Number of requests not using TLS.
...

Per-Route configuration

{
  "metrics_key_function": <string>,
  "metrics_key_depth": <string>
}

See Routing.

Setting `metrics_key_depth` value

Typically, the greater metrics_key_depth, the finer-grained metrics you will end up with for analysis. However, there are some tradeoffs to consider.

Edge Proxy

As you see in the gm.metrics filter documentation, metrics_key_depth will be set to 1 by default. The resulting metrics for an edge proxy would look something like this:

Note that key field above only goes down 1 subdirectory. Does this provide enough granularity of the information? It depends.

Let's say we have following endpoints:

https://greymatter.io/apis/my-service/stores/
https://greymatter.io/apis/my-service/users/37
https://greymatter.io/apis/another-service/featured/2020/09
https://greymatter.io/apis/another-service/home.html

With metrics_key_depth of 1, the average response time for the above routes get rolled up to one key:

/apis

If you chose metrics_key_depth of 2, the same URLs get rolled up to two:

/apis/my-service
/apis/another-service

This would likely give you an idea of the average response time for each micro service. If URLs are structured as something like https://[domain]/[service]/ in your environment, you can get the same granularity of the information for metrics_key_depth of 1 (i.e. key="/my-service" and key="/another-service").

If you chose metrics_key_depth of 3, the URLs in the example would get rolled up to:

/apis/my-service/stores/
/apis/my-service/users/
/apis/another-service/featured/
/apis/another-service/home.html

These look fine for these example URLs. But if URLs are structured like https://[domain]/[service]/ and my-service has millions of users, then you will end up with keys that look like: /my-service/users/[id] for each and every single user IDs - which will be millions.

The motivation behind choosing the default value of 1 is to minimize the size of the data storage. As stated in Prometheus' best practices:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

Keep in mind that this example is for the edge proxy where requests for many different microservices will flow through just the same. For this reason, the safe option would be to choose a small number for metrics_key_depth to prevent the cardinality explosions due to a service that may get added in future.

Sidecar Proxy

Service sidecars can also have gm.metrics filter. Because this is specific to a service it sits next to, we can go down a little deeper if we wanted to.

Let's take my-service from the first example:

https://greymatter.io/apis/my-service/stores/
https://greymatter.io/apis/my-service/users/

metrics_key_depth of 1 will give us:

/stores
/users

It is typical to have a mesh route object that will rewrite a path /apis/my-service/ to / before forwarding the request to a side car. So even though we have a depth of 1, it still gives us timeseries data with finer-grained path.

Balancing between data storage and data granularity

In short, the greater the metrics_key_depth, the faster the data storage will fill up. However, if highly rolled up "average" metrics will not give users the information they need, then there is no point in collecting them. In these scenarios, other strategies besides reducing the metrics_key_depth value should be considered (such as data retention periods or shipping to cheaper storage).

PreviousListAuth NextOAuth

Last updated 4 years ago

Was this helpful?

hashtagGathered Metrics

hashtagTotal Stats

hashtagRoute Stats

hashtagHost Stats

hashtagPrometheus

hashtagAWS CloudWatch

hashtagFilter Configuration Option

hashtagExample Configuration

hashtagExample Responses

hashtagPer-Route configuration

hashtagSetting metrics_key_depth value

hashtagEdge Proxy

hashtagSidecar Proxy

hashtagBalancing between data storage and data granularity

Gathered Metrics

Total Stats

Route Stats

Host Stats

Prometheus

AWS CloudWatch

Filter Configuration Option

Example Configuration

Example Responses

Per-Route configuration

Setting `metrics_key_depth` value

Edge Proxy

Sidecar Proxy

Balancing between data storage and data granularity