Prometheus Store

All collected metrics are stored in a Prometheus server inside the mesh. This store can be accessed by both APIs and a built-in UI. To access the UI, visit the Prometheus URL from the Intelligence 360 Application. The default path of the UI is /services/prometheus/latest.

NOTE: you can always find the URLs of core Grey Matter components, like the Prometheus base URL, with the toggles path.

Metrics

Grey Matter aggregates metrics from every instance of every Service throughout the Fabric mesh and presents them for insight and analysis. The main key indicators are brought forth in the historical and instance views of the Intelligence 360 Application, but a great deal more can be accessed whenever needed.

From the UI, you can execute queries against the collected metrics and graph the results.

Querying

In addition to the UI, Prometheus exposes /api/{version}/query to be used as an API endpoint. This can be used to pull historical metrics for reporting and custom analysis. The examples below demonstrate the types of queries that can be performed, but a full explanation of the options available can be found in the Prometheus Documentation.

Using prerecorded rules

The Prometheus server deployed with Grey Matter comes with many useful recording rules to access frequently needed or computationally expensive expressions. You can see a list of all available recording rules by navigating to the Status > Rules page in the Prometheus UI, or by accessing the /rules route.

These rules can be used as is, or built upon to form more complex queries. For example, the overviewQueries:avgUpPercent:avg rule computes the up time for a service at each scrape interval (usually every 15s) and stores it as a new timeseries. We can combine this new timeseries with Prometheus's built in avg_over_time function to return the percentage of uptime for the edge service over the past hour:

avg_over_time(overviewQueries:avgUpPercent:avg{job="edge"}[1h]) * 100

Running this query returns an instant vector result. The value array contains a timestamp representing the instant that the metric was captured and a corresponding percentage value.

$ curl https://{prometheus_endpoint}/api/v1/query --data-urlencode "query=avg_over_time(overviewQueries:avgUpPercent:avg{job='edge'})*100"

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "job": "edge"
        },
        "value": [
          1598394589.487,
          "100"
        ]
      }
    ]
  }
}

Querying metrics directly

Grey Matter metrics can also be queried directly. For example, we can find the system CPU usage for all services that Prometheus monitors by running the following query:

system_cpu_pct

This gives us an array of instant vector results.

$ curl https://{prometheus_endpoint}/api/v1/query?query=system_cpu_pct

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "system_cpu_pct",
          "instance": "10.0.179.118:8080",
          "job": "example"
        },
        "value": [
          1598392226.087,
          "12.596401008059724"
        ]
      },
      {
        "metric": {
          "__name__": "system_cpu_pct",
          "instance": "10.0.158.182:8080",
          "job": "edge"
        },
        "value": [
          1598392226.087,
          "5.236907732468766"
        ]
      }
    }
   }
   ...

To narrow down results, we can provide a job parameter to the request. The job should map to the discovered proxy name of the service:

system_cpu_pct{job='edge'}

$ curl https://{prometheus_endpoint}/api/v1/query --data-urlencode "query=system_cpu_pct{job='edge'}"

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "system_cpu_pct",
          "instance": "10.0.138.150:8080",
          "job": "edge"
        },
        "value": [
          1598392692.487,
          "5.01253132453294"
        ]
      }
    ]
  }
}

Similarly, the request duration for a specific route can be queried:

http_request_duration_seconds_sum{key='/services/catalog/latest'}

$ curl https://{prometheus_endpoint}/api/v1/query --data-urlencode "query=http_request_duration_seconds_sum{key='/services/catalog/latest'}"

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "http_request_duration_seconds_sum",
          "instance": "192.168.37.172:8081",
          "job": "edge",
          "key": "/services/catalog/latest",
          "method": "GET",
          "status": "200"
        },
        "value": [
          1598409389.582,
          "1.5646367309999996"
        ]
      },
      {
        "metric": {
          "__name__": "http_request_duration_seconds_sum",
          "instance": "192.168.37.172:8081",
          "job": "edge",
          "key": "/services/catalog/latest",
          "method": "GET",
          "status": "503"
        },
        "value": [
          1598409389.582,
          "0.029742157"
        ]
      }
    ]
  }
}

Alerting

Alerting is currently performed directly through Prometheus alerting configuration. This requires the setup of an AlertManager and configuration of alerting rules. In the Prometheus configuration of the deployment.

PreviousInstance Metrics NextSLO

Last updated 4 years ago

Was this helpful?