NestJS - Monitoring - Metrics (Prometheus)

In this section, we’ll introduce a new Prometheus Module that can help you monitor your applications and feed production-grade monitoring systems (e.g. Grafana).

This section is part of a larger guide. You can follow it from the beginning or just complete the only prerequisite step (getting started).

At this point, you should have a fully operational NestJS project to follow the upcoming steps.

Intro

Monitoring our apps is a very crucial part of our workflow, it helps us trace anomalies and bugs that may cause failures and system malfunction. Alongside logging, metrics are mandatory for providing real-time visibility of our system’s operation. Prometheus offers a great way to collect and serve such metrics to offer a comprehensive view of the application flows and the runtime that hosts it.

Prometheus offers a very simple and easy-to-implement way to collect metrics, a very powerful query language (PromQL), built-in alerting based on rules, and lots of integrations with well-known tools like Grafana.

In this section, we’ll only focus on the application side and system instrumentation, as for the server side you can use docker (K8S) or ready-to-be-used hosted Prometheus solutions (I’ll try to offer such a guide also later on, so make sure you revisit the guide), which goes beyond the scope of the guide.

Now, let’s start by installing the needed packages:

npm install prom-client @nestjs/config

The module will offer (i) app-level metrics (Counter and Gauge for now), (ii) HTTP metrics (i.e. metrics related to incoming requests and responses), and (iii) system metrics (provided as default metrics from prom-client) and ways to configure the above based on desired behavior.

Collect metrics

This service offers a method to enable system metrics (check https://github.com/siimon/prom-client/tree/master/lib/metrics for further details) and also ways to create custom counters and gauges. The latter can be used by other app services to create metrics with the desired configurations (e.g. acceptable name and labels).

The module also offers a middleware that can be attached to the web server’s call cycle to track metrics related to HTTP requests. Counts requests and labels them based on method, url, and statusCode, as this is a must-have information for any monitoring system.

Scrape metrics

Introduces a new GET /metrics endpoint that returns the currently enabled metrics (based on your config or set app metrics), that can be used by other systems to scrape the offered metrics.

Example output (HTTP/System metrics enabled, +custom hello_world_counter):

# HELP hello_world_counter Counts hello world requests
# TYPE hello_world_counter counter
hello_world_counter 4

# HELP http_requests Tracks HTTP requests
# TYPE http_requests counter
http_requests{method="GET",url="/api/v1/metrics",statusCode="200"} 13
http_requests{method="GET",url="/api/v1/hello",statusCode="200"} 4

# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.065486

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.007919

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.073405

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1706560290
...

Config Options

PROMETHEUS_DEFAULT_METRICS_ENABLED: By setting this to ‘true’, you can enable Prometheus default metrics (provided by prom-client). PROMETHEUS_HTTP_METRICS_ENABLED: By setting this to ‘true’, you can enable HTTP metrics (tracks HTTP requests/responses).

The module

On boot, it enables default metrics based on the corresponding config option.

If HTTP tracking is enabled, it applies MeasureHttp middleware to all routes (check here for more details).

Don’t forget to import your new module to your app.module.ts and any other module that must track custom metrics:

@Module({
  imports: [PrometheusModule]
})
export class AppModule {}

Improvements

Although the current solution provides our system with a way to track HTTP requests, it has some issues:

Too simple. It offers a way to count requests and filter them, but other than that, it doesn’t offer additional insights like response time, which is very important for an effective monitoring system.
Poor performance. One very important goal for maintaining a performant Prometheus system is to maintain low label cardinality (i.e. the set of values for each label should be as small as possible) for all metrics.

For the 1st issue, the solution is pretty straightforward, we simply want to replace the http_requests Counter with a Histogram that can offer various information (total count, cumulative counters for specific response time ranges, and total response time for calculating average, etc).

For the 2nd issue (which now becomes even more important as Histogram is more expensive than Counter), we must first identify which label values can cause issues. We have 3 labels method, url, statusCode. The method and statusCode can’t cause issues as they have a limited number of values (for the latter if needed we can even group them to minimize them, e.g. group all 400≤statusCode<500 to 4xx value). But the url can have an unlimited number of values as it can contain path parameters (e.g. /resources/:id) that scale with the number of available values or can be abused by external systems by simply spamming random unique URLs (attack). To fix this we can instead use the route path instead of the the URL and even group all unknown URLs under unmatched.

The current buckets 50ms, 100ms, 300ms, 700ms, 1s, 2s, 5s, 10s are just an examples, you can reconfigure them based on your liking or just leave the default ones which cover most use cases.

We can now see the new metrics which resolve both issues:

http_requests_bucket{le="0.05",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="0.1",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="0.3",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="0.7",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="1",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="2",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="5",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="10",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="+Inf",method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_sum{method="GET",path="/api/v1/metrics",statusCode="200"} 0.08400000000000006
http_requests_count{method="GET",path="/api/v1/metrics",statusCode="200"} 84
http_requests_bucket{le="0.05",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="0.1",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="0.3",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="0.7",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="1",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="2",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="5",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="10",method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="+Inf",method="GET",path="unmatched",statusCode="404"} 2
http_requests_sum{method="GET",path="unmatched",statusCode="404"} 0.002
http_requests_count{method="GET",path="unmatched",statusCode="404"} 2
http_requests_bucket{le="0.05",method="GET",path="/api/v1/hello/:param",statusCode="200"} 0
http_requests_bucket{le="0.1",method="GET",path="/api/v1/hello/:param",statusCode="200"} 0
http_requests_bucket{le="0.3",method="GET",path="/api/v1/hello/:param",statusCode="200"} 0
http_requests_bucket{le="0.7",method="GET",path="/api/v1/hello/:param",statusCode="200"} 0
http_requests_bucket{le="1",method="GET",path="/api/v1/hello/:param",statusCode="200"} 0
http_requests_bucket{le="2",method="GET",path="/api/v1/hello/:param",statusCode="200"} 1
http_requests_bucket{le="5",method="GET",path="/api/v1/hello/:param",statusCode="200"} 1
http_requests_bucket{le="10",method="GET",path="/api/v1/hello/:param",statusCode="200"} 1
http_requests_bucket{le="+Inf",method="GET",path="/api/v1/hello/:param",statusCode="200"} 1
http_requests_sum{method="GET",path="/api/v1/hello/:param",statusCode="200"} 1.001
http_requests_count{method="GET",path="/api/v1/hello/:param",statusCode="200"} 1

/api/v1/hello/world is mapped as /api/v1/hello/:param which means that no matter the parameter value this will remain static.
/unknown is mapped as unmatched which means that all unknown URLs will be now static.

Final Thoughts

The goal of this guide is to provide a ready-to-plug-in Prometheus that can help you with the Monitoring of your app but also showcase how you can create such a module and extend its functionality to cover your own app needs. Feel free to use it as it is, or change/upgrade it.

Finally, you can find here a full-fledged example, alongside various other modules that you might need to create a production application.

Useful Links:

Stackademic

Thank you for reading until the end. Before you go:

Please consider clapping and following the writer! 👏
Follow us X | LinkedIn | YouTube | Discord
Visit our other platforms: In Plain English | CoFeed | Venture