Lightweight Prometheus Server Tutorial

Posted on April 15, 2020

How often have you deployed Prometheus on your local infrastructure only to find that it devours your resources?

Yes, we’re thinking mainly of RAM, CPU, and disk space.

Many times we have deployed Prometheus and then had to shrink the retention time for our metrics. Or to increase the scraping time. Certainly, the main reason for our adjustments was mostly the cost constraints.

Let’s see what other tools we can use to provide an efficient local infrastructure. First stop: your Lightweight Prometheus Server. Hop on!

Installing exporters and Grafana dashboards

It’s tempting to install countless exporters and pre-configured Grafana dashboards. This can help you earn higher visibility on your resource usage. But it comes at a great cost. Depending on the purpose of the deployed Prometheus Servers, we used only a handful of metrics. However, we also considered exporters like Node Exporter that would provide a multitude of metrics.

Choosing the right metrics

Closely following, we decided to use Prometheus to scale some application pods based on CPU usage. The first step was to reduce resource usage. Then, to delete as many scrape jobs from the default values file as possible. By doing this, it would scrape only the necessary targets. Namely, the /cadvisor endpoints on the Kubernetes nodes. Thus, our Prometheus configuration file ended up looking like this:

prometheus.yml:
 rule_files:
    - /etc/config/recording_rules.yml
    - /etc/config/alerting_rules.yml
    - /etc/config/rules
    - /etc/config/alerts

 scrape_configs:
    - job_name: 'kubernetes-nodes-cadvisor'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
        - role: node
        
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor

Nonetheless, even with only the metrics from cadvisor, Prometheus was crashing the VM. This was mainly because of excess RAM and CPU usage. Additionally, out of all the metrics scraped from the cadvisor endpoint, we only needed container_cpu_usage_seconds_total.

How to create a curated list of metrics

After spending some time researching the issue, we have finally come across a setting that would fix it. It’s the configuration file which tells Prometheus to only scrape certain metrics. Then, the configuration file turned into this:

prometheus.yml:
 rule_files:
    - /etc/config/recording_rules.yml
    - /etc/config/alerting_rules.yml
    - /etc/config/rules
    - /etc/config/alerts

 scrape_configs:
    - job_name: 'kubernetes-nodes-cadvisor'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
        - role: node
        
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
      metric_relabel_configs:
      - source_labels: [ __name__ ]
        regex: 'container_cpu_usage_seconds_total'
        action: keep

Notice the last part of the configuration file. It tells Prometheus to only keep that specific metric from the scrape job. Below you can find the results:

Resource usage with the old config. After running for 1h, RAM usage keeps increasing as more time series are scraped:

NAME                                 CPU(cores)   MEMORY(bytes)
prometheus-server-6cd9b7c5f4-lh94r   41m          526Mi

Resource usage with new config—running for 7d:

NAME                                 CPU(cores)   MEMORY(bytes)
prometheus-server-6cd9b7c5f4-6q597   20m          222Mi

To sum it up, the first step towards creating your Lightweight Prometheus Server is to install all the necessary exporters and Grafana dashboards. Next, select only the metrics you need for the graphs. After that, create a list with the metrics you truly need. Ultimately, update your Prometheus configuration file accordingly.

Coming up next, check how you can use Prometheus to autoscale your workloads.