How often have you deployed Prometheus on your local infrastructure only to find that it devours your resources?
Yes, we’re thinking mainly of RAM, CPU, and disk space.
Many times we have deployed Prometheus and then had to shrink the retention time for our metrics. Or to increase the scraping time. Certainly, the main reason for our adjustments was mostly the cost constraints.
Let’s see what other tools we can use to provide an efficient local infrastructure. First stop: your Lightweight Prometheus Server. Hop on!
Installing exporters and Grafana dashboards
It’s tempting to install countless exporters and pre-configured Grafana dashboards. This can help you earn higher visibility on your resource usage. But it comes at a great cost. Depending on the purpose of the deployed Prometheus Servers, we used only a handful of metrics. However, we also considered exporters like Node Exporter that would provide a multitude of metrics.
Choosing the right metrics
Closely following, we decided to use Prometheus to scale some application pods based on CPU usage. The first step was to reduce resource usage. Then, to delete as many scrape jobs from the default values file as possible. By doing this, it would scrape only the necessary targets. Namely, the /cadvisor endpoints on the Kubernetes nodes. Thus, our Prometheus configuration file ended up looking like this:
prometheus.yml:
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: 'kubernetes-nodes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
Nonetheless, even with only the metrics from cadvisor, Prometheus was crashing the VM. This was mainly because of excess RAM and CPU usage. Additionally, out of all the metrics scraped from the cadvisor endpoint, we only needed container_cpu_usage_seconds_total.
How to create a curated list of metrics
After spending some time researching the issue, we have finally come across a setting that would fix it. It’s the configuration file which tells Prometheus to only scrape certain metrics. Then, the configuration file turned into this:
prometheus.yml:
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: 'kubernetes-nodes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: 'container_cpu_usage_seconds_total'
action: keep
Notice the last part of the configuration file. It tells Prometheus to only keep that specific metric from the scrape job. Below you can find the results:
- Resource usage with the old config. After running for 1h, RAM usage keeps increasing as more time series are scraped:
NAME CPU(cores) MEMORY(bytes)
prometheus-server-6cd9b7c5f4-lh94r 41m 526Mi
- Resource usage with new config—running for 7d:
NAME CPU(cores) MEMORY(bytes)
prometheus-server-6cd9b7c5f4-6q597 20m 222Mi
To sum it up, the first step towards creating your Lightweight Prometheus Server is to install all the necessary exporters and Grafana dashboards. Next, select only the metrics you need for the graphs. After that, create a list with the metrics you truly need. Ultimately, update your Prometheus configuration file accordingly.
Coming up next, check how you can use Prometheus to autoscale your workloads.