How to Implement Kubernetes Autoscaling Using Prometheus

Posted on April 23, 2020

At CloudHero, we value infrastructure automation above all. Automation is the key to delivering value to customers as it allows teams to focus on what matters most for the business. Important automation events, such as Kubernetes autoscaling, should be triggered correctly, otherwise, they might generate unwanted infrastructure costs.

In this blogpost, we will go through the story of how we implemented Kubernetes autoscaling using Prometheus, and the struggles we have faced on the way there. The application running on Kubernetes was the Magento eCommerce platform, as you may find later that we are using statistics from Nginx and PHP-FPM.

Methods for Kubernetes Autoscaling

The way we approach Kubernetes autoscaling is by using two components:

Kubernetes Cluster Autoscaler
Kubernetes Horizontal Pod Autoscaler

The Kubernetes Cluster Autoscaler can be found here, and the Kubernetes Horizontal Pod Autoscaler is a resource built-in in Kubernetes. A basic HPA YAML looks like this:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

The Horizontal Pod Autoscaler uses the metrics.k8s.io API, provided by the metrics-server, which is deployed using this Helm chart. We calculate resource utilization is by taking the average CPU usage across all pods in a deployment. When the average CPU usage is above 50%, we scale the deployment, adding one extra pod.

If your deployment has resource requests or anti-affinity enabled (as they should if you want Kubernetes autoscaling to make sense), when your VMs are at capacity, the newly created pods will remain in a Pending state. This is where the cluster autoscaler comes into play, as it creates more instances in your auto-scaling group to schedule Pending pods.

Basic HPA Scaling

Although this sounds good, it has a fundamental problem. Basic Kubernetes autoscaling based on CPU usage is affected by spikes, and thus can lead to unnecessary pods/instances being created, leading to wasted resources. This is what happened to our Magento deployment. It has caused all sorts of problems because we would have over-provisioned computing capacity most of the time, and it showed on our billing sheet for that month. Thus, we decided to opt for a flexible solution, so we chose Prometheus as our metrics aggregator and source.

When using basic HPA scaling remember to activate resource requests for your deployments, as your utilization is calculated as (current usage)/(resource request).

Before going forward, we recommend installing Prometheus using Helm from the official Helm chart.

Prometheus HPA Scaling. Part 1: Adapter

To achieve communication between the Horizontal Pod Autoscaler and Prometheus, we need a layer of aggregation. More specifically, an application that serves metrics using the custom.metrics.k8s.io API. A handy tool that does exactly this is Prometheus Adapter, which can be installed using this helm chart. The most difficult thing to understand from the values file is the rules configuration, so we will focus on that.

First, you need to define a custom rule, as such:

  custom: 
  - seriesQuery: 'container_memory_cache{name!="",container!="POD"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod_name: {resource: "pod"}
    name:
      matches: "container_memory_cache"
      as: "pod_memory_cache"
    metricsQuery: 'avg(container_memory_cache{name!="",container!="POD"}) by (pod_name, namespace)'

The logic for the rule above is as follows:

seriesQuery is a query that filters out metrics scraped from the pause container. What a pause container is can be found in this great article by Ian Lewis.
overrides specify which metric labels match the namespace and pod_name labels that the HPA uses.
name specifies which name the HPA will use (in our case, we will use pod_memory_cache for our HPA instead of the default metric name).
metricsQuery must contain a query which only outputs metric values for a pod/namespace combination. Output looks like this: (IP addresses are redacted for security reasons).

Before going into metrics, the basic template for the HPA is this:

{{- if .Values.autoscaling }}
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: {{ .Release.Name }}
  namespace: {{ .Release.Namespace }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .Release.Name }}
  minReplicas: {{ .Values.autoscaling.min }}
  maxReplicas: {{ .Values.autoscaling.max }}
  metrics:
  - type: Pods
    pods:
      metricName: {{ .Values.autoscaling.metricName }}
      targetAverageValue: {{ .Values.autoscaling.target }}
{{- end }}

This is from our application Helm chart. Now, let’s talk a little bit about metricName and targetAverageValue, because those are the most important.

metricName refers to the name of the metric that you have chosen, for our examples below, metricName can be either website_request_seconds, phpfpm_request_duration or pod_cpu_usage.

targetAverageValue is the value at which the HPA will begin adding pods. A very important thing to keep in mind is that it uses the <int>m notation, which is also used for CPU requests and limits. For example:

When using CPU metrics, 1500m = 1.5 = one and a half CPU cores (or compute units).
When using time based metrics, 1500m = 1.5 = one and a half seconds.

Prometheus HPA Scaling. Part 2: Nginx

Shortly after the failed attempt with the standard autoscaler, we decided on scaling our Magento deployment based on the response time reported by Nginx. By default, Nginx does not expose performance metrics, so we decided on using the VTS module on our proxies. After switching the standard Nginx image to one with the module, the only thing needed so that Prometheus could start scraping metrics were these annotations on the application deployment manifest:

prometheus.io/scrape: "true"
prometheus.io/port: "80"
prometheus.io/path: "/status"

This enables Prometheus to auto-discover scrape targets. Make sure that Prometheus has all the necessary permissions to communicate with the Kubernetes API.

Next we created a custom adapter rule. At the time, our rule looked like this:

  custom: 
  - seriesQuery: 'nginx_vts_upstream_request_seconds{kubernetes_namespace!="",kubernetes_pod_name!=""}'
    resources:
      overrides:
      kubernetes_namespace: {resource: "namespace"}
      kubernetes_pod_name: {resource: "pod"}
    name:
      matches: "nginx_vts_upstream_request_seconds"
      as: "website_request_seconds"
    metricsQuery: 'nginx_vts_upstream_response_seconds{kubernetes_namespace!="",kubernetes_pod_name!=""} + nginx_vts_upstream_request_seconds{kubernetes_namespace!="",kubernetes_pod_name!=""}'

This approach was working, but it had one fundamental problem. Only the /status endpoint can be bypassed from being taken into account when computing the upstream response and request duration. Our metrics were way off from what NewRelic APM was reporting, and the autoscaling was not firing when needed.

Keep in mind that if your Nginx config file only has a location block for PHP-FPM, this should work, but it did not satisfy our use case.

Prometheus HPA Scaling. Part 3: PHP-FPM

After the failed attempt with Nginx, we decided to peel another layer of the onion and dive into PHP-FPM metrics directly. To achieve this, we deployed a PHP-FPM Exporter developed by hipages. After deploying it, make sure that the status page is activated in your PHP-FPM config, like so:

pm.status_path = /status

Next, onto the Prometheus annotations, we must modify the previous ones from Nginx with these ones:

prometheus.io/scrape: "true"
prometheus.io/port: "9253"
prometheus.io/path: "/metrics"

The Prometheus adapter rule looks like this:

  custom: 
  - seriesQuery: 'phpfpm_process_request_duration{kubernetes_namespace!="",kubernetes_pod_name!=""}'
    resources:
      overrides:
      kubernetes_namespace: {resource: "namespace"}
      kubernetes_pod_name: {resource: "pod"}
    name:
      matches: "phpfpm_process_request_duration"
      as: "phpfpm_request_duration"
    metricsQuery: '(avg(avg_over_time(phpfpm_process_request_duration[10m])) by (kubernetes_pod_name,kubernetes_namespace)) / 1000000'

I use this query to compute the average of all FPM workers request duration over the last 10 minutes, in milliseconds.

This approach worked great and was in use for some time, it was even matching up with the NewRelic APM metrics, but it had a flaw. Whenever MySQL or Redis were overloaded for one reason or another, PHP-FPM reported that requests took to long, and triggered the autoscaling process. However, the load on the servers was not high, thus rendering the autoscaling useless in this situation. In some cases, it would even cause some problems by creating new connections to MySQL when it was already overloaded, on top of the extra infrastructure costs.

Prometheus HPA Scaling. Part 4: CPU

In the end, we got back to CPU usage metrics, but with a fundamental difference. Prometheus enabled us to smooth spikes by calculating averages or sums over periods of time.

Finally, the adapter rule looks like this:

  custom: 
  - seriesQuery: 'container_cpu_usage_seconds_total{name!="",container!="POD"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod_name: {resource: "pod"}
    name:
      matches: "container_cpu_usage_seconds_total"
      as: "pod_cpu_usage"
    metricsQuery: 'sum(rate(container_cpu_usage_seconds_total{name!="",container!="POD"}[10m])) by (pod_name, namespace)'

Using CPU usage sum over the last 10 minutes, our workloads always scale when it is absolutely necessary.

I look forward to hearing from your experience autoscaling Kubernetes.