How to Automate Elasticsearch Index Creation

elasticsearch index creation automation

Looking to increase developer productivity and observability at Otter, we noticed that when using one Elasticsearch index for each application, search becomes faster, the queries become easier, and the logs can be parsed using custom regex patterns, and we have full control over the cleanup policy when using Elasticsearch Curator. 

Whenever we deployed a new application, we had to manually modify a configuration file. This quickly became very time consuming and error-prone. As our entire infrastructure is running on Kubernetes and we use Elasticsearch for our log database, we decided to replace our Fluentbit setup with Filebeat and Logstash. By doing this, we could to fully automate the Elasticsearch index creation process, as they are tightly integrated with the other technologies. 

Automate Elasticsearch Index Creation

Getting Started with Elasticsearch Index Creation

As always, we start by using Helm. We recommend using the Helm templates from elastic’s GitHub repository, as they are still actively maintained. Installing Elasticsearch and Kibana are not in the scope of this article, so we will only focus on installing and configuring Filebeat and Logstash.

In order to install the official Elastic repo, we must run the following command:

helm repo add elastic https://helm.elastic.co

Also, this tutorial assumes that you are deploying your resources in the elk namespaces. To create this namespace, run:

kubectl create namespace elk

Filebeat Configuration

Here, we only need to modify the values file, and then, install the chart using it. The only thing we need to modify from the main values file filebeatConfig key.

filebeatConfig:
  filebeat.yml: |
    logging.level: error
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true
          hints.default_config:
            type: container
            paths:
              - /var/log/containers/*${data.kubernetes.container.id}.log

    output.logstash:
      hosts: ["logstash-logstash:5044"]

    setup.template:
      name: "k8s"
      pattern: "k8s-*"
      enabled: false

    setup.ilm.enabled: false

The main differences are that we use filebeat’s autodiscover feature and use the kubernetes provider. We also enabled hints based discovery, create a template and disable ILM (it is only available if you have an Elastic license).

You should also modify your tolerations so filebeat can run on other instances, for example the master nodes (your key names may differ):

tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

After modifying the values file, run:

helm install --name filebeat --namespace elk elastic/filebeat -f fb-values.yaml

or, if you are using Helm v3:

helm install filebeat --namespace elk elastic/filebeat -f fb-values.yaml

Setting Up and Running Logstash for Elasticsearch

This is where the most of the work goes into. Next, we are going to show you a more complex example so you can simplify it if you need to. As with filebeat’s values file, we only modify logstash’s config.

logstashPipeline:
 logstash.conf: |
  input {
    beats {
      port => 5044
      host => "0.0.0.0"
    }
  }
  filter {
    if ("" in [kubernetes][labels][app]) and ("" in [kubernetes][labels][environment]) {
      mutate { add_field => { "[@metadata][target_index]" => "k8s-%{[kubernetes][labels][app]}-%{[kubernetes][labels][environment]}-%{[kubernetes][container][name]}-%{+yyyy.MM.dd}" } }
    } 
    else {
      mutate { add_field => { "[@metadata][target_index]" => "k8s-%{[kubernetes][namespace]}-%{+yyyy.MM.dd}" } }
    }
    if [event][module] == "nginx" {
      if [fileset][name] == "access" {
        grok {
          match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - %{DATA:[nginx][access][user_name]} \[%{HTTPDATE:[nginx][access][time]}\] \"%{WORD:[nginx][access][method]} %{DATA:[nginx][access][url]} HTTP/%{NUMBER:[nginx][access][http_version]}\" %{NUMBER:[nginx][access][response_code]} %{NUMBER:[nginx][access][body_sent][bytes]} \"%{DATA:[nginx][access][referrer]}\" \"%{DATA:[nginx][access][agent]}\" \"(?:%{IPORHOST:[nginx][access][real_ip]},%{IPORHOST:[nginx][access][internal_ip]}|-)\" \"%{DATA:[nginx][access][frontend_cookie]}\" \"%{DATA:[nginx][access][request_time]}\" \"%{DATA:[nginx][access][upstream_response_time]}\""] }
          remove_field => "message"
        }
        mutate {
          add_field => { "read_timestamp" => "%{@timestamp}" }
        }
        date {
          match => [ "[nginx][access][time]", "dd/MMM/YYYY:H:m:s Z" ]
          remove_field => "[nginx][access][time]"
        }
        useragent {
          source => "[nginx][access][agent]"
          target => "[nginx][access][user_agent]"
          remove_field => "[nginx][access][agent]"
        }
        geoip {
          source => "[nginx][access][real_ip]"
          target => "[nginx][access][geoip]"
        }
      }
    }
  }
  output {
    elasticsearch {
      hosts => "elasticsearch-master:9200"
      manage_template => false
      index => "%{[@metadata][target_index]}"
    }
  }

The Logstash configuration file is based on conditional statements, which makes it very powerful. Inside the conditional statements, you can use different filters. We will go through a short description of the ones we use. For the rest of them, you can check the documentation on Elastic’s website, which is very well written.

A Few More Steps to Go

Firstly, we use the mutate plugin to dynamically create indices based on Kubernetes metadata. They are created by concatenating two pod labels and one Kubernetes metadata field. We use the app and environment labels to distinguish between our logs. Moreover, we divide them further by using the container name, so we can have Nginx logs in one index, and PHP logs on another. We do not apply these labels to every deployment, as it would create unnecessary shards in the Elasticsearch cluster. For example, we run many cronjobs, and having an index to each of them is overkill, so, if these labels are not present, the logs are sent to a common index based on the namespace name. We also use the mutate filter to remove Nginx’s timestamp and keep the one which filebeat reports.

Then, the grok filter helps us with parsing our Nginx logs, because we use a different log format. Notice the if [event][module] == "nginx" condition. The event.module label is present because we put an Nginx module on our deployments. In order to do so yourself, apply an annotation to your deployments like this:

      annotations:
        co.elastic.logs.nginx/module: 'nginx'

This annotation states that logs coming from the nginx container should pe parsed using the filebeat nginx module, but it only works on default log formats.

What About the Other Ones?

The rest of the filters are easy to understand. We will only talk a little bit about the geoip filter, because we think it’s wonderful. Now, it takes and IP address, which we take from the X-Forwarded-For header, searches in its GeoIP database for it, and creates another field with geographic information. This way you can create Kibana map dashboard to see where you requests come from. However, using this setup, GeoIP fields will not work by default. Next, to fix this issue, check out our blogpost about the Top Five Tips and Tricks to Manage Your Elasticsearch Cluster.

After modifying the values file, run:

helm install --name logstash --namespace elk elastic/logstash -f ls-values.yaml

or, if you are using Helm v3:

helm install logstash --namespace elk elastic/logstash -f ls-values.yaml

I hope that this article will provide the basic information to help you take your cloud infrastructure to the next level. Yet, if you decide on using fluentbit instead of filebeat/logstash, we highly recommend you to read our previous article on this topic.

Get new content delivered directly to your inbox.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.