posted on May 17, 2017

Log Management Made Easy in Large-scale Distributed Systems

The following is a guest post by Steven Chen.

In the past, an application was usually a single-piece monolithic system that was hosted on a single server or a few servers with a backup system enabled. System administrators could handle system monitoring (e.g. log management) the “old-fashioned” way, by logging on to those servers using SSH and analysing log files using some Linux commands (e.g. grep or awk).

But this is no longer the case when it comes to modern software systems, especially those which adopt the micro-services architecture style. A single, huge, monolithic application will be split into many logically small enough, deployable and independent services. All those services combine together to perform complete functions just as a monolithic system does.

First of all, to handle different workloads, each of those services usually have multiple instances (e.g. hundreds of instances, but the number may vary according to different workloads) running across different servers in a data centre.

Secondly, we need to enable monitoring on those services to keep track of their running status so that operations team can perform system maintenance when necessary.

Last but not least, except for basic system status monitoring, we may want to understand how those services perform. A typical way of doing it is through parsing log files. For example, we may want to figure out the answers for the following questions:

When should we scale up/down the system and by how much?
What is the performance of our critical services in a certain period?
How can we perform a fail-over quickly on cluster X?

All the factors mentioned above, in combination, make log management very challenging in large-scale distributed systems.

Centralised monitoring using ELK

In this post, we are going to propose a centralised monitoring service using ELK (Elasticsearch, Logstash and Kibana) to address the aforementioned challenges. The ELK stack is open-source, under the Apache License version 2, and has been used by many companies in production. The stack is built to be scalable.

The typical approach of applying ELK is to deploy Logstash to each server and leverage its text parsing capability to parse raw string log entries into structured formats (e.g. JSON strings). These structured entries will then be sent to Elasticsearch via Logstash’s forwarding ability for indexing and querying. And Kibana allows us to visualise the data in Elasticsearch.

The way Logstash works is that it applies codec to incoming data (i.e. deserialisation) and then executes filtering logic over the serialised data to generate a more structured output data. Before it moves output data to its recipients, it applies codec to that data (i.e. serialisation). The filtering process in Logstash consumes a majority of the computational resources.

Recall that in a typical deployment, Logstash will be running on production servers that are hosting applications. Therefore, it may cause an overhead on production servers even though it claims to be very light-weight.

Another issue we might run into is that it will be difficult to debug Logstash in customers’ production environments as some of them may have extremely restricted environments, particularly for its filtering logic.

Fortunately, Elastic provides Beats, a more light-weight agent, such as Filebeat. Basically, what it does is to forward data from a local machine to remote servers. In our practice, we employ Filebeat to each of those application servers, which only doing one thing - forward logs to Logstash instances.

It is claimed by Elastic that Filebeat has a back-pressure sensitive protocol that will not overload the pipeline. An example of a minimal configuration for sending data to Logstash is shown as follows:

- input_type: log
  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /path/to/log/*.log
output.logstash:
  # The Logstash hosts, in the format of "hostname:port",
  # multiple Logstash host strings are separated by comma
  hosts: ["localhost:5044"]

In order to parse raw string log entries, a typical solution is to use the Grok plugin for Logstash. There are quite a few existing Grok patterns developed by the community, such as Apache HTTP server log patterns. But there may not be an existing plugin for our use case.

And as the system evolves, there may be more information to be recorded and the log entry format may become more and more complex. To avoid writing Grok patterns for every new type of log entry. We propose to change the way we generate our log entries (for existing systems or third-party systems, depending on the context).

Rather than generating an unstructured raw string, we require our developers to write log entries by complying with our predefined semi-structured format. For example:

[ISO_8601_TIMESTAMP] [LOGGER] {"key1":"value1","key2":"value2", ...}

Only the time-stamp and logger information are logged as raw strings followed by a JSON string of important log information. This will preserve the log’s readability for administrators, while making it easy to extend.

In the future, if we have more information to log, we can simply add it to the JSON string. And more importantly, we only need one Grok pattern as JSON strings can be converted to JSON objects easily, which is the data Elasticsearch expects.

To implement Grok patterns is rather straight-forward because we reuse some of the patterns in Grok, you can refer to this repository for more information.

Below is an example minimal configuration of Logstash forwarding data to Elasticsearch:

input {
  beats {
    port => 5044
    codec => "plain"
    include_codec_tag => false
  }
}

filter {
  grok {
    # Path to directory that contains customized Grok patterns.
    # Multiple path strings are separated by comma.
    patterns_dir => ["/path/to/your/patterns"]
    match => {"message" => ["%{LOG_FOO}", "%{LOG_BAR}"]}
  }
  # You may perform other mutations according to what are the fields you require
}

output {
  elasticsearch {
    codec => "plain"
    hosts => ["localhost:9200"]
    user => "elastic"
    password => "changeme"
  }
}

This solution is able to handle deployment using Docker containers. What we need to do is to mount the log directory from containers to the Docker host server and Filebeat will be able to pick up data from those directories. If Filebeat is running on a Docker container, it is also required to mount the log directory from the Docker host to the Filebeat container.

Using this approach, all log data can be centralised with Elasticsearch and our administrators can visualise that data using Kibana and data scientists can build models for that data to garner insights via Elasticsearch DSL.

Summary

Monitoring services in large-scale systems has become rather important and the traditional way of doing this involves too much human effort and is error-prone. ELK has been proven to be one of the more promising alternatives to solve this problem.

At Wismut Labs, we have developed a proof-of-concept of this approach for our systems. Interested to find out more? Drop us an email at sales@wismutlabs.com.