Our Experience of Creating Large Scale Log Search System Using ElasticSearch

by soon posted May 11, 2020
Written by Lee Jae Ik on 05/01/2018

At NHN, we have a service called NELO (NHN Error Log System) to manage and search logs pushed to the system by various applications and other Web services. The search performance and functionality of NELO2, the second generation of the system, have significantly been improved through ElasticSearch. Today I would like to share our experience at NHN in deploying ElasticSearch in Log Search Systems.

ElasticSearch is a distributed search engine based on Lucene developed by Shay Banon. Shay and his team have recently released the long-awaited version 0.90. Here is a link to a one-hour recorded webinar where Clinton Gormley, one of the core ElasticSearch developers, explains what's new in ElasticSearch 0.90.

If you are developing a system which requires a search functionality, I would recommend ElasticSearch as its installation and server expansion are very easy. Since it is a distributed system, ElasticSearch can easily cope with an increase in the volume of search targets. At NHN, all logs coming into NELO2 are stored and indexed by ElasticSearch for faster near real-time search results.

Features of ElasticSearch

Let's get started with familiarizing ourselves with the terms widely used in ElasticSearch. For those who are familiar with relational database systems, the following table compares the terms used in relational databases with the terms used in ElasticSearch.

Table 1: Comparison of the terms of RDBMS and ElasticSearch.
Relational DB ElasticSearch
Database Index
Table Type
Row Document
Column Field
Schema Mapping
Index Everything is indexed
SQL Query DSL

JSON-based Schemaless Storage

ElasticSearch is a search engine but can be used like NoSQL. Since a data model is represented in JSON, both requests and responses are exchanged as JSON documents. Moreover, sources are also stored in JSON. Although schema is not defined in advance, JSON documents are automatically indexed when they are transferred. Number and date types are automatically mapped.

Multitenancy

ElasticSearch supports multitenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query. NELO2 separates indexes by date and stores logs. When executing a search, NELO requests indexes of dates within the scope of search with a single query.

Code 1: Multitenancy Example Query.

# Store logs in the log-2012-12-26 index
curl -XPUT http://localhost:9200/log-2012-12-26/hadoop/1 -d '{
    "projectName": "hadoop",
    "logType": "hadoop-log",
    "logSource": "namenode",
    "logTime":"2012-12-26T14:12:12",
    "host": "host1.nelo2",
    "body": "org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile"
}'

# Store logs in the log-2012-12-27 index
curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{
    "projectName": "hadoop",
    "logType": "hadoop-log",
    "logSource": "namenode",
    "logTime":"2012-12-27T02:02:02",
    "host": "host2.nelo2",
    "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
}'

# Request search to the nelo2-log-2012-12-26 and nelo2-log-2012-12-27 indexes at once
curl -XGET http://localhost:9200/nelo2-log-2012-12-26,nelo2-log-2012-12-27/_search

Scalability and Flexibility

ElasticSearch provides excellent scalability and flexibility. It enables the expansion of functionality through plug-ins, which was further improved in recent 0.90 release. For example, by using Thrift or Jetty plugin, you can change transfer protocol. If you install BigDesk or Head, which is a required plugin, you can use the functionality of ElasticSearch monitoring. As shown in the following Code 2, you can also adjust the number of replicas dynamically. The number of shards is not changeable as it is fixed for each index, so an appropriate number of shards should be allocated in the first time by taking the number of nodes and future server expansion into account.

Code 2: Dynamic Configuration Change Query.

1
2
3
4
5
6
$ curl -XPUT http://localhost:9200/log-2012-12-27/ -d '{
    "settings": {
        "number_of_shards": 10,
        "number_of_replicas": 1
    }
}'

Distributed Storage

ElasticSearch is a distributed search engine. It distributes data by configuring multiple shards according to keys. An index is configured for each shard. Each shard has 0 or more replicas. Moreover, ElasticSearch supports clustering, and when a cluster runs, one of many nodes is selected as the master node to manage metadata. If the master node fails, another node in the cluster automatically becomes the master. It is also very easy to add nodes. When a node is added to the same network, the added node will automatically find the cluster through multicast and add itself to the cluster. If the same network is not used, the master node address should be specified through unicast (see a related video: http://youtu.be/l4ReamjCxHo).

Installing

Quick Start

ElasticSearch supports zero configuration installation. As shown in the following code snippets, all you have to do for execution is download a file from the official homepage and unzip it.

  1. Download
    1
    2
    ~$ wget http://download.ElasticSearch.org/ElasticSearch/ElasticSearch/ElasticSearch-0.20.1.tar.gz  
    ~$ tar xvzf ElasticSearch-0.20.1.tar.gz
  2. Executing Server
    1
    ~$ bin/ElasticSearch -f

Installing Plugins

You can easily expand the functionality of ElasticSearch through plugins. You can add management functionalities, change the analyzer of Lucene, and change the basic transfer module from Netty to Jetty. The following is a command we use to install plugins for NELO2. Head and bigdesk, which are found in the first and second lines, are the plugins required for ElasticSearch monitoring. It is strongly recommended to install them and check their functionalities. After installing them, visit http://localhost:9200/plugin/head/ and http://localhost:9200/plugin/bigdesk/, and you can see the status of ElasticSearch in your Web browser.

1
2
3
4
bin/plugin -install Aconex/ElasticSearch-head
bin/plugin -install lukas-vlcek/bigdesk
bin/plugin -install ElasticSearch/ElasticSearch-transport-thrift/1.4.0
bin/plugin -install sonian/ElasticSearch-jetty/0.19.9

Main Configurations

You don't need to change configurations when conducting a simple functionality test. When you carry out a performance test or apply it to production services, then you should change some default configurations. See the following snippet and try to find for yourself the configurations which should be changed from the initial configuration file.

Code 5: Main Configurations (config/ElasticSearch.yml).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# As it is a name used to identify clusters, use a name with uniqueness and a meaning.
cluster.name: ElasticSearch-nelo2
 
# A node name is automatically created but it is recommended to use a name that is discernible in a cluster like a host name.
node.name: "xElasticSearch01.nelo2"
 
# The default value of the following two is all true. node.master sets whether the node can be the master, while node.data is a configuration for whether it is a node to store data. Usually you need to set the two values as true, and if the size of a cluster is big, you should adjust this value by node to configure three types of node. More details will be explained in the account of topologies configuration later.
node.master: true
node.data: true
 
# You can change the number of shards and replicas. The following value is a default value:
index.number_of_shards: 5
index.number_of_replicas: 1
 
#To prevent jvm swapping, you should set the following value as true:
bootstrap.mlockall: true
 
# It is a timeout value for checking the status of each node in a cluster. You should set an appropriate value; if the value is too small, nodes may frequently get out of a cluster. The default value is 3 seconds.
discovery.zen.ping.timeout: 10s
 
# The default value is multicast, but in an actual environment, unicast should be employed due to the possibility of overlapping with other clusters. It is recommended to list servers that can be a master in the second setting.
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3[portX-portY]"]

Using REST API

ElasticSearch provides a REST API as shown below. It provides most of its functionalities through REST API, including the creation and deletion of indexes, mappings, as well as search and change of settings. In addition to REST API, it also provides various client APIs for Java, Python and Ruby.

Code 6: REST API Format in ES.

1
http://host:port/(index)/(type)/(action|id)

As mentioned earlier, NELO2 classifies indexes (databases in RDBMS terms) by date, and type (table) is separated by project. Code 7 below shows the process of creating logs that came into the hadoop project on December 27, 2012, in the unit of document by using a REST API.

Code 7: An Example of Using ElasticSearch REST API.

1
2
3
4
5
6
7
8
9
10
11
12
#Creating documents
curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1
curl -XGET http://localhost:9200/log-2012-12-27/hadoop/1
curl -XDELETE http://localhost:9200/log-2012-12-27/hadoop/1
 
#Search
curl -XGET http://localhost:9200/log-2012-12-27/hadoop/_search
curl -XGET http://localhost:9200/log-2012-12-27/_search
curl -XGET http://localhost:9200/_search
 
#Seeing the status of indexes
curl -XGET http://localhost:9200/log-2012-12-27/_status

Creating Documents and Indexes

As shown in the following Code 8, when the request is sent, ElasticSearch creates the log-2012-12-27 index and hadoop type automatically without any pre-defined index or type. If you want to create them specifically instead of using auto-creation, you should specify the setting of action.auto_create_index and index.mapper.dynamic as false in the configuration file.

Code 8: Creating Documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Request
curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{
    "projectName": "hadoop",
    "logType": "hadoop-log",
    "logSource": "namenode",
    "logTime":"2012-12-27T02:02:02",
    "host": "host2.nelo2",
    "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
}'
 
# Result
{
    "ok": true,
    "_index": "log-2012-12-27",
    "_type": "hadoop",
    "_id": "1",
    "_version": 1
}

As shown in Code 9 below, you can make a request after including type in a document.

Code 9: A Query Including Type.

1
2
3
4
5
6
7
8
9
10
curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{
    "hadoop": {
        "projectName": "hadoop",
        "logType": "hadoop-log",
        "logSource": "namenode",
        "logTime":"2012-12-27T02:02:02",
        "host": "host2.nelo2",
        "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
    }
}'

If an id value is omitted as in Code 10, an id will be created automatically when a document is created. Note that thePOST method was used instead of PUT when a request was made.

Code 10: A Query Creating a Document without an ID.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Request
curl -XPOST http://localhost:9200/log-2012-12-27/hadoop/ -d '{
    "projectName": "hadoop",
    "logType": "hadoop-log",
    "logSource": "namenode",
    "logTime":"2012-12-27T02:02:02",
    "host": "host2.nelo2",
    "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
}'
 
# Result
{
    "ok": true,
    "_index": "log-2012-12-27",
    "_type": "hadoop",
    "_id": "kgfrarduRk2bKhzrtR-zhQ",
    "_version": 1
}

Deleting a Document

Code 11 below shows how to delete a document (a record in RDBMS terms) in type (a table). You can delete a hadoop type document with id=1 of the log-2012-12-27 index by using the DELETE method.

Code 11: A Query to Delete a Document.

1
2
3
4
5
6
7
8
9
10
11
# Request
 
# Result
{
    "ok": true,
    "_index": "log-2012-12-27",
    "_type": "hadoop",
    "_id": "1",
    "found": true
}

Getting a Document

You can get a hadoop type document with id=1 of the log-2012-12-27 index by using the GET method as shown in Code 12.

Code 12: A Query to Get a Document.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#Request
 
# Result
{
    "_index": "log-2012-12-27",
    "_type": "hadoop",
    "_id": "1",
    "_source": {
        "projectName": "hadoop",
        "logType": "hadoop-log",
        "logSource": "namenode",
        "logTime":"2012-12-27T02:02:02",
        "host": "host2.nelo2",
        "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
    }
}

Search

When the Search API is called, ElasticSearch executes the Search API and returns the search results that match the content of the query. Code 13 shows an example of using Search API.

Code 13: An Example Query of Using Search API.

1
2
3
4
5
6
7
8
9
10
11
# All types of a specific index
 
# A specific type of a specific index
$ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop,apache/_search?q=host:host2.nelo2'
 
# A specific type of all indexes
 
# All indexes and types

Search API by Using URI Request

Table 2: Main Parameters.
Name Description
q Query string.
default_operator The operator used as a default (AND or OR). The default is OR.
fields The field to get as a result. The default is the "_source" field.
sort Sort method. Ex) fieldName:asc/fieldName:desc.
timeout Search timeout value. The default is "unlimited".
size The number of result values. The default is 10.

If you use URI, you can search easily by using parameters in Table 2 and a query string. As it does not provide all search options, it is useful when used for tests.

Code 14: Search Query by Using URI Request.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Request
 
# Result
{
    "_shards":{
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits":{
        "total": 1,
        "hits": [
            {
                "_index": "log-2012-12-27",
                "_type": "hadoop",
                "_id": "1",
                "_source": {
                    "projectName": "hadoop",
                    "logType": "hadoop-log",
                    "logSource": "namenode",
                    "logTime":"2012-12-27T02:02:02",
                    "host": "host2.nelo2",
                    "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
                }
            }
        ]
    }
}

Search API by Using Request Body

When HTTP body is used, perform a search by using query DSL. As query DSL has a large amount of contents, you are advised to refer to a guide from the official website.

Code 15: Search by Using Query DSL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Request
    "query": {
        "term": { "host": "host2.nelo2" }
    }
}'
 
# Result
{
    "_shards":{
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits":{
        "total": 1,
        "hits": [
            {
                "_index": "log-2012-12-27",
                "_type": "hadoop",
                "_id": "1",
                "_source": {
                    "projectName": "hadoop",
                    "logType": "hadoop-log",
                    "logSource": "namenode",
                    "logTime":"2012-12-27T02:02:02",
                    "host": "host2.nelo2",
                    "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"
                }
            }
        ]
    }
}

Mapping

Put Mapping API

To add a mapping to a specific type, you can define a mapping in the form shown in Code 16.

Code 16: Query to Register a Mapping.

1
2
3
4
5
6
7
8
9
10
11
12
13
{
    "hadoop": {
        "properties": {
            "projectName": {"type": "string", "index": "not_analyzed"},
            "logType": {"type": "string", "index": "not_analyzed"},
            "logSource": {"type": "string", "index": "not_analyzed"},
            "logTime": {"type": "date"},
            "host": {"type": "string", "index": "not_analyzed"},
            "body": {"type": "string"},
        }
    }
}'

Get Mapping API

To get defined mapping information, you can use a query in the form shown in Code 17.

Code 17: Query to Get a Mapping.

Delete Mapping API

Code 18 shows an example of deleting a defined mapping.

Code 18: Query to Delete a Mapping.

How to Optimize Performance

Memory and the Number of Open Files

If the amount of data to search increases, you will need more memory. When you run ElasticSearch, you will encounter many problems due to the use of memory. In an operating method recommended by an ElasticSearch community, when you run a server exclusively for ElasticSearch, you are advised to allocate only half of the memory capacity to ElasticSearch, and to allow the OS to use the other half for system cache. You can set the memory size by setting the ES_HEAP_SIZE environmental variable or by using -Xms and -Xmx of JVM.

Code 19: Execution by Specifying Heap Size.

1
bin/ElasticSearch -Xmx=2G -Xms=2G

When using ElasticSearch, you will see OutOfMemory errors frequently. This error occurs when the field cache exceeds the maximum heap size. If you change the setting for index.cache.field.type from resident (default) to soft, soft reference will be used and the cache area will be preferentially GC, and this problem can be resolved.

Code 20: Configuring Field Cache Type.

1
index.cache.field.type: soft

If the amount of data increases, the number of index files also increases. This is because Lucene, which is used by ElasticSearch, manages indexes in the unit of segments. Sometimes the number will even exceed the number of MAX_OPEN files. For this reason, you need to change the maximum open file limit by using the ulimit command. The recommended value is 32000-64000, but sometimes you may need to set a larger value depending on the size of the system or data.

Index Optimization

NELO2 manages indexes by date. If indexes are managed by date, you can delete old logs that don't need to be managed easily and quickly, as shown in Code 21. In this case, the overhead imposed on the system is smaller than when deleting logs by specifying the TTL value for each document.

Code 21: Deleting an Index.

If index optimization is performed, segments are incorporated. Using this method, you can enhance search performance. As index optimization can impose a burden on the system, it is better to perform it when the system is being used less.

Code 22: Index Optimization.

Shards and Replicas

You can't change the number of shards after setting it. For this reason, you need to decide this value carefully by taking the current number of nodes in the system and the number of nodes expected to be added in the future into account. For example, if there are 5 nodes and the number is expected to reach 10 in the future, it is recommended to set the number of shards as 10 from the beginning. If you set it as 5 in the beginning, you can add 5 more nodes later, but you won't be able to use the added 5 nodes. If you set the number of replicas to 1, of course, you can utilize the added 5 nodes as nodes exclusively for replication.

If the number of shards increases, it is more advantageous to process a large amount of data because queries are distributed as much as the number of shards. But you need to set this value appropriately, because the performance could be deteriorated due to increasing traffic if the value is too high.

Configuring Cluster Topologies

The content of the configuration file of ElasticSearch is shown in Code 23 below. There are three types of nodes:

  1. data node
    This does not act as the master, and only stores data. When it receives a request from a client, it searches data from shards or creates an index. 
  2. master node
    It functions to maintain a cluster, and requests indexing or search to data nodes.  
  3. search balancer node
    If it receives a search request, it requests data, gathers data and delivers the result.

You can have one node which will function both like a master and a data node. But if you use the three types of node separately, you can reduce the burden of the data node. In addition, if you configure the master node separately, you can improve the stability of a cluster. Also, you can reduce operation costs by using low-spec. server equipment for the master and search node.

Code 23: Settings Related to Topology.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# You can exploit these settings to design advanced cluster topologies.
#
# 1. You want this node to never become a master node, only to hold data.
#    This will be the "workhorse" of your cluster.
#
# node.master: false
# node.data: true
#
# 2. You want this node to only serve as a master, to not store any data and
#    to have free resources. This will be the "coordinator" of your cluster.
#
# node.master: true
# node.data: false
#
# 3. You want this node to be neither a master nor a data node, but
#    to act as a "search load balancer" (fetching data from nodes,
#    aggregating results, etc.)
#
# node.master: false
# node.data: false

Figure 1 below shows the configuration of NELO2 topologies that use ElasticSearch. The efficiency of equipment use and the stability of the entire cluster has been improved as follows: only ElasticSearch runs on the 20 data nodes (server) so that they can achieve sufficient performance, while other daemon server processes in addition to ElasticSearch run on the 4 master nodes and 3 search nodes.

nhn_nelo2_elasticsearch_topologies.png

Figure 1: NELO2 ElasticSearch Topologies.

Configuring Routing

When a large amount of data needs to be indexed, increasing the number of shards will improve the overall performance. On the other hand, if the number of shards increases, the traffic among nodes will also go up. For example, when there are 100 shards, if it receives a single search request, it sends the request to all the 100 shards and aggregates data, and this imposes a burden on the entire cluster.

If you use routing, data will be stored only in a specific shard. Even if the number of shards increases, the application will still send a request only to a single shard, and consequently the traffic can be reduced dramatically. Figure 2, 3, and 4 are excerpted from the slides Rafal Kuc presented at Berlin Buzzwords 2012. If you don't use routing, as shown in Figure 2, the application will send a request to all the shards. But if you use routing, it will send a request only to a specific shard, as shown in Figure 3.

According to the material cited, in Figure 4 when there are 200 shards, the response time is over 10 times faster with routing than without routing. If routing is applied, the number of threads will increase by 10 to 20 times compared to when it is not applied, but the CPU usage is much smaller. In some cases, however, the performance will be better when routing is not applied. For a search query whose result should be collected from multiple shards, it could be more advantageous in terms of performance to send the request to multiple shards. To complement this, NELO2 determines the use of routing depending on the log usage of the project.

nhn_nelo_before_using_routing.png

Figure 2: Before Using Routing.

nhn_nelo2_after_using_routing.png

Figure 3: After Using Routing.

nhn_nelo2_performance_comparison_before_after_using_routing.png

Figure 4: Performance Comparison before and after Using Routing.

Conclusion

The number of users of ElasticSearch is increasing rapidly, thanks to its easy installation and high scalability. It was several days only since the release of the latest ElasticSearch version 0.90. Its functionality is improving very quickly thanks to its active community. In addition, more and more companies are beginning to use ElasticSearch for their services. Recently, some committers, including the developer Shay Banon, gathered together and established ElasticSearch.com, which provides consulting and training services.

In this article I have explained the basic information on the installation of ElasticSearch, how to use it, and do performance tuning. We have started testing the latest 0.90 release and soon will migrate the current 0.20.1 ES deployment. In the next post I will continue this topic and tell you about our experience with 0.90 as well as the critical split-brain problem we have previously experienced. Due to the scarcity of solutions for this problem, I believe it will be very useful for our readers.

References