Blog

ElasticSearch Aggs Save the Day

I’ve been working with ElasticSearch quite a bit over the last couple of months.  As with many, I was introduced to ES by Logstash.  I’m more and more impressed every time I sit down to hammer out a new query.  They have clearly been working hard over in the ES shop.

A lot of the stuff that I want to look at is time based, and if there was one key limitation in ElasticSearch 0.90.x, it was the lack of flexibility of date histogram facets.  They were very helpful when I wanted to look at things in 2D, but the wheels fell off as soon as I wanted to look at 3 or more dimensions.

I posted a StackOverflow question the other day to see if I was missing something in my approach to solving a multi-faceted problem, and a reference to aggregations in ES 1.0 was mentioned as a possible solution.  I had read a little bit about them, was excited about using them, so I mentally prepared myself for a tedious lab upgrade and reworking of some existing code after the migration.  As it turns out, it went very smoothly.

The only thing I really changed was the logstash output.  I had it use the elasticsearch_http output rather than the elasticsearch one based on a couple of posts I found while exploring the topic.  Our code passes existing tests, and indexing appears to be humming along just fine, and as was suggested, aggregations solve the problem I was facing.

The Problem

Let’s say that I have a doc structure like below, I have 100 different hosts reporting in, and I want to find the sum of the maximum and average values for “total_widgets” across all hosts by hour:

{

“_index”: “logstash-2014.02.06”,

“_type”: “xyz”,

“_id”: “HZ_2oaGvQvKWvsOLyYrGrw”,

“_score”: 1,

“_source”: {

   “@version”: “1”,

   “@timestamp”: “2014-02-05T16:01:01.260-08:00”,

   “type”: “log”,

   “host”: “compute-4.lab.solinea.com”,

   “received_at”: “2014-02-05 21:01:01 UTC”,

   “total_widgets”: 24,

}

}

Old Approach

In short, I never did get that to work at scale, but I suspect it would have involved 100 queries (one for each host), and a bunch of post-processing (also not very scalable).  The other approach that I tried, was to use crossfilter.js to do the bucketing and aggregation.  That was successful on very small datasets, but eventually I ran into a problem with the size of the data that needed to be pulled across the wire, and didn’t have the time to write the code to really take advantage of the crossfilter’s ability to stream data through the filter.

Using Aggregations

It turns out that this is a pretty simple thing to do with aggregations since they can be heirarchical.  The basic approach I took was to bucket the logs by host first, then bucket logs for each host by date, then apply the max and average metric aggregations to each host by date bucket.  

The query for searching one day and bucketing by hour looks like this:

{

   “query”: {

  “bool”: {

 “must”: [

{

   “range”: {

  “@timestamp”: {

 “from”: “2014-02-07T00:00:00.000-00:00”,

 “to”: “2014-02-07T23:59:59.999-00:00”

  }

   }

},

{

   “term”: {

  “type”: “log”

   }

}

 ]

  }

   },

   “aggs”: {

  “events_by_host”: {

 “terms”: {

“field”: “host.raw”

 },

 “aggs”: {

“events_by_date”: {

   “date_histogram”: {

  “field”: “@timestamp”,

  “interval”: “hour”

   },

   “aggs”: {

  “max_total_widgets”: {

 “max”: {

“field”: “total_widgets”

 }

  },

  “avg_total_widgets”: {

 “avg”: {

“field”: “total_widgets”

 }

  }

   }

}

 }

  }

   }

}

Here’s a subset of the returned result:

“aggregations”: {

  “events_by_host”: {

 “buckets”: [

{

   “key”: “compute-1.lab.solinea.com”,

   “doc_count”: 166,

   “events_by_date”: {

  “buckets”: [

 {

“key”: 1391806800000,

“doc_count”: 46,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 },

 {

“key”: 1391810400000,

“doc_count”: 60,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 },

 {

“key”: 1391814000000,

“doc_count”: 60,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 }

  ]

   }

},

{

   “key”: “compute-2.lab.solinea.com”,

   “doc_count”: 164,

   “events_by_date”: {

  “buckets”: [

 {

“key”: 1391806800000,

“doc_count”: 44,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 },

 {

“key”: 1391810400000,

“doc_count”: 60,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 },

 {

“key”: 1391814000000,

“doc_count”: 60,

“max_total_widgets”: {

   “value”: 24

},

“avg_total_widgets”: {

   “value”: 24

}

 }

  ]

   }

}

It doesn’t get me all the way to the finish line, but it’s definitely within reach.  With this I could either stack the data in a graph or do some light post-processing on the server or the client to sum up the max and average for each time slice to present a total view. You could also reverse the aggregations to get a dataset with the time values as the outer buckets.  This might make things easier depending on how you want to use the data.  

 

Author: John Stanford