Tech Resources

How We Boosted Counting Performance by 7410x with Redis


This is the second of a 2 part post about how we improved query performance on our analytics dashboard by over 7000x. All just by moving some of our data from MySQL to Redis. Part 1 was a technical explanation of the setup, while part 2 shows the benchmarks we saw when comparing fetching data from both systems.


We use Redis a lot. It is fast, stable, effective and awesome! This time, we found Redis useful in solving a painful problem: counting unique users for multiple different filters.

We recently found a new feature in Redis (new for us at least): HyperLogLog. HyperLogLog is a growth arrowprobabilistic data structure which makes estimating the number of distinct objects in a set very fast (Actually, more like blazing fast), but with a minor standard error (You can read more about it here). The moment we read about HyperLogLog we knew there’s something in it. And now that Redis has made it so easy to use, our testing started almost immediately.

We Want Real-Time Data

Until now, we used to keep all data about unique users in MySQL. The data was saved in different variations and ready for filtering (country, day …). As time went by, our queries became slower and slower. It was a pretty grim situation when all our different optimizations on MySQL showed us there’s no real solution here. We were offered to take many different approaches using Redshift, Hadoop or ElasticSearch but we didn’t want to have our data presented in any delay to our users. We wanted a complete, real-time data presentation in our dashboard that is being instantly updated using our background workers.

Redis to The Rescue

Once we had Redis running and migrated the MySQL data in, the results were astonishing. We’ve been tweaking MySQL to try to make distinct counting faster for a couple of months now, and results were mediocre at best (not to MySQLs fault, we were counting cardinality in 10 million+ row tables), but Redis was FAST. Although speed wasn’t the only thing we had to benchmark, we weren’t sure how well the 0.8% error deviation Redis promises for HyperLogLog stood up when we ran queries on our data.

MySql is Under Performing

To get us started, here is a benchmark of part of the many many different ways we tried tweaking MySQL specifically for COUNT DISTINCT

mysql-performance

We tried different query and index structures, the conclusions we drew from the process:

  • SELECT COUNT(*) FROM (SELECT * GROUP BY id) seemed to constantly work better than SELECT COUNT(DISTINCT id).
  • MySQLWorkbench is awesome.
  • With 10M rows and getting larger every day, MySQL just wasn’t the tool for counting the cardinality of our user-data.

Revelation of Goodness

Once we migrated all of our MySQL data into Redis Keys, we saw Redis zip by MySQL in a blink of the eye.

mysql-redis

There’s no mistake in that graph. We tried to chart both performance times of MySql and Redis on the same graph, but you probably can’t see redis’s values there. Here’s a close up of Redis performance times.

redis-performance

Amazing!

The Fly in The Ointment

This can’t be all so good. HyperLogLog only gives an estimate, so then it was time to compare the estimates to the actual MySQL counts. For most queries, the difference was much smaller than the 0.8% error deviation (the smallest was 0.03%), but after benchmarking many different queries, we also had 2 that reached an error of 1.1% and 1.7%.

redis-difference

In the end, these error deviations were acceptable for some of our use cases. We’re still saving exact counts outside of Redis … Just in case.

HyperLogLog is a very powerful tool for counting unique entities. You should definitely use it if you’re willing to accept its minor standard error.

Feel free to share:

Related Posts

4 Comments

  1. Keith W.

    April 29, 2015 at 11:17 pm

    Can you offer any indication of how you built your data structures? Maybe an example of how the keys are used/named?

    We have just recently started using Redis for a bunch of things and I think HLL could be very useful for DAU in our case.

    Reply

    • Matan Kulikovsky

      May 2, 2015 at 3:44 pm

      Yeah no problem.

      We use split keys by the types of groups we’re interested in. To make things simple you can do day and country code.

      So for example you would have 20150501-US for visitors from the united states on May 1st.
      To add a user just use pfadd(‘20150501-US’, ) every time the user visits, he will only be counted once per key.

      To get aggregated results for a few dates and/or countries, use pfmerge and that gives you 1 count per user aswell for the merged result, then pfcount for the key you merged into.

      Reply

  2. Oliver

    January 25, 2016 at 1:24 am

    Hi, just a quick question.
    Say you want to find the amount of requests in the last seven days. Do you need to have a function which finds what all the key names would be?

    I’m looking to do per minute tracking, and that’s a lot of key names for a week or a month.

    Reply

    • Matan Kulikovski

      February 3, 2016 at 11:11 pm

      Hi Oliver,

      Your problem actually seems pretty simple, minute/hour/day are a known set of numbers, so building the strings would be pretty simple using nested loops.

      For example (just a general idea)

      var keys = [];
      for (var day = 1; day<=7; day++) {
      for (var hour = 1; hour<=24; hour++) {
      for (var minute = 0; minute <= 59; minute ++ {
      keys.push("2016-01-" + day + "-" + hour + "-" + minute);
      }
      }
      }

      Now you have 7 days worth of minute-by-minute strings inside keys[], and you can use those for pfmerge before pfcount

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Popular Posts

Join 14709 other smart people who get email updates for free!

We don't spam!

Unsubscribe any time

Categories

More in Tech Resources
Redis Featured Image
Installing Redis Master-Slave on Amazon EC2

[markdown] *This is the first out of a 2 part post about how we improved query performance on our analytics dashboard by...

Close