Monday, May 30, 2016

Node.js CPU Intensive Joins

CPU Intensive

One of the first rules you'll hear when moving to node.js is that you should not do CPU intensive tasks with node.js.  This is usually followed up with the classic examples of calculating prime numbers or floating point mathematics.  Are there examples of CPU intensive processing in your typical node.js application?


Joins

As it turns out, there is a common code pattern in node.js that is CPU intensive, joins.  Joins are when you have two arrays of objects and want to find the matching object(s) in the two arrays.  They're one of the pillars of relational databases and have been optimized and improved for years.  So, why is it that node.js suffers from the age old practice of joining.  It turns out that our move to NoSQL databases has oftentimes moved the join operations to the application, in this case node.js.  If a NoSQL database is not designed in a way that duplicates or embeds the related data, there's no choice but to query both tables and join the data in application code. There's many ways in which we can join the data in node.js application code.  I'm going to demonstrate some of the ways to do inner joins in node.js.  As you'll see later, there's a stark difference in CPU performance between the various methods.


loops O(n * m)

One of the easiest ways to join the data is to simply loop over one array and for each element in the array loop over the other array until you find the matching element.  While easy to write, it's CPU usage leaves a lot to be desired.  The number of operations in the worst case is O(n * m).  This means that we will at least have n operations and the second loop will be at worst searched through all items.  As you'll see this causes some major performance implications on the CPU, even with very small n and m sizes.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        let item = undefined;
        for (let j =0; j < bar.length; j++) {
            if (foo[i]._id === bar[j]._id) {
                item = bar[j];
                break;
            }
        }
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }


Array.find O(n * m)

Another easy way to join the data is to use the ES6 Array.find method.  This is very similar to using two loops but offers an easier API for the developer.  You just need to implement the matching function and the find method handles the nastiness of dealing with the loops. While easier to write than the loops, it has similar performance metrics to the loops plus the cost of pushing and popping the matching function on the stack.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = bar.find(function (b) { return b._id === foo[i]._id; });
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }


Map O(n+n)

ES6 provides a new data structure called the Map.  Map allows the developer to use a key value data structure that is a first class citizen in javascript.  No longer do you need to awkwardly use an object as a name value bucket.  Also, the javascript runtime will manage the best ways to allocate and search these values. For example, at smaller numbers the implementation might just be a sorted set but at larger numbers it will switch over to a hash table.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = barMap.get(foo[i]._id);
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }

Object O(n+n)

Previously to ES6 we had to use to use the object as a key value store.  The key would be set as the field name on an object and the value of the field set to the value.  To join with the object, we simply iterate across one of the arrays and find the corresponding field that matches.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = barMapObj[foo[i]._id];
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }

Merge O(n)

One of the not so obvious ways to join your data is through a merge join.  It is similar in fashion to the loop join above but it requires both of your arrays to be sorted by a key before the merge.  To do the merge join, you iterate until both arrays are empty and check equality of the keys.  If the two keys equal then you join the values.  If the keys do not equal you add the key that is less than the other key.

       
    let result = [];
    for (let i =0; i < elementCount; i++) {
        if (foo[i]._id === bar[i]._id) {
            result.push({
                _id: foo[i]._id,
                a: foo[i].a,
                b: foo[i].b,
                c: foo[i].c,
                d: foo[i].d,
                e: bar[i].e,
                f: bar[i].f,
                g: bar[i].g,
                h: bar[i].h
            });
        } else if (foo[i]._id < bar[i]._id) {
            // outer join foo
            result.push({
                _id: foo[i]._id,
                a: foo[i].a,
                b: foo[i].b,
                c: foo[i].c,
                d: foo[i].d,
                e: undefined,
                f: undefined,
                g: undefined,
                h: undefined
            });
        } else {
            // outer join foo
            result.push({
                _id: bar[i]._id,
                a: undefined,
                b: undefined,
                c: undefined,
                d: undefined,
                e: bar[i].e,
                f: bar[i].f,
                g: bar[i].g,
                h: bar[i].h
            });
        }
    }

The results

As you can see the array.find and the loop joins quickly become inefficient and lock up the CPU for long periods of time.  Even at 10,000 items in each array it took about 1.4 seconds to perform an array.find join and nearly 1 second for the loop join.  Making them unusable in websites or APIs that have even the highest of performance requirements.  Remember with the CPU working on this request it cannot complete any other requests.  Therefore, all in progress requests would be waiting the nearly 1.4 seconds for this one call to complete.

With 100,000 items in each array, the loop join and array.find join were over 3 minutes to complete.  This type of performance would make any background processing that has near time requirements very hard to complete.  It's hard to see in the graph due to the vast time differences but the merge sort was faster than both the map or the object joins.  Which makes sense because it is O(n) versus O(n+n) complexity.  However, this doesn't take into account the time it takes to sort the two arrays.  Finally, the map join was slightly better than the map object join.

Which makes the final take away this, for the vast majority of joins in nodejs you should use the Map object.  The ES6 Map object is faster and has a cleaner API.

Gist

* All code ran against node.js v4.2.2

No comments: