Monday, May 30, 2016

Node.js CPU Intensive Joins

CPU Intensive

One of the first rules you'll hear when moving to node.js is that you should not do CPU intensive tasks with node.js.  This is usually followed up with the classic examples of calculating prime numbers or floating point mathematics.  Are there examples of CPU intensive processing in your typical node.js application?


Joins

As it turns out, there is a common code pattern in node.js that is CPU intensive, joins.  Joins are when you have two arrays of objects and want to find the matching object(s) in the two arrays.  They're one of the pillars of relational databases and have been optimized and improved for years.  So, why is it that node.js suffers from the age old practice of joining.  It turns out that our move to NoSQL databases has oftentimes moved the join operations to the application, in this case node.js.  If a NoSQL database is not designed in a way that duplicates or embeds the related data, there's no choice but to query both tables and join the data in application code. There's many ways in which we can join the data in node.js application code.  I'm going to demonstrate some of the ways to do inner joins in node.js.  As you'll see later, there's a stark difference in CPU performance between the various methods.


loops O(n * m)

One of the easiest ways to join the data is to simply loop over one array and for each element in the array loop over the other array until you find the matching element.  While easy to write, it's CPU usage leaves a lot to be desired.  The number of operations in the worst case is O(n * m).  This means that we will at least have n operations and the second loop will be at worst searched through all items.  As you'll see this causes some major performance implications on the CPU, even with very small n and m sizes.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        let item = undefined;
        for (let j =0; j < bar.length; j++) {
            if (foo[i]._id === bar[j]._id) {
                item = bar[j];
                break;
            }
        }
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }


Array.find O(n * m)

Another easy way to join the data is to use the ES6 Array.find method.  This is very similar to using two loops but offers an easier API for the developer.  You just need to implement the matching function and the find method handles the nastiness of dealing with the loops. While easier to write than the loops, it has similar performance metrics to the loops plus the cost of pushing and popping the matching function on the stack.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = bar.find(function (b) { return b._id === foo[i]._id; });
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }


Map O(n+n)

ES6 provides a new data structure called the Map.  Map allows the developer to use a key value data structure that is a first class citizen in javascript.  No longer do you need to awkwardly use an object as a name value bucket.  Also, the javascript runtime will manage the best ways to allocate and search these values. For example, at smaller numbers the implementation might just be a sorted set but at larger numbers it will switch over to a hash table.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = barMap.get(foo[i]._id);
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }

Object O(n+n)

Previously to ES6 we had to use to use the object as a key value store.  The key would be set as the field name on an object and the value of the field set to the value.  To join with the object, we simply iterate across one of the arrays and find the corresponding field that matches.

       
    let result = [];
    for (let i =0; i < foo.length; i++) {
        var item = barMapObj[foo[i]._id];
        result.push({
            _id: foo[i]._id,
            a: foo[i].a,
            b: foo[i].b,
            c: foo[i].c,
            d: foo[i].d,
            e: item.e,
            f: item.f,
            g: item.g,
            h: item.h
        });
    }

Merge O(n)

One of the not so obvious ways to join your data is through a merge join.  It is similar in fashion to the loop join above but it requires both of your arrays to be sorted by a key before the merge.  To do the merge join, you iterate until both arrays are empty and check equality of the keys.  If the two keys equal then you join the values.  If the keys do not equal you add the key that is less than the other key.

       
    let result = [];
    for (let i =0; i < elementCount; i++) {
        if (foo[i]._id === bar[i]._id) {
            result.push({
                _id: foo[i]._id,
                a: foo[i].a,
                b: foo[i].b,
                c: foo[i].c,
                d: foo[i].d,
                e: bar[i].e,
                f: bar[i].f,
                g: bar[i].g,
                h: bar[i].h
            });
        } else if (foo[i]._id < bar[i]._id) {
            // outer join foo
            result.push({
                _id: foo[i]._id,
                a: foo[i].a,
                b: foo[i].b,
                c: foo[i].c,
                d: foo[i].d,
                e: undefined,
                f: undefined,
                g: undefined,
                h: undefined
            });
        } else {
            // outer join foo
            result.push({
                _id: bar[i]._id,
                a: undefined,
                b: undefined,
                c: undefined,
                d: undefined,
                e: bar[i].e,
                f: bar[i].f,
                g: bar[i].g,
                h: bar[i].h
            });
        }
    }

The results

As you can see the array.find and the loop joins quickly become inefficient and lock up the CPU for long periods of time.  Even at 10,000 items in each array it took about 1.4 seconds to perform an array.find join and nearly 1 second for the loop join.  Making them unusable in websites or APIs that have even the highest of performance requirements.  Remember with the CPU working on this request it cannot complete any other requests.  Therefore, all in progress requests would be waiting the nearly 1.4 seconds for this one call to complete.

With 100,000 items in each array, the loop join and array.find join were over 3 minutes to complete.  This type of performance would make any background processing that has near time requirements very hard to complete.  It's hard to see in the graph due to the vast time differences but the merge sort was faster than both the map or the object joins.  Which makes sense because it is O(n) versus O(n+n) complexity.  However, this doesn't take into account the time it takes to sort the two arrays.  Finally, the map join was slightly better than the map object join.

Which makes the final take away this, for the vast majority of joins in nodejs you should use the Map object.  The ES6 Map object is faster and has a cleaner API.

Gist

* All code ran against node.js v4.2.2

Sunday, August 18, 2013

Jenkins build events integrated into Show Slow

Show slow allows for custom events to be graphed along with all of your normal events.  Seeing your various performance metrics graphed with your various Jenkins builds labeled on the graph is extremely useful.  You can compare the various metrics across your various builds.

To add the Jenkins builds as custom events in show slow simply add a shell build event to your build job.  In the command field add the following bash script.



Set the URL_PREFIX variable to your website's scheme and domain.  Remember to include a trailing /.
Set the RESOURCE_URL variable to the path of your sub-page
Set the SHOW_SLOW_URL variable to the location of your showslow beacon events page without any parameters


Friday, August 16, 2013

Ubuntu ShowSlow setup

Show Slow is an excellent website that you can use to track your website's performance metrics over time.  You can have your stats sent to the publicly available www.showslow.com website or install your instance so your metrics are kept private.  The installation and configuration guide is here but it's lacking some of the details for Ubuntu.  So here's a step by step way to setup Show Slow on Ubuntu 13.04.

  1. sudo apt-get install apache2                #installs the apache webserver
  2. sudo apt-get install php5                     #installs php5
  3. sudo apt-get install php5-mysql           #installs php's mysql support
  4. sudo apt-get install php5-mcrypt         #installs php encryption support
  5. sudo apt-get install curl php5-curl       #installs curl & php curl support
  6. sudo apt-get install mysql-server         #insalls mysql
  7. mysql -u root -p  #fill in your password
    1. GRANT USAGE ON *.* TO 'showslowuser'@'%' IDENTIFIED BY PASSWORD '-- Your Password Here!--';
    2. GRANT ALL PRIVILEGES ON `showslow`.* TO 'showslowuser'@'%';
    3. exit
  8. cd /var/www
  9. sudo wget http://www.showslow.org/downloads/showslow_1.2.tar.bz2 #download Show Slow
  10. sudo tar xvfj showslow_1.2.tar.bz2     #uncompress Show Slow
  11. cd showslow_1.2
  12. sudo cp config.sample.php config.php
  13. sudo vi config.php
    1. set your timezone as one of the IANA timezones
    2. set your db connection information to the showslowuser from #7
  14. make
  15. /etc/init.d/apache2 restart
Your private instance of Show Slow should be available now at http://localhost/showslow_1.2.  To configure Firefox to send data to your instance of Show Slow follow the guide at http://localhost/showslow_1.2/configure.php.

Monday, June 10, 2013

Updating a WinRM HTTPs expired self signed cert

WinRM is the best API for managing Windows remotely.  However, the configuration and maintenance of WinRM leaves a lot to be desired.  I've switched over to using a fantastic set of powershell scripts for configuring WinRM found here.

The certs generated by the selfsignedcert.ps1 have a default expiration of 90 days.  Therefore, you'll need to update the certs in short order if you use this setup.

Here's an easy step by step way to update the self signed certs.

  1. download the windows 7.1 SDK here
  2. copy the makecert.exe from the SDK bin to your server's c:\temp\
  3. open a command prompt to c:\temp
    1. makecert.exe -sk "%YOURHOSTNAME%" -ss My -sr LocalMachine -r -n "CN=%YOURHOSTNAME%" -a sha1 -eku "1.3.6.1.5.5.7.3.1"
  4. open the mmc certificates snap in & copy the new cert's thumbprint to notepad & remove the spaces
  5. open powershell
    1. winrm enumerate winrm/config/Listener
    2. copy down the address you used for the HTTPS Listener
    3. run winrm delete winrm/config/Listener?Address=%YOURADDRESS%+Transport=HTTPS
  6. open command prompt
    1. winrm create winrm/config/Listener?Address=%YOURADDRESS%+Transport=HTTPS @{Hostname="%YOURHOSTNAME%";CertificateThumbprint="%YOURTHUMBPRINT%"}
Hopefully this saves someone else a day's worth of work.


Tuesday, November 30, 2010

TFS 2010 BlueGreen Deployment with Amazon Elastic Load Balancer

One of the best ways to minimize the outage taken while deploying a website or webservice is by the use of Blue-Green deployment model. I've created a TFS 2010 deployment workflow that uses Amazon EC2 and does just that.

Step 1: Download and install the Amazon Elastic Load Balancer API tools and install on your build box. Make sure you are able to use the command line on your build box to register and deregister Amazon EC2 instances before continuing.

Step 2: Add a BlueInstanceIds and GreenInstanceIds String[] arguments to your build WF.

Step 3: Add a LoadBalancerName String argument to your build WF.

Step 4: Add a AWSAccessKeyId and AWSSecretAccessKey String arguments to your build WF.

Step 4: Add a Parrallel ForEach or a ForEach to your build WF that iterates over your GreenInstanceIds.

Step 5. Add an InvokeProcess activity to the body of your ForEach from step 3. The FileName property should point to elb-deregister-instances-from-lb.cmd. The Arguments should be set to LoadBalancerName & " --instances " & GreenServerInstanceId & " --I " & AWSAccessKeyId & " --S " & AWSSecretAccessKey

Step 6. Set your variables in your build definition.

Step 7. Run your build


This should successfully DeRegister any servers in the GreenServerInstanceIds variable. Follow the same concepts for Registering your Green servers as well as registering/deregistering your blue servers.

TFS 2010 InvokeProcess & command line commands

The InvokeProcess activity is an activity that you can use to start an external process from the build workflow. There seems to be quite a bit of documentation online stating that you can set the file name property to various DOS commands (xcopy, move, del, etc). However, I've never been able to get this to work. I was however able to get c:\windows\system32\cmd.exe /c [YOUR COMMAND HERE] to work. Which makes sense the InvokeProcess is just wrapping the .net Process class which is not a command line it's a Windows process.