Background and Purpose

A client of ours reached out to us to investigate a performance issue in their Node application. After two days of intense instrumentation, we found a single cause for all the performance issues.

Project Description

Our client had spent months converting their existing Rails application to use Node. The desire to improve their application’s response time motivated their decision to switch to Node. Their current Rails application was responding at a rate of around 1.4k requests per second. Their aim was to double that metric, and choosing Node was one of the best options to meet that goal.

But once they deployed their rewritten application to production, performance was actually worse! Response times were on par or slower than the Rails application. And to make matters worse, the application was crashing and required frequent restarts. After exhausting all options, they decided to turn to us for a second look.

After a brief introduction, we dove right in. Initial tests indicated that the server was fast; but under load, it was a different story. With hundreds of requests, the server would hold on to all the requests before responding. This caused increasing response times as more requests came in. Sometimes, something would cause the server to flush all queued responses. After the flush, the application’s response queue would climb back up again.

We saw that each request was spawning off fifteen requests to internal microservices. That, in and of itself, is not bad, and is something that Node should be able to handle without issue. But with the response queue backing up, these external calls would take longer and longer. They’d start out at around 500ms and then scale up to 20s to complete. In some more brutal test scenarios, Node would crash with a JavaScript heap out of memory error. These “out of memory errors” were likely the reason for the application’s random restarts.

Thinking that this may be the source of the problem, we took a hard look at this integration point. They used the Fetch API, backed by Promises, to build their microservice integration. Server-side rendering necessitated the use of polyfills. These provide older browser support for the Fetch API. Microservice responses then passed through Redux to initialize the application’s state.

The first step of any optimizing work should be to gather data. It is only with data that you can show with any certainty that you have optimized the right places. So before we did anything, we added instrumentation to measure the time each line of code took to run.

With data in hand, we stripped out the internals of the microservice integration. This should narrow down the source of the slow down. We completely removed the server-side rendering logic, removed Redux, and removed the polyfills. This allowed us to rule out as much third-party code as possible. Unfortunately, none of that seemed to solve the problem. On the plus side, we now know those abstractions are quite fast and efficient!

We then started to look at the differences between the Fetch API and the returned Promise values. Those are native abstractions, but abstractions nonetheless. So we went one level lower and used Node’s http module for making network requests. That, too, was a dead end.

We had pared down the response code. We had removed third-party libraries. We had used the lowest level, native Node APIs. We still had no significant improvement in response times. The only other place to go was up. Up into Hapi’s route handling code.

To test the theory that the issue was within Hapi’s code, we made our test case use Node’s built-in HTTP server. Tests against that were lightning fast, as expected, thus confirming our theory. After digging inside Hapi’s internals, we discovered a potential issue. Hapi uses a custom event emitter that depends on setImmediate. Before diving any further, we decided to take look at Hapi’s open issues. It is there that we found our smoking gun in Issue #3347.

GitHub user DavidTPate laid out the source of the problem: setImmediate “schedules the ‘immediate’ execution of the callback after I/O events’ callbacks…”

It's possible to saturate Node. It's possible that since setImmediate() defers its work, the flood of incoming requests could prevent the setImmediate() callbacks from executing causing memory consumption to increase.

@cjihrig

This means that any I/O blocks the work needed to respond to the original request. The server never has a chance to respond to old requests because it is too busy handling new requests. It holds on to them all until something breaks and it can release them. The application will run out of memory and crash if it never gets the opportunity to respond.

But process.nextTick adds the callback to the “next tick queue”. So once the event loop turn runs to completion, all queued callbacks run. This is a much better solution for our specific problem. The necessity of setImmediate and callback deferment is still under discussion. A downgrade of Hapi to a recent version that used process.nextTick solved the issue for our client. Their application went from 1.4k req/s to over 6k req/s, far exceeding their desired metric goals!

Outcome

We provided our client with a detailed walkthrough of our findings. The simple and impactful nature of the changes encouraged them to move forward. They saw an immediate improvement in speed and stability. This allowed them to move forward with their plans to launch their new backend site wide.