Running Pluggable Code at Scale

BlueCrew|7 Minute Read

Running Pluggable Code at Scale

BlueConic is a fast-growing company, and our accelerated growth impacts the development of our product. To illustrate some of the technical challenges we face, we figured we’d shed some light on one aspect of this rapid growth: running pluggable code at scale.

At BlueConic, we process ~650 million requests daily. And for all these requests, we gather profile info, apply content personalization, and synchronize data with external systems.

To keep this all running smoothly, BlueConic’s customer data platform (CDP) was designed with a steady “core” (with several APIs) and a highly fluid plugin architecture.

Our plugin architecture allows us to build out-of-the-box connections at a fast pace, with easy on-the-fly deployment — and without impacting the core operation of BlueConic.

This plugin architecture also allows us to capably scale as an organization. For instance, multiple teams can work on different plugins without conflicting with one another.

BlueConic allows plugins to package code that either runs in the browser or mobile app of the visitor of a channel or code that runs in the browser of the user using BlueConic.

Until 2015, plugins could only ship code that runs on the client, be it in a browser or mobile app. Although this is powerful and solves several use cases for client-side content personalization and data gathering, it doesn’t enable BlueConic to provide server-to-server communication, process privacy sensitive info, or execute long-running batch processes.

Technical goals for the desired solution

Apart from the functional objectives (see the original historical Jira issue below), the following technical goals were specified in our platform goals:

  • Ship plugins with JavaScript code that can run in a server-side context.
  • Ensure there is no downtime whatsoever when it comes to deployment.
  • Scale horizontally, when needed.
  • Make sure the pluggable code doesn’t impact other customers and is separated in terms of processing the data.
  • Allow for both long-running processes and “real-time” code execution.
  • Scale the cost and ensure there’s a low “footprint,” in terms of resources. (We have to be careful of exponential cost growth that wouldn’t fit our business model, since we expect our customer base, whose numbers are growing steadily as well, to increasingly run more server-side code.)

blueconic

Exploring the technological landscape

From the start, it was clear that JavaScript was the desired language.

Why? Our existing plugin mechanism leveraged browser-based JavaScript and could use our JavaScript-based API. It only made sense to have a similar API for the server-side code.

Also, the JavaScript ecosystem is very rich, in terms of existing modules that we could pick from, making it a no-brainer to choose JavaScript.

But server-side JavaScript was not very mature in 2015. Node.js (although released in 2009) was still a pre-major version (0.12.0), and other server-side JavaScript implementations looked promising.

In addition to Node.js (which seemed suitable in almost all perspectives), we also looked at Java 8’s Nashorn. Although it met the basic need of executing JavaScript from within the JVM, it had several disadvantages when compared to Node.js. The most important were:

  • The lack of support for some NPM modules (with native code)
  • The question of how scalable Nashorn would be
  • How we could run both long-running processes and real-time code execution

Node.js also gained momentum over Nashorn, which increased our trust in Node.js for the future. At the end of the day, Node.js fit the bill in terms of flexibility, long-running processes, no downtime deployment, and a low footprint, but other challenges remained for both security and horizontal scalability.

What challenges did we encounter?

When doing research into a new technology, there will always be pains aside from the gains. The challenges we encountered when going down the Node.js road where mostly around security and scalability.

Security

Node.js has a very open API, which opens up all kinds of attack vectors since it would run in the BlueConic network infrastructure. That left us with a challenge:

An open API is great for developers, but not so much for security

Isolation was another important topic. Every tenant would need to be isolated completely from other tenants. Customers own their data, and we act as a processor for their data, so data from customers can never be compromised. We examined a couple options:

  • 1) Using a “serverless” service such as Amazon Lambda (introduced in 2015): Not having to manage the Node.js infrastructure was something we liked right away. Pricing was reasonable, although more expensive than managing it ourselves. It supported the use case for real-time code execution, but not so much for long-running processes (timeouts). Later, AWS Batch was introduced, which could be an alternative for long-running processes. When going down this road, how would we share data between BlueConic and Lamda/Batch? Using the existing REST API seemed reasonable but introduced other headaches with authentication. In the end, “serverless” solutions (both AWS Lambda as AWS Batch) were discarded because they were too immature at the time, too expensive and not flexible enough for our needs.
  • 2) Running Node.js in our own (AWS based) infrastructure: This would give us the most flexibility but posed a security risk with the open Node.js API in mind. We started investigating solutions where Node.js could be used in a kind of “sandboxed” environment. Docker popped up (introduced publicly in 2013). This new-to-us technology certainly looked promising.

As stated above, Node.js was ideal for us, but raised security and isolation concerns. Docker could solve these issues (along with horizontal scalability), because Docker provides us with the tools to run Node.js (through a Dockerfile) in a preconfigured virtual environment.

The open API risk was mitigated with Docker, because the risk of a plugin breaking things (or accessing other tenants’ data) was brought down to “container” level. Docker itself is not without attack vectors (on the host-machine level), but when configured properly, it can’t do much harm.

So, with the security and isolation issues now tackled, horizontal scalability was still an open challenge.

Horizontal scalability

There are several ways to make Node.js horizontally scalable, ranging from leveraging Node.js cluster to having multiple servers with a load balancer.

Since our Node.js processes wouldn’t be directly accessible from outside the BlueConic network, along with our decision to use Docker containers for Node.js containment, we opted for a custom solution for horizontal scalability.

In July 2015, Amazon released Elastic Container Services (ECS). This is basically an Amazon, managed Docker service. Since all of BlueConic’s infrastructure runs on Amazon, ECS looked promising.

ECS offers a reliable EC2-based solution for deploying Docker containers, including auto-scalability, monitoring, and logging.

Implementation

We now had all the ingredients for creating our desired solution:

  • The ability to ship (server-side) JavaScript in plugins
  • Node.js as our execution environment (which enabled us to be quick and lean resource-wise and flexible through npmjs.com)
  • Docker for both security and horizontal scalability
  • Amazon ECS for managing the Docker infrastructure

A diagram says more than a thousand words, so here comes the (simplified) infrastructure that we came up with:

blueconic

The most basic setup for one customer is:

  • One Docker container running Node.js
  • An ability to scale up with more containers when needed, to distribute the load (BlueConic core communicates (over internal HTTPS requests) with one of the (almost-stateless) Docker containers running Node.js, to execute either real-time code or for executing a long-running batch process)
  • The containers receiving the customer-specific plugins that are installed (including NPM dependencies)
  • The Node.js containers receiving relevant profile information in the request, but also having the ability to request more info through the existing REST API, which is available to all plugins
  • Consul used for service discovery, as it keeps a list of available containers per customer

Successes

Since its inception in 2015, a lot of functionality (APIs) has been added to the initial solution. At its core, it’s still what it was five years ago.

All BlueConic connections that have been created since run as server-side code in one of the Node.js containers (as long-running processes), while other plugins lean more towards responding to real-time events (profile properties that have changed or timeline events that have been added to a profile’s timeline).

This solution has brought us so much flexibility in terms of plugins being able to ship new versions of server-side code, along with scalability with multiple teams, without this impacting other customers (by being sandboxed) or the BlueConic core.

It felt like a pioneering solution back in 2015 — but it’s a rock-solid foundation to build on.

At the time of writing, we’re running over 600 Docker containers (for Node.js), distributed over six m5.2xlarge servers. All these containers combined serve an average of 160 million requests each day, where containers of larger customers take a big share of these, when compared to smaller customers.

If you want to learn more about this work or are interested in joining one of our engineering teams, visit BlueConic’s careers page today to check out our available openings.

blueconic

See what BlueConic can do for you.