Using Kubernetes to scale our infrastructure

Using Kubernetes to scale our infrastructure

Codavel Performance Service is built on top of our core technology, Bolina, a mobile-first network protocol. Apart from delivering the best network experience possible to all our users, our main goal is to make our protocol as easy to use as HTTP, so that it can be integrated into any service in a few minutes. 

For us, this meant creating a mobile SDK that requires only a few lines of code to enable Bolina on any Android or iOS app, and a server cluster that not only can be quickly deployed into any infrastructure but is also able to scale, so that it can cope with the millions of requests per seconds we aim to serve. And while the old-school engineer that lies in me glows with the beeps thrown by a physical computer booting up or 1000+ lines of custom installation scripts, large scale deployments should definitely make use of automated workflows. In the following paragraphs I will explain why we choose Kubernetes to help us with that.

Our Service, the ins and outs

But first let me explain the basics of how we designed our integration of the Bolina Protocol with HTTP:

When the Client enables our mobile SDK, it starts intercepting all the configured HTTP requests made by the App, and redirects them to a local service running on the device, managed by our SDK. This service is responsible for translating those requests into our Bolina Protocol, and sending them to our servers through a secure Bolina connection. The Bolina server is responsible for almost the opposite operation, as it processes the Bolina requests and translates them back into HTTP. Then, it fetches the resource from the Content Server, as a regular HTTP client would, and sends the data back to Client through the same Bolina connection. Pretty simple, right?

And while establishing a connection between a client and one specific server is pretty straightforward, as the number of clients that use our service increases a few more challenges arise:

  • How do we know how many clients each server should handle simultaneously?
  • How do the clients know to which server should they connect to?
  • How do we scale the number of servers when needed?

 

With these questions in mind, the first thing we tackled was finding a way to detect the capacity of each server. Performance is at the core of any communication protocol and Bolina is no different, so we had to ensure that all the effort we put in making Bolina as fast as we could was not being limited by external (protocol-speaking) bottlenecks, such as the number of open connections or lack of resources to perform all the required tasks.

One common solution is to monitor the Instance’s CPU and/or memory usage and evaluate if the measured values are under a certain threshold. But as we have full control of the entire process running on the server, we felt that we could include a few more metrics, such as the number of bytes processed, the number of requests per second, even the processing time of each event of the process’s main loop, and so on. This allowed us to have a higher granularity about each server availability and, most importantly, to quickly detect if a server was reaching its maximum capacity and avoid that server from increasing the number of open Bolina connections.

Based on these metrics, we divided the Bolina Server availability into three different stages:

  • Available: Server  is below its capacity, and is ready to accept new clients.
  • Warning: Server is reaching its maximum capacity, and it should not accept new clients in order to stabilize its usage and avoid getting into the Critical stage.
  • Critical: Server is above its capacity, and the overall performance could be affected. It should stop receiving traffic until it drops at least to a Warning stage.

But how to know which servers are running?

After guaranteeing that each Bolina Server could accurately determine its availability, the next step was to aggregate that information into a service that another entity, such as a load balancer, could query and be able not only to get a list of available servers, but to determine which one would be more suitable to receive a new client. 

Let’s take the following JSON as an example of the information that a load balancer could receive: 

[
   {
       "Address": {
           "IP": "1.2.3.4",
           "Port": {
           "TCP": 9001,
           "UDP": 9002
           }
       },
       "Metadata": {
           "ServerID": "805390934",
           "Status": "available"
       }
   },
   {
       "Address": {
           "IP": "5.6.7.8",
           "Port": {
           "TCP": 9001,
           "UDP": 9002
           }
       },
       "Metadata": {
           "ServerID": "202130133",
           "Status": "warning"
       }
   }

Based on this information, the Load Balancer knows that there are two Bolina Servers running, and that one of them is in the “Warning” state. If a client opened a new connection at this stage, the decision should be to redirect it to the server “805390934”, with the address 1.2.3.4, since it has more capacity than the other.

To store that information, we decided to use Consul, a distributed and highly available service mesh system, that allowed us to quickly set up a system to where all the servers could efficiently communicate. Consul also has custom health checks, available through an HTTP API, to where the servers could periodically send their availability and,  as a bonus, a distributed Key-Value store that we could utilize to share relevant information for the entire cluster of servers.

Regarding its architecture, it basically consists of a cluster composed of at least 3 consul agents (running in server mode), to where other agents (running in client mode) could connect to and communicate through gossip protocol. To integrate with Consul, each Bolina Server is deployed alongside a consul agent, as a “sidecar”, which is responsible to receive the server’s HTTP messages describing  its availability and propagate the information to the consul cluster, where, as we were aiming, we have an aggregated view of all the running instances.

Great! Now I want more servers

Now that we defined the model to detect the availability of a Bolina Server, and a Service from where we could fetch the status of all Bolina Servers, we moved on to the next problem: how to use that information? (of course we already knew what to do, but it is just for the sake of the writing flow).

As described previously, each Bolina Server acts basically as an HTTP proxy, not storing any of the requested resources. This means that, apart from the data that is currently being transferred between Client and Server, each instance is pretty much stateless and each Bolina Server is able to handle data from any Bolina Client, independently of having a previous established session or not. So, in terms of scaling our Service, our concern is“only” (and I need to emphasize the “”) to ensure a number of Bolina Servers that are capable of serving all the Client’s requests at any given time.

Coupling this requirement with the definition of availability above, we defined that the system should always have the percentage of servers in the “Available” state above a certain threshold, to act as safeguard, as available capacity. Each time the percentage dropped below that value, a new instance should be spawned. To guarantee that, we developed a custom watcher for the Consul cluster, that is notified each time that a Server changes its availability or there is a change in the number of instances running. If required, this service is also responsible to scale the deployment to the necessary number of instances. Since we put all our creativity into the development of the products, we simply named it Bolina Scaler.

This leads us to the final challenge: what is the most efficient way to create and launch new instances? How to ensure that the number of running instances at a given time is what we intended? And although we have Codavel Performance Service, composed by the necessary number of Bolina Servers, running in our own infrastructure, how could we make it easier for anyone that is interested in deploying Bolina on their infrastructure? Here comes Kubernetes!

Show me some code!

Bolina Servers were always a nice fit for containers. The stateless nature of the service enabled us to deploy in an isolated and immutable environment. However, as the number of connected clients grows, the number of required containers to handle all that traffic also increases significantly, and we had to ensure that we were able to deploy and orchestrate all of those efficiently, while at the same time guaranteeing that they were fault tolerant and able to communicate with all the other necessary components within the cluster. Kubernetes addresses all of those concerns. 

Keeping things very simple, its architecture consists of one or more (typically much more) physical/virtual hosts, called Nodes, divided into one master node and worker nodes. As you may have guessed, the master is responsible for managing and orchestrating the cluster, while the workers run all the necessary containers and associated tasks. After defining what you want to be deployed (more on that in a few lines), Kubernetes is responsible for managing the system so that it always matches your configuration. This means that it distributes the workload between all the available workers, in a virtual network where all the containers can find the others and communicate with each other, and is constantly verifying that the current state (in terms of service’s configuration and number of running instances of each service) is always the intended.

Through its automated rollouts and rollbacks, It also makes it very easy and fast to horizontally scale up or down the deployment, as it boils down to starting or terminating containers, which allow us to quickly adapt to the natural variation in the number of connected Bolina clients without any downtime, and efficient in terms of the resources being used. 

By now you are probably getting tired of all the writing and just waiting to see some code, right? So buckle up, a lot of k8s terminology coming:

First we deployed a statefulset with three replicas and a persistent volume associated, to deploy the Consul agents running in server mode. This creates the Consul server cluster, while ensuring that all data is kept safely, even if one of those containers stops. 

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: consul
spec:
 selector:
   matchLabels:
     app: consul
     component: server
 serviceName: consul
 podManagementPolicy: "Parallel"
 replicas: 3
 template:
   metadata:
     labels:
       app: consul
       component: server
     annotations:
       "consul.hashicorp.com/connect-inject": "false"
   spec:
     serviceAccountName: consul
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
               matchExpressions:
                 - key: app
                   operator: In
                   values:
                     - consul
             topologyKey: kubernetes.io/hostname
     terminationGracePeriodSeconds: 10
     containers:
       - name: consul
         image: consul:latest
         env:
           - name: POD_IP
             valueFrom:
               fieldRef:
                 fieldPath: status.podIP
           - name: CONSUL_ENCRYPTION_KEY
             valueFrom:
               secretKeyRef:
                 name: consul
                 key: consul-encryption-key
         args:
           - "agent"
           - "-advertise=$(POD_IP)"
           - "-bootstrap-expect=3"
           - "-config-file=/etc/consul/config/server.json"
           - "-encrypt=$(CONSUL_ENCRYPTION_KEY)"
         volumeMounts:
           - name: consul-config-server
             mountPath: /etc/consul/config
         lifecycle:
           preStop:
             exec:
               command:
               - /bin/sh
               - -c
               - consul leave
         ports:
         - name: server-rpc
           containerPort: 8300
           protocol: "TCP"
         - name: consul-dns-tcp
           containerPort: 8600
           protocol: "TCP"
         - name: consul-dns-udp
           containerPort: 8600
           protocol: "UDP"
         - name: http
           containerPort: 8500
           protocol: "TCP"
         - name: serflan-tcp
           protocol: "TCP"
           containerPort: 8301
         - name: serflan-udp
           protocol: "UDP"
           containerPort: 8301
         - name: serfwan-tcp
           protocol: "TCP"
           containerPort: 8302
         - name: serfwan-udp
           protocol: "UDP"
           containerPort: 8302
     volumes:
       - name: consul-config-server
         configMap:
           name: consul-config-server

The config itself is pretty much self-explanatory, but please note the podAntiAffinity section. That forces each of the Consul pods to be placed on a separate node, to avoid losing connectivity in case all the pods were placed on the same node and that node failed.

To communicate with that cluster, we deployed Consul agent clients as a sidecar for each Bolina Server, to reduce the latency in the communication between them. Translating that into Kubernetes language, it means that each Bolina instance is a pod composed of two containers: the Bolina Server itself and a Consul agent.

The cluster of Bolina servers is defined as a deployment, where the number of replicas equals the number of desired running instances, all connected and registered in the Consul cluster, and ready for action. Note that we set the number of replicas to 0. This is not an error, as the Bolina Scaler will be the one responsible for changing that value dynamically.

---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: bolina
spec:
 replicas: 0
 selector:
   matchLabels:
     app: bolina
     component: server
 template:
   metadata:
     labels:
       app: bolina
       component: server
   spec:
     terminationGracePeriodSeconds: 60
     containers:
     - name: bolina-server
       Image: bolina-image
 
       ports:
          - name: bolina-tcp
            containerPort: 9001
            hostPort: 9001
            protocol: TCP
          - name: bolina-udp
            containerPort: 9002
            hostPort: 9002
            protocol: UDP
     - name: consul-bolina
       image: consul:latest
       env:
       - name: CONSUL_ENCRYPTION_KEY
         valueFrom:
           secretKeyRef:
             name: consul
             key: consul-encryption-key
       - name: POD_IP
         valueFrom:
           fieldRef:
             fieldPath: status.podIP
       lifecycle:
         preStop:
           exec:
             command:
               - /bin/sh
               - -c
               - consul leave
       args:
           - "agent"
           - "-advertise=$(POD_IP)"
           - "-config-file=/etc/consul/config/client.json"
           - "-encrypt=$(CONSUL_ENCRYPTION_KEY)"
       volumeMounts:
         - name: consul-config-client
           mountPath: /etc/consul/config
       ports:
         - name: server-rpc
           containerPort: 8300
           hostPort: 8300 
           protocol: "TCP"
         - name: consul-dns-tcp
           containerPort: 8600
           hostPort: 8600
           protocol: "TCP"
         - name: consul-dns-udp
           containerPort: 8600
           hostPort: 8600
           protocol: "UDP"
         - name: http
           containerPort: 8500
           hostPort: 8500 
           protocol: "TCP"
         - name: serflan-tcp
           protocol: "TCP"
           containerPort: 8301
           hostPort: 8301
         - name: serflan-udp
           protocol: "UDP"
           containerPort: 8301
           hostPort: 8301
         - name: serfwan-tcp
           protocol: "TCP"
           containerPort: 8302
           hostPort: 8302
         - name: serfwan-udp
           protocol: "UDP"
           containerPort: 8302
           hostPort: 8302
     volumes:
       - name: consul-config-client
         configMap:
           name: consul-config-client

As for the scaling itself, and as i explained previously, the number of Bolina servers is calculated by our Bolina Scaler service. Since all the data is kept and managed by our Consul cluster, this service is also stateless. So, in order to deploy it in our Kubernetes Cluster, we simply created a deployment, similar as the one for the Bolina Servers:

---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: watcher
spec:
 replicas: 1
 selector:
   matchLabels:
     app: watcher
     component: server
 template:
   metadata:
     labels:
       app: watcher
       component: server
   spec:
     terminationGracePeriodSeconds: 10
     containers:
       - name: watcher
         image: watcher-image
         env:
           - name: POD_IP
             valueFrom:
               fieldRef:
                 fieldPath: status.podIp
         volumeMounts:
          - name: kube-config
            mountPath: /usr/src/app/config
            subPath: config
       - name: consul-watcher
         image: consul:latest
         env:
           - name: CONSUL_ENCRYPTION_KEY
             valueFrom:
               secretKeyRef:
                 name: consul
                 key: consul-encryption-key
           - name: POD_IP
             valueFrom:
               fieldRef:
                 fieldPath: status.podIP
         lifecycle:
           preStop:
             exec:
               command:
                 - /bin/sh
                 - -c
                 - consul leave
        args:
           - "agent"
           - "-advertise=$(POD_IP)"
           - "-config-file=/etc/consul/config/client.json"
           - "-encrypt=$(CONSUL_ENCRYPTION_KEY)"
         volumeMounts:
           - name: consul-config-client
             mountPath: /etc/consul/config
         ports:
           - name: server-rpc
             containerPort: 8300
             protocol: "TCP"
           - name: consul-dns-tcp
             containerPort: 8600
             protocol: "TCP"
           - name: consul-dns-udp
             containerPort: 8600
             protocol: "UDP"
           - name: http
             containerPort: 8500
             protocol: "TCP"
           - name: serflan-tcp
             protocol: "TCP"
             containerPort: 8301
           - name: serflan-udp
             protocol: "UDP"
             containerPort: 8301
           - name: serfwan-tcp
             protocol: "TCP"
             containerPort: 8302
           - name: serfwan-udp
             protocol: "UDP"
             containerPort: 8302
     volumes:
       - name: consul-config-client
         configMap:
           name: consul-config-client
       - name: kube-config
         configMap:
           name: kube-config

Noticed the kube-config configuration? Well, if the number of desired instances changes at any given time, this service needs to interact with the cluster and give instructions to create or shutdown an instance. And here is another helpful tool: The Kubernetes API.

Since the Watcher is running inside the kubernetes cluster, we just needed to integrate the javascript API into our code and change the number of replicas of the Bolina deployment when required. It is up to kubernetes to ensure that our cluster has always the desired number of pods up and running.

async createInstances(desired_replicas) {
const current_deployment = await this.k8sApi.readNamespacedDeployment(this.name, this.namespace);
let deployment = current_deployment.body;
let previous_replicas = deployment.spec.replicas
deployment.spec.replicas = desired_replicas;

let new_deployment = await this.k8sApi.replaceNamespacedDeployment(this.name, this.namespace, current_deployment);
var number_of_replicas = c.response.body["spec"]["replicas"]
console.log(`Desired instance ${desiredInstances}. Running ${number_of_replicas} replicas`)
return number_of_replicas
}
}

The kube config file is what grants access for the service to modify the cluster.

As soon as we start the watcher, it will begin to monitor and update the number of Bolina Server replicas of the deployment to the desired starting instances, and the system is up and running.

That was easy!

If you read the above description of our deployment seems too simple, well… It’s because it is. Kubernetes is a great and easy-to-use tool that allows you to quickly deploy your service in any infrastructure, independently of your cloud provider, and with some great features that ensure the stability and robustness of your deployment. We strongly believe that you should focus most of your efforts into what you’re good at, and tools like Kubernetes help us focus more on delivering the best user experience possible to anyone through a modern transfer protocol, Bolina, and less on the infrastructure side.

Still, this is only a part of our product and there are a lot of challenges we faced when scaling to millions of users. For instance, things get a little bit more complicated when we start discussing how to access the kubernetes cluster from the wild (meaning, the internet), and how we do L4 load balancing for both UDP and TCP traffic, but that itself is enough content for one or two blog posts that you will be able to read soon (spoiler alert: Envoy is another great tool ;) ).