Golang in Production

Golang has finally reached the big leagues at reecetech! We have released a rewritten REST API microservice - users-service - which is core to The Reece System. The Reece System is Reece’s internally built point of sale system, referred to as TRS for simplicity. This article will look at the motivations of the rewrite, the architecture and also the performance of the microservice.

Motivation

In the Delivery Engineering team there is a backlog of exciting new epics (aka projects), tech debt items and various impromptu items fighting for the team’s attention. The team’s method of choosing an epic is to look at the estimated work versus estimated value of each epic and choose the epics with higher value, lower effort first. Last quarter we chose an epic to rewrite one of the existing microservices in Golang and deploy it to production.

We estimated that the epic would be medium effort since it can take time to develop the code, much care must be taken to work out all of the bugs to safely get the application all the way into production. We agreed the epic was high value due to multipronged value propositions including retiring the exiting Django Python application which has exorbitant resource usage, random errors and crashes, and to gain experience in using Golang for microservices. The Django application had fallen in such a bad state following the dissolving of the team which built it due to a corporate restructure, which had resulted in the application being neglected with no owner and scarce Django developers with the know-how to patch up the issues.

Golang was considered a good candidate for a rewrite due to its maintainability, ease of use, efficient resource usage, and high request throughput. Due to the language’s simple language syntax and concise training materials, new team members have been able to start raising pull requests to modify Golang code within a few days and in some cases even creating entire microservies in their first few weeks. The efficient resource usage proposition of Golang is also particularly appealing as reecetech sets its sights on running more compute in the cloud where excessive resource use can have a serious impact on the balance sheet of the business. Using Golang to replace a Django microservice was a good test to trial a pattern which can be adopted by other teams.

Architecture

The application is a HTTP REST API microservice which interfaces with various tables in an IBM Informix database, and also orchestrates calls to various other microservices to fulfil the HTTP requests. Since reecetech predominantly runs its compute workloads in a distributed Kubernetes environment, it is essential for the application to be tolerant of network and worker node failures to maintain 247 uptime, and cache appropriate data. Also, reecetech expects all applications used in production to have a swagger documentation page for its API, and to publish and consume Pact contracts. The final architectural requirement was for the Golang application to be written in an iterative agile software delivery pattern meaning that one endpoint should be rewritten and released to production at a time.

Open Source Packages

Predominately all of the functional and non-functional requirements were achieved with the following libraries: - echo: for lightweight, fast HTTP routing - go_ibm_db: to interface Golang with the Informix Database - sqlx: a superset of the database/sql package built in to Golang which provides useful functions like Get and Select - echo-swagger: Generate rest api documentation - go-retryablehttp: HTTP retries for resiliency to network failures and outages of downstream services - newrelic/go-agent: runs NewRelic agent for monitoring

Node Failure Resilience

The only way to ensure node failure resilience is to run more than one replica (ie. more than one pod) on different nodes. This is achieved via setting the Kubernetes deployment resource PodAntiAffinity to tell the Kubernetes scheduler to make sure it runs the replica sets on different nodes.

Distributed Caching

Since the application is deployed as a Kubernetes deployment with more than one replica set, it is not feasible to use a in memory cache, so we use a distributed cache. Redis was chosen for this since reecetech has proven existing patterns for bundling Redis with applications in helm charts with great results thanks to the lightweight and speed of Redis. The Redis cache is simply exposed through a Kubernetes service which all of the Golang application pods consume.

Strangler Pattern

The Strangler Pattern, or similarly The Fig Strangler Application, is a pattern used when rewriting critical applications. It does so in a way that avoids lengthy rewrites followed by a big bang cutover. Instead allows both the new and the old applications to live side by side, whilst over time the new application slowly strangles the old one out.

We utilised the Strangler Pattern by having all requests directed from the ingress to the Golang application, and then the Golang application proxy requests of unimplemented endpoints to the Django Application.

Whilst this pattern means that during the strangulation we are utilising more resources than we were before the rewrite, we believe that a pragmatic approach of smaller endpoints being cut over allows for a smaller blast radius if any bugs were to be introduced than if all endpoints were cut over in a big bang.

Test Coverage

Four tiers of testing were built to ensure the rewritten application is working as specified.

Performance

The performance of the Golang rewritten application has far exceeded the performance of the original Django application in all areas including reduced CPU usage, reduced memory usage and increased request throughput.

The throughput results are very clear when we look at the most heavily used GET endpoint (collected via NewRelic based on production workloads): Django RPM: 1200 Django Response Time (Ms): 21.5 Golang RPM: 4,740 Golang Response Time (Ms): 4.3 In the above results we noticed that the Golang application was receiving much higher RPM than the Django application. This is because the average response time was so much lower for the Golang application that we were able to scale back the number of replica sets running by about 1/5th, which resulted in more requests to fewer pods. However, as we see in the resource results below, the Golang up still doesn’t break a sweat. Django Application NewRelic Data: statistics Golang Application NewRelic Data: statistics

Go Go Golang

Given the smooth experience and solid performance so far and warm reception from the Engineering management, the Delivery Engineering team is planning to continue with the push for Golang by pairing up with other reecetech development teams to see how the learning curve is for them and how it might fit into their work.