Coremetrix successful migration to Kubernetes

After a huge amount of work from our local engineering team, with the assistance from our colleagues in Creditinfo Global Technologies, Coremetrix are delighted to confirm that we have completed our migration to a containerised infrastructure leveraging Kubernetes.

Kubernetes is a natural technology choice for Coremetrix and allows us to maximise the advantage we’ve obtained through running our services in the AWS cloud. Kubernetes means that our clients can rest assured that our services will scale horizontally no matter how much load they are put under and will further reduce the possibility of any downtime on the platform.

It is critical to Coremetrix to provide a global solution that will scale massively while keeping our technology operating costs as low as possible, Kubernetes gives us the tools to do this and provides the stability you would associate with a large enterprise combined with the agility and flexibility of a startup.

Head of Engineering, Cristian Pupazan led and is delighted to complete the migration; “It is important that our quiz platform can run across multiple operating environments such as private clouds powered by opensource technologies or public clouds such as AWS. Moving to technology such a Kubernetes enables us to achieve this. Also using Kubernetes means that our deployments are now a lot cleaner and our services are self-healing and easy to scale. Building on top of such a platform will give the tech team more velocity, efficiency and agility.”

Well done to Cris and team!

According to Conor Redmond, Head of Operations at Coremetrix, the new infrastructure improves on an already impressive offering:“Our clients rely on us to provide innovative solutions that scale. Kubernetes allows us to meet our service requirements to them and to provide a highly stable infrastructure. We already provide an uptime SLA of 99.99% so our new infrastructure can only provide even greater confidence and better service. Most importantly for us as a growing business, it also allows us to maintain our low-cost base while having the confidence that we can scale without limits as we grow.”

Some info below on Kubernetes and containerisation.

https://kubernetes.io/…/concep…/overview/what-is-kubernetes/
https://www.cio.com/…/what-are-containers-and-why-do-you-ne…

A guide to software manifestos

As we are all too aware, manifestos are a feature of modern party politics. They also have an application in the IT world, and while political manifestos are sometimes forgotten, their computing equivalents help software developers produce systems that are efficient, adaptable and robust, whatever the demands of users. Here, we discuss three of the most important.

Manifesto for Agile Software Development

This was released by 17 pioneering software professionals after a gathering in 2001. They came from different industries and backgrounds but unanimously agreed on four values

 

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

 

The authors believe that there is value in the items on the right, but they value the items on the left more. The Agile movement these principles spawned has not only shaped the way software is developed but also transformed organisations of all kinds.

 

The term “agile” refers to the short feedback cycle. Prioritisation allows important or risky features to be developed, tested and delivered first. Agile became a key to innovation. The short feedback loop enables ideas to be tested, iterated, progressed, or even dismissed quickly. Being agile also means to adapt to changes in business rather than following the pre-defined plan.

 

At Coremetrix, we are continually reaping the benefits of applying the Agile manifesto. The tech team collaborates closely with our other teams, including data scientists, data modelling specialists, research psychologists and salespeople. By taking requirements, priorities and feedbacks directly from these stakeholders, we ensure that the right platform is developed for every project, from compiling the quiz to collating the data.

 

Manifesto for Software Craftsmanship

 

Agile was just the start. Many adoptions of that approach tend to focus on the process and ignoring technical practices and the competence of developers.  This can result in systems that are expensive to maintain and difficult to evolve. That realisation led to the evolution of a fifth Agile value: “craftsmanship over execution”. This, in turn, yielded another manifesto, encapsulating the following values:

 

Not only working software, but also well-crafted software

Not only responding to change but also steadily adding value

Not only individuals and interactions but also a community of professionals

Not only customer collaboration, but also productive partnerships

 

The manifesto emphasises that “in pursuit of the items on the left we have found the items on the right to be indispensable.”  The manifesto is about raising the bar of quality in the software profession.

 

Every member of the tech team at Coremetrix strives to be a craftsman and shares that mindset on professionalism in software development.  We promote well-crafted software as well as working software. We adopt relevant and important technical practices and strive to keep code quality high. We believe that each developer should not only improve themselves but also others in the company and in the wider tech community. We create simple, elegant and quality software solutions to deliver business value.

 

The Reactive Manifesto

 

The version 2 of the manifesto was formally published in 2014, which defines Reactive Systems as

 

Responsive: The system responds in a timely manner if at all possible

Resilient: The system stays responsive in the face of failure

Elastic: The system stays responsive to varying workload

Message Driven: Reactive Systems rely on asynchronous message-passing to establish a boundary between components…

 

This manifesto summarises a set of design architectural design principles that are discovered independently by organisations from different domains. The industry encapsulates in terms of the 4Rs stated in version 1 of the manifesto: react to users, react to failure, react to load and react to events.

 

The manifesto is created to address the challenges faced by modern applications. They must run on multicore processors, serving billions of requests each day (including from Internet of Things (IOTs) and mobile devices), keep low latency, high throughput and availability.

 

Coremetrix’s quiz and data collection platform consist of many microservices and software components. At the system level, the tech team applies these reactive design principles when we design, develop, deploy and continually evolve the platform. At each software component level, we use programming languages and tools, such as Scala and Akka, that support reactive programming. Hence, the platform, which is backed by an asynchronous messaging system (RabbitMQ) is deployed in the cloud and has all of the reactive properties mentioned above.

Summary

The Agile Manifesto mainly addresses the interactions between customers and developers so that the right software is developed. The Craftsmanship Manifesto is to raise the bar of the software profession so that systems are developed in the right way. The Reactive Manifesto gives guidance on designing systems so it can handle the business requirements in the age of cloud computing, big data and IoTs. Any business developing software to deliver competitive advantages must address the challenges and the aspirations indicated by the three manifestos.

 

 

Ex Coremetrix Senior Software Engineer

Moving to an asynchronous microservice architecture

At COREMETRIX we recently did some big changes to our architecture and this is where inspiration for writing this article came from. We are fortunate enough not to have to deal with a monolith application but rather with microservices. We run on the cloud (AWS) and we are big fans of automation, from infrastructure as code (Terraform, Ansible) to our tests (CDC test, synthetic monitoring). While a microservice architecture has a lot of benefits, as your architecture evolves you can end up with quite a lot of dependencies between your services.

This article will highlight some of the benefits of moving towards an asynchronous architecture and some of our learnings.

Synchronous vs. Asynchronous

 

A synchronous operation is when you ask someone else for something and you wait for him or her to respond. This operation occurs in actual time. As illustrated in the diagram below, after making a request, Alice has to actively wait for Bob to respond. Both Alice and Bob have to work together at the same time – in other words they are synchronised.

 

 

An asynchronous operation is when you send someone a message and you do not wait for him or her to reply. You can go on and do your business and only react once you receive a reply. One example of asynchronous communication is an email communication. The diagram below illustrates Alice and Bob working in an asynchronous way.

 

 

Microservice architecture

A Microservice is an architectural term that describes a method of developing software systems independently deployable and loosely coupled. Rather than creating a large application that does everything, or in other words a monolith, you create a suite of modular services, where each service has its own well-defined function. Microservices integrate via well-defined interfaces, most commonly REST over HTTP.

Another important aspect of a microservice is data encapsulation. Each service owns its own data and this data is only accessed via its interface.

I’ve definitely seen places where so-called microservices shared the same database. This is a major anti-pattern where you pay the cost of a distributed application and lose a lot of the benefits of microservice architecture. Going back and fixing things like this is time consuming.

A microservice architecture allows you to choose the right technology for the specific problem. It also means that they are easier to maintain as the codebase is a lot smaller.

 

Our journey started like this, we followed the best practises, some of them mentioned above, and we took into account things like the 12 factor app. We took tests seriously and introduced Consumer Driven Contacts (CDC) and Synthetic Monitoring. CDC tests were relatively simple to implement, where each test was written from the point of view of the consumer and how each consumer expected an API to behave. Tests created their own data and cleaned up after themselves.

We also made sure that we decoupled legacy services that were integrated at the database level. We end up with a microservice architecture where each service could be safely deployed multiple times a day. Each service owns only the data it needs to do its job.

 

As the application evolved, more and more microservices got created and introducing certain features became a bit harder as some of the services inevitably became quite coupled to one another. Further more, if one service became unavailable this immediately affected other services that depend on it.

The diagram below illustrates a simplified version of how the architecture looked at that point in time. You can easily imagine that if service A becomes unavailable for a period of time, service I cannot do its work.

 

Apart from the fact that a calling service is impacted by errors, we also lost some of the flexibility as every service knows about each other.

 

Asynchronous Microservice architecture

 

Introducing a message queue like RabbitMQ and making services interacting with each other this way solved a lot of the previously specified problems. The number of dependencies was minimised (less coupling), therefore each service became more autonomous. More importantly, a service may continue to work if any of its downstream dependencies are down.

 

 

In this new architecture none of these services know about each other. Also availability is increased. If you imagine that service I is a service that scores certain events and service A is service that produces these events. In case of service I becoming unavailable for a while, A can continue to function. In the meantime these messages get accumulated on the queue and when the scoring service becomes available again, they will be consumed.

 

This new set up comes with some new problems that you will need to take into account. Things like what happens if a message cannot be consumed? How do you make sure you do not lose messages? What if your message queue gets filled up? At Coremetrix, having chosen RabbitMQ as our message queue, we answered these questions by making sure every queue has an equivalent dead letter queue where rejected messages end up. These messages can be manually inspected and even replayed later on. We also made our queues mirrored across multiple nodes for high availability. And finally we set up alerts on queue sizes.

 

Another complication that we had to overcome was testing. Automated testing is a lot more problematic when it comes to asynchronous systems. Our tests consist of unit tests and integration tests that run against Docker containers and stubs when we do a build. On top of that we have CDC tests and synthetic tests after deployment to each environment.

One of the first things we had to do was to write integration tests for the integration with the queues. These tests give us the confidence that we are using the infrastructure correctly.

Secondly, after the new changes were introduced, the majority of our CDC test started failing. In these tests we had to replicate the behaviour of the producer and poll the downstream API to check for the response. Our test are written in Scala, you can imagine a typical test looking like something like this:

 

 

You would have to also set up some PatienceConfig to define the time you tolerate for unsuccessful attempts and the interval to sleep between attempts.

These tests give us the confidence that messages can be published/consumed and served as expected via the tested API.

 

Summary

 

Moving to an asynchronous architecture made our system more robust, less coupled and improved performance and scalability. It also improved reliability by making our system more tolerant to errors. There are definitely a lot of new things to learn and new problems to take into account. Testing can be more complicated but that is a price worth paying.

 

Cristian Pupazan

Head of Engineering