Machine learning product development

Machine learning products are generally complex workflows that stitch together several data extraction, preparation and model building or inference operations.

These operations need to be performed in exactly the same way when models are built and when they are used. They often require bundles of software tools, libraries and vendors.

Unlike traditional software development, there are two types of testing or validation of machine learning products.

First, models need to be validated with predefined numerical measures to ensure that they answer their base problem statement and that they generalise well (i.e. they perform well on new previously unseen data). This step can be either in the development phase or encapsulated in the machine learning product itself, if the product is meant to build new models automatically.

Second, the code or software need to be tested for bugs / logic / performance etc. in the same way as traditional software.

This can be a lot of shuffling of code, hyperparameter tuning, experimentation, debugging etc.

A second difference between traditional software development and machine learning product development concerns how data is used and managed.

In traditional software, the engineer is concerned with building a logic that uses data for its intended purpose and that their software doesn’t fail unexpectedly if there are new or erroneous data values.

The machine learning engineer or data scientist is concerned with the above in addition to applying mathematical (and at times logical) transformations to the data in order to adopt it to the requirement of the their algorithms downstream. These transformations often depend on the statistical properties of the data sample used to build the models at development stage. When deploying their products to production, they need to ensure that the exact same transformations apply to the new unseen data on which their workflow is applied. They would be committing a grave error if their transformations are recalculated based on the statistical properties of the incoming data in production.

Finally, once a machine learning product is deployed, it will be used to infer something about new incoming data (e.g. make a prediction, give a recommendation or perform a classification). Usually, it will be integrated into a broader product targeting an end user group (e.g. customers on a website, or engineers monitoring a critical process). To achieve both good performance and efficiency, the machine learning product is often accessed as a service. This works in a similar way to accessing a website where a user types a URL that fires up some pre-built web pages. In the case of machine learning products, instead of a web page, we have a process that is executed when its corresponding URL is used by a user or another application. Ideally, we want that process to consume the resources it needs for the workload it is called to execute and to hibernate when it is not being used. This adds a non-trivial scaling requirement.

Implications for efficient machine learning product deployment strategy

Repeatability, scalability and consistency of the complex operations that make up a machine learning product are, therefore, the corner stone of successful implementations of machine learning strategies.

One of the challenges that data scientists have faced, is how to efficiently achieve these three objectives without needing to allocate substantial resources to the translation of their code into more appropriate languages, and the running and maintenance of large infrastructures that carry a high fixed cost.

The technical challenge of moving from laptop or virtual machine based prototype to a production grade software is one of the major reasons for data science initiatives to be of limited scale in most organisations.

Our focus has been on adopting and developing best practices both from a technology but also from a process perspective to ensure that machine learning products can be efficiently developed, deployed, used and improved.

Here enters Kubernetes.

What is Kubernetes

According to its official website, Kubernetes is:

an open source system for automating deployment, scaling, and management of containerised applications.

Application containerisation is an operating system (OS) level virtualisation method used to deploy and run distributed applications without launching an entire virtual machine (VM) for each application.

Simply put, containerisation allows developers to write software that is made up of small apps ran on multiple computers within a network (distributed applications), and to deploy each of these small apps as a container that packages the code and everything else it depends on to run, but no more (virtualisation). This makes running applications fast, efficient, reliable and almost independent from the underlying computing environment.

If we have a large number of these containerised applications, it may become onerous to manage them and the resources they need to consume. Kubernetes is designed to allow organisations to scale the number of applications that they deploy in this pattern, and to manage the underlying infrastructure and processes needed to run than them at scale.

Relevance to machine learning

So, with Kubernetes we get three key features:

  1. we can deploy complex software as a set of small applications, each running in its virtual environment that contains the code and its dependencies (containers)
  2. we can shift and lift these containers from one computing environment to another without reworking the code
  3. we can scale our computing environment depending on our need, and automate quite a lot of the container deployment, scaling and management operations

Meanwhile, as discussed above, machine learning applications are in effect complex workflows. Data scientists use a variety of tools, libraries and vendors to build the various operations in the workflows. They build complex mathematical logic that needs to be deployed consistently, alongside which they need to easily package all the dependencies of their applications. It is therefore easy to see how the application containerisation paradigm perfectly fits the needs of machine learning product developers for repeatability, scalability and consistency.

A non-trivial advantage of building machine learning products on a kubernetes environment, is that code used in development is used in production. There are no translations, no adjustments to a target computing environment. This is a huge efficiency gain for what is a technically and intellectually demanding work.

Further insights

State of The Art Machine Learning Ops: in this article we discuss the Dev/Git Ops process we use to build scalable machine learning products.

Automating Machine Learning: in this article we give an example of a product we built using both kubernetes and our Dev/Git Ops process.