Vertex AI custom training jobs in GitLab CI

How to set up containers for MLOps pipelines

Ricardo Mendes
Google Cloud - Community

--

Photo by Ian Taylor on Unsplash

This blog post is about a challenge that my team recently faced when working on gcloud beta ai and docker dependent tasks in the scope of an MLOps project. Such tasks consist of developing lightweight integration tests for custom training applications. We use the Google Cloud Vertex AI unified ML platform, so we can leverage gcloud beta ai custom-jobs local-run to test “locally” both in the development machines and in the early stages of the CI/CD pipeline.

The Machine Learning engineers who work on the custom training applications successfully ran the tests in their development environments, where Cloud SDK and the Docker daemon are available. Once local development was done, it was time to set up the continuous integration environment, which runs on GitLab.

Docker image setup

As stated in the documentation, gcloud beta ai custom-jobs local-run packages your training code into a Docker image and executes it locally. Under the hood, it seems to call docker build and docker run, so our first action was to learn about running Docker commands within GitLab CI containers (our runners use the Docker executor). Fortunately, GitLab provides extensive documentation on it!

If you read GitLab’s documentation, you will notice that most examples use the below image and services setup:

That should be fine for various Docker in Docker (aka DinD) use cases but not for ours. We need not only the Docker daemon but also Cloud SDK in the CI container.

So we decided to build a custom image, starting from docker:19.03.12 and installing the Cloud SDK into it, as follows.

Few things to have in mind:

  1. The docker:19.03.12 image is built upon alpine:3.12, so we used an Alpine-based approach to install Cloud SDK into the container, taking the official Cloud SDK Alpine Dockerfile as reference.
  2. Cloud SDK’s Alpine image is built upon alpine:3.13, so we don’t expect compatibility issues when using the same commands on top of docker:19.03.12 / alpine:3.12.
  3. Cloud SDK’s Alpine image does not include extra components, and we need gcloud beta. So, we installed the component, as you can see in line 28 of the above Dockerfile.

Demo

I’ve put together a git repository to demonstrate the custom image in action: gitlab.com/ricardomendes/docker-cloud-sdk-sample-repo.

The .gitlab-ci.yml file, which brings a short CI pipeline, consists of two steps. First, it builds and pushes the custom Docker image to the repo’s container registry. Then, it pulls the image and runs a few commands that allow us to make sure gcloud beta and docker are up and running, as shown in the below code snippet:

The results are:

$ gcloud --versionGoogle Cloud SDK 361.0.0
beta 2021.10.15
bq 2.0.71
core 2021.10.15
gsutil 5.4
$ docker --versionDocker version 19.03.12, build 48a66213fe$ docker run hello-worldUnable to find image ‘hello-world:latest’ locally
latest: Pulling from library/hello-world
...
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the “hello-world” image from the Docker Hub. (amd64)
3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

The complete build log is available here.

This is how we set up an image with the required services to run gcloud beta ai custom-jobs local-run and enable the integration tests in the MLOps pipeline we are working on.

Final considerations

Running distinct workloads in the same Docker container is not always straightforward. Therefore, we considered using the custom image to run other types of integration tests, some Python-based, for instance.

Although the custom image has elementary Python 3 resources, required by gcloud, we realized it is not appropriate to run Python-based tests. Installing Python dependencies in Alpine-based containers is cumbersome (please look at these links to understand what I mean: pythonspeed.com/articles/alpine-docker-python and python.org/dev/peps/pep-0656), and this is why we also considered a Debian-based approach instead of Alpine. To do so, we started from google/cloud-sdk:361.0.0, which is based on debian:buster-slim, and then tried to install Docker into it, but have not succeeded so far.

That being said, the team agreed on running integration tests in multiple containers, which can run concurrently, leveraging the features that each image provides and avoiding more and more customizations.

I hope it helps, and feedback is always welcome!

Best,

--

--