Auto-Scaling GitLab CI in OTC

Auto-Scaling CI

This document captures the essential operational aspects of the auto-scaling CI deployed to ease the wait for large and intensive bitbake builds.

The central idea is that a single gitlab-runner acts as a front-end to a dynamic pool of machines that are created, provisioned, used and then destroyed as the number of pending jobs changes.

This is documented by GitLab at Install and register GitLab Runner for autoscaling with Docker Machine | GitLab

In our configuration, due to the nature of bitbake builds, we do two additional deviations from the stock system:

  1. The machine running as gitlab-runner manager also provides NFS share for download cache, and sstate-cache
  2. The machine running as gitlab-runner manager also provides NFS share for git-repo cache and maintains that cache with systemd services.

The first deviation is just practical, without it even smallest build takes hours and may easily fail to download a source archive or git repository. The second deviation helps to avoid cloning large meta-layers over and over and seems to help with rate-limiting against GitHub, where many of those layers are stored.

GitLab Runner

Install a standard GitLab runner from the Debian packages provided by GitLab.

Register the runner and use the following configuration file as template. XXX denote secretes that were removed and need to be replaced by real values.

/usr/local/bin/docker-machine

We are using a fork of docker-machine maintained by GitLab, with some fixes that are not yet upstream.

The PR to fix the bug affecting docker is Restart, not start docker after setting up auth (!62) · Merge requests · GitLab.org / Ops Sub-Department / docker-machine · GitLab

/usr/local/bin/docher-machine-driver-otc

We are using a fork of OTC driver for docker-machine with a patch that enables cloud-init user-data.

The PR to add the missing feature is Add support for user-data by zyga · Pull Request #32 · Huawei/DockerMachineDriver4OTC · GitHub

/etc/gitlab-runner/config.toml

# Accept up to 19 jobs at a time.
concurrent = 19
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-runner-manager"
  url = "https://git.ostc-eu.org/"
	# This is NOT the tocken you get from GitLab CI page.
	# This is a machine-specific token created using that token.
	# Don't copy-paste this across machines.
  token = "XXX"
  executor = "docker+machine"
	# Allow the docker+machine executor to have up to 10 machines
	# This cannot be smaller than concurrent= setting above.
  limit = 19
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "ubuntu:20.04"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
		# Mount /cache from the container host to /var/shared
		# in the container. This is how we expose the mounted
		# NFS share all the way down to the containers.
		#
		# Note that the /cache mount point is established by cloud-init below.
    volumes = ["/cache:/var/shared:rw"]
    shm_size = 0
  [runners.machine]
		# Do not keep any idle machines around.
		# In a pure auto-scaling setup this could be tweaked to keep some
		# number of workers around during working hours and fewer or none
		# in off-peak hours. There's extensive syntax to support this documented
		# GitLab reference manual.
    IdleCount = 0
		# Keep an idle machine up for 60 seconds before removing it.
    IdleTime = 60
    MachineDriver = "otc"
    MachineName = "%s"
		# Recycle machines after 100 builds in case something piles up.
    MaxBuilds = 100
    MachineOptions = [
				# Region and endpoint URL. There's also one for NL but we've never use it.
        "otc-service-endpoint=https://ecs.eu-de.otc.t-systems.com",
        "otc-region=eu-de",
				# The access key and secret permitted to create machines.
        "otc-access-key-id=XXX",
        "otc-access-key-secret=XXX",
				# Find the tenant ID of your account.
        "otc-tenant-id=ed302128210940f189099d63dcf62b16",
				# The type of machine created. Each machine gets only one job
				# so find some balance between large but expensive machines
				# and a swarm of cheaper but smaller machines. This describes
				# a general purpose 2CPU+16GB system.
        "otc-flavor-id=s3.large.8",
				# Find available images in OTC Image Management Service.
				# Look for public images and expand the details of the image
				# you want to use to find the UUID. The image referenced below
				# is Ubuntu 20.04/latest for amd64.
        "otc-image-id=660f9a71-d491-4013-92ad-22d65cdd7c67",
				# Find the VPC information in your cloud dashboard
				# otc-vpc-id is the UUID of the "core" VPC (the name is arbitrary)
				# in the OTC dashboard. 
        "otc-vpc-id=XXX",
				# otc-subnet-id is actually the Network ID as shown in the OTC dashboard
				# of the "core" subnet of the "core" VPC.
        "otc-subnet-id=XXX",
				# There are several availability zones in eu-de.
				# Perhaps pick one that matches the deployment of gitlab-runner-manager.
        "otc-available-zone=eu-de-01",
				# Size of the root volume should be enough to hold a single build, the OS
				# and any containers needed. 100GB is a rough approximation.
        "otc-root-volume-size=100",
        "otc-root-volume-type=SSD",
				# The security group must allow connections to the Docker TCP port
				# in the inbound rules: TCP/2376 in addition to SSH (TCP/22)
        "otc-security-group=docker-machine",
				# Give the machine a public IP address.
        "otc-elastic-ip=1",
				# Give the machine 10Mbit/s of bandwidth for public traffic.
        "otc-bandwidth-size=10",
				# User name docker-machine tries to ssh as. This must be consistent with
				# the selected image.
        "otc-ssh-user=ubuntu",
				# The user-data script (valid cloud-init setup) which mounts NFS
				# from the manager machine.
        "otc-user-data-file=/srv/user-data.sh",
      ]

/srv/user-data.sh

The cloud-init user-data payload, as a simple shell script to mount NFS.

#!/bin/sh -v
export DEBIAN_FRONTEND=noninteractive

apt-get update
apt-get install -y nfs-client

mkdir /cache
# This is the IP of the NFS server, which in our case is also the
# coordinator, for convenience.
mount -t nfs 192.168.184.154:/srv/cache /cache

Statistics

We want to keep some basic statistics about the number of machines kept alive over time. This is accompilshed by the following system:

  1. A systemd service enumerating machines and storing the output into an unprivileged file
  2. A systemd timer running said service every minute
  3. A script which counts the number of machines based on the output of the service

The last script is used by a landscape custom graph, which runs the unprivileged script (as nobody) to display a graph of the number machines over time.

The scripts are reproduced below:

/etc/systemd/system/docker-machine-ls.service

[Unit]
Description=Query the number of docker machine units

[Service]
Type=oneshot
Environment=HOME=/root
ExecStart=/bin/sh -c "/usr/local/bin/docker-machine ls >/tmp/docker-ls.txt && mv /tmp/docker-ls.txt /run/docker-ls.txt"

/etc/systemd/system/docker-machine-ls.timer

[Unit]
Description=Check how many docker-machine machines are up

[Timer]
OnBootSec=10min
OnUnitActiveSec=150sec

[Install]
WantedBy=timers.target

/usr/local/bin/docker-machine-count

#!/bin/sh
if [ -s /run/docker-ls.txt ]; then
  echo "$(grep -c Running /run/docker-ls.txt)"
else
  echo 0
fi