Auto-Updating VM OS Base Images with KubeVirt’s Containerised Data Importer

KubeVirt is a great way to run virtual machines on top of an existing Kubernetes cluster – and is useful for situations where you have an existing bare-metal Kubernetes cluster, but still have a few workloads that either can’t be containerised or are unsuitable for containerisation.

If you want to boot a VM straight in to a pre-installed OS, rather than using an external provisioning server or hand-installing from an attached ISO, there are two main ways to do this;

  1. Use a containerDisk. This pulls a pre-made container image and runs the OS inside of it. This is ephemeral and so upon VM restart any data written to the disk will be lost. This is fine if your workloads are stateless or don’t rely upon persistent data (say, Kubernetes worker nodes!) but problematic otherwise.
  2. Use a dataVolume. This clones an existing DataVolume, or an externally-hosted disk image in to a new Persistent Volume and attaches it to the VM. This disk persists across VM reboots, and so is a great option for stateful workloads or those that need to persist data. The downside however is that this image will need to be pulled each time you create a new VM. Not a problem for one off VMs or small OS images, but if you’re spinning up a large number of VMs on a regular basis then this could significantly delay start up unless you were to host a copy of the image internally. Given that many OS images also receive nightly updates with the latest packages, ideally this mirror would need to automatically sync the latest updates too.

KubeVirt’s Containerised Data Importer (CDI) has a handy feature that implements option 2 – periodically importing (syncing) the latest version of an image and making it available on the cluster without the need to deploy a seperate mirroring system! Once configured, newly launched VMs will consume the latest version of the OS image that has been imported (syncrhonised) to the cluster.

At a high level, this works as follows;

  1. A DataImportCron object is defined. This tells CDI where to import the image from, how often to check for new versions, how many versions to retain etc.
  2. CDI will periodically import the image as per the DataImportCron spec. Each import will create a new DataVolume object that refers to an underlying PersistentVolume containing the cloned image. At this point, a VM could consume this DataVolume, however the name of it changes with each import.
  3. If the DataImportCron object specifies a managedDataSource, then on each import CDI will also create or update an existing DataSource object. This object acts as an abstraction – it has a consistent name but points to the latest DataVolume that CDI has created. As the name is consistent, it makes it ideal for using in a VM spec!

With the high level flow established, let’s create the actual objects and see this in action.

Note: This assumes that you have both KubeVirt and the Containerised Data Importer already installed on your cluster. If not, please follow the KubeVirt and Containerised Data Importer installation instructions first.

Step 1 – Defining a DataImportCron object

The manifest below shows an example DataImportCron object which will import (sync) a Ubuntu 22.04 image to the cluster at midnight every day;

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataImportCron
metadata:
  name: ubuntu-2204
  namespace: default
spec:
  managedDataSource: ubuntu-2204
  schedule: "0 0 * * *"
  template:
    spec:
      source:
        registry:
          url: docker://quay.io/containerdisks/ubuntu:22.04
      storage:
        resources:
          requests:
            storage: 5Gi

Let’s break down what each part of this does!

  managedDataSource: ubuntu-2204

Here, we specify the name of the DataSource object that should be created and kept up to date with the latest DataVolume created by the import. This will be created within the same namespace as the DataImportCron object.

  schedule: "0 0 * * *"

How often should new images be imported? A balance needs to be struck here between freshness and data pull volume. Most OS images are rebuilt nightly, and so a nightly pull is likely appropriate. This uses standard crontab syntax.

  template:
    spec:
      source:
        registry:
          url: docker://quay.io/containerdisks/ubuntu:22.04

This defines where the image should be imported from. This must either be from a Docker registry, or the location of an OCI Archive – it cannot be an ISO or disk image file. The KubeVirt team maintains a set of container disk images here for Ubuntu, RedHat Container OS and CentOS stream that make great sources for OS images.

      storage:
        resources:
          requests:
            storage: 5Gi

This defines the storage specification for the cloned image. The example here is basic, only specifying that the disk should be inflated to 5Gi on clone. This follows KubeVirt’s StorageSpec API, and so other parameters can also be specified such as storageClassName. This is handy if you have multiple storage classes available in your cluster, for example, slow HDD-backed storage or fast SSD-backed storage.

Upon creating the object on the cluster, CDI will automatically perform the initial import. Subsequent imports will run as per the cron timer specified in the schedule section.

A new DataSource and DataVolume object will be created automatically. Initially, the DataVolume will be in a ImportScheduled phase. An importer pod will be spun up automatically, and the DataVolume will transition to the ImportInProgress phase as the importer pod carries out the import operation. By running kubectl describe against the DataVolume, the import progress can be seen along with any errors. Once complete, the DataVolume will transition in to the Succeeded phase. At this point, the data volume is ready for use by a VM via the DataSource abstraction.

Step 2 – Create a VM to clone and consume the volume

Now that the DataVolume has been created, with a DataSource pointing to it, a VM can be created which will clone the DataVolume in to a new persistent DataVolume specifically for the new VM. The following VirtualMachine object demonstrates this, using the newly created DataVolume from the previous step.

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: ubuntu-2204-vm
  name: ubuntu-2204-vm
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/vm: ubuntu-2204-vm
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: os
          - disk:
              bus: virtio
            name: cloudinitdisk
        resources:
          requests:
            memory: 512M
      volumes:
      - dataVolume:
          name: ubuntu-2204-vm-disk
        name: os
      - cloudInitNoCloud:
          userData: |
            #cloud-config
            password: ThisIsReallyUnsafeButThisIsJustADemo
            chpasswd: { expire: False }
        name: cloudinitdisk
  dataVolumeTemplates:
    - spec:
        pvc:
          accessModes:
            - "ReadWriteOnce"
          resources:
            requests:
              storage: 5Gi
        sourceRef:
          kind: DataSource
          name: ubuntu-2204
      metadata:
        name: ubuntu-2204-vm-disk

Much of this is just a standard definition of a KubeVirt VM – but let’s break down the bits related to the volume specifically. Keep in mind that this is a VirtualMachine object, and not a VirtualMachineInstance. When a VirtualMachine object is created on a cluster, KubeVirt will automatically create one or more VirtualMachineInstance objects which run the VM itself – much like how a Deployment object will create Pod objects!

  dataVolumeTemplates:
    - spec:
        pvc:
          accessModes:
            - "ReadWriteOnce"
          resources:
            requests:
              storage: 5Gi
        sourceRef:
          kind: DataSource
          name: ubuntu-2204
      metadata:
        name: ubuntu-2204-vm-disk

Starting at the bottom, a dataVolumeTemplate is defined. KubeVirt will use this to create a new DataVolume that matches the spec here. This is important, as we want to make a new data volume for this VM that is cloned from our imported image – we don’t want to consume the shared imported image directly!

The important part here is the sourceRef. If we were to consume an existing, known, DataVolume we would instead specify a source here, pointing to the DataVolume by name. However, our imported images will have a unique name on each import with a randomly generated suffix. Rather, we want to use a sourceRef – as we can use this to point towards a DataSource object, which in turn points to a DataVolume. As the DataImportCron object we created earlier to import the image specified a managedDataSource, a DataSource object was automatically created that points towards the imported DataVolume – and this will also be kept up to date with future imports! If your DataSource object is in a different namespace to this VM, ensure you specify the namespace of the DataSource object here.

Ensure you specify a suitable name for the DataVolume – the cloned DataVolume will be unique to this VM, and so should have a name that is suitably related to the VM!

As before, this follows KubeVirt’s StorageSpec API. Therefore, other options such as storageClassName can be specified. If your imported DataVolume was stored on slow HDD-backed storage for cost savings for example, you might want to specify a storageClassName here that relates to fast SSD-backed storage for optimal VM I/O performance!

      volumes:
      - dataVolume:
          name: ubuntu-2204-vm-disk
        name: os
      - cloudInitNoCloud:
          userData: |
            #cloud-config
            password: ThisIsReallyUnsafeButThisIsJustADemo
            chpasswd: { expire: False }
        name: cloudinitdisk

Although this is a fairly standard volume definition for a VM, there’s a few things of note here. Firstly, we need to refer to a DataVolume that will be attached to the VM. This must match the name of the DataVolume that is created by the dataVolumeTemplates section. Note that although the referenced dataVolume has a specific name, the volume’s name is just os. This is specific to the VM itself, and so does not need to be unique – therefore something generic such as os is ok!

Secondly, we define a cloudInitNoCloud volume that contains a few lines of cloud config. As the VM images are pre-installed, cloud config is the easiest way to perform configuration on the VM on-boot. In this case, we’re simply setting the password of the built-in ubuntu account and disabling password expiration. Ideally, we’d set a trusted SSH key here instead, or not permit any logins! Much more configuration can be done with cloud config – such as installing packages, starting services etc and this can be a great way to programatically configure VMs rather than performing manual post-deployment configuration over SSH. More information about what is possible in cloud init can be found in the cloud-init documentation.

When the VM is created, KubeVirt will automatically start to clone the DataVolume specified by the DataSource object (our previously imported OS image). Running a kubectl describe against the VM’s DataVolume will show the status of this and any errors – going through a CloneInProgress phase before finishing in the CloneComplete phase. Once complete, the volume will be attached to the VM and boot!

And that’s it! A VM will now be running that uses a cloned version of the automatically imported OS base image. As the volume was cloned, we can safely perform writes to it – such as specific configuration for this VM, package installation, data storage etc. If the VM is rebooted, the data volume is kept. Only if the VM is destroyed will the data volume be lost.

Conclusion

Hopefully this post helped to explain how to set up a new DataImportCron object to automatically import the latest version of an OS base image to a cluster, and then create a VM to clone the OS image for use as persistent storage by that VM!

When writing manifests for KubeVirt, the API reference can be a handy guide for showing exactly what can be set. Although KubeVirt does have extensive documentation, parts of the API refrerence are not documented and so it’s often useful to combine both sources of information. Furthermore, the original design documentation for the CDI Golden Image Delivery and Update Pipeline may help to explain some of the design decisions made. Finally, on the CDI repo there is also some documentation on OS Image Polling and Updates, which has an example of a similar DataImportCron object and a DataVolume that consumes it as a clone.

You may also like...

Leave a Reply

Your email address will not be published.