Deploying Across Heterogeneous Edge Gateways in Kubernetes

Author

Paul Mundt

Links

Within the SODALITE H2020 project, one of the areas we are focusing on is the deployment of optimized containers with accelerator-specific ML models for Edge-based inference across heterogeneous Edge Gateways in a Connected Vehicle Fleet. The main motivation for this is three-fold:

While we can benefit from Cloud and HPC resources for base model training (e.g. TensorFlow), adaptations for specific accelerators are still required (e.g. preparing derivative TFLite models for execution on a GPU or EdgeTPU).
The lifecycle of a vehicle far exceeds that of a specific Cloud service, meaning that we can not make assumptions about the environment we are deploying into over time.
Different services may have a stronger need for a specific type of accelerator (e.g. GPU, FPGA), meaning that an existing service may need to be re-scheduled and re-deployed onto an another available resource on a best-fit basis.

Node Labelling in Kubernetes

While there has been existing work looking at node feature labelling in heterogeneous Kubernetes clusters, the most prominent of which is the official node feature discovery project, support for Edge Gateways, which are typically embedded SBCs (typically either as a standalone blackbox, or as part of a pre-existing In-Vehicle Infotainment (IVI) system) has been found to be somewhat lacking. In the case of the official NVIDIA GPU device plugin, for example, detection of the GPU requires use of the NVIDIA Management Library (NVML), which, in turn, assumes an enumerable PCI bus. Jetson Nano users with an integrated GPU are therefore simply out of luck at the moment. Others, such as the Coral Dev Board, provide for enumeration of the EdgeTPU via the PCI bus, but do not as of yet provide a specific device plugin to manage and expose the EdgeTPU device. In both of these cases, we can work around these limitations by tagging the node with platform-specific properties / device capabilities and deploying a container with a targeted accelerator-specific run-time environment.

Enter Device Tree

A common feature across most of these Edge Gateways is the existence of a semi-standard devicetree blob (DTB), which exposes a static description of hardware and its corresponding topology through a special tree structure. While a comprehensive explanation of the devicetree is out of scope for this post, those that are so inclined can read through the specification here. A general overview of the devicetree and its node structure is as follows:

DeviceTree node overview. Source: Free Electrons, ELC 2014.

In the case of the Jetson Nano, this looks like the following, where / represents the top of the tree, or root node:

/dts-v1/;/ {
        nvidia,fastboot-usb-pid = <0xb442>;
        compatible = "nvidia,jetson-nano", "nvidia,tegra210";
        nvidia,proc-boardid = "3448";
        nvidia,pmu-boardid = "3448";
        serial-number = "xxxxxxxxxxxxxxxxx";
        nvidia,dtbbuildtime = "Jul 16 2019", "17:09:35";
        model = "NVIDIA Jetson Nano Developer Kit";
        ...        gpu {
                compatible = "nvidia,tegra210-gm20b", "nvidia,gm20b";
                access-vpr-phys;
                resets = <0x21 0xb8>;
                status = "okay";
                interrupts = <0x0 0x9d 0x4 0x0 0x9e 0x4>;
                reg = <0x0 0x57000000 0x0 0x1000000 0x0 0x58000000 0x0 0x1000000 0x0 0x538f0000 0x0 0x1000>;
                iommus = <0x2b 0x1f>;
                reset-names = "gpu";
                nvidia,host1x = <0x78>;
                interrupt-names = "stall", "nonstall";
        };
        ...

while in general the model property of the root node should provide us with a unique identifier for the specific model of the system board in a standard manufacturer,model format, the specification unfortunately opts to take the easy way out and only recommends a standard format. This watering down of the specification means that we are, unfortunately, unable to use the model property as a consistent source for node labelling, and must fall back onto the compatible properties instead — while the specification provides no firm requirements here either, these are at least forced to adopt a standard convention implicitly in order to match with the Linux kernel naming conventions.

Generating Node Labels from DeviceTree Properties

In order to generate node labels from DeviceTree properties, we developed a custom Kubernetes controller specifically for this purpose:

Given the lack of consistency of the model encoding, as mentioned above, the approach taken by our DeviceTree node labeller is therefore to iterate over compatible properties within the root node, as well as any designated children for which we wish to expose labels — such as, in the Jetson Nano case, the gpu node. A simple dry-run on the node with the children of interest defined demonstrates the tags that will be generated:

$ k8s-dt-node-labeller -d -n gpu
Discovered the following devicetree properties:beta.devicetree.org/nvidia-jetson-nano: 1
beta.devicetree.org/nvidia-tegra210: 1
beta.devicetree.org/nvidia-tegra210-gm20b: 1
beta.devicetree.org/nvidia-gm20b: 1

Deploying the Node Labeller into a Heterogeneous Cluster

The node labeller itself is intended to be deployed into a heterogeneous cluster as a DaemonSet. A general overview of the labeller in action is shown below:

It is designed to be deployed directly into the cluster using the in-tree example DaemonSet specification:

$ kubectl apply -f k8s-dt-labeller-ds.yaml

Targeted Pod Placement

Once the labeller is up and running, it’s now possible to target specific Gateways or Gateway + accelerator pairs. To target the Jetson Nano, for example, the model-specific beta.devicetree.org/nvidia-jetson-nano label can be used as the basis for node selection. To target the specific GPU, beta.devicetree.org/nvidia-gm20b can be used. To further constrain the selection, multiple labels can be used together to define the selection basis.

Using an HTTP echo server as a simple deployment example, a targeted Pod description for a Jetson Nano with a GM20B GPU can be written as follows:

apiVersion: v1
kind: Pod
metadata:
  name: http-echo-gpu-pod
  labels:
    app: http-echo
spec:
  containers:
  - name: http-echo
    image: adaptant/http-echo
    imagePullPolicy: IfNotPresent
    args: [ "-text", "hello from a Jetson Nano with an NVIDIA GM20B GPU" ]
    ports:
    - containerPort: 5678
  nodeSelector:
    beta.devicetree.org/nvidia-jetson-nano: "1"
    beta.devicetree.org/nvidia-gm20b: "1"

Example Pod specification targeting Jetson Nano

Which can then be exposed through a simple service definition, as follows:

kind: Service
apiVersion: v1
metadata:
  name: http-echo-service
spec:
  type: NodePort
  selector:
    app: http-echo

  ports:
  - port: 5678
    protocol: TCP
    name: http

Example HTTP echo service specification

We further expose the port externally for testing outside of the cluster:

$ kubectl port-forward service/http-echo-service 5678:5678

and demonstrate connectivity to the appropriate node:

Next Steps

Specific base container run-time environments for the different accelerator types are being prepared separately and will be made available during later stages of the SODALITE project.

For an example on getting started with an NVIDIA GPU container runtime targeting the Jetson Nano, please refer to the official guidance from NVIDIA here. The scheduling of the Pod within the cluster can be carried out using the aforementioned node selection criteria and Pod template.

Limitations

In order to walk the devicetree, access to the node’s /sys/firmware directory is required — this is presently enabled by running the Pod in privileged mode. It may be possible to leverage allowedProcMountTypes to disable path masking within the Pod and run without privileged mode, but this has not yet been verified.
At present there is no mechanism by which a DaemonSet can gracefully terminate without triggering a restart, due to the Pod RestartPolicy being forced to Always in DaemonSet Pod specifications. This means that, at the moment, the initial node selector for the DaemonSet must constrain itself to nodes that are known to be DeviceTree-capable in order to avoid spurious restarts. This has not been an issue yet with targeting primarily arm64 and armhf targets, but could be problematic for other architectures.
While the labeller can attest to the existence of a node in the devicetree, it offers no detailed device-specific information or control - all of which would need to be implemented through a fit-for-purpose device plugin (or baked into the container runtime — as in the GPU case). The labeller can, however, be used as a basis for scheduling device plugins on nodes with matching capabilities.