OFED

Warning: Experimental workflow subject to change

The Nvidia DOCA framework is distributed as part of StackHPC Release Train for OFED driver support, this repository is synced into Ark as part of the Release Train workflows, however to ensure compatibility with Release Train packages, we are required to build OFED modules with support for the latest Release Train kernel.

Workflow

The workflow uses workflow_dispatch to manually request an OFED build, which will deploy a builder VM, apply kayobe config to the builder, upgrade the kernel, reboot, then run two Ansible playbooks for building and uploading OFED modules to Ark.

Pre-requisites

Before building OFED packages, the workflow will ensure that:

  • A full distro-sync has taken place, ensuring the kernel is upgraded.

  • The bootloader has been configured to use the latest kernel (reset-bls-entries.yml)

  • noexec is disabled in the temporary logical volume.

build-ofed

Currently we only support building Rocky Linux 9 OFED kernel module packages.

The Build OFED module workflow will check that the filesystem is configured (noexec disabled) to allow the DOCA build script to run. The workflow will also install any necessary dependencies for the module build.

The build script will output a doca-kernel-repo RPM which contains all kernel modules built as part of the workflow. When this RPM is installed, the repofile is created pointing to the modules in /usr/share/doca-host-<doca-version>/Modules/<kernel-version>/ on the host.

push-ofed

As mentioned above, the DOCA repository is synced into the doca repository in Ark. This workflow will upload the doca-kernel-repo RPM to a separate repository named doca-modules. The version for this repository is set in pulp-repo-versions.yml and is disabled for local pulp syncs by default.

Install process

Release Train configuration

DOCA repositories will need to be synced to the local Pulp service, Ensure the DOCA hosts added to the mlnx group before running a package sync, if the group is not empty DOCA will be synced into the local Pulp. The local Pulp can be synced with Ark by running:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml

DOCA repositories can be templated to hosts by running Kayobe host configure.

kayobe overcloud host configure -t dnf

StackHPC DOCA kernel modules will require the latest kernel version available in Ark for the current Rocky minor version. You should ensure that packages are up to date by running a package update, which can also be limited to hosts in the mlnx group.

kayobe overcloud host package update --packages "*" --limit mlnx

To ensure the latest kernel is the default on boot, the bootloader entries will need to be reset before rebooting.

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reset-bls-entries.yml -e reset_bls_host=mlnx

The hosts can now be rebooted to use the latest kernel, a rolling reboot may be applicable here to reduce distruptions. See the package updates documentation <package-updates>.

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit mlnx

install-doca

A playbook is provided to install DOCA on hosts in the mlnx group. Ensure this group is configured to include the hosts you wish to install DOCA on. To run the install playbook:

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/install-doca.yml