Files
Blog/content/post/14-proxmox-cluster-upgrade-8-to-9-with-ceph.md
Gitea Actions 57db4726d7
All checks were successful
Blog Deployment / Check-Rebuild (push) Successful in 6s
Blog Deployment / Build (push) Has been skipped
Blog Deployment / Deploy-Staging (push) Successful in 11s
Blog Deployment / Test-Staging (push) Successful in 2s
Blog Deployment / Merge (push) Successful in 7s
Blog Deployment / Deploy-Production (push) Successful in 11s
Blog Deployment / Test-Production (push) Successful in 3s
Blog Deployment / Clean (push) Has been skipped
Blog Deployment / Notify (push) Successful in 3s
Auto-update blog content from Obsidian: 2025-11-03 11:17:20
2025-11-03 11:17:20 +00:00

10 KiB
Raw Blame History

slug, title, description, date, draft, tags, categories
slug title description date draft tags categories
proxmox-cluster-upgrade-8-to-9-with-ceph Template true

Intro

My Proxmox VE cluster is almost one year old now, and it's been a while since I didn't update my nodes. Now is the time to move forward and bump it to Proxmox VE 9.

I'm mainly interested in the new HA affinity rules, here what this version brings:

  • Debian 13 "Trixie"
  • Snapshots for thick-provisioned LVM shared storage
  • Fabrics feature for the SDN stack
  • Better mobile interface
  • Affinity rules in HA cluster

In this article, I will walk you through the upgrade steps for my Proxmox VE highly available cluster supported by Ceph distributed storage. The official documentation can be found here.


Prerequisites

Before jumping into the upgrade, let's review the prerequisites:

  1. All nodes upgraded to the latest version of Proxmox VE 8.4.
  2. Ceph: upgrade cluster to Ceph 19.2 Squid before.
  3. Proxmox Backup Server: upgrade to Proxmox BS 4.
  4. Reliable access to the node.
  5. A healthy cluster.
  6. backup of all VMs and CTs.
  7. At least 5 GB free disk space on the root mount point.

Well, I have some homework to do before the major upgrade to Proxmox VE 9. My nodes are currently in version 8.3.2, hence a first update is necessary.

Then my Ceph cluster, for my distributed storage, is using Ceph Reef (18.2.4). After the update to Proxmox VE 8.4, I'll move from Ceph Reef to Squid.

I don't use Proxmox Backup Server in my homelab for now, I can skip that point. For the access to the nodes, it is better to reach the console (not from the WebGUI). I don't have direct access, In only have SSH.

The last points are checked, all my nodes have more than 10GB on the / mount point.

One of my VM is using the host's processing unit of the APU via PCI pass-through. As this prevents the VM for hot migration, I remove the device at the beginning of this procedure to avoid having to restart the VM each time.

Also, until the end of the upgrade to Proxmox VE 9, I set the Ceph OSDs as "no out", to avoid the CRUSH algorithm to try to rebalance the Ceph cluster during the upgrade:

ceph osd set noout

Update Proxmox VE to 8.4.14

The plan is simple, for all nodes, one at a time, I will:

  • Enable the maintenance mode
ha-manager crm-command node-maintenance enable $(hostname)
  • Update the node
apt-get update
apt-get dist-upgrade -y

At the end of the update, I'm aksed to remove a bootloader, which I execute:

Removable bootloader found at '/boot/efi/EFI/BOOT/BOOTX64.efi', but GRUB packages not set up to update it!
Run the following command:

echo 'grub-efi-amd64 grub2/force_efi_extra_removable boolean true' | debconf-set-selections -v -u

Then reinstall GRUB with 'apt install --reinstall grub-efi-amd64'
  • Restart it
reboot
  • Disable the maintenance node
ha-manager crm-command node-maintenance disable $(hostname)

Between each node, I wait for the Ceph status to be clean, without warnings.

At the end, the Proxmox VE cluster is updated to 8.4.14

Upgrade Ceph from Reef to Squid

I can now move on into the Ceph upgrade, the Proxmox documentation for that topics is here.

On all nodes, I update the source of the Ceph packages for Proxmox:

sed -i 's/reef/squid/' /etc/apt/sources.list.d/ceph.list

I upgrade the Ceph packages:

apt update
apt full-upgrade -y

After the upgrade on the first node, the Ceph version now shows 19.2.3, I can see my OSDs appear now outdated, the monitors need either an upgrade or a restart: proxmox-ceph-version-upgrade.png

I carry on and upgrade the packages on the 2 other nodes.

I have a monitor on each node, so I have to restart the monitor, one node at a time:

systemctl restart ceph-mon.target

I verify the Ceph status between each restart:

ceph status

Once all monitors are restarted, they report the latest version, with ceph mon dump:

  • Before: min_mon_release 18 (reef)
  • After: min_mon_release 19 (squid)

Now I can restart the OSD, still one node at a time. I have one OSD per node:

systemctl restart ceph-osd.target

I monitor the Ceph status with the Proxmox WebGUI. At start, it is showing some fancy colors. I'm just waiting to be back to full green, it takes less than a minute: Pasted_image_20251102230907.png

A warning now shows up: HEALTH_WARN: all OSDs are running squid or later but require_osd_release < squid. Now all my OSDs are running Squid, I can set the minimum version to it:

ceph osd require-osd-release squid

I'm not currently using CephFS so I don't have to care about the MDS (MetaData Server) daemon.

The Ceph cluster has been successfully upgraded to Squid (19.2.3).


Checks

The prerequisites to upgrade the cluster to Proxmox VE 9 are now complete. Am I ready to upgrade? Not yet.

Script pve8to9

A small checklist program named pve8to9 is included in the latest Proxmox VE 8.4 packages. The program will provide hints and warnings about potential issues before, during and after the upgrade process. Pretty handy isn't it?

Running the tool the first time give me some insights on what I need to do. The script checks a number of parameters, grouped by theme. Here the VM guest section:

= VIRTUAL GUEST CHECKS =

INFO: Checking for running guests..
WARN: 1 running guest(s) detected - consider migrating or stopping them.
INFO: Checking if LXCFS is running with FUSE3 library, if already upgraded..
SKIP: not yet upgraded, no need to check the FUSE library version LXCFS uses
INFO: Checking for VirtIO devices that would change their MTU...
PASS: All guest config descriptions fit in the new limit of 8 KiB
INFO: Checking container configs for deprecated lxc.cgroup entries
PASS: No legacy 'lxc.cgroup' keys found.
INFO: Checking VM configurations for outdated machine versions
PASS: All VM machine versions are recent enough

At the end, you have the summary. The goal is to address as many FAILURES and WARNINGS as possible:

= SUMMARY =

TOTAL:    57
PASSED:   43
SKIPPED:  7
WARNINGS: 2
FAILURES: 2

These are the FAILURES and WARNINGS on my system:

FAIL: 1 custom role(s) use the to-be-dropped 'VM.Monitor' privilege and need to be adapted after the upgrade
FAIL: systemd-boot meta-package installed. This will cause problems on upgrades of other boot-related packages. Remove 'systemd-boot' See https://pve.proxmox.com/wiki/Upgrade_from_8_to_9#sd-boot-warning for more information.
WARN: 1 running guest(s) detected - consider migrating or stopping them.
WARN: The matching CPU microcode package 'amd64-microcode' could not be found! Consider installing it to receive the latest security and bug fixes for your CPU.
        Ensure you enable the 'non-free-firmware' component in the apt sources and run:
        apt install amd64-microcode

Custom Role using VM.Monitor

Some times ago, to use Terraform with my Proxmox cluster, I created a dedicated role, I detailed this in that [post]({{< ref "post/3-terraform-create-vm-proxmox" >}}).

This role is using the VM.Monitor privilege, which is removed in Proxmox VE 9.

NOTICE: Proxmox VE 9 replaced the ambiguously named 'VM.Monitor' privilege with 'Sys.Audit' for QEMU HMP monitor access and new dedicated 'VM.GuestAgent.*' privileges for access to a VM's guest agent. The guest agent sub-privileges are 'Audit' for all informational commands, 'FileRead' and 'FileWrite' for file-read and file-write, 'FileSystemMgmt' for filesystem freeze, thaw and trim, and 'Unrestricted' for everything, including command execution. Operations that affect the VM runstate require 'VM.PowerMgmt' or 'VM.GuestAgent.Unrestricted'

New

    • VM.PowerMgmt

    • Sys.Console

    • Sys.Audit

    • VM.Config.Cloudinit

    • Pool.Allocate

    • SDN.Use

    • VM.Config.Memory

    • VM.Allocate

    • VM.Console

    • VM.Clone

    • VM.Config.Network

    • Sys.Modify

    • VM.Config.Disk

    • Datastore.Allocate

    • VM.Config.CPU

    • VM.Config.CDROM

    • Datastore.Audit

    • VM.Migrate

    • Datastore.AllocateSpace

    • VM.Config.Options

  • Pool.Audit
    • VM.Config.HWType
    • VM.Audit

New

  • VM.GuestAgent.Audit
  • VM.GuestAgent.FileRead
  • VM.GuestAgent.FileWrite
  • VM.GuestAgent.FileSystemMgmt
  • VM.GuestAgent.Unrestricted
  • SDN.Audit
  • Mapping.Audit
  • Mapping.Use
  • Sys.Syslog

Dropped

  • Permissions.Modify"
  • SDN.Allocate
  • Realm.Allocate
  • VM.Replicate
  • Realm.AllocateUser
  • Sys.AccessNetwork
  • Datastore.AllocateTemplate
  • Sys.PowerMgmt
  • User.Modify
  • Mapping.Modify
  • Group.Allocate
  • Sys.Incoming
  • VM.Backup
  • VM.Snapshot
  • VM.Snapshot.Rollback

Old

VM.Monitor

NOTICE: Proxmox VE 9 replaced the ambiguously named 'VM.Monitor' privilege with 'Sys.Audit' for QEMU HMP monitor access and new dedicated 'VM.GuestAgent.*' privileges for access to a VM's guest agent. The guest agent sub-privileges are 'Audit' for all informational commands, 'FileRead' and 'FileWrite' for file-read and file-write, 'FileSystemMgmt' for filesystem freeze, thaw and trim, and 'Unrestricted' for everything, including command execution. Operations that affect the VM runstate require 'VM.PowerMgmt' or 'VM.GuestAgent.Unrestricted'

Continuously use the pve8to9 checklist script

pve8to9

Move important Virtual Machines and Containers

Upgrade

Update the configured APT repositories

Update Debian Base Repositories to Trixie

Add the Proxmox VE 9 Package Repository

Update the Ceph Package Repository

Refresh Package Index

Upgrade the system to Debian Trixie and Proxmox VE 9.0

Check Result & Reboot Into Updated Kernel

Post-Upgrade Validation

  • Checking cluster communication (pvecm status)

  • Verifying storage mounts and access

  • Testing Ceph cluster health (ceph -s)

  • Confirming VM operations, backups, and HA groups

  • Re-enabling HA and migrating workloads back

Finally, I can remove the noout flag:

ceph osd unset noout