Aria Orchestrator patch 5 upgrade failing
search cancel

Aria Orchestrator patch 5 upgrade failing

book

Article ID: 435269

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

  • Upgrade for Aria Automation Orchestrator from version 8.18.1 (24977824) to 8.18.1 (25193536) fails
  • Aria Automation Orchestrator appliance journal logs show:

    Mar 18 06:09:58 hostname[1903]: panic: freepages: failed to get all reachable pages (page 636: multiple references (stack: [672 669 636]))
    Mar 18 06:09:58 hostname[1903]: goroutine 15 [running]:
    Mar 18 06:09:58 hostname[1903]: go.etcd.io/bbolt.(*DB).freepages.func2()
    Mar 18 06:09:58 hostname[1903]: /go/pkg/mod/go.etcd.io/[email protected]/db.go:1202 +0x8d
    Mar 18 06:09:58 hostname[1903]: created by go.etcd.io/bbolt.(*DB).freepages in goroutine 55
    Mar 18 06:09:58 hostname[1903]: /go/pkg/mod/go.etcd.io/[email protected]/db.go:1200 +0x1c5
    Mar 18 06:09:58 hostname systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

    Appliance journal logs can be reviewed with command journalctl -xe or historical logs are stored in /services-logs/journal/systemd.journal-YYYYMMDD

Environment

Aria Automation Orchestrator 8.18.1

Cause

During startup, containerd opens its bbolt database and runs an integrity check on the B+ tree page structure. The check found that page 636 is referenced multiple times in the page tree (by pages 672, 669, and itself) — this should never happen in a healthy database. Because the page reference graph is inconsistent, bbolt panics rather than risk further data corruption.

The most common reasons for this corruption:

  • Unclean shutdown / power loss — If the system lost power or was hard-reset while containerd was writing to the database, a partial write could leave the B+ tree in an inconsistent state. BoltDB uses mmap and relies on the OS flushing pages; an interruption at the wrong moment can corrupt the freelist or internal page pointers.
  • Disk I/O errors — Bad sectors, failing storage, or flaky storage controllers can silently corrupt pages on disk.
  • Filesystem corruption — If the underlying filesystem (ext4, xfs, etc.) itself has inconsistencies, the database file can be damaged.
  • Known bug in bbolt v1.3.x freelist handling — There were several bugs in bbolt's freelist management in the 1.3.x line. These were tracked in issues like etcd-io/bbolt#707 and related PRs. Some of these bugs could cause page double-frees or multiple references even without external I/O failures.

Resolution

Recovery steps followed on the source 8.18.1 pre-upgrade state:

  1. Stop containerd (it may be crashed):
    systemctl stop containerd
  2. Locate the corrupted database — typically at 
    /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
  3. Back up the corrupted file containing folder: 
    mv /var/lib/containerd/io.containerd.metadata.v1.bolt /var/lib/containerd/io.containerd.metadata.v1.bolt.corrupted
  4. Start containerd
    systemctl start containerd
  5. Restart services with command
    /opt/scripts/deploy.sh
  6. Proceed the upgrade using known instructions in article