Production Deployment Guide

Step-by-step instructions for deploying ROMbsr in enterprise environments with high availability, monitoring, and operational excellence

Deployment Architecture

A typical production ROMbsr deployment consists of multiple specialized nodes working together:

Build Nodes (2-4 servers)

  • High CPU/RAM for compilation
  • 500GB+ fast storage per node
  • Can be cloud or on-premise
  • Stateless - can scale horizontally

Sign Nodes (2 servers)

  • Physical servers with USB ports
  • Isolated network segment
  • YubiHSM2 attached
  • Active/passive HA configuration

GitOps Queue (Git server)

  • Can use GitHub/GitLab/Gitea
  • Private repository required
  • GPG commit signing enforced
  • Webhook support beneficial

Production Setup Guide

Infrastructure Preparation

# Create dedicated user
sudo useradd -r -s /bin/bash -m rombsr
sudo usermod -a -G docker rombsr  # If using containers

# Create directory structure
sudo mkdir -p /opt/rombsr
sudo mkdir -p /etc/rombsr
sudo mkdir -p /var/lib/rombsr
sudo mkdir -p /var/log/rombsr

# Set permissions
sudo chown -R rombsr:rombsr /opt/rombsr
sudo chown -R rombsr:rombsr /etc/rombsr
sudo chown -R rombsr:rombsr /var/lib/rombsr
sudo chown -R rombsr:rombsr /var/log/rombsr

Install ROMbsr

# Clone repository
cd /opt
sudo -u rombsr git clone https://github.com/your-org/rombsr.git
cd rombsr

# Build PKCS#11 libraries
sudo -u rombsr make all

# Install Python dependencies
sudo -u rombsr python3 -m venv /var/lib/rombsr/venv
sudo -u rombsr /var/lib/rombsr/venv/bin/pip install -r lib/python/requirements.txt

# Set production paths
export ROMBSR_ROOT=/opt/rombsr
export ROMBSR_CONFIG=/etc/rombsr
export ROMBSR_DATA=/var/lib/rombsr

Configure Build Nodes

# /etc/rombsr/common.conf
SIGNING_QUEUE_TYPE=git
SIGNING_GIT_URL=git@gitops.internal:android/signing-queue.git
RELEASE_BASE_URL=https://releases.example.com
LOG_LEVEL=INFO
LOG_FORMAT=json

# /etc/rombsr/build.conf
# Request capture mode - intercepts signing operations
SIGNING_MODE=mock
BUILD_JOBS=32
CCACHE_SIZE=100G
BUILD_TMPFS_SIZE=30G
REPO_SYNC_JOBS=8
GIT_USER_NAME="ROMbsr Build Bot"
GIT_USER_EMAIL="rombsr-build@example.com"

Configure Sign Nodes

# /etc/rombsr/sign.conf
# Real HSM mode - processes captured requests
SIGNING_MODE=hsm
HSM_CONNECTOR_URL=http://localhost:12345
HSM_AUTH_KEY_ID=3
SIGNING_BATCH_SIZE=10
SIGNING_RATE_LIMIT=100
GIT_USER_NAME="ROMbsr Sign Bot"
GIT_USER_EMAIL="rombsr-sign@example.com"

# Install YubiHSM connector
wget https://developers.yubico.com/YubiHSM2/Releases/yubihsm-connector-latest.tar.gz
tar xzf yubihsm-connector-latest.tar.gz
cd yubihsm-connector
sudo make install

# Configure connector
cat > /etc/yubihsm-connector.yaml <

Setup HSM

# Insert YubiHSM2 into USB port
# Start connector service
sudo systemctl enable --now yubihsm-connector

# Provision HSM (one-time)
cd /opt/rombsr
sudo -u rombsr bin/rombsr hsm-provision

# Import Android signing keys
sudo -u rombsr bin/rombsr hsm-import /secure/android-keys/

# Verify setup
sudo -u rombsr bin/rombsr hsm-status
sudo -u rombsr bin/rombsr hsm-test

Install Systemd Services

# Copy service files
sudo cp /opt/rombsr/contrib/systemd/*.service /etc/systemd/system/
sudo cp /opt/rombsr/contrib/systemd/*.timer /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable services (sign node)
sudo systemctl enable --now rombsr-sign.service
sudo systemctl enable --now rombsr-monitor.service

# Enable timers (build node)
sudo systemctl enable --now rombsr-nightly.timer
sudo systemctl enable --now rombsr-cleanup.timer

# Check status
sudo systemctl status rombsr-*

Understanding SIGNING_MODE Configuration

Important: SIGNING_MODE is not about development vs production environments. Both modes are production features that work together:

SIGNING_MODE=mock (Build Nodes)

Enables request capture via libmock_pkcs11.so. This intercepts signing operations during build, saves the requests, and returns temporary signatures so the build can complete. This is NOT a test mode - it's how production builds capture what needs to be signed.

SIGNING_MODE=hsm (Sign Nodes)

Uses real HSM for processing captured signing requests. The sign orchestrator reads requests from the GitOps queue, applies real signatures using the YubiHSM2, and writes responses back. This completes the signing process started by the build nodes.

High Availability Configuration

Active/Passive Sign Nodes

For production environments requiring high availability:

# Primary sign node
SIGN_NODE_ROLE=primary
SIGN_NODE_PRIORITY=100

# Secondary sign node
SIGN_NODE_ROLE=secondary
SIGN_NODE_PRIORITY=50

# Use distributed locking (Redis/etcd)
LOCK_BACKEND=redis
LOCK_REDIS_URL=redis://ha-redis:6379/0
LOCK_KEY_PREFIX=rombsr:sign:

# Failover configuration
FAILOVER_TIMEOUT=300
HEALTH_CHECK_INTERVAL=30
Important: Only one sign node should be active at a time to prevent race conditions. The distributed lock ensures mutual exclusion even if both nodes are running.

Load Balancing Build Nodes

Round-Robin Builds

Use a simple scheduler to distribute builds across available nodes. Each node pulls from a central work queue.

Capacity-Based

Monitor CPU/RAM usage and route builds to least loaded node. Prometheus metrics enable smart scheduling.

Device Affinity

Assign specific devices to specific build nodes to optimize cache usage and reduce redundant downloads.

Monitoring & Alerting

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'rombsr-build'
    static_configs:
      - targets: ['build1:9090', 'build2:9090', 'build3:9090']

  - job_name: 'rombsr-sign'
    static_configs:
      - targets: ['sign1:9090', 'sign2:9090']

# Key metrics to monitor
# - rombsr_build_duration_seconds
# - rombsr_signing_requests_pending
# - rombsr_hsm_errors_total
# - rombsr_disk_usage_percent

Critical Alerts

# alerting_rules.yml
groups:
  - name: rombsr
    rules:
      - alert: BuildFailed
        expr: increase(rombsr_build_failures_total[1h]) > 0
        annotations:
          summary: "Build failed for {{ $labels.rom }}/{{ $labels.device }}"

      - alert: SigningQueueBacklog
        expr: rombsr_signing_requests_pending > 50
        for: 10m
        annotations:
          summary: "Signing queue backlog: {{ $value }} requests pending"

      - alert: HSMError
        expr: increase(rombsr_hsm_errors_total[5m]) > 0
        annotations:
          summary: "HSM errors detected"
          severity: critical

      - alert: DiskSpaceLow
        expr: rombsr_disk_usage_percent > 85
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

Grafana Dashboard

Import the provided Grafana dashboard for comprehensive visibility:

Operational Procedures

Daily Operations Checklist

Morning (9 AM):
□ Check overnight build status
  $ rombsr status --since yesterday

□ Review any alerts from monitoring
  $ journalctl -u rombsr-* --since "8 hours ago" | grep ERROR

□ Verify signing queue health
  $ rombsr sign --status

□ Check disk space on all nodes
  $ ansible all -m shell -a "df -h /var/lib/rombsr"

End of Day (5 PM):
□ Trigger nightly builds if not automated
  $ rombsr build grapheneos shiba
  $ rombsr build calyxos shiba

□ Clean old builds (keep last 7 days)
  $ rombsr clean --builds --older-than 7

□ Verify HSM connectivity for overnight signing
  $ rombsr hsm-status

Maintenance Windows

Weekly Maintenance

  • Update ROMbsr to latest version
  • Rotate logs to prevent disk fill
  • Update virus definitions for scans
  • Test backup restoration procedure

Monthly Maintenance

  • Security patches for OS
  • Review and update firewall rules
  • Audit user access permissions
  • Performance tuning review

Quarterly Maintenance

  • HSM firmware updates
  • Disaster recovery drill
  • Security audit and pen testing
  • Capacity planning review

Performance Tuning

Build Performance Optimization

Setting Default Optimized Impact
BUILD_JOBS auto (nproc) nproc * 1.5 +20% build speed
CCACHE_SIZE 50G 100-200G +40% cache hits
USE_TMPFS false true (64GB+ RAM) +30% I/O speed
REPO_SYNC_JOBS 4 16 +60% sync speed

Storage Configuration

# Fast NVMe for builds
/var/lib/rombsr/builds  - NVMe SSD (1TB+)
/var/lib/rombsr/cache   - NVMe SSD (200GB)

# Standard SSD for sources
/var/lib/rombsr/sources - SATA SSD (500GB)

# tmpfs for maximum speed (requires 64GB+ RAM)
mount -t tmpfs -o size=30G tmpfs /var/lib/rombsr/builds/out

# Optimize filesystem
tune2fs -o journal_data_writeback /dev/nvme0n1p1
tune2fs -O ^has_journal /dev/nvme0n1p1  # Disable journal (risky!)

Troubleshooting Guide

Common Issues and Solutions

Build Keeps Failing at Same Stage

# Check state file for errors
cat /var/lib/rombsr/state/<batch-id>.json | jq .

# Clear ccache if corruption suspected
ccache -C

# Increase memory limits
ulimit -v unlimited

# Check for disk space issues
df -h /var/lib/rombsr

HSM Not Responding

# Check USB connection
lsusb | grep Yubico

# Restart connector
sudo systemctl restart yubihsm-connector

# Check connector logs
tail -f /var/log/yubihsm-connector.log

# Test with yubihsm-shell
yubihsm-shell -C localhost:12345

Signing Queue Stuck

# Check git connectivity
cd /var/lib/rombsr/signing-queue
git pull

# Verify GPG keys
git log --show-signature -1

# Check sign orchestrator logs
journalctl -u rombsr-sign -f

# Manual queue processing
rombsr sign --once --verbose
Pro Tip: Enable debug logging temporarily for detailed troubleshooting: export ROMBSR_LOG_LEVEL=DEBUG

Disaster Recovery

Backup Strategy

#!/bin/bash
# Daily backup script

# Configuration backup
tar czf /backup/rombsr-config-$(date +%Y%m%d).tar.gz /etc/rombsr

# State backup
tar czf /backup/rombsr-state-$(date +%Y%m%d).tar.gz /var/lib/rombsr/state

# HSM backup (encrypted)
rombsr hsm-export /backup/hsm-$(date +%Y%m%d)

# Signing queue backup
cd /var/lib/rombsr/signing-queue
git bundle create /backup/queue-$(date +%Y%m%d).bundle --all

# Upload to offsite storage
aws s3 sync /backup/ s3://disaster-recovery/rombsr/

Recovery Procedures

  1. System Recovery: Restore from configuration management (Ansible/Puppet)
  2. HSM Recovery: Import keys from encrypted backup to new HSM
  3. Queue Recovery: Clone from git bundle or secondary remote
  4. State Recovery: Restore state files to resume interrupted builds