openboatmobile-ai/HERMES_FIX_SUMMARY.md
CeeLo Greenheart a593af9b27 Initial commit - Clean public release
Sanitized for public release:
- Removed all API keys, tokens, and secrets
- Removed personal Discord IDs from hermes-openclaw.json
- Updated git URLs to be generic placeholders
- All sensitive data uses environment variable interpolation
2026-04-22 19:13:28 +00:00

5.6 KiB

Hermes Deployment Audit - Summary of Fixes

Executive Summary

The Terraform Hermes deployment had 5 critical issues preventing the service from running. All have been fixed in the cloud-init template.

What Was Wrong

Critical Issues Found:

  1. Systemd service couldn't find docker-compose.yml

    • ExecStart=/usr/bin/docker compose up (missing file path)
  2. Service ran as non-root user without Docker permissions

    • User permissions from usermod -aG docker don't take effect for the systemd service
  3. Docker image pulled before docker-compose-plugin installed

    • Installation order was wrong
  4. No check that Docker daemon was ready

    • Timing issues during bootstrap
  5. No verification service actually started

    • Deployment would complete even if Hermes failed to start

What Was Fixed

1. Systemd Service Configuration

Before:

ExecStart=/usr/bin/docker compose up
ExecStop=/usr/bin/docker compose down
User=${admin_user}

After:

ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'
User=root
StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes

Why: Now properly finds the compose file and doesn't have permission issues.


2. Installation Order

Before:

- curl -fsSL https://get.docker.com | sh
- apt-get install -y docker-compose-plugin  # too late
- docker pull nousresearch/hermes-agent:latest

After:

- curl -fsSL https://get.docker.com | sh
- apt-get install -y docker-compose-plugin  # right after docker
- sleep 5
- docker ps > /dev/null || (sleep 10 && docker ps)  # verify ready
- docker pull nousresearch/hermes-agent:latest

Why: Ensures docker-compose-plugin is installed before use and Docker is ready.


3. Service Startup Verification

Before:

- systemctl start hermes.service
# ... done, might have failed but we don't know

After:

- systemctl start hermes.service
- sleep 3
- systemctl is-active hermes.service || systemctl status hermes.service

Why: Immediately tells us if startup failed.


4. Enhanced Health Check Script

Added comprehensive diagnostics:

  • ✓ Docker daemon status
  • ✓ Container exists
  • ✓ Container running (with uptime)
  • ✓ Port listening
  • ✓ Config files exist
  • ✓ Systemd service status
  • ✓ Recent logs
  • ✓ Discord configuration check

New Documentation

1. HERMES_DEBUGGING.md

Complete troubleshooting guide with:

  • Quick diagnostic checklist
  • Common issues and their fixes
  • Command reference
  • Manual start/stop procedures
  • Discord connectivity testing
  • Log interpretation

2. HERMES_AUDIT_REPORT.md

Detailed audit findings explaining:

  • What each issue was
  • Why it caused failures
  • How it was fixed
  • Expected behavior after fixes

How to Apply These Fixes

Option 1: Fresh Deployment (Cleanest)

terraform destroy -auto-approve
source .env && terraform init && terraform apply

Option 2: Update Existing Stack

source .env && terraform apply -auto-approve

Verification After Deployment

After applying these fixes and deploying:

# SSH into server
ssh hermes@<SERVER_IP>

# Run comprehensive health check
/usr/local/bin/hermes-health-check.sh

# Manually verify
systemctl status hermes.service
docker ps
docker logs hermes

Expected output:

  • ✓ Hermes systemd service active
  • ✓ Docker container running
  • ✓ Gateway listening on port 18789
  • ✓ Discord bot shows online in your server

Files Changed

Core Deployment

  • templates/userdata-hermes.tpl - Fixed cloud-init configuration

Documentation

  • docs/HERMES_DEBUGGING.md - NEW Troubleshooting guide
  • docs/HERMES_AUDIT_REPORT.md - NEW Detailed audit findings
  • README.md - Added reference to debugging guide

Why These Fixes Work

Each fix addresses a specific failure point:

Issue Root Cause Fix Result
Compose file not found No path specified Specify full path with -f Service finds config
Docker permission denied Non-root user, group not applied Run service as root Service can use Docker
Docker not ready Immediate pull attempt Add delays and checks Image pulls successfully
Silent failures No verification Check service status Know if it failed
Can't debug No logging Added journal logging Can read logs

Testing the Fixes

To verify the fixes work on your deployments:

  1. Quick test (5 min):

    # Just check service is running
    systemctl status hermes.service
    docker ps | grep hermes
    
  2. Full health check (10 min):

    /usr/local/bin/hermes-health-check.sh
    
  3. Discord test (Manual):

    • Mention the bot in a configured channel
    • It should respond within a few seconds

Rollback Plan

If something goes wrong:

# Revert to previous state
git checkout templates/userdata-hermes.tpl

# Then redeploy or manually stop
systemctl stop hermes.service
docker compose -f ~hermes/docker-compose.yml down

OpenClaw Status

✓ OpenClaw service is properly configured and doesn't have these issues.


Next Steps

  1. Review the changes in templates/userdata-hermes.tpl
  2. Redeploy using terraform apply
  3. Verify using systemctl status hermes.service
  4. Test Discord connectivity
  5. Refer to HERMES_DEBUGGING.md if any issues occur

All changes are backward compatible and don't affect other components.