# Hermes Deployment Audit - Summary of Fixes ## Executive Summary The Terraform Hermes deployment had **5 critical issues** preventing the service from running. All have been fixed in the cloud-init template. ## What Was Wrong ### Critical Issues Found: 1. ✗ **Systemd service couldn't find docker-compose.yml** - `ExecStart=/usr/bin/docker compose up` (missing file path) 2. ✗ **Service ran as non-root user without Docker permissions** - User permissions from `usermod -aG docker` don't take effect for the systemd service 3. ✗ **Docker image pulled before docker-compose-plugin installed** - Installation order was wrong 4. ✗ **No check that Docker daemon was ready** - Timing issues during bootstrap 5. ✗ **No verification service actually started** - Deployment would complete even if Hermes failed to start ## What Was Fixed ### 1. Systemd Service Configuration **Before:** ```ini ExecStart=/usr/bin/docker compose up ExecStop=/usr/bin/docker compose down User=${admin_user} ``` **After:** ```ini ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up' ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down' User=root StandardOutput=journal StandardError=journal SyslogIdentifier=hermes ``` **Why:** Now properly finds the compose file and doesn't have permission issues. --- ### 2. Installation Order **Before:** ```yaml - curl -fsSL https://get.docker.com | sh - apt-get install -y docker-compose-plugin # too late - docker pull nousresearch/hermes-agent:latest ``` **After:** ```yaml - curl -fsSL https://get.docker.com | sh - apt-get install -y docker-compose-plugin # right after docker - sleep 5 - docker ps > /dev/null || (sleep 10 && docker ps) # verify ready - docker pull nousresearch/hermes-agent:latest ``` **Why:** Ensures docker-compose-plugin is installed before use and Docker is ready. --- ### 3. Service Startup Verification **Before:** ```yaml - systemctl start hermes.service # ... done, might have failed but we don't know ``` **After:** ```yaml - systemctl start hermes.service - sleep 3 - systemctl is-active hermes.service || systemctl status hermes.service ``` **Why:** Immediately tells us if startup failed. --- ### 4. Enhanced Health Check Script **Added comprehensive diagnostics:** - ✓ Docker daemon status - ✓ Container exists - ✓ Container running (with uptime) - ✓ Port listening - ✓ Config files exist - ✓ Systemd service status - ✓ Recent logs - ✓ Discord configuration check --- ## New Documentation ### 1. **HERMES_DEBUGGING.md** Complete troubleshooting guide with: - Quick diagnostic checklist - Common issues and their fixes - Command reference - Manual start/stop procedures - Discord connectivity testing - Log interpretation ### 2. **HERMES_AUDIT_REPORT.md** Detailed audit findings explaining: - What each issue was - Why it caused failures - How it was fixed - Expected behavior after fixes --- ## How to Apply These Fixes ### Option 1: Fresh Deployment (Cleanest) ```bash terraform destroy -auto-approve source .env && terraform init && terraform apply ``` ### Option 2: Update Existing Stack ```bash source .env && terraform apply -auto-approve ``` --- ## Verification After Deployment After applying these fixes and deploying: ```bash # SSH into server ssh hermes@ # Run comprehensive health check /usr/local/bin/hermes-health-check.sh # Manually verify systemctl status hermes.service docker ps docker logs hermes ``` **Expected output:** - ✓ Hermes systemd service active - ✓ Docker container running - ✓ Gateway listening on port 18789 - ✓ Discord bot shows online in your server --- ## Files Changed ### Core Deployment - `templates/userdata-hermes.tpl` - Fixed cloud-init configuration ### Documentation - `docs/HERMES_DEBUGGING.md` - **NEW** Troubleshooting guide - `docs/HERMES_AUDIT_REPORT.md` - **NEW** Detailed audit findings - `README.md` - Added reference to debugging guide --- ## Why These Fixes Work Each fix addresses a specific failure point: | Issue | Root Cause | Fix | Result | |-------|-----------|-----|--------| | Compose file not found | No path specified | Specify full path with `-f` | Service finds config | | Docker permission denied | Non-root user, group not applied | Run service as root | Service can use Docker | | Docker not ready | Immediate pull attempt | Add delays and checks | Image pulls successfully | | Silent failures | No verification | Check service status | Know if it failed | | Can't debug | No logging | Added journal logging | Can read logs | --- ## Testing the Fixes To verify the fixes work on your deployments: 1. **Quick test (5 min):** ```bash # Just check service is running systemctl status hermes.service docker ps | grep hermes ``` 2. **Full health check (10 min):** ```bash /usr/local/bin/hermes-health-check.sh ``` 3. **Discord test (Manual):** - Mention the bot in a configured channel - It should respond within a few seconds --- ## Rollback Plan If something goes wrong: ```bash # Revert to previous state git checkout templates/userdata-hermes.tpl # Then redeploy or manually stop systemctl stop hermes.service docker compose -f ~hermes/docker-compose.yml down ``` --- ## OpenClaw Status ✓ OpenClaw service is properly configured and doesn't have these issues. --- ## Next Steps 1. **Review** the changes in `templates/userdata-hermes.tpl` 2. **Redeploy** using `terraform apply` 3. **Verify** using `systemctl status hermes.service` 4. **Test** Discord connectivity 5. **Refer** to `HERMES_DEBUGGING.md` if any issues occur All changes are backward compatible and don't affect other components.