- Split cloudinit.tf into cloudinit-hermes.tf and cloudinit-openclaw.tf - Split variables.tf into variables-common.tf, variables-hermes.tf, variables-openclaw.tf - Move templates into hermes/templates/ and openclaw/templates/ - Move models/ into openclaw/models/ - Move hermes-openclaw.json to openclaw/openclaw-reference.json - Move hermes docs to hermes/docs/ - OpenClaw cloudinit now uses variables instead of hardcoded values - All 48 variable references verified against definitions
5.6 KiB
Hermes Deployment Audit - Summary of Fixes
Executive Summary
The Terraform Hermes deployment had 5 critical issues preventing the service from running. All have been fixed in the cloud-init template.
What Was Wrong
Critical Issues Found:
-
✗ Systemd service couldn't find docker-compose.yml
ExecStart=/usr/bin/docker compose up(missing file path)
-
✗ Service ran as non-root user without Docker permissions
- User permissions from
usermod -aG dockerdon't take effect for the systemd service
- User permissions from
-
✗ Docker image pulled before docker-compose-plugin installed
- Installation order was wrong
-
✗ No check that Docker daemon was ready
- Timing issues during bootstrap
-
✗ No verification service actually started
- Deployment would complete even if Hermes failed to start
What Was Fixed
1. Systemd Service Configuration
Before:
ExecStart=/usr/bin/docker compose up
ExecStop=/usr/bin/docker compose down
User=${admin_user}
After:
ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'
User=root
StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes
Why: Now properly finds the compose file and doesn't have permission issues.
2. Installation Order
Before:
- curl -fsSL https://get.docker.com | sh
- apt-get install -y docker-compose-plugin # too late
- docker pull nousresearch/hermes-agent:latest
After:
- curl -fsSL https://get.docker.com | sh
- apt-get install -y docker-compose-plugin # right after docker
- sleep 5
- docker ps > /dev/null || (sleep 10 && docker ps) # verify ready
- docker pull nousresearch/hermes-agent:latest
Why: Ensures docker-compose-plugin is installed before use and Docker is ready.
3. Service Startup Verification
Before:
- systemctl start hermes.service
# ... done, might have failed but we don't know
After:
- systemctl start hermes.service
- sleep 3
- systemctl is-active hermes.service || systemctl status hermes.service
Why: Immediately tells us if startup failed.
4. Enhanced Health Check Script
Added comprehensive diagnostics:
- ✓ Docker daemon status
- ✓ Container exists
- ✓ Container running (with uptime)
- ✓ Port listening
- ✓ Config files exist
- ✓ Systemd service status
- ✓ Recent logs
- ✓ Discord configuration check
New Documentation
1. HERMES_DEBUGGING.md
Complete troubleshooting guide with:
- Quick diagnostic checklist
- Common issues and their fixes
- Command reference
- Manual start/stop procedures
- Discord connectivity testing
- Log interpretation
2. HERMES_AUDIT_REPORT.md
Detailed audit findings explaining:
- What each issue was
- Why it caused failures
- How it was fixed
- Expected behavior after fixes
How to Apply These Fixes
Option 1: Fresh Deployment (Cleanest)
terraform destroy -auto-approve
source .env && terraform init && terraform apply
Option 2: Update Existing Stack
source .env && terraform apply -auto-approve
Verification After Deployment
After applying these fixes and deploying:
# SSH into server
ssh hermes@<SERVER_IP>
# Run comprehensive health check
/usr/local/bin/hermes-health-check.sh
# Manually verify
systemctl status hermes.service
docker ps
docker logs hermes
Expected output:
- ✓ Hermes systemd service active
- ✓ Docker container running
- ✓ Gateway listening on port 18789
- ✓ Discord bot shows online in your server
Files Changed
Core Deployment
templates/userdata-hermes.tpl- Fixed cloud-init configuration
Documentation
docs/HERMES_DEBUGGING.md- NEW Troubleshooting guidedocs/HERMES_AUDIT_REPORT.md- NEW Detailed audit findingsREADME.md- Added reference to debugging guide
Why These Fixes Work
Each fix addresses a specific failure point:
| Issue | Root Cause | Fix | Result |
|---|---|---|---|
| Compose file not found | No path specified | Specify full path with -f |
Service finds config |
| Docker permission denied | Non-root user, group not applied | Run service as root | Service can use Docker |
| Docker not ready | Immediate pull attempt | Add delays and checks | Image pulls successfully |
| Silent failures | No verification | Check service status | Know if it failed |
| Can't debug | No logging | Added journal logging | Can read logs |
Testing the Fixes
To verify the fixes work on your deployments:
-
Quick test (5 min):
# Just check service is running systemctl status hermes.service docker ps | grep hermes -
Full health check (10 min):
/usr/local/bin/hermes-health-check.sh -
Discord test (Manual):
- Mention the bot in a configured channel
- It should respond within a few seconds
Rollback Plan
If something goes wrong:
# Revert to previous state
git checkout templates/userdata-hermes.tpl
# Then redeploy or manually stop
systemctl stop hermes.service
docker compose -f ~hermes/docker-compose.yml down
OpenClaw Status
✓ OpenClaw service is properly configured and doesn't have these issues.
Next Steps
- Review the changes in
templates/userdata-hermes.tpl - Redeploy using
terraform apply - Verify using
systemctl status hermes.service - Test Discord connectivity
- Refer to
HERMES_DEBUGGING.mdif any issues occur
All changes are backward compatible and don't affect other components.