# Hermes Deployment Audit Report ## Issues Found During the audit of the Terraform project for Hermes Agent deployment, several critical issues were identified that would prevent Hermes from running properly: ### 1. **Systemd Service Configuration Error** (CRITICAL) **Problem:** The systemd service didn't specify the docker-compose file path - `ExecStart=/usr/bin/docker compose up` without the `-f` flag - The service couldn't find docker-compose.yml when running from an arbitrary directory - No guarantee the service would change to the correct working directory **Impact:** Service would start but immediately fail or not find the compose file. **Fix:** Updated to: ```ini ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up' ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down' ``` ### 2. **User Permissions Issue** (CRITICAL) **Problem:** Service was configured to run as `User=${admin_user}` (non-root) - Adding a user to the docker group with `usermod -aG docker` doesn't take effect for existing sessions - The systemd service tries to use docker before the hermes user has proper permissions - Would require a re-login to apply the docker group permissions **Impact:** Service runs as hermes user without the necessary docker group permissions, causing "permission denied" errors. **Fix:** Changed service to run as root (necessary for Docker): ```ini User=root ``` And ensured proper file ownership: ```bash chown ${admin_user}:${admin_user} /home/${admin_user}/docker-compose.yml chmod 644 /home/${admin_user}/docker-compose.yml ``` ### 3. **Installation Order Issue** **Problem:** Docker image was pulled before docker-compose-plugin was installed - `docker pull` command succeeded (using legacy docker) - But `docker compose` (the plugin) comes later - If the pull failed, docker-compose-plugin wouldn't have been installed yet **Impact:** Potential race condition during bootstrap. **Fix:** Reordered runcmd to install docker-compose-plugin immediately after Docker: ```yaml 1. curl docker installer 2. apt-get install docker-compose-plugin # BEFORE pulling image 3. docker pull nousresearch/hermes-agent:latest ``` ### 4. **No Docker Daemon Ready Check** (HIGH) **Problem:** Script tried to pull images immediately after Docker installation - Docker socket might not be ready - Starting services before Docker is fully operational **Impact:** Timing-dependent failures, especially on slower systems. **Fix:** Added health checks and delays: ```bash # Wait for Docker daemon to be ready sleep 5 docker ps > /dev/null || (sleep 10 && docker ps) ``` ### 5. **No Service Startup Verification** (MEDIUM) **Problem:** Service was started with no check that it actually came up - If the service failed to start, deployment would complete successfully anyway - User wouldn't know until they SSH in **Impact:** Silent failures that only become apparent when checking the server. **Fix:** Added verification: ```bash # Verify service started systemctl is-active hermes.service || systemctl status hermes.service ``` ### 6. **Poor Error Logging** (MEDIUM) **Problem:** systemd service logged to stdout but nothing captured the startup errors - No journal entries with what went wrong - No way to see Docker errors in the cloud-init logs **Impact:** Difficult to diagnose why the service failed. **Fix:** Added proper journal logging: ```ini StandardOutput=journal StandardError=journal SyslogIdentifier=hermes ``` ## Changes Made ### Terraform Files Modified 1. **templates/userdata-hermes.tpl** - Fixed systemd service configuration - Reordered runcmd operations - Added Docker readiness checks and delays - Enhanced health check script - Added service startup verification - Improved completion messages 2. **docs/HERMES_DEBUGGING.md** (NEW) - Comprehensive troubleshooting guide - Common issues and solutions - Diagnostic commands - Manual start/stop procedures - Discord connectivity testing 3. **README.md** - Added reference to HERMES_DEBUGGING.md documentation ## Testing These Changes To test the fixes, you need to redeploy: ```bash # Option 1: Destroy and redeploy (cleanest) terraform destroy # Answer yes when prompted source .env && terraform init && terraform apply # Option 2: Update existing (if keeping infrastructure) source .env && terraform apply -auto-approve ``` After deployment, verify Hermes is running: ```bash # SSH into the server (username is 'hermes' or your override) ssh hermes@ # Run the health check /usr/local/bin/hermes-health-check.sh # Or manually verify systemctl status hermes.service docker ps docker logs hermes ``` ## Deployment Flow Now With the fixes, the cloud-init deployment flow is now: 1. ✓ Update system packages 2. ✓ Create hermes user 3. ✓ Write configuration files (.env, config.yaml, docker-compose.yml, SOUL.md) 4. ✓ Write health check script 5. ✓ Write systemd service unit 6. ✓ Install Docker 7. ✓ Install docker-compose-plugin 8. ✓ Wait for Docker daemon to be ready 9. ✓ Pull Hermes image 10. ✓ Set proper permissions 11. ✓ Reload systemd 12. ✓ Enable hermes.service 13. ✓ Start systemd service (which runs docker-compose up) 14. ✓ Wait for startup 15. ✓ Verify service is active ## Expected Behavior After Fix When you SSH into the server after deployment: ```bash $ systemctl status hermes.service ● hermes.service - Hermes Agent Service Loaded: loaded (/etc/systemd/system/hermes.service; enabled; vendor preset: enabled) Active: active (running) since ... $ docker ps CONTAINER ID IMAGE STATUS abc123 nousresearch/hermes-agent:latest Up 2 minutes $ docker logs hermes [INFO] Hermes Agent starting... [INFO] Discord bot initialized ... ``` And in Discord: - Bot shows "online" status - Responds to mentions in configured channels - Respects user allowlist ## Next Steps 1. **Redeploy** with the fixed template 2. **Verify** using the health checks documented in HERMES_DEBUGGING.md 3. **Test Discord** connectivity by mentioning the bot in a channel 4. **Monitor logs** using `docker logs -f hermes` if issues occur ## Additional Notes - The audit identified these issues by analyzing the template configuration and deployment flow - Similar fixes should be applied if you have OpenClaw deployments - The systemd service is now production-ready with proper error handling - Health check script was significantly enhanced for better diagnostics