openboatmobile-ai/docs/HERMES_AUDIT_REPORT.md
CeeLo Greenheart a593af9b27 Initial commit - Clean public release
Sanitized for public release:
- Removed all API keys, tokens, and secrets
- Removed personal Discord IDs from hermes-openclaw.json
- Updated git URLs to be generic placeholders
- All sensitive data uses environment variable interpolation
2026-04-22 19:13:28 +00:00

6.4 KiB

Hermes Deployment Audit Report

Issues Found

During the audit of the Terraform project for Hermes Agent deployment, several critical issues were identified that would prevent Hermes from running properly:

1. Systemd Service Configuration Error (CRITICAL)

Problem: The systemd service didn't specify the docker-compose file path

  • ExecStart=/usr/bin/docker compose up without the -f flag
  • The service couldn't find docker-compose.yml when running from an arbitrary directory
  • No guarantee the service would change to the correct working directory

Impact: Service would start but immediately fail or not find the compose file.

Fix: Updated to:

ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'

2. User Permissions Issue (CRITICAL)

Problem: Service was configured to run as User=${admin_user} (non-root)

  • Adding a user to the docker group with usermod -aG docker doesn't take effect for existing sessions
  • The systemd service tries to use docker before the hermes user has proper permissions
  • Would require a re-login to apply the docker group permissions

Impact: Service runs as hermes user without the necessary docker group permissions, causing "permission denied" errors.

Fix: Changed service to run as root (necessary for Docker):

User=root

And ensured proper file ownership:

chown ${admin_user}:${admin_user} /home/${admin_user}/docker-compose.yml
chmod 644 /home/${admin_user}/docker-compose.yml

3. Installation Order Issue

Problem: Docker image was pulled before docker-compose-plugin was installed

  • docker pull command succeeded (using legacy docker)
  • But docker compose (the plugin) comes later
  • If the pull failed, docker-compose-plugin wouldn't have been installed yet

Impact: Potential race condition during bootstrap.

Fix: Reordered runcmd to install docker-compose-plugin immediately after Docker:

1. curl docker installer
2. apt-get install docker-compose-plugin  # BEFORE pulling image
3. docker pull nousresearch/hermes-agent:latest

4. No Docker Daemon Ready Check (HIGH)

Problem: Script tried to pull images immediately after Docker installation

  • Docker socket might not be ready
  • Starting services before Docker is fully operational

Impact: Timing-dependent failures, especially on slower systems.

Fix: Added health checks and delays:

# Wait for Docker daemon to be ready
sleep 5
docker ps > /dev/null || (sleep 10 && docker ps)

5. No Service Startup Verification (MEDIUM)

Problem: Service was started with no check that it actually came up

  • If the service failed to start, deployment would complete successfully anyway
  • User wouldn't know until they SSH in

Impact: Silent failures that only become apparent when checking the server.

Fix: Added verification:

# Verify service started
systemctl is-active hermes.service || systemctl status hermes.service

6. Poor Error Logging (MEDIUM)

Problem: systemd service logged to stdout but nothing captured the startup errors

  • No journal entries with what went wrong
  • No way to see Docker errors in the cloud-init logs

Impact: Difficult to diagnose why the service failed.

Fix: Added proper journal logging:

StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes

Changes Made

Terraform Files Modified

  1. templates/userdata-hermes.tpl

    • Fixed systemd service configuration
    • Reordered runcmd operations
    • Added Docker readiness checks and delays
    • Enhanced health check script
    • Added service startup verification
    • Improved completion messages
  2. docs/HERMES_DEBUGGING.md (NEW)

    • Comprehensive troubleshooting guide
    • Common issues and solutions
    • Diagnostic commands
    • Manual start/stop procedures
    • Discord connectivity testing
  3. README.md

    • Added reference to HERMES_DEBUGGING.md documentation

Testing These Changes

To test the fixes, you need to redeploy:

# Option 1: Destroy and redeploy (cleanest)
terraform destroy
# Answer yes when prompted
source .env && terraform init && terraform apply

# Option 2: Update existing (if keeping infrastructure)
source .env && terraform apply -auto-approve

After deployment, verify Hermes is running:

# SSH into the server (username is 'hermes' or your override)
ssh hermes@<SERVER_IP>

# Run the health check
/usr/local/bin/hermes-health-check.sh

# Or manually verify
systemctl status hermes.service
docker ps
docker logs hermes

Deployment Flow Now

With the fixes, the cloud-init deployment flow is now:

  1. ✓ Update system packages
  2. ✓ Create hermes user
  3. ✓ Write configuration files (.env, config.yaml, docker-compose.yml, SOUL.md)
  4. ✓ Write health check script
  5. ✓ Write systemd service unit
  6. ✓ Install Docker
  7. ✓ Install docker-compose-plugin
  8. ✓ Wait for Docker daemon to be ready
  9. ✓ Pull Hermes image
  10. ✓ Set proper permissions
  11. ✓ Reload systemd
  12. ✓ Enable hermes.service
  13. ✓ Start systemd service (which runs docker-compose up)
  14. ✓ Wait for startup
  15. ✓ Verify service is active

Expected Behavior After Fix

When you SSH into the server after deployment:

$ systemctl status hermes.service
● hermes.service - Hermes Agent Service
     Loaded: loaded (/etc/systemd/system/hermes.service; enabled; vendor preset: enabled)
     Active: active (running) since ...
     
$ docker ps
CONTAINER ID   IMAGE                              STATUS
abc123         nousresearch/hermes-agent:latest   Up 2 minutes

$ docker logs hermes
[INFO] Hermes Agent starting...
[INFO] Discord bot initialized
...

And in Discord:

  • Bot shows "online" status
  • Responds to mentions in configured channels
  • Respects user allowlist

Next Steps

  1. Redeploy with the fixed template
  2. Verify using the health checks documented in HERMES_DEBUGGING.md
  3. Test Discord connectivity by mentioning the bot in a channel
  4. Monitor logs using docker logs -f hermes if issues occur

Additional Notes

  • The audit identified these issues by analyzing the template configuration and deployment flow
  • Similar fixes should be applied if you have OpenClaw deployments
  • The systemd service is now production-ready with proper error handling
  • Health check script was significantly enhanced for better diagnostics