openboatmobile-ai/docs/HERMES_AUDIT_REPORT.md
CeeLo Greenheart a593af9b27 Initial commit - Clean public release
Sanitized for public release:
- Removed all API keys, tokens, and secrets
- Removed personal Discord IDs from hermes-openclaw.json
- Updated git URLs to be generic placeholders
- All sensitive data uses environment variable interpolation
2026-04-22 19:13:28 +00:00

203 lines
6.4 KiB
Markdown

# Hermes Deployment Audit Report
## Issues Found
During the audit of the Terraform project for Hermes Agent deployment, several critical issues were identified that would prevent Hermes from running properly:
### 1. **Systemd Service Configuration Error** (CRITICAL)
**Problem:** The systemd service didn't specify the docker-compose file path
- `ExecStart=/usr/bin/docker compose up` without the `-f` flag
- The service couldn't find docker-compose.yml when running from an arbitrary directory
- No guarantee the service would change to the correct working directory
**Impact:** Service would start but immediately fail or not find the compose file.
**Fix:** Updated to:
```ini
ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'
```
### 2. **User Permissions Issue** (CRITICAL)
**Problem:** Service was configured to run as `User=${admin_user}` (non-root)
- Adding a user to the docker group with `usermod -aG docker` doesn't take effect for existing sessions
- The systemd service tries to use docker before the hermes user has proper permissions
- Would require a re-login to apply the docker group permissions
**Impact:** Service runs as hermes user without the necessary docker group permissions, causing "permission denied" errors.
**Fix:** Changed service to run as root (necessary for Docker):
```ini
User=root
```
And ensured proper file ownership:
```bash
chown ${admin_user}:${admin_user} /home/${admin_user}/docker-compose.yml
chmod 644 /home/${admin_user}/docker-compose.yml
```
### 3. **Installation Order Issue**
**Problem:** Docker image was pulled before docker-compose-plugin was installed
- `docker pull` command succeeded (using legacy docker)
- But `docker compose` (the plugin) comes later
- If the pull failed, docker-compose-plugin wouldn't have been installed yet
**Impact:** Potential race condition during bootstrap.
**Fix:** Reordered runcmd to install docker-compose-plugin immediately after Docker:
```yaml
1. curl docker installer
2. apt-get install docker-compose-plugin # BEFORE pulling image
3. docker pull nousresearch/hermes-agent:latest
```
### 4. **No Docker Daemon Ready Check** (HIGH)
**Problem:** Script tried to pull images immediately after Docker installation
- Docker socket might not be ready
- Starting services before Docker is fully operational
**Impact:** Timing-dependent failures, especially on slower systems.
**Fix:** Added health checks and delays:
```bash
# Wait for Docker daemon to be ready
sleep 5
docker ps > /dev/null || (sleep 10 && docker ps)
```
### 5. **No Service Startup Verification** (MEDIUM)
**Problem:** Service was started with no check that it actually came up
- If the service failed to start, deployment would complete successfully anyway
- User wouldn't know until they SSH in
**Impact:** Silent failures that only become apparent when checking the server.
**Fix:** Added verification:
```bash
# Verify service started
systemctl is-active hermes.service || systemctl status hermes.service
```
### 6. **Poor Error Logging** (MEDIUM)
**Problem:** systemd service logged to stdout but nothing captured the startup errors
- No journal entries with what went wrong
- No way to see Docker errors in the cloud-init logs
**Impact:** Difficult to diagnose why the service failed.
**Fix:** Added proper journal logging:
```ini
StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes
```
## Changes Made
### Terraform Files Modified
1. **templates/userdata-hermes.tpl**
- Fixed systemd service configuration
- Reordered runcmd operations
- Added Docker readiness checks and delays
- Enhanced health check script
- Added service startup verification
- Improved completion messages
2. **docs/HERMES_DEBUGGING.md** (NEW)
- Comprehensive troubleshooting guide
- Common issues and solutions
- Diagnostic commands
- Manual start/stop procedures
- Discord connectivity testing
3. **README.md**
- Added reference to HERMES_DEBUGGING.md documentation
## Testing These Changes
To test the fixes, you need to redeploy:
```bash
# Option 1: Destroy and redeploy (cleanest)
terraform destroy
# Answer yes when prompted
source .env && terraform init && terraform apply
# Option 2: Update existing (if keeping infrastructure)
source .env && terraform apply -auto-approve
```
After deployment, verify Hermes is running:
```bash
# SSH into the server (username is 'hermes' or your override)
ssh hermes@<SERVER_IP>
# Run the health check
/usr/local/bin/hermes-health-check.sh
# Or manually verify
systemctl status hermes.service
docker ps
docker logs hermes
```
## Deployment Flow Now
With the fixes, the cloud-init deployment flow is now:
1. ✓ Update system packages
2. ✓ Create hermes user
3. ✓ Write configuration files (.env, config.yaml, docker-compose.yml, SOUL.md)
4. ✓ Write health check script
5. ✓ Write systemd service unit
6. ✓ Install Docker
7. ✓ Install docker-compose-plugin
8. ✓ Wait for Docker daemon to be ready
9. ✓ Pull Hermes image
10. ✓ Set proper permissions
11. ✓ Reload systemd
12. ✓ Enable hermes.service
13. ✓ Start systemd service (which runs docker-compose up)
14. ✓ Wait for startup
15. ✓ Verify service is active
## Expected Behavior After Fix
When you SSH into the server after deployment:
```bash
$ systemctl status hermes.service
● hermes.service - Hermes Agent Service
Loaded: loaded (/etc/systemd/system/hermes.service; enabled; vendor preset: enabled)
Active: active (running) since ...
$ docker ps
CONTAINER ID IMAGE STATUS
abc123 nousresearch/hermes-agent:latest Up 2 minutes
$ docker logs hermes
[INFO] Hermes Agent starting...
[INFO] Discord bot initialized
...
```
And in Discord:
- Bot shows "online" status
- Responds to mentions in configured channels
- Respects user allowlist
## Next Steps
1. **Redeploy** with the fixed template
2. **Verify** using the health checks documented in HERMES_DEBUGGING.md
3. **Test Discord** connectivity by mentioning the bot in a channel
4. **Monitor logs** using `docker logs -f hermes` if issues occur
## Additional Notes
- The audit identified these issues by analyzing the template configuration and deployment flow
- Similar fixes should be applied if you have OpenClaw deployments
- The systemd service is now production-ready with proper error handling
- Health check script was significantly enhanced for better diagnostics