Sanitized for public release: - Removed all API keys, tokens, and secrets - Removed personal Discord IDs from hermes-openclaw.json - Updated git URLs to be generic placeholders - All sensitive data uses environment variable interpolation
203 lines
6.4 KiB
Markdown
203 lines
6.4 KiB
Markdown
# Hermes Deployment Audit Report
|
|
|
|
## Issues Found
|
|
|
|
During the audit of the Terraform project for Hermes Agent deployment, several critical issues were identified that would prevent Hermes from running properly:
|
|
|
|
### 1. **Systemd Service Configuration Error** (CRITICAL)
|
|
**Problem:** The systemd service didn't specify the docker-compose file path
|
|
- `ExecStart=/usr/bin/docker compose up` without the `-f` flag
|
|
- The service couldn't find docker-compose.yml when running from an arbitrary directory
|
|
- No guarantee the service would change to the correct working directory
|
|
|
|
**Impact:** Service would start but immediately fail or not find the compose file.
|
|
|
|
**Fix:** Updated to:
|
|
```ini
|
|
ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
|
|
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'
|
|
```
|
|
|
|
### 2. **User Permissions Issue** (CRITICAL)
|
|
**Problem:** Service was configured to run as `User=${admin_user}` (non-root)
|
|
- Adding a user to the docker group with `usermod -aG docker` doesn't take effect for existing sessions
|
|
- The systemd service tries to use docker before the hermes user has proper permissions
|
|
- Would require a re-login to apply the docker group permissions
|
|
|
|
**Impact:** Service runs as hermes user without the necessary docker group permissions, causing "permission denied" errors.
|
|
|
|
**Fix:** Changed service to run as root (necessary for Docker):
|
|
```ini
|
|
User=root
|
|
```
|
|
And ensured proper file ownership:
|
|
```bash
|
|
chown ${admin_user}:${admin_user} /home/${admin_user}/docker-compose.yml
|
|
chmod 644 /home/${admin_user}/docker-compose.yml
|
|
```
|
|
|
|
### 3. **Installation Order Issue**
|
|
**Problem:** Docker image was pulled before docker-compose-plugin was installed
|
|
- `docker pull` command succeeded (using legacy docker)
|
|
- But `docker compose` (the plugin) comes later
|
|
- If the pull failed, docker-compose-plugin wouldn't have been installed yet
|
|
|
|
**Impact:** Potential race condition during bootstrap.
|
|
|
|
**Fix:** Reordered runcmd to install docker-compose-plugin immediately after Docker:
|
|
```yaml
|
|
1. curl docker installer
|
|
2. apt-get install docker-compose-plugin # BEFORE pulling image
|
|
3. docker pull nousresearch/hermes-agent:latest
|
|
```
|
|
|
|
### 4. **No Docker Daemon Ready Check** (HIGH)
|
|
**Problem:** Script tried to pull images immediately after Docker installation
|
|
- Docker socket might not be ready
|
|
- Starting services before Docker is fully operational
|
|
|
|
**Impact:** Timing-dependent failures, especially on slower systems.
|
|
|
|
**Fix:** Added health checks and delays:
|
|
```bash
|
|
# Wait for Docker daemon to be ready
|
|
sleep 5
|
|
docker ps > /dev/null || (sleep 10 && docker ps)
|
|
```
|
|
|
|
### 5. **No Service Startup Verification** (MEDIUM)
|
|
**Problem:** Service was started with no check that it actually came up
|
|
- If the service failed to start, deployment would complete successfully anyway
|
|
- User wouldn't know until they SSH in
|
|
|
|
**Impact:** Silent failures that only become apparent when checking the server.
|
|
|
|
**Fix:** Added verification:
|
|
```bash
|
|
# Verify service started
|
|
systemctl is-active hermes.service || systemctl status hermes.service
|
|
```
|
|
|
|
### 6. **Poor Error Logging** (MEDIUM)
|
|
**Problem:** systemd service logged to stdout but nothing captured the startup errors
|
|
- No journal entries with what went wrong
|
|
- No way to see Docker errors in the cloud-init logs
|
|
|
|
**Impact:** Difficult to diagnose why the service failed.
|
|
|
|
**Fix:** Added proper journal logging:
|
|
```ini
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
SyslogIdentifier=hermes
|
|
```
|
|
|
|
## Changes Made
|
|
|
|
### Terraform Files Modified
|
|
|
|
1. **templates/userdata-hermes.tpl**
|
|
- Fixed systemd service configuration
|
|
- Reordered runcmd operations
|
|
- Added Docker readiness checks and delays
|
|
- Enhanced health check script
|
|
- Added service startup verification
|
|
- Improved completion messages
|
|
|
|
2. **docs/HERMES_DEBUGGING.md** (NEW)
|
|
- Comprehensive troubleshooting guide
|
|
- Common issues and solutions
|
|
- Diagnostic commands
|
|
- Manual start/stop procedures
|
|
- Discord connectivity testing
|
|
|
|
3. **README.md**
|
|
- Added reference to HERMES_DEBUGGING.md documentation
|
|
|
|
## Testing These Changes
|
|
|
|
To test the fixes, you need to redeploy:
|
|
|
|
```bash
|
|
# Option 1: Destroy and redeploy (cleanest)
|
|
terraform destroy
|
|
# Answer yes when prompted
|
|
source .env && terraform init && terraform apply
|
|
|
|
# Option 2: Update existing (if keeping infrastructure)
|
|
source .env && terraform apply -auto-approve
|
|
```
|
|
|
|
After deployment, verify Hermes is running:
|
|
|
|
```bash
|
|
# SSH into the server (username is 'hermes' or your override)
|
|
ssh hermes@<SERVER_IP>
|
|
|
|
# Run the health check
|
|
/usr/local/bin/hermes-health-check.sh
|
|
|
|
# Or manually verify
|
|
systemctl status hermes.service
|
|
docker ps
|
|
docker logs hermes
|
|
```
|
|
|
|
## Deployment Flow Now
|
|
|
|
With the fixes, the cloud-init deployment flow is now:
|
|
|
|
1. ✓ Update system packages
|
|
2. ✓ Create hermes user
|
|
3. ✓ Write configuration files (.env, config.yaml, docker-compose.yml, SOUL.md)
|
|
4. ✓ Write health check script
|
|
5. ✓ Write systemd service unit
|
|
6. ✓ Install Docker
|
|
7. ✓ Install docker-compose-plugin
|
|
8. ✓ Wait for Docker daemon to be ready
|
|
9. ✓ Pull Hermes image
|
|
10. ✓ Set proper permissions
|
|
11. ✓ Reload systemd
|
|
12. ✓ Enable hermes.service
|
|
13. ✓ Start systemd service (which runs docker-compose up)
|
|
14. ✓ Wait for startup
|
|
15. ✓ Verify service is active
|
|
|
|
## Expected Behavior After Fix
|
|
|
|
When you SSH into the server after deployment:
|
|
|
|
```bash
|
|
$ systemctl status hermes.service
|
|
● hermes.service - Hermes Agent Service
|
|
Loaded: loaded (/etc/systemd/system/hermes.service; enabled; vendor preset: enabled)
|
|
Active: active (running) since ...
|
|
|
|
$ docker ps
|
|
CONTAINER ID IMAGE STATUS
|
|
abc123 nousresearch/hermes-agent:latest Up 2 minutes
|
|
|
|
$ docker logs hermes
|
|
[INFO] Hermes Agent starting...
|
|
[INFO] Discord bot initialized
|
|
...
|
|
```
|
|
|
|
And in Discord:
|
|
- Bot shows "online" status
|
|
- Responds to mentions in configured channels
|
|
- Respects user allowlist
|
|
|
|
## Next Steps
|
|
|
|
1. **Redeploy** with the fixed template
|
|
2. **Verify** using the health checks documented in HERMES_DEBUGGING.md
|
|
3. **Test Discord** connectivity by mentioning the bot in a channel
|
|
4. **Monitor logs** using `docker logs -f hermes` if issues occur
|
|
|
|
## Additional Notes
|
|
|
|
- The audit identified these issues by analyzing the template configuration and deployment flow
|
|
- Similar fixes should be applied if you have OpenClaw deployments
|
|
- The systemd service is now production-ready with proper error handling
|
|
- Health check script was significantly enhanced for better diagnostics
|