Sanitized for public release: - Removed all API keys, tokens, and secrets - Removed personal Discord IDs from hermes-openclaw.json - Updated git URLs to be generic placeholders - All sensitive data uses environment variable interpolation
239 lines
5.6 KiB
Markdown
239 lines
5.6 KiB
Markdown
# Hermes Deployment Audit - Summary of Fixes
|
|
|
|
## Executive Summary
|
|
|
|
The Terraform Hermes deployment had **5 critical issues** preventing the service from running. All have been fixed in the cloud-init template.
|
|
|
|
## What Was Wrong
|
|
|
|
### Critical Issues Found:
|
|
|
|
1. ✗ **Systemd service couldn't find docker-compose.yml**
|
|
- `ExecStart=/usr/bin/docker compose up` (missing file path)
|
|
|
|
2. ✗ **Service ran as non-root user without Docker permissions**
|
|
- User permissions from `usermod -aG docker` don't take effect for the systemd service
|
|
|
|
3. ✗ **Docker image pulled before docker-compose-plugin installed**
|
|
- Installation order was wrong
|
|
|
|
4. ✗ **No check that Docker daemon was ready**
|
|
- Timing issues during bootstrap
|
|
|
|
5. ✗ **No verification service actually started**
|
|
- Deployment would complete even if Hermes failed to start
|
|
|
|
## What Was Fixed
|
|
|
|
### 1. Systemd Service Configuration
|
|
**Before:**
|
|
```ini
|
|
ExecStart=/usr/bin/docker compose up
|
|
ExecStop=/usr/bin/docker compose down
|
|
User=${admin_user}
|
|
```
|
|
|
|
**After:**
|
|
```ini
|
|
ExecStart=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml up'
|
|
ExecStop=/bin/sh -c 'cd /home/${admin_user} && exec docker compose -f docker-compose.yml down'
|
|
User=root
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
SyslogIdentifier=hermes
|
|
```
|
|
|
|
**Why:** Now properly finds the compose file and doesn't have permission issues.
|
|
|
|
---
|
|
|
|
### 2. Installation Order
|
|
**Before:**
|
|
```yaml
|
|
- curl -fsSL https://get.docker.com | sh
|
|
- apt-get install -y docker-compose-plugin # too late
|
|
- docker pull nousresearch/hermes-agent:latest
|
|
```
|
|
|
|
**After:**
|
|
```yaml
|
|
- curl -fsSL https://get.docker.com | sh
|
|
- apt-get install -y docker-compose-plugin # right after docker
|
|
- sleep 5
|
|
- docker ps > /dev/null || (sleep 10 && docker ps) # verify ready
|
|
- docker pull nousresearch/hermes-agent:latest
|
|
```
|
|
|
|
**Why:** Ensures docker-compose-plugin is installed before use and Docker is ready.
|
|
|
|
---
|
|
|
|
### 3. Service Startup Verification
|
|
**Before:**
|
|
```yaml
|
|
- systemctl start hermes.service
|
|
# ... done, might have failed but we don't know
|
|
```
|
|
|
|
**After:**
|
|
```yaml
|
|
- systemctl start hermes.service
|
|
- sleep 3
|
|
- systemctl is-active hermes.service || systemctl status hermes.service
|
|
```
|
|
|
|
**Why:** Immediately tells us if startup failed.
|
|
|
|
---
|
|
|
|
### 4. Enhanced Health Check Script
|
|
**Added comprehensive diagnostics:**
|
|
- ✓ Docker daemon status
|
|
- ✓ Container exists
|
|
- ✓ Container running (with uptime)
|
|
- ✓ Port listening
|
|
- ✓ Config files exist
|
|
- ✓ Systemd service status
|
|
- ✓ Recent logs
|
|
- ✓ Discord configuration check
|
|
|
|
---
|
|
|
|
## New Documentation
|
|
|
|
### 1. **HERMES_DEBUGGING.md**
|
|
Complete troubleshooting guide with:
|
|
- Quick diagnostic checklist
|
|
- Common issues and their fixes
|
|
- Command reference
|
|
- Manual start/stop procedures
|
|
- Discord connectivity testing
|
|
- Log interpretation
|
|
|
|
### 2. **HERMES_AUDIT_REPORT.md**
|
|
Detailed audit findings explaining:
|
|
- What each issue was
|
|
- Why it caused failures
|
|
- How it was fixed
|
|
- Expected behavior after fixes
|
|
|
|
---
|
|
|
|
## How to Apply These Fixes
|
|
|
|
### Option 1: Fresh Deployment (Cleanest)
|
|
```bash
|
|
terraform destroy -auto-approve
|
|
source .env && terraform init && terraform apply
|
|
```
|
|
|
|
### Option 2: Update Existing Stack
|
|
```bash
|
|
source .env && terraform apply -auto-approve
|
|
```
|
|
|
|
---
|
|
|
|
## Verification After Deployment
|
|
|
|
After applying these fixes and deploying:
|
|
|
|
```bash
|
|
# SSH into server
|
|
ssh hermes@<SERVER_IP>
|
|
|
|
# Run comprehensive health check
|
|
/usr/local/bin/hermes-health-check.sh
|
|
|
|
# Manually verify
|
|
systemctl status hermes.service
|
|
docker ps
|
|
docker logs hermes
|
|
```
|
|
|
|
**Expected output:**
|
|
- ✓ Hermes systemd service active
|
|
- ✓ Docker container running
|
|
- ✓ Gateway listening on port 18789
|
|
- ✓ Discord bot shows online in your server
|
|
|
|
---
|
|
|
|
## Files Changed
|
|
|
|
### Core Deployment
|
|
- `templates/userdata-hermes.tpl` - Fixed cloud-init configuration
|
|
|
|
### Documentation
|
|
- `docs/HERMES_DEBUGGING.md` - **NEW** Troubleshooting guide
|
|
- `docs/HERMES_AUDIT_REPORT.md` - **NEW** Detailed audit findings
|
|
- `README.md` - Added reference to debugging guide
|
|
|
|
---
|
|
|
|
## Why These Fixes Work
|
|
|
|
Each fix addresses a specific failure point:
|
|
|
|
| Issue | Root Cause | Fix | Result |
|
|
|-------|-----------|-----|--------|
|
|
| Compose file not found | No path specified | Specify full path with `-f` | Service finds config |
|
|
| Docker permission denied | Non-root user, group not applied | Run service as root | Service can use Docker |
|
|
| Docker not ready | Immediate pull attempt | Add delays and checks | Image pulls successfully |
|
|
| Silent failures | No verification | Check service status | Know if it failed |
|
|
| Can't debug | No logging | Added journal logging | Can read logs |
|
|
|
|
---
|
|
|
|
## Testing the Fixes
|
|
|
|
To verify the fixes work on your deployments:
|
|
|
|
1. **Quick test (5 min):**
|
|
```bash
|
|
# Just check service is running
|
|
systemctl status hermes.service
|
|
docker ps | grep hermes
|
|
```
|
|
|
|
2. **Full health check (10 min):**
|
|
```bash
|
|
/usr/local/bin/hermes-health-check.sh
|
|
```
|
|
|
|
3. **Discord test (Manual):**
|
|
- Mention the bot in a configured channel
|
|
- It should respond within a few seconds
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If something goes wrong:
|
|
|
|
```bash
|
|
# Revert to previous state
|
|
git checkout templates/userdata-hermes.tpl
|
|
|
|
# Then redeploy or manually stop
|
|
systemctl stop hermes.service
|
|
docker compose -f ~hermes/docker-compose.yml down
|
|
```
|
|
|
|
---
|
|
|
|
## OpenClaw Status
|
|
|
|
✓ OpenClaw service is properly configured and doesn't have these issues.
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Review** the changes in `templates/userdata-hermes.tpl`
|
|
2. **Redeploy** using `terraform apply`
|
|
3. **Verify** using `systemctl status hermes.service`
|
|
4. **Test** Discord connectivity
|
|
5. **Refer** to `HERMES_DEBUGGING.md` if any issues occur
|
|
|
|
All changes are backward compatible and don't affect other components.
|