Aligns runtime behaviour with SECURITY.md 2.6: externally reachable
messaging adapters must fail closed unless access is explicitly
configured. Closes the confirmed multiplex authorization bypass a
secondary profile's open dm/group policy no longer inherits the default
profile's allowlist trust.
- Own-policy adapters (WhatsApp, WeCom, Weixin, QQBot, Yuanbao) default
dm_policy/group_policy to pairing/allowlist instead of open; open now
requires an explicit GATEWAY_ALLOW_ALL_USERS or per-platform allow-all.
- Startup guard (_own_policy_open_startup_violation) refuses to boot when
an enabled adapter is open without the allow-all opt-in; the guard now
runs for every secondary profile in multiplex mode too.
- Profile-aware own-policy authorization: _authorization_adapter /
_adapter_for_source resolve the live adapter via SessionSource.profile,
so _is_user_authorized and the ingress/pairing/busy/queue paths read the
originating profile's adapter policy, not the default profile's.
- Fail-closed intake for Email, Feishu P2P, and Discord (blank-principal
denial, empty-allowlist deny, missing-interaction.user deny).
Salvaged from #44073 (external-surface hardening), split into a focused
gateway-authz PR per maintainer request. Follow-up fix by Hermes Agent:
the Discord slash-auth channel bypass now matches DISCORD_ALLOWED_CHANNELS
by the same name-inclusive keys (id + name + #name + parent) the on_message
scope gate uses, so a name-form channel allowlist authorizes slash
interactions consistently (was id-only, breaking #name matching).
Co-authored-by: Hermes Agent <agent@nousresearch.com>
Discord's _fetch_channel_context backfills recent channel/thread activity
(from any member who can post there, not just the allowlisted user) into
the agent's context with no sender-trust distinction. Slack's equivalent
_fetch_thread_context was fixed to prefix non-allowlisted senders with
[unverified] and add LLM guidance not to act on their content, mitigating
indirect prompt injection from third parties in shared channels/threads.
Port the same mechanism to Discord using the already-wired
_is_sender_authorized/set_authorization_check plumbing.
A transient Bot API network error during gateway bootstrap (deleteWebhook
or the initial start_polling) currently raises out of connect() and marks
the Telegram adapter fatal, restart-looping the whole gateway even though
the right behavior is to degrade the Telegram channel and let the existing
reconnect ladder recover in the background.
- _delete_webhook_best_effort(): swallow only transient network errors and
continue to polling; non-network errors (e.g. auth failures) still raise.
- _start_polling_resilient(): on a transient conflict/network error at
bootstrap, schedule background recovery and return degraded instead of
raising; non-transient errors still propagate.
- Track the polling error-callback recovery tasks in _background_tasks so
they can't be garbage-collected mid-flight.
- Add a second Telegram Bot API seed fallback IP (149.154.166.110).
Reconnect keeps its existing 10-retry -> supervisor-restart semantics; this
change only fixes the bootstrap raise, it does not alter the retry ladder.
disconnect() reads self._post_connect_task, but several tests build a bare
TelegramAdapter via object.__new__() without calling __init__ (which sets the
attr). Use getattr(..., None) so disconnect() works on those instances too
(pitfall #17).
Command-menu registration (set_my_commands), the status-indicator, and
DM-topic setup make Bot API calls that can stall for certain bot tokens.
They ran inside connect() before/after _mark_connected() but still within
the coroutine the gateway wraps in a connect timeout, so one slow call blew
the whole connect and the adapter never came up — even though polling/webhook
was already live (getMe works via curl). Fixes#46298.
- mark connected as soon as polling/webhook startup succeeds
- move command-menu, status-indicator, and DM-topic setup into a cancellable
background housekeeping task (_run_post_connect_housekeeping)
- cancel that task during disconnect so it can't fire into a torn-down client
- harden scope-name lookup with getattr fallback
Salvaged onto the relocated plugin adapter (plugins/platforms/telegram/
adapter.py) since the original PR #46404 targeted the pre-migration
gateway/platforms/telegram.py path.
Co-authored-by: Hermes Agent <teknium@nousresearch.com>
Add a per-platform `cron_continuable_surface` extra key
(`thread` default | `in_channel`) so a continuable cron job can deliver
FLAT into a Slack channel — no dedicated thread — and still be
replied-to. In `in_channel` mode the scheduler skips the thread-open
branch (leaves `thread_id=None`); the shipped origin-mirror then seeds
the `(slack, chat_id, None)` shared-channel session — the same bucket
`reply_in_thread: false` routes inbound channel replies to — so a plain
channel reply continues the job in context.
Design: specs/cron-inchannel-continuable (D1–D7, F5). Model B
(shared-channel session), NOT anchoring to the delivery `ts` — on Slack
replying to a specific message IS threading, so a `ts` anchor would only
relocate the thread, never deliver true threadless continuable.
- gateway/platforms/base.py: `supports_inchannel_continuable` capability
flag (default False → unsupported platforms fail SAFE to `thread`).
- plugins/platforms/slack/adapter.py: flag=True; `_cron_continuable_surface()`
resolver (coerces to the two-value enum); `_warn_if_inchannel_without_flat_reply`
connect-time warning (D5: warn, not hard-require — the misconfig fails safe).
- gateway/config.py: shared-key bridge line (top-level OR nested config).
- cron/scheduler.py: read the key generically from platform config, gate
the `in_channel` branch on the adapter capability flag, skip thread-open.
No new seed function (reuses the existing mirror — G6).
Pairing (docs): `in_channel` + `reply_in_thread: false` +
`require_mention: false` (or a free-response channel). Missing
`reply_in_thread: false` fails safe to a threaded continuation.
Gateway-side config flag — `/restart` to apply; NO Slack app reinstall.
Tests (from inside the worktree, PYTHONPATH=$PWD):
- +6 cron scheduler tests (in_channel skips thread-open; seeds flat
channel session with thread_id=None; thread-mode regression;
fail-safe on unsupported platform; value coercion). Prove-fail:
removing the `and not in_channel_surface` guard turns the two
load-bearing tests RED; restore → GREEN.
- +10 slack resolver/capability/warning tests; +2 config-bridge tests.
- tests/manual/cron_inchannel_e2e.py: offline E2E driving BOTH real
legs (delivery seed + inbound reply keying) → both converge on
(slack, C, None).
- No regressions: test_slack.py 216 passed alone; broader sweep green
(4 pre-existing cross-file-ordering failures reproduce identically on
pristine origin/main).
Docs: cron.md + slack.md + zh-Hans mirrors of both.
Add an interactive Raft setup flow for hermes gateway setup. The wizard follows the existing platform adapter setup pattern, persists RAFT_PROFILE to the Hermes env file, preserves an existing profile when the user declines reconfiguration, and registers the flow via setup_fn.
Add focused Raft adapter coverage for saving RAFT_PROFILE, keeping an existing profile, and registering setup_fn.
Signed-off-by: skyzh <skyzh@mail.build>
Signed-off-by: HaoHao <HaoHao@mail.build>
Matrix outbound image downloads validated only the final URL after
following redirects, so a public URL that 302-redirects to loopback /
private-network / cloud-metadata endpoints had already connected to the
unsafe hop before the check ran.
Re-validate every redirect hop before following it:
- aiohttp path resolves redirects manually with allow_redirects=False,
validating each Location via is_safe_url (aiohttp can't use the httpx
response event hook).
- httpx fallback installs the shared _ssrf_redirect_guard event hook.
Regression tests cover per-hop blocking of an unsafe redirect, following
a safe redirect chain, and httpx guard wiring.
An invite to a room with no remaining members surfaces as "no servers
in the room have been provided" or "room not found" on join. The pending
invite was never cleared, so every gateway startup re-attempted the join
and re-emitted the warning indefinitely.
Detect that specific failure mode by narrow error-message match and call
leave_room to decline the invite; transient/network errors leave the
invite untouched for the next sync. Adds 5 tests.
Reimplements the matrix portion of #33953 onto the current plugin adapter
(gateway/platforms/matrix.py was relocated to
plugins/platforms/matrix/adapter.py since the PR was opened). The two
gateway/status.py fixes from that PR (wrapper-subcommand rejection,
psutil start-time fallback) already landed on main independently.
Reported by @Bougey; original patch authored by @KiraKatana.
The network-error reconnect ladder (#55992) captured a stable self._app
local across its awaits and failed fast when the adapter was torn down
mid-sleep. The 409-conflict retry path had the identical unguarded
self._app.updater.start_polling() deref — a concurrent disconnect()
during its RETRY_DELAY sleep would raise the same 'NoneType' object has
no attribute 'updater' and, on a non-final retry, land in limbo. Apply
the same stable-local + fail-fast pattern so the existing except block
reschedules or escalates to fatal.
_handle_polling_network_error's chained retry never updated
self._polling_error_task, so the reentrancy guard shared with the
heartbeat loop and the pending-updates probe went stale mid-recovery,
letting more than one recovery attempt run concurrently against the
same adapter. Combined with a TOCTOU window in
_handle_adapter_fatal_error (the adapter was only removed from
self.adapters in a finally block after awaiting disconnect()), two
concurrent fatal notifications for the same adapter could both pass
the "still installed" check and call disconnect() twice, which is
where the reported "'NoneType' object has no attribute 'updater'"
originates once self._app is cleared by the first call.
- Reassign the chained retry task to self._polling_error_task so the
guard reflects an in-flight recovery.
- Capture self._app in a local variable across the stop/start_polling
sequence instead of re-reading self._app between awaits.
- Claim (pop) the adapter from self.adapters before awaiting
disconnect() in _handle_adapter_fatal_error, not after, closing the
TOCTOU window for a concurrent notification on the same adapter.
The forum-topic skill-binding lookup assumed config.extra['group_topics']
was always a list of {chat_id, topics} entries. When an operator writes the
natural mapping shape ({"-100...": [...]}), iterating yields string keys and
chat_entry.get(...) raises AttributeError, breaking dispatch for that group.
Normalize both shapes to a common iterator and guard non-dict/non-list
entries so malformed config falls through cleanly instead of crashing.
Inside httpx AsyncClient response event hooks, response.next_request is
often None even for a genuine redirect, so guards keyed on
`if response.is_redirect and response.next_request` silently never fire.
A public URL that 302s to http://169.254.169.254/ was followed anyway,
defeating the pre-flight is_safe_url() check.
Resolve the redirect target from the Location header (via urljoin, so
relative Locations work too), falling back to next_request only when no
Location is present. Extracted as tools.url_safety.redirect_target_from_response
and wired into every SSRF redirect guard:
- gateway/platforms/base.py (shared image + audio download for all platforms)
- tools/vision_tools.py (two download hooks)
- plugins/platforms/slack/adapter.py
Original fix by @zapabob (PR #35940), which targeted the since-refactored
gateway/platforms/slack.py; reconstructed onto the current shared sites and
widened to the whole bug class.
When discord.auto_thread is enabled and a top-level server-channel message
should be routed to a new thread, a transient thread-create failure (e.g.
Cannot connect to host discord.com:443) returned None and _handle_message
fell through to an inline parent-channel reply — dumping a new task into a
shared channel and breaking thread-first workflows.
- _auto_create_thread retries the primary + seed-message paths once after a
750ms backoff for transient connect errors.
- _handle_message treats None as a hard failure: posts a short visible notice
in the parent channel and returns without invoking the agent. The notify
send is wrapped so a secondary connect error can't raise.
Fixes#20243
Group gating (_should_process_message) read the raw message_thread_id,
while event routing (_build_message_event) normalized it. A plain
non-forum group reply's message_thread_id is a reply-UI anchor, not a
topic, so an anchor id matching an ignored_threads entry wrongly
dropped the message, and the anchor was treated as a routable topic
under allowed_topics.
Extract _effective_message_thread_id and route both gating and
event-building through it, so gating and session routing agree on one
normalized value: real topic/forum messages keep their thread id, reply
anchors are dropped, and forum General-topic messages normalize to the
General-topic id.
The renderer now emits native Block Kit table blocks; the module and
_rich_blocks_enabled docstrings still described the earlier monospace-only
approach.
Replace the interim monospace table fallback with Slack's native `table`
block (rows of rich_text cells). Addresses the core ask in #18918.
- _table_block(): builds type:"table" with rich_text cells, so inline
formatting (bold, links, code) renders inside cells.
- Column alignment parsed from the markdown separator row (:---, :-:, --:)
into column_settings (left = default/null-skip, center/right emitted).
- Escaped pipes (\\|) are not treated as column separators.
- Respects Slack's table limits (100 rows / 20 cols / 10k aggregate chars);
oversized or unparseable tables gracefully fall back to aligned monospace
(rich_text_preformatted), so a big table never breaks the message.
Docs (EN + zh-Hans) updated to describe native tables + the fallback.
Tests: native table shape, alignment->column_settings, inline-formatted
cells, oversized/too-wide monospace fallback, escaped-pipe cell. Prove-
failed against a stubbed _table_block (native-table tests fail, fallback
tests stay green). All existing Slack tests still pass.
Add platforms.slack.extra.rich_blocks (default off). When enabled, the
final agent message is sent as Slack Block Kit blocks — section headers,
dividers, and true nested lists via rich_text — instead of flat mrkdwn.
- New plugins/platforms/slack/block_kit.py: pure markdown->blocks renderer
(headers, dividers, nested ordered/bullet lists, blockquotes, fenced code;
pipe-tables as aligned monospace since Block Kit has no robust table block).
Enforces Slack's 50-block / 3000-char section limits and returns None to
fall back to plain text on empty/oversized/unexpected input. Never raises.
- adapter.send(): render blocks on the single-chunk primary message; a
text= fallback is ALWAYS sent alongside (notifications/accessibility).
- adapter.edit_message(): blocks only on finalize=True, so intermediate
streaming edits stay plain mrkdwn (no per-flush block re-derivation).
- Docs (EN + zh-Hans) + config example. Send-side only: no app reinstall.
Tests: pure-renderer unit suite + adapter integration suite (blocks present
when on, plain text when off, text fallback always set, finalize gating,
multi-chunk fallback). Prove-failed against a stubbed renderer.
Mitigates indirect prompt injection (CWE-863) in Slack thread context.
When the bot is mentioned mid-thread for the first time, _fetch_thread_context
pulls the full thread via conversations.replies and prepends every reply to
the LLM prompt. Replies from senders not on the allowlist were rendered
identically to authorised senders, letting a third party in a shared channel
inject instructions the model might act on when answering the next authorised
message.
- BasePlatformAdapter.set_authorization_check / _is_sender_authorized, registered
by GatewayRunner._make_adapter_auth_check() with a closure over the existing
_is_user_authorized chain (platform/global/group allowlists, allow-all flags,
pairing store all stay the single source of truth — no env-var re-parsing).
- Tags non-bot thread messages whose sender fails the auth check with an
[unverified] prefix; strengthens the header with soft guidance only when at
least one unverified message is present, so setups without an allowlist see
no behaviour change.
- Wired into all three adapter-init sites in run.py (start, reconnect watcher,
restart) so the reconnect path is covered too.
Softened wording: adapted from the original [untrusted] tag to [unverified]
and non-accusatory header framing — the label reflects allowlist status, not
a judgment about the person. Adapter relocated to plugins/platforms/slack/
since the PR was authored.
Salvaged from #17059.
Buffered text/photo/media-group flushes and the polling-error recovery
task sit behind an asyncio.sleep(). On disconnect they kept running and
dispatched handle_message() into a torn-down session, producing stale or
duplicate deliveries. disconnect() only cancelled media-group and photo
batch tasks — text batches and the polling-error task leaked.
Set a _drop_delayed_deliveries flag from _mark_disconnected/_set_fatal_error
(cleared by _mark_connected) and check it in all enqueue+flush paths so a
flush that wins the race against teardown drops instead of dispatching.
_cancel_pending_delivery_tasks() now cancels+clears all four task maps,
skipping the current task. Media-group flush finally-block guarded so a
cancelled stale flush cannot erase a replacement task handle.
Two independent fixes salvaged from #12811 (closing it; one of its three
bundled fixes — Discord free_response — is already on main).
Anthropic max_tokens (#12790): the chat-completions max_tokens fallback only
fired for OpenRouter/Nous URLs, so any other proxy serving a Claude model
(AWS Bedrock, NVIDIA, LiteLLM, vLLM, corporate gateways) shipped requests
with no max_tokens and inherited the proxy's low default (Bedrock: 4096),
exhausting on thinking + large tool calls. Changed the gate in
chat_completion_helpers.build_api_kwargs from URL-gated to model-gated:
fires whenever the model matches an _ANTHROPIC_OUTPUT_LIMITS key. This also
fixes a latent miss — the old 'claude' substring gate skipped MiniMax and
Qwen3 even on OpenRouter. Remains a last-resort fallback (build_kwargs only
applies it after ephemeral/user/profile max_tokens), so it never overrides
an explicit value, and only touches the chat-completions transport (native
Anthropic Messages API is a separate path).
Feishu channel_prompt (#12805): the Feishu adapter never resolved
channel_prompts config, unlike Discord/Slack, so per-channel role prompts
were silently ignored. Added _resolve_channel_prompt() (delegating to the
shared gateway.platforms.base.resolve_channel_prompt) and wired it into all
three MessageEvent construction sites — inbound message, reaction routing,
and card-action routing.
Tests: tests/gateway/test_feishu_channel_prompts.py (6 cases) covering exact
match, parent-thread fallback, no-match, missing-config safety, and event
propagation.
Some legitimate @bot pings were dropped because the mention gates relied on
message.mentions alone, which does not always populate raw <@ID> / <@!ID>
forms (mobile, edited, relayed messages). A bare @bot with no other text
could also spawn a fake empty-text turn.
- add _self_is_explicitly_mentioned() / _raw_mentioned_user_ids() helpers that
treat the bot as mentioned via resolved mentions OR raw content forms
- use them at the allow_bots=mentions gate, multi-agent bot filtering, the
mention-strip/mention_prefix step, and the require_mention gate
- drop bare mention-only pings (no text, no media, no injection, no backfill
context) instead of injecting a placeholder empty turn
Co-authored-by: Teknium <teknium1@gmail.com>
The polling heartbeat's pending-update probe treated a stopped updater
(running=False) as "someone else's job" and silently reset its counter,
so a long-poll task that disappears with no reconnect in flight was never
recovered. get_me() on the general request path stays healthy, so neither
PTB's error_callback nor the connectivity probe ever fires — the gateway
keeps running but stops receiving messages indefinitely (#55769).
Detect the stopped-updater case directly in _probe_pending_updates and feed
it into the existing _handle_polling_network_error ladder, debounced over two
consecutive probes so a just-starting updater or the brief stop()->start_polling()
window of an in-flight reconnect never trips it.
The inbound-media validator _is_allowed_bridge_path() checked against
IMAGE_CACHE_DIR / AUDIO_CACHE_DIR / VIDEO_CACHE_DIR / DOCUMENT_CACHE_DIR
value-imported at module load. After the base.py cache-dir getters became
per-call resolvers, the bridge writes media into the active profile's cache
while the validator still matched the frozen launch-profile constants — so
media was rejected under a profile override (multi-profile gateway).
Resolve the cache roots per-call via the get_*_cache_dir() getters and drop
the now-unused frozen value-imports. Caught by automated review on #55867.
DiscordAdapter.edit_message clipped any formatted payload over the 2,000-char
cap to [:1997]+"..." and returned success=True, so the stream consumer
believed the full reply landed and stopped — the user lost everything past the
boundary and perceived the agent as quitting mid-task.
edit_message is now overflow-aware, mirroring Telegram's proven contract:
- finalize=True: split-and-deliver via _edit_overflow_split — edit chunk 1 in
place, send chunks 2..N as reply-threaded continuations, return the last
visible id in message_id plus continuation_message_ids so the stream
consumer keeps editing the most recent chunk and can clean them all up.
- finalize=False (mid-stream): truncate a one-message preview in place, never
split. A mid-stream split moves the edit target to a continuation and the
next accumulated-token tick re-splits, looping forever (the Telegram #48648
lesson the original port predated).
- Reactive 50035 '2000 or fewer in length' on edit runs the same branch logic.
- Partial continuation failure still reports success with a partial_overflow
raw_response so the consumer retries the tail instead of marking a clipped
reply complete.
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: AhmetArif0 <147827411+AhmetArif0@users.noreply.github.com>
When WHATSAPP_FORWARD_OWNER_MESSAGES is enabled and the bridge marks an
inbound message with fromOwner=true, also prefix MessageEvent.text with
"[owner reply] " at construction time. This makes the disambiguation
survive any downstream plugin failure (e.g. handover-rule errors that
bypass silent_ingest), so transcripts never misattribute owner-typed
text to the customer.
Idempotent: re-applies are guarded so a future producer that pre-tags
text won't be double-prefixed.
In `WHATSAPP_MODE=bot` the bridge currently drops every fromMe inbound
message — they are all assumed to be echoes of our own /send calls.
That makes it impossible for plugins / agents to detect when a human
owner has typed directly into a customer chat from the same WhatsApp
Business account (e.g. via a linked phone or WhatsApp Web).
This adds an opt-in `WHATSAPP_FORWARD_OWNER_MESSAGES` env var. When
true, the bridge classifies fromMe inbound by looking up `key.id` in a
bounded LRU of recently-sent message IDs (the existing 50-entry echo
suppressor, bumped to 512 and extracted to a testable
`outbound_ids.js` helper). Hits in the LRU are still dropped (echoes);
misses are forwarded to the Python adapter with `fromOwner: true`.
The Python adapter lifts that flag onto
`MessageEvent.metadata["whatsapp_from_owner"]`. `metadata` is a new
free-form dict on the event so future per-platform signals don't each
need their own field. Default behaviour is unchanged: with the env
flag unset, bot mode still drops every fromMe message exactly as
before.
Use cases for downstream consumers:
- Implicit handover activation when the owner replies manually
- Sliding TTL on owner activity (keep an active session alive while
the owner is engaged)
- Audit trails of owner interventions
- Analytics on human-vs-bot reply ratios
Heuristic limitation (documented in code): the LRU is in-memory. After
a bridge restart, in-flight delivery receipts of pre-restart sends will
briefly look like owner-typed for a few seconds until the set is
repopulated. Persisting isn't worth the disk churn — downstream
consumers should treat the flag as best-effort.
Tests:
- tests/gateway/test_whatsapp_from_owner.py (new): adapter sets the
metadata flag iff the bridge payload has `fromOwner: true`; absent
otherwise.
- scripts/whatsapp-bridge/outbound_ids.test.mjs (new): LRU bounds,
eviction order, falsy-id handling.
Backwards compatibility: with the env flag unset, every code path is
identical to before. No existing deployment is affected.
The Matrix adapter's _upload_file fell back to sending
"(file not found: {file_path})" directly into the room — the same
host-path leak class fixed for the base adapter and Slack in the
previous commit. Replace it with a friendly notice, log the path at
WARN for operators, and preserve any caller-supplied caption.
The base BasePlatformAdapter implementations of send_voice, send_video,
send_document, and send_image_file forwarded their *_path argument
verbatim into the chat text (e.g. "🎬 Video: /home/.../hermes/cache/...").
Telegram, Discord, and Slack adapters all fall back to those base methods
when their native send raises — so a rejected video on Telegram surfaced
the host filesystem layout to the user instead of a useful message.
Replace the path-echo with a friendly notice, log the path for operator
diagnostics, and keep the user-supplied caption intact. The Slack adapter
had three identical sites that fell through to the same path-echo on its
own native upload failures; fix those too. send_document still surfaces
the caller-provided file_name (or the basename derived from it) since
that is the user-facing filename, not a host path.
Add regression tests asserting the *_path argument never appears in the
fallback content while caption text and explicit file_name still do.
Follow-up to the salvaged #8008 fix:
- Sibling-site fix: _evaluate_slash_authorization gated DISCORD_ALLOWED_CHANNELS /
DISCORD_IGNORED_CHANNELS on numeric IDs only, so name/#name config that now works
for on_message still silently failed for slash-command interactions. Refactor the
channel-key helper to _discord_channel_keys_from_channel(channel, parent) and reuse
it at the interaction gate. Fail-closed on missing channel id is preserved.
- The contributor's hardcoded 8s flush deadline could be hard-cancelled mid-flush:
_teardown_adapter already wraps cancel_background_tasks() in the per-adapter
disconnect budget (HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT, default 5s). The flush
deadline now derives from that budget with headroom so it always completes inside it.
- AUTHOR_MAP: map cypher@augmentl.com -> Nickperillo for CI.
- Tests: slash-auth name/#name allow + name ignore matching.
Two related fixes to the Discord gateway adapter:
1. Channel name matching (free-response, allowed, ignored, no-thread channels)
Previously these config values only matched against numeric channel IDs.
If a user configured free_response_channels: cypher (by name), the adapter
would silently ignore it because it only intersected against channel_ids.
Now the adapter builds a channel_keys set that includes the channel ID,
channel name, and #channel-name form, and checks all three for each gate.
2. Flush pending text-batch tasks before shutdown
The Discord adapter uses _pending_text_batch_tasks (its own dict) for
merging rapid successive message chunks. These tasks were NOT added to
self._background_tasks (the base class list), so the base
cancel_background_tasks() never awaited them on restart/shutdown.
This caused a race: in-flight response deliveries were cancelled before
Discord had a chance to send them, resulting in silent dropped messages
visible to users as tool-log-only replies with no text body.
Fix: override cancel_background_tasks() in DiscordAdapter to await all
pending text-batch tasks (8s deadline) before delegating to the base class.
A Slack user/legacy token (xoxp-...) makes auth.test resolve to the
installing human's member ID with no bot_id, so the adapter binds its
identity (_bot_user_id / _team_bot_user_ids) to that human. Every
"is this the bot?" check then misfires: that person's <@...> mentions
wake the bot and are stripped as the bot's own mention, so the agent is
genuinely told it was @mentioned and replies to messages merely
addressed to that human (symptom: bot responds to "@trevor ..." and
insists it was explicitly mentioned).
There is no runtime API error to catch — a user token still
sends/receives — so the only detectable moment is connect time. Add a
warning-only nudge (_warn_if_not_bot_token) alongside the existing
group-DM scope nudge: when auth.test resolves a user_id but no bot_id,
log that the token is a user token and to use the xoxb-... Bot User
OAuth Token. Warning-only: does not block a working-but-misconfigured
install. Fires once per workspace per process.
Salvages the two still-valid hardenings from #5381 onto the relocated
plugin adapters (the discord/feishu/whatsapp adapters moved to
plugins/platforms/ since the PR was opened, and 4 of its 6 hunks are
already on main or superseded).
- feishu: rate limiter now denies untracked keys when the tracking table
is at capacity after pruning stale entries (was: allow through without
tracking). At-capacity-with-all-fresh-entries only happens under abuse,
so allowing untracked requests let an attacker who flooded the table
bypass the limiter entirely. Already-tracked keys and post-prune room
are unaffected.
- whatsapp: absolute file paths handed back by the Baileys bridge are now
validated to resolve inside a known media cache dir before being
attached. A compromised/buggy bridge could otherwise return an
arbitrary path (e.g. /etc/passwd) that would be sent verbatim to the
model. Guard resolves symlinks and accepts both the canonical
cache/<kind> and legacy <kind>_cache layouts.
Follow-up to the group-DM manifest fix. The manifest change only helps
NEW installs; existing apps keep their old (mpim-less) scopes until the
admin reinstalls. Since a missing message.mpim event delivers nothing
(no runtime API error to catch), detect stale installs at connect time
from the auth.test x-oauth-scopes header and log an actionable reinstall
nudge when im:history is granted but mpim:history is not. Also promote
message.mpim from Recommended to Required in the docs event tables so the
default setup path can't drop it.
The WeCom callback endpoint (internet-facing, 0.0.0.0) parsed untrusted
request bodies before signature verification. defusedxml already guards
the entity-expansion class on main, but there was no cap on raw body
size, so an unauthenticated POST could still force unbounded read work
pre-auth.
Set client_max_size=64KB on the aiohttp app (413 at the framework layer)
plus an explicit length guard in _handle_callback as defense in depth.
WeCom callbacks are small encrypted XML envelopes — media is delivered
out-of-band via MediaId, never inline — so 64KB is ample for legitimate
traffic. Adds tests for oversized (413) and normal-sized (not 413) bodies.
Salvaged from #10192 by @memosr (body-size limit half; defusedxml half
already superseded on main).
Two platform-security hardenings:
- Matrix: _on_invite now checks the inviter against the existing
allow-list (_allowed_user_ids / GATEWAY_ALLOW_ALL_USERS) before
auto-joining. Without this any federated Matrix user could invite
the bot into arbitrary rooms, exposing its presence and metadata.
The message and reaction paths already enforce this allow-list; the
invite path bypassed it.
- Mattermost: _api_get / _api_post / _api_put reject any path
containing '..'. WebSocket-event values (channel_id, post_id,
file_id) are interpolated directly into API paths, so a malicious or
compromised server could craft traversal payloads to make the bot
issue authenticated requests to arbitrary endpoints with its bearer
token.
The configurable-E2EE-passphrase change from the original PR is dropped:
the matrix adapter was rewritten onto mautrix and the passphrase-protected
key-export file no longer exists.
Removed/unauthorized Telegram users could inject prompt content before the
per-user auth gate fired. The adapter ran `_should_process_message`,
`_build_message_event`, and text/photo batching — and dispatched to the
runner — before `_is_user_authorized()` (gateway/authz_mixin.py) rejected
the sender. Unmentioned group chatter from a removed user was also
persisted into the session transcript via `_observe_unmentioned_group_message`,
leaking into the agent's observed context independent of dispatch.
Add `_is_user_authorized_from_message()` as an intake prefilter that runs
in `_handle_text_message`, `_handle_command`, `_handle_location_message`,
and `_handle_media_message` BEFORE batching, event construction, and the
unmentioned-group observe branch. It reuses the runner's
`_is_user_authorized()` with a correctly-shaped SessionSource (group vs
forum vs dm, real chat_id for TELEGRAM_GROUP_ALLOWED_* allowlists),
falls back to env allowlists, and only rejects when an allowlist actually
exists — unknown DMs with no allowlist still reach the pairing flow.
Channel posts authorize via `sender_chat` identity when `from_user` is
absent.
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
Co-authored-by: Carlos Manuel Cejas <carlosmcejas@gmail.com>
main (cb982ad99) wired windows_hide_flags() into the auxiliary git/gh/wmic/
bash/powershell/taskkill legs but left two it didn't reach, plus the Electron
backend-launch leg it explicitly deferred. Cover them the same way:
- apps/desktop/electron/main.cjs: getNoConsoleVenvPython resolves the BASE
pythonw.exe instead of the venv Scripts\pythonw.exe shim, which re-execs a
console python.exe and flashes a conhost the desktop backend can't suppress.
Both backend creators put the venv site-packages on PYTHONPATH so imports
still resolve under the base interpreter. (main's commit said this Electron
leg "needs a Windows-tested change of its own".)
- tools/tts_tool.py, tools/transcription_tools.py, plugins/platforms/discord:
ffmpeg conversions (voice notes / TTS / STT) via windows_hide_flags().
- plugins/platforms/whatsapp: netstat + taskkill bridge-port cleanup via
windows_hide_flags().
All no-ops on POSIX. Tests assert the base-pythonw preference and the ffmpeg
legs pass CREATE_NO_WINDOW.
The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the
general request path and stays healthy while PTB's getUpdates consumer is
silently wedged (updater.running=True but the long-poll task is stuck, observed
on WSL2). DMs then queue in the Bot API and never reach handlers (#42909).
Augment the existing _polling_heartbeat_loop to also probe
get_webhook_info().pending_update_count. After two consecutive probes that see a
non-draining queue while the updater claims to be running, escalate into the
existing _handle_polling_network_error recovery ladder — no new restart
machinery. No-ops in webhook mode, when the updater is not running, or when a
reconnect is already in flight.
Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count
signal as the missing liveness probe. This reuses the existing heartbeat +
recovery path rather than adding a parallel watchdog.
Fixes#42909.
Rebase onto plugins/platforms/matrix/adapter.py (code moved from
gateway/platforms/matrix.py). Same logic: _on_invite checks is_direct
on invite events and calls _record_dm_room to persist in m.direct
account data.
Fixes#44679
* fix(telegram): clear send_path_degraded on successful reconnect
_send_path_degraded was cleared only in _verify_polling_after_reconnect,
60s after reconnect and only if scheduled. A clean start_polling() reconnect
left the flag stuck True, short-circuiting send() and blocking all outbound
messages until the deferred probe ran (or forever if it never did).
Clear the flag the moment start_polling() succeeds — that is the recovery
signal. The deferred probe remains a defensive re-check that re-enters the
reconnect ladder (re-setting the flag) if it detects a silent wedge.
Fixes#35205.
* docs: add infographic for #35205 telegram send-path fix
Telegram polling entered a self-inflicted ~31s loop of 409 Conflict ->
retry -> resume -> Conflict. The error_callback PTB invokes synchronously
inside its internal network_retry_loop only scheduled our async recovery
task (loop.create_task) and returned, so PTB kept polling getUpdates on its
own while our handler concurrently ran stop -> sleep -> start_polling. The
two polling sessions overlapped and Telegram returned a fresh 409.
Fix: in the conflict branch of the error_callback, synchronously set PTB's
private polling stop_event before scheduling recovery. PTB's loop exits on
its next tick (it races that event in do_action), so our handler owns
polling alone. The handler's await updater.stop() drains the task and PTB
clears the event, so the subsequent start_polling() builds a fresh event
and is not poisoned.
Keeps the existing reconnect ladder intact (option B) — fixes only the
race. Defensive: probes mangled + unmangled stop_event spellings and no-ops
(prior behaviour) if neither exists; never flips _running, which would make
the handler skip stop() and leave the loop wedged.
The Discord adapter could enter a silent zombie state after a network
outage / proxy stall: the process is alive, _client looks open, but the
underlying socket is dead. discord.py's WebSocket reconnect never sees a
RST through a wedged proxy/NAT, so client.start() spins forever without
exiting — which means the bot-task done callback (which only fires on
task completion) never trips either. The bot stays "offline" in Discord
until a manual `hermes gateway restart`. Reported offline for 13-17h.
Adds an out-of-band REST liveness probe in DiscordAdapter. Every
`discord.liveness_interval_seconds` (default 60s) the adapter issues a
cheap fetch_user(bot_id) — the same REST path as message delivery, so it
fails when the proxy/NAT is wedged. After
`discord.liveness_failure_threshold` consecutive failures (default 3) the
probe closes the wedged client and surfaces a retryable fatal error,
which trips the gateway's existing _platform_reconnect_watcher and
rebuilds the adapter. Operators disable it by setting either knob to 0.
Config lives in config.yaml (discord.liveness_*) per the .env-is-secrets
policy; _apply_yaml_config bridges it to internal env vars the adapter
reads, matching the existing HERMES_DISCORD_TEXT_BATCH_* pattern.
Co-authored-by: Hermes Agent <agent@nousresearch.com>
When a Telegram attachment download/cache fails (typically a transient
httpx.ConnectError to Telegram's CDN), the except handler logged a warning
and fell through to handle_message() with empty media and no text — the user
thought the file was delivered, the agent saw a content-less turn with no
signal an attachment was attempted, and the only record was a buried log line.
Adds _surface_media_cache_failure(): replies to the user in Telegram so they
know to retry, and appends an agent-visible notice to event.text via the
existing _append_observed_note channel so the agent knows an attachment was
attempted and failed. No new event fields (structured-event refactor is out
of scope per #23045). Wired into all five cache-failure sites — photo, voice,
audio, video, document — since they shared the identical silent fall-through.
Bug 1 from #23045 (unsupported types routed as fake user messages) no longer
exists on main: the document handler now accepts any file type, so there is no
rejection branch to fix.
Closes#23045
DISCORD_ALLOWED_USERS="*" now means "allow everyone", matching the
SIGNAL_ALLOWED_USERS / DISCORD_ALLOWED_CHANNELS wildcard convention and
the value `claw migrate` emits. Previously _is_allowed_user did exact
ID matching only, so "*" matched no user and blocked every non-self
sender — a P1 with no workaround.
Three sites, all required for the fix to hold at runtime:
- _is_allowed_user: short-circuit when "*" is in the allowlist.
- connect(): exclude "*" from the intents.members trigger so the
wildcard does not request the privileged Server Members intent
(which can block the bot from coming online).
- _resolve_allowed_usernames: preserve "*" verbatim; otherwise it lands
in the username-resolution bucket, matches no member, and is silently
dropped from the set and env var on the first on_ready — quietly
undoing the fix.
Slash auth delegates to _is_allowed_user (auto-covered); component auth
already honors "*" on main.