agent0

Files

T

gitprov c8e8067993 Serialize local model calls and skip concurrent context discovery

LM Studio and Ollama run one model on one GPU — concurrent requests
cause crashes. Two fixes:

1. Per-upstream semaphore (concurrency=1) in _route_agent_chat for
   lm-studio/ollama providers. All agent-routed calls to the same
   base URL queue instead of hitting the GPU simultaneously.

2. skip_discovery=True when routing to a local model. Context discovery
   would fire a second LM Studio call alongside the main inference.
   Novel words are still registered in SOAS (low saliency) but the
   LLM confirmation step waits. Configure write_model_id or a separate
   agent model pointing at a cloud/remote model to re-enable live
   context discovery.

3. _LLM_CONCURRENCY 2 → 1 in write_queue for the same reason.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-21 22:07:37 +02:00