c8e8067993
LM Studio and Ollama run one model on one GPU — concurrent requests cause crashes. Two fixes: 1. Per-upstream semaphore (concurrency=1) in _route_agent_chat for lm-studio/ollama providers. All agent-routed calls to the same base URL queue instead of hitting the GPU simultaneously. 2. skip_discovery=True when routing to a local model. Context discovery would fire a second LM Studio call alongside the main inference. Novel words are still registered in SOAS (low saliency) but the LLM confirmation step waits. Configure write_model_id or a separate agent model pointing at a cloud/remote model to re-enable live context discovery. 3. _LLM_CONCURRENCY 2 → 1 in write_queue for the same reason. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>