Adding a lot of skills for Hermes Gerhard

2026-05-09 15:51:39 +02:00
parent 7d6362d9d4
commit 106fe12c68
245 changed files with 63514 additions and 163 deletions
@@ -1,6 +1,6 @@
 ---
 name: serving-llms-vllm
-description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
+description: "vLLM: high-throughput LLM serving, OpenAI API, quantization."
 version: 1.0.0
 author: Orchestra Research
 license: MIT
@@ -13,6 +13,10 @@ metadata:

 # vLLM - High-Performance LLM Serving

+## When to use
+
+Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
+
 ## Quick start

 vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).