Adding a lot of skills for Hermes Gerhard
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: serving-llms-vllm
|
||||
description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
description: "vLLM: high-throughput LLM serving, OpenAI API, quantization."
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
@@ -13,6 +13,10 @@ metadata:
|
||||
|
||||
# vLLM - High-Performance LLM Serving
|
||||
|
||||
## When to use
|
||||
|
||||
Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
|
||||
## Quick start
|
||||
|
||||
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
||||
|
||||
Reference in New Issue
Block a user