37 lines
712 B
Markdown
37 lines
712 B
Markdown
---
|
||
id: vllm
|
||
title: "vLLM"
|
||
type: concept
|
||
tags: [LLM, inference, GPU, optimization]
|
||
sources:
|
||
- "[[LLM Terms Framework]]"
|
||
last_updated: 2025-12-20
|
||
---
|
||
|
||
## Definition
|
||
|
||
vLLM是一个高效LLM推理框架,通过KV Cache和连续批处理提升GPU利用率。
|
||
|
||
## Key Optimizations
|
||
|
||
### KV Cache
|
||
- 缓存已计算的Key-Value矩阵
|
||
- 避免重复计算
|
||
- 大幅提升推理速度
|
||
|
||
### Continuous Batching
|
||
- 动态批处理多个请求
|
||
- 提高GPU利用率
|
||
- 降低延迟
|
||
|
||
## Why It Matters
|
||
|
||
- 官方HuggingFace推理速度慢
|
||
- vLLM可提升10-24倍速度
|
||
- 支持高并发推理
|
||
|
||
## Connections
|
||
- [[LLM]] ← uses ← [[vLLM]]
|
||
- [[推理优化]] ← uses ← [[vLLM]]
|
||
- [[GPU利用率]] ← improves ← [[vLLM]]
|