Auto-sync
This commit is contained in:
@@ -1,36 +0,0 @@
|
||||
---
|
||||
id: vllm
|
||||
title: "vLLM"
|
||||
type: concept
|
||||
tags: [LLM, inference, GPU, optimization]
|
||||
sources:
|
||||
- "[[LLM Terms Framework]]"
|
||||
last_updated: 2025-12-20
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
vLLM是一个高效LLM推理框架,通过KV Cache和连续批处理提升GPU利用率。
|
||||
|
||||
## Key Optimizations
|
||||
|
||||
### KV Cache
|
||||
- 缓存已计算的Key-Value矩阵
|
||||
- 避免重复计算
|
||||
- 大幅提升推理速度
|
||||
|
||||
### Continuous Batching
|
||||
- 动态批处理多个请求
|
||||
- 提高GPU利用率
|
||||
- 降低延迟
|
||||
|
||||
## Why It Matters
|
||||
|
||||
- 官方HuggingFace推理速度慢
|
||||
- vLLM可提升10-24倍速度
|
||||
- 支持高并发推理
|
||||
|
||||
## Connections
|
||||
- [[LLM]] ← uses ← [[vLLM]]
|
||||
- [[推理优化]] ← uses ← [[vLLM]]
|
||||
- [[GPU利用率]] ← improves ← [[vLLM]]
|
||||
Reference in New Issue
Block a user