24 lines
1018 B
Markdown
24 lines
1018 B
Markdown
---
|
||
title: "Continuous Batching"
|
||
type: concept
|
||
tags: [continuous-batching, vllm, inference, gpu]
|
||
aliases: [Continuous Batching, 连续批处理, Iteration-Level Scheduling]
|
||
last_updated: 2025-12-20
|
||
---
|
||
|
||
## Definition
|
||
Continuous Batching,连续批处理,vLLM 的推理优化技术。与传统的攒满一批再处理不同,Continuous Batching 在每个解码步骤(按 Token 迭代)都动态组装活跃请求批次,GPU 基本满负载运转。
|
||
|
||
## Key Facts
|
||
- 传统批处理:攒满一批再跑,短任务被长任务阻塞(头阻塞问题)
|
||
- Continuous Batching:每步解码都组装新批次,无需等待整批结束即可插入新请求
|
||
- 基于 [[PagedAttention]] 的块式内存 + 步进级调度器实现
|
||
- 提高 GPU 并发与公平性,充分利用 GPU 算力
|
||
|
||
## Connections
|
||
- [[vLLM]] ← 使用 ← [[Continuous Batching]]
|
||
- [[PagedAttention]] ← 协同 ← [[Continuous Batching]]
|
||
|
||
## Sources
|
||
- [[大模型相关术语和框架总结|llm-mcp-prompt-rag-vllm-token-数据蒸馏]]
|