Auto-sync: update nexus workspace
This commit is contained in:
24
wiki/concepts/PagedAttention.md
Normal file
24
wiki/concepts/PagedAttention.md
Normal file
@@ -0,0 +1,24 @@
|
||||
---
|
||||
title: "PagedAttention"
|
||||
type: concept
|
||||
tags: [paged-attention, vllm, inference, optimization]
|
||||
aliases: [PagedAttention, 分页注意力]
|
||||
last_updated: 2025-12-20
|
||||
---
|
||||
|
||||
## Definition
|
||||
PagedAttention,vLLM 的核心注意力机制创新,将 [[KV Cache]] 切分为固定大小的块(block),并用页表式映射管理,类似操作系统的虚拟内存调度方式。
|
||||
|
||||
## Key Facts
|
||||
- 传统方式:为每条序列分配一大块连续内存,导致碎片化和 OOM(显存不足)
|
||||
- PagedAttention 解决方案:将 KV Cache 切分为固定大小块,用页表管理,灵活调度
|
||||
- 优势:避免碎片化、支持动态并发、支持 KV 块复用(多分支/重复前缀场景)
|
||||
- 显著减少预填充(Prefill)时间
|
||||
|
||||
## Connections
|
||||
- [[vLLM]] ← 使用 ← [[PagedAttention]]
|
||||
- [[KV Cache]] ← 优化管理 ← [[PagedAttention]]
|
||||
- [[Continuous Batching]] ← 协同 ← [[PagedAttention]]
|
||||
|
||||
## Sources
|
||||
- [[大模型相关术语和框架总结|llm-mcp-prompt-rag-vllm-token-数据蒸馏]]
|
||||
Reference in New Issue
Block a user