{"id":180149,"date":"2026-06-09T18:06:37","date_gmt":"2026-06-09T18:06:37","guid":{"rendered":"https:\/\/ktromedia.com\/?p=180149"},"modified":"2026-06-09T18:06:37","modified_gmt":"2026-06-09T18:06:37","slug":"serving-multiple-users-at-once-how-continuous-batching-keeps-llm-inference-efficient","status":"publish","type":"post","link":"https:\/\/ktromedia.com\/?p=180149","title":{"rendered":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient"},"content":{"rendered":"<div style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;<\/span><\/p>\n<p><span class=\"crayon-s\">Continuous batching = iteration-level scheduling + ragged (packed) batching.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">Two approaches are compared (both run BATCH_SIZE sequences concurrently, so the<\/span><\/p>\n<p><span class=\"crayon-s\">comparison is slot-for-slot fair):<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a01. Static batching (baseline):<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Prompts are processed BATCH_SIZE at a time.\u00a0\u00a0Each wave is padded to a<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 common length and run together until the LONGEST request in that wave<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 finishes; a hard &#8220;<\/span><span class=\"crayon-e\">batch <\/span><span class=\"crayon-i\">barrier<\/span><span class=\"crayon-s\">&#8221; then has to clear before the next wave<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 starts.\u00a0\u00a0Short requests sit idle behind the barrier.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a02. Continuous batching (production-aligned):<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Two ideas combine to keep the GPU busy.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 (a) Iteration-level scheduling: the moment a sequence finishes it frees<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 its slot, and the next queued prompt is admitted on the SAME step &#8211;<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 no waiting for the rest of the batch.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 (b) Ragged \/ packed batching &#8211; the part that makes it truly &#8220;<\/span><span class=\"crayon-i\">continuous<\/span><span class=\"crayon-s\">&#8220;:<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 instead of padding every sequence into a rectangular [B, max_len]<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 tensor, ALL in-flight tokens are concatenated into a single unpadded<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [1, total_tokens] row and run in ONE forward pass.\u00a0\u00a0A block-diagonal<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 causal attention mask stops tokens from attending across sequence<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 boundaries, so packing is mathematically identical to running each<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sequence on its own (verified: greedy output matches per-prompt<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 generation token-for-token).<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Because attention is governed entirely by the mask, a newly admitted<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 prompt&#8217;s multi-token PREFILL rides along in the same forward pass as<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 every other sequence&#8217;s single-token DECODE step.\u00a0\u00a0Prefill and decode are<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 fused: no padding, no separate prefill pass.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 KV cache: each sequence keeps its own DynamicCache; every step the caches<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 are concatenated along the time axis into one packed cache, and the newly<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 computed KV is scattered back per sequence.\u00a0\u00a0(Real engines store the<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cache in fixed-size pages &#8211; &#8220;<\/span><span class=\"crayon-e\">paged <\/span><span class=\"crayon-i\">attention<\/span><span class=\"crayon-s\">&#8221; &#8211; to avoid this per-step<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 reassembly, but the attention\/masking logic is exactly what you see here.)<\/span><\/p>\n<p><span class=\"crayon-s\">&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">time<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">torch<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">dataclasses <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">dataclass<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">field<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">typing <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">Optional<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">AutoTokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">AutoModelForCausalLM<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DynamicCache<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">transformers<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cache_utils <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">DynamicLayer<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">MODEL_ID<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;openai-community\/gpt2&#8221;<\/span><span class=\"crayon-h\">\u00a0\u00a0 <\/span><span class=\"crayon-p\"># swap for any causal LM<\/span><\/p>\n<p><span class=\"crayon-v\">BATCH_SIZE<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># max concurrent sequences (slots)<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">_device_sync<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Block until queued GPU work finishes, so timings are accurate.&#8221;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">type<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;cuda&#8221;<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">cuda<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">synchronize<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">elif <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">type<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;mps&#8221;<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">mps<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">synchronize<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">static_batching<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">tuple<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Baseline. Process requests BATCH_SIZE at a time; each wave runs together<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0until its LONGEST request finishes, then a batch barrier clears before the<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0next wave starts.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0Downside: short requests in a wave idle until the wave&#8217;s longest is done &#8211;<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0and no slot can be refilled until the whole wave clears the barrier.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">not<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">padding_side<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;left&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dict<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">indexed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># (req_id, (prompt, cap))<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">wave_start <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">indexed<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">BATCH_SIZE<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">wave<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">indexed<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">wave_start<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wave_start<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">BATCH_SIZE<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">wave_max<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">max<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">cap <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">_<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">_<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cap<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wave<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Show which request occupies each slot in this wave.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slot<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cap<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">wave<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0++ slot {slot} <span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">flush<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">prompts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">p<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">_<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">p<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">_<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wave<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">inputs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">tokenizer<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">prompts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">return_tensors<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;pt&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">padding<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">truncation<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-st\">to<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">with <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">no_grad<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">output_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">generate<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-v\">inputs<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">wave_max<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># whole wave decodes to the longest<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">pad_token_id<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">eos_token_id<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">do_sample<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">width<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">inputs<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">input_ids<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0*** batch barrier: all {len(wave)} slots wait for the longest &#8220;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;({wave_max} tokens) ***&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">flush<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slot<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cap<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">zip<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">wave<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">output_ids<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">text<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">decode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">width<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-v\">width<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cap<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">skip_special_tokens<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">text<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0&#8212; slot {slot} done\u00a0\u00a0req {req_id} ({cap}\/{wave_max} tokens): {text[:90]}&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">flush<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">k<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">k<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sorted<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-sy\">@<\/span><span class=\"crayon-e\">dataclass<\/span><\/p>\n<p><span class=\"crayon-t\">class<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">Sequence<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;State for a single in-flight sequence.&#8221;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># original request index (for ordering results)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">str<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># per-request cap so short requests finish early<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Tokens to feed on the NEXT step: the whole prompt right after admission<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># (prefill), then a single token per step (decode).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Per-sequence KV-cache; None until this sequence has run once.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">Optional<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">DynamicCache<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">None<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># number of cached tokens (prompt + generated)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">tokens_generated<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">output_ids<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">field<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">default_factory<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">_make_cache<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">tuple<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">Tensor<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">Tensor<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">DynamicCache<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Build a DynamicCache from explicit per-layer (keys, values) tensors.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0We SET the tensors directly instead of calling DynamicLayer.update() (which<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0would append), because we are assembling caches from scratch each step.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cache<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DynamicCache<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">k<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">v<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layer<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DynamicLayer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">lazy_initialization<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">k<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">v<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">keys<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">k<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">v<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">layer<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cache<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">_ragged_step<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">Sequence<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Run ONE packed forward pass over every active sequence.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0All sequences are flattened into a single row (batch dim = 1):<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0input_ids\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0[1, total_q]\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; every sequence&#8217;s pending tokens<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0position_ids\u00a0\u00a0 [1, total_q]\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; each token&#8217;s position in ITS sequence<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0attention_mask [1, 1, total_q, total_kv + total_q]\u00a0\u00a0&#8211; block-diagonal causal<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0past_key_values\u00a0\u00a0packed cache [1, H, total_kv, D]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0total_q\u00a0\u00a0= sum of pending tokens (1 per decoding seq, prompt_len per new seq)<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0total_kv = sum of already-cached tokens across sequences<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0Returns the next greedy token for each sequence (same order as &#8220;seqs&#8220;).<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">total_q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sum<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sum<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">kv_len <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Packed inputs: concatenate every sequence&#8217;s pending tokens into one row.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">flat_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">seqs <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">input_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">flat_ids<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-t\">long<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Tag every KEY and every QUERY token with (sequence index, position-in-sequence).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Key space is laid out as [ cached tokens | this step&#8217;s new tokens ], matching<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># how the model appends new KV to the end of the packed cache.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># cached block<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">p<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">p<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># new block (also queries)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">j<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">j<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">pos<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">pos<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_pos_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_pos_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Each token&#8217;s positional embedding uses its own sequence position, not its<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># offset in the packed row.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">position_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos_t<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">unsqueeze<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># [1, total_q]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Block-diagonal causal mask: a query may attend to a key only if they belong<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># to the SAME sequence (block-diagonal) and the key is not in the future<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># (causal).\u00a0\u00a0This is the whole trick &#8211; it makes packing equivalent to running<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># each sequence separately.\u00a0\u00a00.0 = attend, large-negative = blocked (additive).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">same<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_seq_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_seq_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">causal<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_pos_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\"><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">allowed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">same<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&amp;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">causal<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># [total_q, total_kv + total_q]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">attn_mask<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">zeros<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_q<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_q<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">attn_mask<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">masked_fill_<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-o\">~<\/span><span class=\"crayon-v\">allowed<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">finfo<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">min<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Packed KV-cache: concatenate each sequence&#8217;s cache along the time axis.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Freshly admitted sequences (kv_len == 0) contribute nothing here.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cached<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">seqs <\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cached<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">num_layers<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">cached<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">l<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">num_layers<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">ks<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cat<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">keys <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cached<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dim<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cat<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">values <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cached<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dim<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">ks<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">past<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">_make_cache<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">past<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DynamicCache<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">with <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">no_grad<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">input_ids<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">input_ids<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">attention_mask<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">attn_mask<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">position_ids<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">position_ids<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">past_key_values<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">past<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">use_cache<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Greedy next token for each sequence: read the logits at its LAST pending<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># token (for a prefilling sequence that is the final prompt token).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">logits<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">logits<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># [total_q, vocab]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">offsets<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">last_idx<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">off<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ql <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">offsets<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">off<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">last_idx<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">off<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ql<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">off<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ql<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">next_tokens<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">logits<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">argmax<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">last_idx<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Scatter the newly computed KV back to each sequence.\u00a0\u00a0The output cache is<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># [ old packed block | new packed block ]; slice this step&#8217;s new block per<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># sequence and append it to that sequence&#8217;s own cache.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">past_key_values<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">num_layers<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">out_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">o<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ql<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">offsets<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">l<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">num_layers<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">k_new<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">keys<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">o<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">o<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ql<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">v_new<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">o<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">o<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ql<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">kv_cache <\/span><span class=\"crayon-st\">is<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">k_new<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">v_new<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cat<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">keys<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">k_new<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dim<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cat<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">layers<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">l<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">v_new<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dim<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_cache<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">_make_cache<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">layers_kv<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ql<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">next_tokens<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">visualize_ragged_step<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">Sequence<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">title<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slot_ids<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Illustrative print of ONE packed step: the concatenated input row and the<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0block-diagonal causal attention mask.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0This mirrors the masking logic in _ragged_step (recomputed here as a boolean<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0grid purely for display) so you can SEE that sequences are packed together<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0yet isolated by the mask.\u00a0\u00a0Each sequence gets a letter A, B, C, &#8230;<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# = a query may attend to that key\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0. = blocked<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">chr<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">ord<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;A&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">total_q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sum<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sum<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">kv_len <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;\\n{&#8216;=&#8217; * 72}\\n\u00a0\u00a0{title}&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0total_q={total_q} tokens fed this step\u00a0\u00a0|\u00a0\u00a0total_kv={total_kv} cached&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0{len(seqs)} sequences packed into ONE unpadded row of shape [1, {total_q}]:\\n&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># The concatenated tokens, grouped per sequence (this is the &#8220;ragged&#8221; row).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">kind<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;PREFILL({q_lens[i]})&#8221;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;decode({q_lens[i]})&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">toks<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8221; &#8220;<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">join<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">repr<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">decode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">t<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">toks<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">66<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">toks<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">toks<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">63<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;&#8230;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0\u00a0\u00a0{labels[i]} = slot {slot_ids[i]}\u00a0\u00a0{kind:<span class=\"crayon-sy\">)<\/span><\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Rebuild the block-diagonal causal mask as a boolean grid for display.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># cached keys<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">kv_len<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># new keys \/ queries<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">j<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_lens<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">si<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">j<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">q_seq<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">q_pos<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">q_seq_t<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_seq<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_pos<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">key_seq_t<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_pos_t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tensor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_pos<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">allowed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">q_seq_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_seq_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&amp;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_pos_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\"><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">q_pos_t<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">row_str<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Space between sequence groups; &#8216; | &#8216; at the cached -&gt; new-tokens split.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ki <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8221; | &#8220;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">elif <\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">!=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8221; &#8220;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">join<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">line<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">left<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;{left:&gt;7} &#8220;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">row_str<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;\\n\u00a0\u00a0block-diagonal causal mask\u00a0\u00a0(row = query, col = key)\u00a0\u00a0 # attend\u00a0\u00a0 . blocked&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">total_kv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0\u00a0\u00a0key layout:\u00a0\u00a0[ cached KV\u00a0\u00a0|\u00a0\u00a0this step&#8217;s new tokens ]&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">line<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;keys:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">key_seq<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ki <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">qi <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">total_q<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;#&#8221;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">allowed<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">qi<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">ki<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;.&#8221;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">ki <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">line<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;{labels[q_seq[qi]]} p{q_pos[qi]}&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cells<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">continuous_batching<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">tuple<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><span class=\"crayon-s\">&#8220;Ragged continuous batching: dynamic scheduling + packed prefill\/decode.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0Scheduling policy:<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; Up to BATCH_SIZE sequences run concurrently.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; A newly admitted sequence is queued with its full prompt as the next<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0tokens to feed; its prefill then happens packed into the next step<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0alongside everyone else&#8217;s decode.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; Every step runs ONE packed forward pass across all active slots.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8211; When a sequence finishes it is immediately replaced by the next prompt.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0The admission log shows slots being reused (iteration-level scheduling).<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0Two representative steps are visualized: the first step (all prompts being<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0prefilled at once) and the first step that fuses a new prompt&#8217;s prefill with<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0other sequences&#8217; decode tokens.<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a0\u00a0\u00a0&#8220;<\/span><span class=\"crayon-s\">&#8220;&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">device<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">next<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">parameters<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">dtype<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">queue<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-p\"># (req_id, (prompt, max_new_tokens))<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">slots<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">list<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">Optional<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">Sequence<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">BATCH_SIZE<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dict<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">_admit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">slot_idx<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">int<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">-&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">not<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">queue<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">slots<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">slot_idx<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">None<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">queue<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">pop<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">prompt_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">tokenizer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;input_ids&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">slots<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">slot_idx<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">Sequence<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">prompt_ids<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># prefill rides the next step<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0++ [step {step:3d}] slot {slot_idx} <\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;({max_new_tokens} tok cap): {prompt!r}&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">flush<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Fill the pool with the first batch of prompts (step 0 = before any decode).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">step<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">BATCH_SIZE<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_admit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">printed_mixed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">False<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">while<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">any<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">is<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">not<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">None <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slots<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">step<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">active<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">slots<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">is<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">not<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">None<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">_<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">active<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">slot_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">_<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">active<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Visualize a couple of representative steps so the packing is visible<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># (printing every step would be far too much output).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">mixed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">any<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">any<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">kv_len<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">step<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">visualize_ragged_step<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;STEP {step}\u00a0\u00a0&#8211;\u00a0\u00a0prompts packed together (all PREFILL)&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slot_ids<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">elif <\/span><span class=\"crayon-e\">mixed <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">not<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">printed_mixed<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">visualize_ragged_step<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;STEP {step}\u00a0\u00a0&#8211;\u00a0\u00a0PREFILL + DECODE fused in one pass&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">slot_ids<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">printed_mixed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">True<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># ONE packed forward pass (prefill + decode fused, no padding).<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">next_tokens<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">_ragged_step<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seqs<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">slot_idx<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">tok <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">zip<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">active<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">next_tokens<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">output_ids<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">tok<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">tokens_generated<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pending_ids<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">tok<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># next step: a single decode token<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tok<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">eos_token_id <\/span><span class=\"crayon-st\">or<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">tokens_generated<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">max_new_tokens<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">result_text<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">prompt<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">\\<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">decode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">output_ids<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">skip_special_tokens<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">seq<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">req_id<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">result_text<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0&#8212; step {step:3d}] slot {slot_idx} done\u00a0\u00a0req {seq.req_id} &#8220;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;({seq.tokens_generated}\/{seq.max_new_tokens} tokens): {result_text[:90]}&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">flush<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_admit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">slot_idx<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">k<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">k<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sorted<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">main<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;Loading {MODEL_ID}&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">AutoTokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">from_pretrained<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">MODEL_ID<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">pad_token<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">eos<\/span><span class=\"crayon-sy\">_<\/span>token<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Pick the fastest available device.\u00a0\u00a0On Apple Silicon (M1\/M2\/&#8230;) this is<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># the MPS GPU.\u00a0\u00a0We keep float32 on MPS on purpose: float16 there flips a few<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># greedy ties, which would break the &#8220;static == continuous, token-for-token&#8221;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># property this demo relies on.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">cuda<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">is_available<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;cuda&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">float16<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">elif <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">backends<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">mps<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">is_available<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;mps&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">float32<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;cpu&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">torch<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">float32<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">AutoModelForCausalLM<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">from_pretrained<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">MODEL_ID<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">attn_implementation<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;eager&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\">\u00a0\u00a0 <\/span><span class=\"crayon-p\"># use our custom 4D mask directly<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">eval<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-st\">to<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">device<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;Running on {device} ({dtype})\\n&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;The capital of France is&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">6<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Today&#8217;s weather is so&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">50<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;In machine learning, a transformer is&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">300<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Once upon a time in a land far away,&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">30<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Quantum computing differs from classical computing because&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">180<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;The history of the Roman Empire began&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">45<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;=== Static batching ===&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_device_sync<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">start<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">time<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">perf_counter<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">static_batching<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_device_sync<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">static_elapsed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">time<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">perf_counter<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">start<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;\\nStatic batching elapsed: {static_elapsed:.2f}s\\n&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;=== Continuous batching (ragged) ===&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_device_sync<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">start<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">time<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">perf_counter<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">continuous_batching<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">requests<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">tokenizer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">_device_sync<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">continuous_elapsed<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">time<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">perf_counter<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">start<\/span><\/p>\n<p><span class=\"crayon-e\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;\\nContinuous batching elapsed: {continuous_elapsed:.2f}s&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">__name__<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;__main__&#8221;<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">main<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;&#8221;&#8220; Continuous batching = iteration-level scheduling + ragged (packed) batching. \u00a0 Two approaches are compared (both run BATCH_SIZE sequences concurrently, so the comparison is slot-for-slot fair): \u00a0 \u00a0\u00a01. Static batching (baseline): \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Prompts are processed BATCH_SIZE at a time.\u00a0\u00a0Each wave is padded to a \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 common length and run together until the LONGEST request in<\/p>\n","protected":false},"author":1,"featured_media":180150,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[42],"tags":[],"class_list":{"0":"post-180149","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ktromedia.com\/?p=180149\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia\" \/>\n<meta property=\"og:description\" content=\"&#8220;&#8221;&#8220; Continuous batching = iteration-level scheduling + ragged (packed) batching. \u00a0 Two approaches are compared (both run BATCH_SIZE sequences concurrently, so the comparison is slot-for-slot fair): \u00a0 \u00a0\u00a01. Static batching (baseline): \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Prompts are processed BATCH_SIZE at a time.\u00a0\u00a0Each wave is padded to a \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 common length and run together until the LONGEST request in\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ktromedia.com\/?p=180149\" \/>\n<meta property=\"og:site_name\" content=\"Ktromedia\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/KTROMedia\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-09T18:06:37+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1707\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"KTRO TEAM\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"KTRO TEAM\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ktromedia.com\/?p=180149#article\",\"isPartOf\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149\"},\"author\":{\"name\":\"KTRO TEAM\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\"},\"headline\":\"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient\",\"datePublished\":\"2026-06-09T18:06:37+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149\"},\"wordCount\":2750,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg\",\"articleSection\":[\"\u4eba\u5de5\u667a\u80fd\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/ktromedia.com\/?p=180149#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ktromedia.com\/?p=180149\",\"url\":\"https:\/\/ktromedia.com\/?p=180149\",\"name\":\"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia\",\"isPartOf\":{\"@id\":\"https:\/\/ktromedia.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149#primaryimage\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg\",\"datePublished\":\"2026-06-09T18:06:37+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180149#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ktromedia.com\/?p=180149\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/?p=180149#primaryimage\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg\",\"width\":2560,\"height\":1707},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ktromedia.com\/?p=180149#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ktromedia.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ktromedia.com\/#website\",\"url\":\"https:\/\/ktromedia.com\/\",\"name\":\"Ktromedia\",\"description\":\"KTRO MEDIA Crypto News\",\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ktromedia.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ktromedia.com\/#organization\",\"name\":\"Ktromedia\",\"url\":\"https:\/\/ktromedia.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"width\":250,\"height\":250,\"caption\":\"Ktromedia\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/KTROMedia\/\",\"https:\/\/www.linkedin.com\/company\/ktro-media\/\",\"https:\/\/t.me\/ktrogroup\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\",\"name\":\"KTRO TEAM\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"caption\":\"KTRO TEAM\"},\"description\":\"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002\",\"sameAs\":[\"https:\/\/ktromedia.com\"],\"url\":\"https:\/\/ktromedia.com\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ktromedia.com\/?p=180149","og_locale":"en_US","og_type":"article","og_title":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia","og_description":"&#8220;&#8221;&#8220; Continuous batching = iteration-level scheduling + ragged (packed) batching. \u00a0 Two approaches are compared (both run BATCH_SIZE sequences concurrently, so the comparison is slot-for-slot fair): \u00a0 \u00a0\u00a01. Static batching (baseline): \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Prompts are processed BATCH_SIZE at a time.\u00a0\u00a0Each wave is padded to a \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 common length and run together until the LONGEST request in","og_url":"https:\/\/ktromedia.com\/?p=180149","og_site_name":"Ktromedia","article_publisher":"https:\/\/www.facebook.com\/KTROMedia\/","article_published_time":"2026-06-09T18:06:37+00:00","og_image":[{"width":2560,"height":1707,"url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg","type":"image\/jpeg"}],"author":"KTRO TEAM","twitter_card":"summary_large_image","twitter_misc":{"Written by":"KTRO TEAM","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ktromedia.com\/?p=180149#article","isPartOf":{"@id":"https:\/\/ktromedia.com\/?p=180149"},"author":{"name":"KTRO TEAM","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b"},"headline":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient","datePublished":"2026-06-09T18:06:37+00:00","mainEntityOfPage":{"@id":"https:\/\/ktromedia.com\/?p=180149"},"wordCount":2750,"commentCount":0,"publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"image":{"@id":"https:\/\/ktromedia.com\/?p=180149#primaryimage"},"thumbnailUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg","articleSection":["\u4eba\u5de5\u667a\u80fd"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ktromedia.com\/?p=180149#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ktromedia.com\/?p=180149","url":"https:\/\/ktromedia.com\/?p=180149","name":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient - Ktromedia","isPartOf":{"@id":"https:\/\/ktromedia.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ktromedia.com\/?p=180149#primaryimage"},"image":{"@id":"https:\/\/ktromedia.com\/?p=180149#primaryimage"},"thumbnailUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg","datePublished":"2026-06-09T18:06:37+00:00","breadcrumb":{"@id":"https:\/\/ktromedia.com\/?p=180149#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ktromedia.com\/?p=180149"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/?p=180149#primaryimage","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Serving-Multiple-Users-at-Once-How-Continuous-Batching-Keeps-LLM.jpg","width":2560,"height":1707},{"@type":"BreadcrumbList","@id":"https:\/\/ktromedia.com\/?p=180149#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ktromedia.com\/"},{"@type":"ListItem","position":2,"name":"Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient"}]},{"@type":"WebSite","@id":"https:\/\/ktromedia.com\/#website","url":"https:\/\/ktromedia.com\/","name":"Ktromedia","description":"KTRO MEDIA Crypto News","publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ktromedia.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ktromedia.com\/#organization","name":"Ktromedia","url":"https:\/\/ktromedia.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","width":250,"height":250,"caption":"Ktromedia"},"image":{"@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/KTROMedia\/","https:\/\/www.linkedin.com\/company\/ktro-media\/","https:\/\/t.me\/ktrogroup"]},{"@type":"Person","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b","name":"KTRO TEAM","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/image\/","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","caption":"KTRO TEAM"},"description":"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002","sameAs":["https:\/\/ktromedia.com"],"url":"https:\/\/ktromedia.com\/?author=1"}]}},"_links":{"self":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180149","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=180149"}],"version-history":[{"count":1,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180149\/revisions"}],"predecessor-version":[{"id":180151,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180149\/revisions\/180151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/media\/180150"}],"wp:attachment":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=180149"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=180149"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=180149"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}