Showing 1–19 of 19 results
/ Date/ Name
Sep 16, 2021SuperNIC: A Hardware-Based, Programmable, and Multi-Tenant SmartNICMay 18, 2024The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model ServingAug 7, 2021Clio: A Hardware-Software Co-Designed Disaggregated Memory SystemAug 4, 2025Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPodJan 20, 2024Inference without Interference: Disaggregate LLM Inference for Mixed Downstream WorkloadsJan 24, 2025DeepServe: Serverless Large Language Model Serving at ScaleJun 17, 2025Efficient Serving of LLM Applications with Probabilistic Demand ModelingSep 8, 2024InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM InferenceDec 18, 2025MEPIC: Memory Efficient Position Independent Caching for LLM ServingApr 11, 2026Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity DisaggregationJun 16, 2025DDiT: Dynamic Resource Allocation for Diffusion Transformer Model ServingOct 20, 2024EPIC: Efficient Position-Independent Caching for Serving Large Language ModelsFeb 24, 2026ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference DeploymentsJun 25, 2024MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory PoolJan 20, 2024CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM InferenceFeb 6, 2019Storm: a fast transactional dataplane for remote data structuresApr 19, 2025Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache ManagementFeb 16, 2025RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM ReasoningDec 23, 2024BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching