"au:"Yizhou Shan"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Yizhou Shan"" — arXiv2 Search

Showing 1–19 of 19 results

/ Date/ Name

Sep 16, 2021SuperNIC: A Hardware-Based, Programmable, and Multi-Tenant SmartNIC May 18, 2024The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving Aug 7, 2021Clio: A Hardware-Software Co-Designed Disaggregated Memory System Aug 4, 2025Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod Jan 20, 2024Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads Jan 24, 2025DeepServe: Serverless Large Language Model Serving at Scale Jun 17, 2025Efficient Serving of LLM Applications with Probabilistic Demand Modeling Sep 8, 2024InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference Dec 18, 2025MEPIC: Memory Efficient Position Independent Caching for LLM Serving Apr 11, 2026Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation Jun 16, 2025DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving Oct 20, 2024EPIC: Efficient Position-Independent Caching for Serving Large Language Models Feb 24, 2026ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments Jun 25, 2024MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool Jan 20, 2024CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference Feb 6, 2019Storm: a fast transactional dataplane for remote data structures Apr 19, 2025Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management Feb 16, 2025RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning Dec 23, 2024BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching