Showing 1–20 of 32 results
/ Date/ Name
Mar 11, 2025FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression FrameworkSep 27, 2025A Flexible Programmable Pipeline Parallelism Framework for Efficient DNN TrainingFeb 18, 2026FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM ServingFeb 15, 2022Suppressing ZZ Crosstalk of Quantum Computers through Pulse and Scheduling Co-OptimizationMay 24, 2022GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate RepresentationMar 24, 2021FastMoE: A Fast Mixture-of-Expert Training SystemMar 24, 2025Jenga: Effective Memory Management for Serving LLM with HeterogeneityJun 17, 2025HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor SearchNov 8, 2025Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM ServingDec 15, 2025FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core ConnectionFeb 28, 2026Jano: Adaptive Diffusion Generation with Early-stage Convergence AwarenessApr 21, 2026UniEP: Unified Expert-Parallel MoE MegaKernel for LLM TrainingJun 13, 2021G-TADOC: Enabling Efficient GPU-Based Text Analytics without DecompressionMar 26, 2022A Roadmap for Big ModelOct 4, 2022Unveiling the Black Box of PLMs with Semantic Anchors: Towards Interpretable Neural Semantic ParsingMay 16, 2017Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPCJul 11, 2023PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IRFeb 20, 2025GS-Cache: A GS-Cache Inference Framework for Large-scale Gaussian Splatting ModelsMay 12, 2025SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language ModelsAug 8, 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models