Showing 1–12 of 12 results
/ Date/ Name
Feb 10, 2024Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts ModelsJun 16, 2024Quest: Query-Aware Sparsity for Efficient Long-Context LLM InferenceFeb 17, 2025Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMsAug 22, 2024NanoFlow: Towards Optimal Large Language Model Serving ThroughputOct 29, 2023Atom: Low-bit Quantization for Efficient and Accurate LLM ServingJul 17, 2025PolyServe: Efficient Multi-SLO Serving at ScaleNov 25, 2024BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware BatchingJan 1, 2016Practical Algorithms for Learning Near-Isometric Linear EmbeddingsFeb 28, 2025TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead RetrievalDec 1, 2025Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative DecodingDec 24, 2025NVIDIA Nemotron 3: Efficient and Open IntelligenceDec 23, 2025Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning