Showing 1–20 of 39 results
/ Date/ Name
Apr 29, 2024Performance-Aligned LLMs for Generating Fast CodeMay 22, 2023A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUsMar 11, 2023A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts TrainingFeb 12, 2025Democratizing AI: Open-source Scalable LLM Training on GPU-based SupercomputersJun 9, 2025Simulating nationwide coupled disease and fear spread in an agent-based modelApr 3, 2026Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN TrainingMar 9, 2026Speculating Experts Accelerates Inference for Mixture-of-ExpertsOct 25, 2021AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learningFeb 10, 2023Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model TrainingJun 19, 2023Pipit: Scripting the analysis of parallel execution tracesJul 7, 2020Analytics of Longitudinal System Monitoring Data for Performance PredictionDec 19, 2024HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel LanguagesDec 11, 2023ML-based Modeling to Predict I/O Performance on Different Storage Sub-systemsMay 7, 2025Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN TrainingDec 17, 2025Optimizing Agentic Language Model Inference via Speculative Tool CallsApr 25, 2025The Big Send-off: Scalable and Performant Collectives for Deep LearningOct 18, 2023Jorge: Approximate Preconditioning for GPU-efficient Second-order OptimizationJan 23, 2024Automated Programmatic Performance Analysis of Parallel ProgramsJan 23, 2024Can Large Language Models Write Parallel Code?Jun 26, 2025ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks