"au:"Joel Hestness"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Joel Hestness"" — arXiv2 Search

Showing 1–20 of 27 results

/ Date/ Name

Dec 1, 2017Deep Learning Scaling is Predictable, Empirically Mar 25, 2020Pipelined Backpropagation at Scale: Training Large Models without Batches Sep 3, 2019Beyond Human-Level Accuracy: Computational Challenges in Deep Learning Apr 19, 2021Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks for Brain Tumor Segmentation Apr 6, 2023Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster Jun 28, 2022RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network Mar 15, 2017Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting Oct 18, 2023Position Interpolation Improves ALiBi Extrapolation Nov 1, 2024Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers May 2, 2025Don't be lazy: CompleteP enables compute-efficient deep transformers May 19, 2025Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training Sep 20, 2023BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model Sep 19, 2023SlimPajama-DC: Understanding Data Combinations for LLM Training May 24, 2024Sparse maximal update parameterization: A holistic approach to sparse training dynamics Nov 6, 2024Crystal: Illuminating LLM Abilities on Language and Code Dec 5, 2025K2-V2: A 360-Open, Reasoning-Enhanced LLM Feb 21, 2025Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Aug 30, 2023Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models Oct 7, 2019Compositional Generalization for Primitive Substitutions Mar 1, 2024MediSwift: Efficient Sparse Pre-trained Biomedical Language Models