arXiv2
Search
Dark
/ Date
/ Name
Aa
W
/ Date
/ Name
"au:"Gavia Gray"" — arXiv2 Search
Showing 1–6 of 6 results
/ Date
/ Name
Nov 1, 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Feb 21, 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
May 19, 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Jun 10, 2019
BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget
Jun 3, 2019
Separable Layers Enable Structured Efficient Linear Substitutions
Nov 7, 2017
Moonshine: Distilling with Cheap Convolutions