Showing 1–20 of 31 results
/ Date/ Name
Oct 4, 2019Neural Language PriorsFeb 9, 2025Scaling Laws for Forgetting during Finetuning with Pretraining Data InjectionJan 21, 2025Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language ModelsJul 15, 2022Position Prediction as an Effective Pretraining StrategyJul 12, 2025Scaling Laws for Optimal Data MixturesJul 27, 2020Neural Temporal Point Processes For Modelling Electronic Health RecordsMay 9, 2018Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity TasksJun 28, 2023DUET: 2D Structured and Approximately Equivariant RepresentationsSep 6, 2024Theory, Analysis, and Best Practices for Sigmoid Self-AttentionMar 8, 2024Poly-View Contrastive LearningJul 25, 2023How to Scale Your EMADec 9, 2025Revisiting the Scaling Properties of Downstream Metrics in Large Language Model TrainingDec 26, 2025Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and DurationApr 27, 2026Scaling Properties of Continuous Diffusion Spoken Language ModelsMar 11, 2023Stabilizing Transformer Training by Preventing Attention Entropy CollapseMar 28, 2020Learning medical triage from clinicians using Deep Q-LearningApr 11, 2019Relational Graph Attention NetworksOct 1, 2021Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?Feb 12, 2025Distillation Scaling LawsJun 4, 2025How PARTs assemble into wholes: Learning the relative composition of images