Showing 1–11 of 11 results
/ Date/ Name
Jul 16, 2022Progress and limitations of deep networks to recognize objects in unusual posesFeb 16, 2026ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token DatasetJan 9, 2024Effective pruning of web-scale datasets based on complexity of concept clustersFeb 6, 2024A comparison between humans and AI at recognizing objects in unusual posesAug 14, 2025BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale PretrainingJun 17, 2024DataComp-LM: In search of the next generation of training sets for language modelsJan 5, 2026DatBench: Discriminative, Faithful, and Efficient VLM EvaluationsMar 17, 2026The Finetuner's Fallacy: When to Pretrain with Your Finetuning DataMar 16, 2023SemDeDup: Data-efficient learning at web-scale through semantic deduplicationOct 3, 2023Sieve: Multimodal Dataset Pruning Using Image Captioning ModelsDec 9, 2025Luxical: High-Speed Lexical-Dense Text Embeddings