Showing 21–40 of 65 results
/ Date/ Name
Feb 15, 2024Generative Representational Instruction TuningNov 7, 2024Scaling Laws for PrecisionFeb 18, 2024KMMLU: Measuring Massive Multitask Language Understanding in KoreanJul 1, 2024RegMix: Data Mixture as Regression for Language Model Pre-trainingMar 31, 2025A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?Aug 25, 2025UQ: Assessing Language Models on Unsolved QuestionsOct 11, 2025HUME: Measuring the Human-Model Performance Gap in Text Embedding TasksDec 6, 2021NL-Augmenter: A Framework for Task-Sensitive Natural Language AugmentationFeb 17, 2026MAEB: Massive Audio Embedding BenchmarkNov 9, 2022BLOOM: A 176B-Parameter Open-Access Multilingual Language ModelMay 23, 2024Lessons from the Trenches on Reproducible Evaluation of Language ModelsOct 27, 2022What Language Model to Train if You Have One Million GPU Hours?Jun 17, 2024DataComp-LM: In search of the next generation of training sets for language modelsJul 23, 2024OpenHands: An Open Platform for AI Software Developers as Generalist AgentsJun 4, 2024The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text EmbeddingJun 22, 2024BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsOct 24, 2025ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of MultilingualityMar 25, 2026Composer 2 Technical ReportJan 24, 2025Humanity's Last ExamOct 4, 2024SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?