"au:"Guilherme Penedo"" — arXiv2 SearchShowing 1–8 of 8 results
/ Date/ Name
Jun 25, 2024The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleNov 28, 2023The Falcon Series of Open Language ModelsJun 1, 2023The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data OnlyJun 26, 2025FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every LanguageJun 5, 2025The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed TextApr 15, 2026How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source DataJan 14, 2025Towards Best Practices for Open Datasets for LLM TrainingFeb 4, 2025SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model