"au:"Hynek Kydlíček"" — arXiv2 SearchShowing 1–8 of 8 results
/ Date/ Name
Jul 20, 2023A Dataset and Strong Baselines for Classification of Czech News TextsDec 23, 2024BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring MechanismJun 25, 2024The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleApr 15, 2026How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source DataJan 14, 2025Towards Best Practices for Open Datasets for LLM TrainingFeb 4, 2025SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language ModelFeb 18, 2025Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMsJun 26, 2025FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language