Fanar Collection A powerful and versatile family of Arabic Large Language Models (LLMs) designed for a wide range of tasks. • 3 items • Updated Jun 10, 2025 • 11
gpt-oss Collection Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7, 2025 • 396
view article Article AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models Sep 16, 2025 • 19
SmolLM3 evaluation datasets Collection Datasets to decontaminate the post-training mixtures against. Use the subset and column values described per entry • 13 items • Updated Jul 8, 2025 • 7
SmolLM3 pretraining datasets Collection datasets used in SmolLM3 pretraining • 15 items • Updated Aug 12, 2025 • 42
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26, 2025 • 75
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data +7 Jun 3, 2025 • 304
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models +1 Mar 20, 2024 • 108
view article Article Atlaset Dataset for Moroccan Darija: From Data Collection, Analysis, to Model Trainings Mar 6, 2025 • 26