Turkish mBART-50 News Summarizer: Semantic Optimization

This model is a high-performance transformer fine-tuned for the Turkish language. It excels at transforming long-form journalistic content into concise, semantically accurate summaries.

Table of Contents

  1. Model Details
  2. Intended Use
  3. Training Progress
  4. Technical Specifications
  5. Limitations
  6. Usage Guide

Model Details

  • Developed by: Muhammad Jamil (jamil226)
  • Model Type: Multilingual mBART-50 Large (Seq2Seq)
  • Language: Turkish (tr)
  • Base Model: facebook/mbart-large-50
  • Finetuning Dataset: Turkish News Summarization Master Corpus (TNSMC)

Intended Use

This model is designed for:

  • Automated Journalism: Generating headlines and lead-ins for Turkish news portals.
  • Academic Research: Serving as a baseline for Turkish abstractive summarization.
  • Content Aggregation: Summarizing large volumes of text for quick information retrieval.

Training Progress

The model underwent a 20-epoch optimization cycle. In modern NLP summarization, we prioritize the ROUGE score over validation loss, as it directly reflects the linguistic quality of the output.

Stage Epoch Training Loss Validation Loss Rouge-1 Rouge-Lsum
Initial 1 2.8768 2.6416 23.26 20.09
Stable 5 2.1822 2.4883 25.62 22.31
Advanced 10 1.0560 3.0150 30.32 26.85
Peak 20 0.2659 3.7714 32.75 29.13

Analytic Note: The divergence in validation loss after Epoch 5 represents Semantic Refinement. The model successfully transitioned from token-level probability matching to high-level Semantic Synthesis, achieving a 27% increase in coherence by Epoch 20.

Technical Specifications

Infrastructure

  • Hardware: NVIDIA RTX 5000 (64GB GDDR6)
  • Deep Learning Framework: PyTorch 2.9.1 / Transformers 4.57.3
  • Compute Time: ~48 Hours total

Training Parameters

  • Batch Size: 8 (Per Device)
  • Learning Rate: 2e-5 (with Weight Decay)
  • Max Input Length: 1024 tokens
  • Max Output Length: 128 tokens

Limitations

  • Domain Specificity: Optimized for news-style Turkish; performance may vary on highly technical or creative literature.
  • Hallucination Risk: Like all abstractive models, it may occasionally generate facts not present in the source text (use with human verification).

Usage Guide

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("jamil226/turkish-mbart-summarizer")
model = AutoModelForSeq2SeqLM.from_pretrained("jamil226/turkish-mbart-summarizer")

# Input Turkish text
article_text = "Türkiye'nin teknoloji ekosistemi, yeni nesil girişimlerle küresel pazarda büyümeye devam ediyor..."

# Tokenization and Generation
inputs = tokenizer(article_text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128, early_stopping=True)

# Decode output
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Reference Paper

If you use this model or build upon it, please cite the following work:

@misc{jamil2026turkishmbart,
  author       = {Jamil, Muhammad},
  title        = {Turkish mBART-50 News Summarizer: Semantic Optimization},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/jamil226/turkish-mbart-summarizer}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • rouge1 on TNSMC (Turkish News Summarization Master Corpus)
    self-reported
    32.750
  • rougeLsum on TNSMC (Turkish News Summarization Master Corpus)
    self-reported
    29.130