Multilingual_(NER)_with_XLM-RoBERTa

This model is a fine-tuned version of xlm-roberta-base on PAN-X dataset for multiple languages (German, French, Italian, and English)

It achieves the following results on the evaluation set:

  • Loss: 0.1809
  • F1: 0.8529

💻 How to Use

The easiest way to use this model for inference is with the Hugging Face pipeline API, setting the task to "ner" and using the "simple" aggregation strategy to automatically merge subword tokens back into recognizable entities.

1. Using the Pipeline (Recommended)

This method simplifies the process and handles token aggregation for you.

from transformers import pipeline

# Replace with your model ID
model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"

# Initialize the NER pipeline
# aggregation_strategy="simple" ensures subwords are combined into single entities
ner_pipeline = pipeline(
    "ner",
    model=model_id,
    aggregation_strategy="simple"
)

# Example 1: German
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
print(ner_pipeline(text_de))
# Output: [{'entity_group': 'PER', 'score': 0.997, 'word': 'Jeff Dean', ...}, {'entity_group': 'ORG', 'score': 0.996, 'word': 'Google', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'Kalifornien', ...}]

# Example 2: English
text_en = "My name is Sarah and I live in London"
print(ner_pipeline(text_en))
# Output: [{'entity_group': 'PER', 'score': 0.996, 'word': 'Sarah', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'London', ...}]

# Example 3: French
text_fr = "Marie Curie était une physicienne française"
print(ner_pipeline(text_fr))
# Output: [{'entity_group': 'PER', 'score': 0.995, 'word': 'Marie Curie', ...}]

2. Manual Loading (For Custom Tasks)

If you need lower-level access to the model or are integrating it into a custom training loop, you can load the model and tokenizer manually.

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Now you can use the tokenizer and model for custom inference or further fine-tuning

language:

  • ['Enlish',"French","German","Italian"]

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 24
  • eval_batch_size: 24
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss F1
0.2902 1.0 835 0.1952 0.8212
0.157 2.0 1670 0.1852 0.8387
0.1028 3.0 2505 0.1809 0.8529

Framework versions

  • Transformers 4.57.3
  • Pytorch 2.9.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
30
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shroukAdel/xlm-roberta-base-finetuned-panx-all

Finetuned
(3705)
this model

Space using shroukAdel/xlm-roberta-base-finetuned-panx-all 1

Evaluation results