Multilingual_(NER)_with_XLM-RoBERTa

This model is a fine-tuned version of xlm-roberta-base on PAN-X dataset for multiple languages (German, French, Italian, and English)

It achieves the following results on the evaluation set:

Loss: 0.1809
F1: 0.8529

💻 How to Use

The easiest way to use this model for inference is with the Hugging Face pipeline API, setting the task to "ner" and using the "simple" aggregation strategy to automatically merge subword tokens back into recognizable entities.

1. Using the Pipeline (Recommended)

This method simplifies the process and handles token aggregation for you.

from transformers import pipeline

# Replace with your model ID
model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"

# Initialize the NER pipeline
# aggregation_strategy="simple" ensures subwords are combined into single entities
ner_pipeline = pipeline(
    "ner",
    model=model_id,
    aggregation_strategy="simple"
)

# Example 1: German
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
print(ner_pipeline(text_de))
# Output: [{'entity_group': 'PER', 'score': 0.997, 'word': 'Jeff Dean', ...}, {'entity_group': 'ORG', 'score': 0.996, 'word': 'Google', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'Kalifornien', ...}]

# Example 2: English
text_en = "My name is Sarah and I live in London"
print(ner_pipeline(text_en))
# Output: [{'entity_group': 'PER', 'score': 0.996, 'word': 'Sarah', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'London', ...}]

# Example 3: French
text_fr = "Marie Curie était une physicienne française"
print(ner_pipeline(text_fr))
# Output: [{'entity_group': 'PER', 'score': 0.995, 'word': 'Marie Curie', ...}]

2. Manual Loading (For Custom Tasks)

If you need lower-level access to the model or are integrating it into a custom training loop, you can load the model and tokenizer manually.

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Now you can use the tokenizer and model for custom inference or further fine-tuning

language:

['Enlish',"French","German","Italian"]

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 24
eval_batch_size: 24
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.2902	1.0	835	0.1952	0.8212
0.157	2.0	1670	0.1852	0.8387
0.1028	3.0	2505	0.1809	0.8529

Framework versions

Transformers 4.57.3
Pytorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Downloads last month: 30

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for shroukAdel/xlm-roberta-base-finetuned-panx-all

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3705)

this model

shroukAdel
/

xlm-roberta-base-finetuned-panx-all