Multilingual_(NER)_with_XLM-RoBERTa
This model is a fine-tuned version of xlm-roberta-base on PAN-X dataset for multiple languages (German, French, Italian, and English)
It achieves the following results on the evaluation set:
- Loss: 0.1809
- F1: 0.8529
💻 How to Use
The easiest way to use this model for inference is with the Hugging Face pipeline API, setting the task to "ner" and using the "simple" aggregation strategy to automatically merge subword tokens back into recognizable entities.
1. Using the Pipeline (Recommended)
This method simplifies the process and handles token aggregation for you.
from transformers import pipeline
# Replace with your model ID
model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"
# Initialize the NER pipeline
# aggregation_strategy="simple" ensures subwords are combined into single entities
ner_pipeline = pipeline(
"ner",
model=model_id,
aggregation_strategy="simple"
)
# Example 1: German
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
print(ner_pipeline(text_de))
# Output: [{'entity_group': 'PER', 'score': 0.997, 'word': 'Jeff Dean', ...}, {'entity_group': 'ORG', 'score': 0.996, 'word': 'Google', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'Kalifornien', ...}]
# Example 2: English
text_en = "My name is Sarah and I live in London"
print(ner_pipeline(text_en))
# Output: [{'entity_group': 'PER', 'score': 0.996, 'word': 'Sarah', ...}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'London', ...}]
# Example 3: French
text_fr = "Marie Curie était une physicienne française"
print(ner_pipeline(text_fr))
# Output: [{'entity_group': 'PER', 'score': 0.995, 'word': 'Marie Curie', ...}]
2. Manual Loading (For Custom Tasks)
If you need lower-level access to the model or are integrating it into a custom training loop, you can load the model and tokenizer manually.
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_id = "shroukAdel/xlm-roberta-base-finetuned-panx-all"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
# Now you can use the tokenizer and model for custom inference or further fine-tuning
language:
- ['Enlish',"French","German","Italian"]
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 24
- eval_batch_size: 24
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 3
Training results
| Training Loss | Epoch | Step | Validation Loss | F1 |
|---|---|---|---|---|
| 0.2902 | 1.0 | 835 | 0.1952 | 0.8212 |
| 0.157 | 2.0 | 1670 | 0.1852 | 0.8387 |
| 0.1028 | 3.0 | 2505 | 0.1809 | 0.8529 |
Framework versions
- Transformers 4.57.3
- Pytorch 2.9.0+cu126
- Datasets 4.0.0
- Tokenizers 0.22.1
- Downloads last month
- 30
Model tree for shroukAdel/xlm-roberta-base-finetuned-panx-all
Base model
FacebookAI/xlm-roberta-base