SentenceTransformer based on sentence-transformers/all-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-mpnet-base-v2
  • Maximum Sequence Length: 384 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("chatlas/all-mpnet-base-v2-combined_4400-400vs1000")
# Run inference
sentences = [
    'Which file could not be opened according to the xAOD::TFileMerger::addFile error message?',
    'Metadata:\nsource: AtlasTalk\n\nChunk text:\nError in <xAOD::TFileMerger::addFile>: /build1/atnight/localbuilds/nightlies/AnalysisBase-2.3.X/AnalysisBase/rel_nightly/xAODRootAccess/Root/TFileMerger.cxx:105 Couldn\'t open file "user.pottgen.5855794._000003.hist-output.root"',
    "Metadata:\nsource: GitLabMarkdown\nproject path: acc-co/ucap/ucap-core\nproject description: \nfile path: docs/src/docs/reference/device-behavior.md\nheader path: 'Device Behavior' > 'Acquisition properties' > 'First updates'\n\nChunk text:\nAs of May 2024, UCAP retains converter outputs (for each selector) within an in-memory data structure, paired with the\nrelevant selector. Thus, UCAP nodes provide first-updates as needed for `get` and `subscribe` operations; however,",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.7130, -0.0958],
#         [ 0.7130,  1.0000, -0.1120],
#         [-0.0958, -0.1120,  1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.745
cosine_accuracy@3 0.8583
cosine_accuracy@5 0.8867
cosine_accuracy@10 0.9183
cosine_precision@1 0.745
cosine_precision@3 0.2861
cosine_precision@5 0.1773
cosine_precision@10 0.0918
cosine_recall@1 0.745
cosine_recall@3 0.8583
cosine_recall@5 0.8867
cosine_recall@10 0.9183
cosine_ndcg@10 0.8348
cosine_mrr@10 0.8078
cosine_map@100 0.8109
dot_accuracy@1 0.745
dot_accuracy@3 0.8583
dot_accuracy@5 0.8867
dot_accuracy@10 0.9183
dot_precision@1 0.745
dot_precision@3 0.2861
dot_precision@5 0.1773
dot_precision@10 0.0918
dot_recall@1 0.745
dot_recall@3 0.8583
dot_recall@5 0.8867
dot_recall@10 0.9183
dot_ndcg@10 0.8348
dot_mrr@10 0.8078
dot_map@100 0.8109

Training Details

Training Dataset

Unnamed Dataset

  • Size: 12,000 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 11 tokens
    • mean: 29.43 tokens
    • max: 93 tokens
    • min: 33 tokens
    • mean: 142.95 tokens
    • max: 356 tokens
  • Samples:
    anchor positive
    On the ATLAS Trigger Developer Pages, what do two-digit version numbers (e.g., 21.3) and three-digit version numbers (e.g., 21.3.9) indicate? Metadata:
    source: twiki
    name:
    version: 51
    last modification: 09-09-2024
    category: trigger
    parents_structure: W, e, b, H, o, m, e, /, A, t, l, a, s, T, r, i, g, g, e, r, /, T, r, i, g, g, e, r, D, e, v, e, l, o, p, e, r, P, a, g, e, s

    Chunk text:
    * Two-digit version numbers correspond the branch used to build the nightly (e.g. 21.3) while three digit version numbers correspond to built releases (21.3.9).
    How can I list all available nox sessions using the uv runner? Metadata:
    source: GitLabMarkdown
    project path: particlepredatorinvasion/digout
    project description: Configurable Python library that automates the conversion of LHCb DIGI files into parquet dataframes by managing a sequence of dependent steps and scheduling their parallel execution on local or distributed systems.
    file path: docs/source/development/tests.md
    header path: 'Testing & Automation' > 'Running Sessions'

    Chunk text:
    To list all available sessions:
    bash<br>uv run nox --list<br>
    To run a specific session:
    bash<br>uv run nox -s <session_name><br>
    For example, to run the linter: uv run nox -s lint_check.
    Which setupATLAS -c options will set up the default CentOS6 container used by ATLAS? Metadata:
    source: AtlasTalk

    Chunk text:
    Answer 5:
    Hi,
    You can also do
    setupATLAS -c centos6
    setupATLAS -c sl6
    setupATLAS -c rhel6
    and it will always setup the default centos6 container that is used by ATLAS.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 1.0,
        "similarity_fct": "dot_score",
        "gather_across_devices": false
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 1,200 evaluation samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 8 tokens
    • mean: 28.49 tokens
    • max: 100 tokens
    • min: 38 tokens
    • mean: 142.38 tokens
    • max: 384 tokens
  • Samples:
    anchor positive
    Which copytool was used when the file transfer failed according to the error message? Metadata:
    source: AtlasTalk

    Chunk text:
    No matching replicas were found in list_replicas() output: [ReplicasNotFound(('No replica found for lfn=panda.0911140145.367865.lib._30337145.30050914397.lib.tgz (allow_lan=True, allow_wan=False)',), {})]:failed to transfer files using copytools=['rucio']
    What are the dimensions of the single conductor wire used in SMC_set10 model set #10? Metadata:
    source: GitLabMarkdown
    project path: steam/analyses/esc-on-smc
    project description:
    file path: SMC_set10/README.md
    header path: 'Model set #10'

    Chunk text:
    Its conductor is a single 2 mm * 0.5 mm wire, but in ROXIE it has 4x4 current lines.
    Where should an author go to submit an ATLAS internal note to the CERN Document Server (CDS)? Metadata:
    source: twiki
    name:
    version: 5
    last modification: 19-04-2022
    category: pubcom
    parents_structure: P, u, b, C, o, m

    Chunk text:
    For each ATLAS internal note the following should be done:
    * go to the [[https://cds.cern.ch/submit?ln=en&doctype=ATN][CDS submission page for ATLAS notes]]: =https://cds.cern.ch/submit?ln=en&doctype=ATN=
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 1.0,
        "similarity_fct": "dot_score",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 4
  • learning_rate: 5e-07
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 4
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-07
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss validation_cosine_ndcg@10
0.5333 100 2.3423 2.2474 0.7773
1.064 200 2.2441 2.1880 0.8141
1.5973 300 2.208 2.1673 0.8285
2.128 400 2.1906 2.1575 0.8343
2.6613 500 2.1826 2.1530 0.8348

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.1.0
  • Transformers: 4.55.2
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chatlas/all-mpnet-base-v2-combined_4400-400vs1000

Finetuned
(335)
this model

Evaluation results