Magneto Schema Retriever - GDC

This is a fine-tuned version of sentence-transformers/all-mpnet-base-v2 for schema matching tasks in the biomedical domain, specifically targeting the GDC (Genomic Data Commons) data standard.

Model Details

  • Base Model: sentence-transformers/all-mpnet-base-v2
  • Task: Column-level semantic search for schema matching
  • Domain: Biomedical/Genomic data (GDC standard)
  • Training Data: Synthetically generated columns from GDC target schema
  • Evaluation Benchmark: GDC-SM - 10 real biomedical study datasets
  • Training Method: Self-supervised contrastive learning with LLM-generated synthetic data
  • Paper: Magneto: Combining Small and Large Language Models for Schema Matching

Training Data vs. Evaluation Benchmark

Important distinction:

  • Training: This model was fine-tuned on synthetically generated data derived from the GDC target schema (736 columns). No real source table data was used during training to avoid data leakage.
  • Evaluation: The model's performance was evaluated on the GDC-SM benchmark, which contains 10 pairs of real biomedical study tables (from published cancer research papers) manually aligned to the GDC standard by domain experts.

The synthetic training data is available at: vida-nyu/magneto-gdc-synthetic

Training Approach

Synthetic Data Generation

The model was fine-tuned using a self-supervised approach that leverages LLMs to generate diverse training data from the GDC target schema. The key innovation is combining LLM-based augmentation with structure-based augmentation:

1. LLM-Based Augmentation (llm-aug):

  • For each anchor column in the GDC target schema, an LLM generates semantically equivalent but syntactically diverse variations
  • Captures syntactic heterogeneity common in real biomedical datasets
  • Example transformations:
    • patient_id → subject_identifier, participant_number, patient_identifier
    • diagnosis_date → date_of_diagnosis, dx_date, diagnostic_date
  • Creates diverse positive examples while preserving semantic meaning

2. Structure-Based Augmentation (struct-aug):

  • Random sampling and shuffling of column values
  • Minor perturbations to column names (character replacements/deletions)
  • Priority sampling that emphasizes frequently occurring values to enhance matching likelihood

Contrastive Learning Framework

The fine-tuning employs triplet loss with online triplet mining:

  • Positive pairs: Columns derived from the same anchor (semantically equivalent, syntactically diverse)
  • Negative pairs: Columns from different anchor classes
  • Objective: Minimize embedding distance between positive pairs while maximizing distance between negative pairs
  • Training strategy: Target-only fine-tuning (trained only on GDC target columns to avoid false negatives from unlabeled source data)

Column Serialization

Columns are encoded using the header_values_verbose strategy:

  • Combines column names with sampled column values
  • Uses priority sampling to select the most informative values (frequent values that enhance matching)
  • Provides richer semantic context beyond just column names

Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('vida-nyu/magneto-schema-retriever-gdc')

# Encode column names/descriptions
embeddings = model.encode([
    "patient_id",
    "diagnosis_date", 
    "tumor_stage"
])

# Find semantic similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])

# For schema matching: encode source and target columns, then retrieve top-k matches
source_embeddings = model.encode(source_columns)
target_embeddings = model.encode(target_columns)
similarities = cos_sim(source_embeddings, target_embeddings)

Performance

This retriever serves as the first stage in the two-phase Magneto pipeline:

  1. Candidate Retrieval (this model): Efficiently generates a ranked list of potential column matches
  2. Match Reranking: LLM reranker further refines the candidate list for optimal accuracy

On the GDC-SM benchmark, the fine-tuned retriever with LLM-generated synthetic training data significantly outperforms zero-shot baselines, particularly excelling at capturing domain-specific biomedical terminology and handling syntactic variations in real-world datasets.

Intended Use

  • Primary: Schema matching for biomedical/genomic datasets aligned to GDC standard
  • Mapping clinical study data to standardized formats (e.g., data harmonization in cancer research)
  • Column name semantic search within biomedical data portals
  • Building automated data integration pipelines for genomic research
  • First-stage retrieval in retrieve-then-rerank schema matching workflows

Limitations

  • Domain-specific: Optimized for biomedical/genomic terminology and GDC standard. Performance may degrade on other domains without additional fine-tuning.
  • Column-level only: Designed specifically for schema matching tasks, not general semantic similarity.
  • Requires reranking: Best results achieved when used as first-stage retrieval with LLM-based reranking.
  • Target-only training: Fine-tuned only on synthetically augmented target schema columns. Does not incorporate source table patterns during training to avoid introducing biases from unlabeled data.
  • Synthetic training data: While effective, the model's ability to handle novel syntactic variations not covered by the LLM-generated augmentations may be limited.

Citation

If you use this model, please cite the Magneto paper:

@article{10.14778/3742728.3742757,
  author = {Liu, Yurong and Pena, Eduardo H. M. and Santos, A\'{e}cio and Wu, Eden and Freire, Juliana},
  title = {Magneto: Combining Small and Large Language Models for Schema Matching},
  year = {2025},
  issue_date = {April 2025},
  publisher = {VLDB Endowment},
  volume = {18},
  number = {8},
  issn = {2150-8097},
  url = {https://doi.org/10.14778/3742728.3742757},
  doi = {10.14778/3742728.3742757},
  journal = {Proc. VLDB Endow.},
  month = apr,
  pages = {2681--2694},
  numpages = {14}
}

If you use the GDC-SM benchmark, please also cite:

@dataset{santos_2025_14963588,
  author = {Santos, Aécio and Wu, Eden and Lopez, Roque and Keegan, Sarah and 
            Pena, Eduardo and Liu, Wenke and Liu, Yurong and Fenyo, David and Freire, Juliana},
  title = {GDC-SM: The GDC Schema Matching Benchmark},
  month = apr,
  year = 2025,
  publisher = {Zenodo},
  version = {1.0},
  doi = {10.5281/zenodo.14963588},
  url = {https://doi.org/10.5281/zenodo.14963588}
}

Related Resources

Acknowledgments

This work was supported by NSF awards IIS-2106888 and OAC-2411221, the DARPA ASKEM program (HR0011262087), and the ARPA-H BDF program.

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vida-nyu/magneto-schema-retriever-gdc

Finetuned
(336)
this model

Dataset used to train vida-nyu/magneto-schema-retriever-gdc