Magneto Schema Retriever - GDC

This is a fine-tuned version of sentence-transformers/all-mpnet-base-v2 for schema matching tasks in the biomedical domain, specifically targeting the GDC (Genomic Data Commons) data standard.

Model Details

Base Model: sentence-transformers/all-mpnet-base-v2
Task: Column-level semantic search for schema matching
Domain: Biomedical/Genomic data (GDC standard)
Training Data: Synthetically generated columns from GDC target schema
Evaluation Benchmark: GDC-SM - 10 real biomedical study datasets
Training Method: Self-supervised contrastive learning with LLM-generated synthetic data
Paper: Magneto: Combining Small and Large Language Models for Schema Matching

Training Data vs. Evaluation Benchmark

Important distinction:

Training: This model was fine-tuned on synthetically generated data derived from the GDC target schema (736 columns). No real source table data was used during training to avoid data leakage.
Evaluation: The model's performance was evaluated on the GDC-SM benchmark, which contains 10 pairs of real biomedical study tables (from published cancer research papers) manually aligned to the GDC standard by domain experts.

The synthetic training data is available at: vida-nyu/magneto-gdc-synthetic

Training Approach

Synthetic Data Generation

The model was fine-tuned using a self-supervised approach that leverages LLMs to generate diverse training data from the GDC target schema. The key innovation is combining LLM-based augmentation with structure-based augmentation:

1. LLM-Based Augmentation (llm-aug):

For each anchor column in the GDC target schema, an LLM generates semantically equivalent but syntactically diverse variations
Captures syntactic heterogeneity common in real biomedical datasets
Example transformations:
- patient_id → subject_identifier, participant_number, patient_identifier
- diagnosis_date → date_of_diagnosis, dx_date, diagnostic_date
Creates diverse positive examples while preserving semantic meaning

2. Structure-Based Augmentation (struct-aug):

Random sampling and shuffling of column values
Minor perturbations to column names (character replacements/deletions)
Priority sampling that emphasizes frequently occurring values to enhance matching likelihood

Contrastive Learning Framework

The fine-tuning employs triplet loss with online triplet mining:

Positive pairs: Columns derived from the same anchor (semantically equivalent, syntactically diverse)
Negative pairs: Columns from different anchor classes
Objective: Minimize embedding distance between positive pairs while maximizing distance between negative pairs
Training strategy: Target-only fine-tuning (trained only on GDC target columns to avoid false negatives from unlabeled source data)

Column Serialization

Columns are encoded using the header_values_verbose strategy:

Combines column names with sampled column values
Uses priority sampling to select the most informative values (frequent values that enhance matching)
Provides richer semantic context beyond just column names

Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('vida-nyu/magneto-schema-retriever-gdc')

# Encode column names/descriptions
embeddings = model.encode([
    "patient_id",
    "diagnosis_date", 
    "tumor_stage"
])

# Find semantic similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])

# For schema matching: encode source and target columns, then retrieve top-k matches
source_embeddings = model.encode(source_columns)
target_embeddings = model.encode(target_columns)
similarities = cos_sim(source_embeddings, target_embeddings)

Performance

This retriever serves as the first stage in the two-phase Magneto pipeline:

Candidate Retrieval (this model): Efficiently generates a ranked list of potential column matches
Match Reranking: LLM reranker further refines the candidate list for optimal accuracy

On the GDC-SM benchmark, the fine-tuned retriever with LLM-generated synthetic training data significantly outperforms zero-shot baselines, particularly excelling at capturing domain-specific biomedical terminology and handling syntactic variations in real-world datasets.

Intended Use

Primary: Schema matching for biomedical/genomic datasets aligned to GDC standard
Mapping clinical study data to standardized formats (e.g., data harmonization in cancer research)
Column name semantic search within biomedical data portals
Building automated data integration pipelines for genomic research
First-stage retrieval in retrieve-then-rerank schema matching workflows

Limitations

Domain-specific: Optimized for biomedical/genomic terminology and GDC standard. Performance may degrade on other domains without additional fine-tuning.
Column-level only: Designed specifically for schema matching tasks, not general semantic similarity.
Requires reranking: Best results achieved when used as first-stage retrieval with LLM-based reranking.
Target-only training: Fine-tuned only on synthetically augmented target schema columns. Does not incorporate source table patterns during training to avoid introducing biases from unlabeled data.
Synthetic training data: While effective, the model's ability to handle novel syntactic variations not covered by the LLM-generated augmentations may be limited.

Citation

If you use this model, please cite the Magneto paper:

@article{10.14778/3742728.3742757,
  author = {Liu, Yurong and Pena, Eduardo H. M. and Santos, A\'{e}cio and Wu, Eden and Freire, Juliana},
  title = {Magneto: Combining Small and Large Language Models for Schema Matching},
  year = {2025},
  issue_date = {April 2025},
  publisher = {VLDB Endowment},
  volume = {18},
  number = {8},
  issn = {2150-8097},
  url = {https://doi.org/10.14778/3742728.3742757},
  doi = {10.14778/3742728.3742757},
  journal = {Proc. VLDB Endow.},
  month = apr,
  pages = {2681--2694},
  numpages = {14}
}

If you use the GDC-SM benchmark, please also cite:

@dataset{santos_2025_14963588,
  author = {Santos, Aécio and Wu, Eden and Lopez, Roque and Keegan, Sarah and 
            Pena, Eduardo and Liu, Wenke and Liu, Yurong and Fenyo, David and Freire, Juliana},
  title = {GDC-SM: The GDC Schema Matching Benchmark},
  month = apr,
  year = 2025,
  publisher = {Zenodo},
  version = {1.0},
  doi = {10.5281/zenodo.14963588},
  url = {https://doi.org/10.5281/zenodo.14963588}
}

Related Resources

Paper: Magneto on arXiv
Code: magneto-matcher GitHub repository
Benchmark: GDC-SM on Zenodo
Training Data: vida-nyu/magneto-gdc-synthetic
Lab: VIDA Lab @ NYU

Acknowledgments

This work was supported by NSF awards IIS-2106888 and OAC-2411221, the DARPA ASKEM program (HR0011262087), and the ARPA-H BDF program.

Downloads last month: 11

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for vida-nyu/magneto-schema-retriever-gdc

Base model

sentence-transformers/all-mpnet-base-v2

Finetuned

(336)

this model

vida-nyu
/

magneto-schema-retriever-gdc