Magneto Schema Retriever - GDC
This is a fine-tuned version of sentence-transformers/all-mpnet-base-v2 for schema matching tasks in the biomedical domain, specifically targeting the GDC (Genomic Data Commons) data standard.
Model Details
- Base Model: sentence-transformers/all-mpnet-base-v2
- Task: Column-level semantic search for schema matching
- Domain: Biomedical/Genomic data (GDC standard)
- Training Data: Synthetically generated columns from GDC target schema
- Evaluation Benchmark: GDC-SM - 10 real biomedical study datasets
- Training Method: Self-supervised contrastive learning with LLM-generated synthetic data
- Paper: Magneto: Combining Small and Large Language Models for Schema Matching
Training Data vs. Evaluation Benchmark
Important distinction:
- Training: This model was fine-tuned on synthetically generated data derived from the GDC target schema (736 columns). No real source table data was used during training to avoid data leakage.
- Evaluation: The model's performance was evaluated on the GDC-SM benchmark, which contains 10 pairs of real biomedical study tables (from published cancer research papers) manually aligned to the GDC standard by domain experts.
The synthetic training data is available at: vida-nyu/magneto-gdc-synthetic
Training Approach
Synthetic Data Generation
The model was fine-tuned using a self-supervised approach that leverages LLMs to generate diverse training data from the GDC target schema. The key innovation is combining LLM-based augmentation with structure-based augmentation:
1. LLM-Based Augmentation (llm-aug):
- For each anchor column in the GDC target schema, an LLM generates semantically equivalent but syntactically diverse variations
- Captures syntactic heterogeneity common in real biomedical datasets
- Example transformations:
patient_id→subject_identifier,participant_number,patient_identifierdiagnosis_date→date_of_diagnosis,dx_date,diagnostic_date
- Creates diverse positive examples while preserving semantic meaning
2. Structure-Based Augmentation (struct-aug):
- Random sampling and shuffling of column values
- Minor perturbations to column names (character replacements/deletions)
- Priority sampling that emphasizes frequently occurring values to enhance matching likelihood
Contrastive Learning Framework
The fine-tuning employs triplet loss with online triplet mining:
- Positive pairs: Columns derived from the same anchor (semantically equivalent, syntactically diverse)
- Negative pairs: Columns from different anchor classes
- Objective: Minimize embedding distance between positive pairs while maximizing distance between negative pairs
- Training strategy: Target-only fine-tuning (trained only on GDC target columns to avoid false negatives from unlabeled source data)
Column Serialization
Columns are encoded using the header_values_verbose strategy:
- Combines column names with sampled column values
- Uses priority sampling to select the most informative values (frequent values that enhance matching)
- Provides richer semantic context beyond just column names
Usage
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('vida-nyu/magneto-schema-retriever-gdc')
# Encode column names/descriptions
embeddings = model.encode([
"patient_id",
"diagnosis_date",
"tumor_stage"
])
# Find semantic similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
# For schema matching: encode source and target columns, then retrieve top-k matches
source_embeddings = model.encode(source_columns)
target_embeddings = model.encode(target_columns)
similarities = cos_sim(source_embeddings, target_embeddings)
Performance
This retriever serves as the first stage in the two-phase Magneto pipeline:
- Candidate Retrieval (this model): Efficiently generates a ranked list of potential column matches
- Match Reranking: LLM reranker further refines the candidate list for optimal accuracy
On the GDC-SM benchmark, the fine-tuned retriever with LLM-generated synthetic training data significantly outperforms zero-shot baselines, particularly excelling at capturing domain-specific biomedical terminology and handling syntactic variations in real-world datasets.
Intended Use
- Primary: Schema matching for biomedical/genomic datasets aligned to GDC standard
- Mapping clinical study data to standardized formats (e.g., data harmonization in cancer research)
- Column name semantic search within biomedical data portals
- Building automated data integration pipelines for genomic research
- First-stage retrieval in retrieve-then-rerank schema matching workflows
Limitations
- Domain-specific: Optimized for biomedical/genomic terminology and GDC standard. Performance may degrade on other domains without additional fine-tuning.
- Column-level only: Designed specifically for schema matching tasks, not general semantic similarity.
- Requires reranking: Best results achieved when used as first-stage retrieval with LLM-based reranking.
- Target-only training: Fine-tuned only on synthetically augmented target schema columns. Does not incorporate source table patterns during training to avoid introducing biases from unlabeled data.
- Synthetic training data: While effective, the model's ability to handle novel syntactic variations not covered by the LLM-generated augmentations may be limited.
Citation
If you use this model, please cite the Magneto paper:
@article{10.14778/3742728.3742757,
author = {Liu, Yurong and Pena, Eduardo H. M. and Santos, A\'{e}cio and Wu, Eden and Freire, Juliana},
title = {Magneto: Combining Small and Large Language Models for Schema Matching},
year = {2025},
issue_date = {April 2025},
publisher = {VLDB Endowment},
volume = {18},
number = {8},
issn = {2150-8097},
url = {https://doi.org/10.14778/3742728.3742757},
doi = {10.14778/3742728.3742757},
journal = {Proc. VLDB Endow.},
month = apr,
pages = {2681--2694},
numpages = {14}
}
If you use the GDC-SM benchmark, please also cite:
@dataset{santos_2025_14963588,
author = {Santos, Aécio and Wu, Eden and Lopez, Roque and Keegan, Sarah and
Pena, Eduardo and Liu, Wenke and Liu, Yurong and Fenyo, David and Freire, Juliana},
title = {GDC-SM: The GDC Schema Matching Benchmark},
month = apr,
year = 2025,
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.14963588},
url = {https://doi.org/10.5281/zenodo.14963588}
}
Related Resources
- Paper: Magneto on arXiv
- Code: magneto-matcher GitHub repository
- Benchmark: GDC-SM on Zenodo
- Training Data: vida-nyu/magneto-gdc-synthetic
- Lab: VIDA Lab @ NYU
Acknowledgments
This work was supported by NSF awards IIS-2106888 and OAC-2411221, the DARPA ASKEM program (HR0011262087), and the ARPA-H BDF program.
- Downloads last month
- 11
Model tree for vida-nyu/magneto-schema-retriever-gdc
Base model
sentence-transformers/all-mpnet-base-v2