YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Sortformer CoreML Models

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

Model Variants

Variant	File	Latency	Use Case
Default	`Sortformer.mlmodelc`	~1.04s	Low latency streaming
NVIDIA Low	`SortformerNvidiaLow.mlmodelc`	~1.04s	Low latency streaming
NVIDIA High	`SortformerNvidiaHigh.mlmodelc`	~30.4s	Best quality, offline

Configuration Parameters

Parameter	Default	NVIDIA Low	NVIDIA High
chunk_len	6	6	340
chunk_right_context	7	7	40
chunk_left_context	1	1	1
fifo_len	40	188	40
spkcache_len	188	188	188

Model Input/Output Shapes

General:

Input	Shape	Description
chunk	`[1, 8*(C+L+R), 128]`	Mel spectrogram features
chunk_lengths	`[1]`	Actual chunk length
spkcache	`[1, S, 512]`	Speaker cache embeddings
spkcache_lengths	`[1]`	Actual cache length
fifo	`[1, F, 512]`	FIFO queue embeddings
fifo_lengths	`[1]`	Actual FIFO length

Output	Shape	Description
speaker_preds	`[C+L+R+S+F, 4]`	Speaker probabilities (4 speakers)
chunk_pre_encoder_embs	`[C+L+R, 512]`	Embeddings for state update
chunk_pre_encoder_lengths	`[1]`	Actual embedding count
nest_encoder_embs	`[C+L+R+S+F, 192]`	Embeddings for speaker discrimination
nest_encoder_lengths	`[1]`	Actual speaker embedding count

Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.

Configuration-Specific Shapes:

Input	Default	NVIDIA Low	NVIDIA High
chunk	`[1, 112, 128]`	`[1, 112, 128]`	`[1, 3048, 128]`
chunk_lengths	`[1]`	`[1]`	`[1]`
spkcache	`[1, 188, 512]`	`[1, 188, 512]`	`[1, 188, 512]`
spkcache_lengths	`[1]`	`[1]`	`[1]`
fifo	`[1, 40, 512]`	`[1, 188, 512]`	`[1, 40, 512]`
fifo_lengths	`[1]`	`[1]`	`[1]`

Output	Default	NVIDIA Low	NVIDIA High
speaker_preds	`[1, 242, 128]`	`[1, 390, 128]`	`[1, 609, 128]`
chunk_pre_encoder_embs	`[1, 14, 512]`	`[1, 14, 512]`	`[1, 381, 512]`
chunk_pre_encoder_lengths	`[1]`	`[1]`	`[1]`
nest_encoder_embs	`[1, 242, 192]`	`[1, 390, 192]`	`[1, 609, 192]`
nest_encoder_lengths	`[1]`	`[1]`	`[1]`

Metric	Default	NVIDIA High
Latency	~1.12s	~30.4s
RTFx (M4 Max)	~5.7x	~125.3x

Usage with FluidAudio (Swift)

import FluidAudio

// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

// Streaming processing
for audioChunk in audioStream {
    if let result = try diarizer.processSamples(audioChunk) {
        for frame in 0..<result.frameCount {
            for speaker in 0..<4 {
                let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
            }
        }
    }
}

// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
    for segment in segments {
        print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
    }
}

Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

Files

Models

Sortformer.mlpackage / .mlmodelc - Default config (low latency)
SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

Scripts

convert_to_coreml.py - PyTorch to CoreML conversion
streaming_inference.py - Python streaming inference example
mic_inference.py - Real-time microphone demo

Source

Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Credits & Acknowledgements

This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

Their work was instrumental in:

Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

This project was built upon the foundational work of the NVIDIA NeMo team.

Downloads last month: 60

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support