DINO-HuVITS

Info

Разработка данной модели вдохновлена статьёй DINO-VITS. В основе лежит архитектура VITS, в которой оригинальный PosteriorEncoder был заменён на модель HuBERT Base, а обучение SpeakerEncoder происходило с помощью функции потерь DINO.

Quick start

import librosa
import torch

from dino_huvits import DinoHuVits


model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS")
model = model.eval()

content, _ = librosa.load("<content-path>", sr=16000)
reference, _ = librosa.load("<reference-paht>", sr=16000)

content = torch.from_numpy(content).unsqueeze(0)
lengths = torch.tensor([content.shape[1]], dtype=torch.long)
reference = torch.from_numpy(reference).unsqueeze(0)

with torch.no_grad():
    output, _ = model(content, lengths, reference)

Datasets

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Papers for SazerLife/DINO-HuVITS

DINO-VITS: Data-Efficient Noise-Robust Zero-Shot Voice Cloning via Multi-Tasking with Self-Supervised Speaker Verification Loss

Paper • 2311.09770 • Published Nov 16, 2023

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Paper • 2304.05754 • Published Apr 12, 2023

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Paper • 2106.07447 • Published Jun 14, 2021 • 4