# OmniVoice 🌍
<p align="center">
<img width="200" height="200" alt="OmniVoice" src="https://zhu-han.github.io/omnivoice/pics/omnivoice.jpg" />
</p>
<p align="center">
<a href="https://huggingface.co/k2-fsa/OmniVoice"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E" alt="Hugging Face Model"></a>
<a href="https://huggingface.co/spaces/k2-fsa/OmniVoice"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Space-blue" alt="Hugging Face Space"></a>
<a href="https://arxiv.org/abs/2604.00688"><img src="https://img.shields.io/badge/arXiv-Paper-B31B1B.svg"></a>
<a href="https://zhu-han.github.io/omnivoice"><img src="https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=GitHub&style=flat-square"></a>
</p>
OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.
**Contents**: [Key Features](#key-features) | [Installation](#installation) | [Quick Start](#quick-start) | [Python API](#python-api) | [Command-Line Tools](#command-line-tools) | [Training & Evaluation](#training--evaluation) | [Discussion](#discussion--communication) | [Citation](#citation)
## Key Features
- **600+ Languages Supported**: The broadest language coverage among zero-shot TTS models ([full list](docs/languages.md))
- **Voice Cloning**: State-of-the-art voice cloning quality.
- **Voice Design**: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
- **Fast Inference**: RTF as low as 0.025 (40x faster than real-time).
- **Diffusion Language Model Architecture**: A clean, streamlined, and scalable design that delivers both quality and speed.
---
## Installation
Choose **one** of the following methods: **pip** or **uv**.
### pip
> We recommend using a fresh virtual environment (e.g., `conda`, `venv`, etc.) to avoid conflicts.
**Step 1**: Install PyTorch
<details>
<summary>NVIDIA GPU</summary>
```bash
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
```
> See [PyTorch official site](https://pytorch.org/get-started/locally/) for other versions installation.
</details>
<details>
<summary>Apple Silicon</summary>
```bash
pip install torch==2.8.0 torchaudio==2.8.0
```
</details>
**Step 2**: Install OmniVoice (choose one)
```bash
# From PyPI (stable release)
pip install omnivoice
# From the latest source on GitHub (no need to clone)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# For development (clone first, editable install)
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
```
### uv
Clone the repository and sync dependencies:
```bash
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
```
> **Tip**: Can use mirror with `uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"`
---
## Quick Start
Try OmniVoice without coding:
- Launch the local web UI: `omnivoice-demo --ip 0.0.0.0 --port 8001`
- Or try it directly on [HuggingFace Space](https://huggingface.co/spaces/k2-fsa/OmniVoice)
> If you have trouble connecting to HuggingFace when downloading the pre-trained models, set `export HF_ENDPOINT="https://hf-mirror.com"` before running.
For full usage, see the [Python API](#python-api) and [Command-Line Tools](#command-line-tools) sections below.
---
## Python API
The OmniVoice model supports three generation modes. All features in this section are also available via [command-line tools](#command-line-tools).
### Voice Cloning
Clone a voice from a short reference audio. Provide `ref_audio` and `ref_text`:
```python
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Apple Silicon users: use device_map="mps" instead
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.
# The model will use Whisper ASR to auto-transcribe it.
torchaudio.save("out.wav", audio[0], 24000)
```
### Voice Design
Describe the desired voice with speaker attributes — no reference audio needed.
Supported attributes: **gender** (male/female), **age** (child to elderly),
**pitch** (very low to very high), **style** (whisper), **English accent**
(American, British, etc.), and **Chinese dialect** (四川话, 陕西话, etc.).
Attributes are comma-separated and freely combinable across categories.
```python
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
```
See [docs/voice-design.md](docs/voice-design.md) for the full attribute
reference, Chinese equivalents, and usage tips.
### Auto Voice
Let the model choose a voice automatically:
```python
audio = model.generate(text="This is a sentence without any voice prompt.")
```
### Generation Parameters
All above three modes share the same `model.generate()` API. You can further control the generation behavior via keyword arguments:
```python
audio = model.generate(
text="...",
num_step=32, # diffusion steps (or 16 for faster inference)
speed=1.0, # speed factor (>1.0 faster, <1.0 slower)
duration=10.0, # fixed output duration in seconds (overrides speed)
# ... more options
)
```
See more detailed control in [docs/generation-parameters.md](docs/generation-parameters.md).
### Non-Verbal & Pronunciation Control
OmniVoice supports inline **non-verbal symbols** and **pronunciation hints** within the input text.
**Non-verbal symbols**: Insert tags like `[laughter]` directly in the text to add expressive non-verbal sounds.
```python
audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")
```
Supported tags: `[laughter]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`, `[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`, `[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`, `[sniff]`, `[sigh]`
**Pronunciation control (Chinese)**: Use pinyin with tone numbers to correct specific character pronunciations.
```python
audio = model.generate(text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。")
```
**Pronunciation control (English)**: Use [CMU pronunciation dictionary](https://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict.0.7a) (uppercase, in brackets) to override default English pronunciations.
```python
audio = model.generate(text="You could probably still make [IH1 T] look good.")
```
---
## Command-Line Tools
Three CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.
| Command | Description | Source |
|---|---|---|
| `omnivoice-demo` | Interactive Gradio web demo | [omnivoice/cli/demo.py](omnivoice/cli/demo.py) |
| `omnivoice-infer` | Single-item inference | [omnivoice/cli/infer.py](omnivoice/cli/infer.py) |
| `omnivoice-infer-batch` | Batch inference across multiple GPUs | [omnivoice/cli/infer_batch.py](omnivoice/cli/infer_batch.py) |
### Demo
```bash
omnivoice-demo --ip 0.0.0.0 --port 8001
```
Provides a web UI for voice cloning and voice design. See `omnivoice-demo --help` for all options.
### Single Inference
```bash
# Voice Cloning
# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--ref_audio ref.wav \
--ref_text "Transcription of the reference audio." \
--output hello.wav
# Voice Design
omnivoice-infer --model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--instruct "male, British accent" \
--output hello.wav
# Auto Voice
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech."\
--output hello.wav
```
### Batch Inference
`omnivoice-infer-batch` can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.
```bash
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
```
The test list is a JSONL file where each line is a JSON object:
```json
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "language_name": "English", "duration": 10.0, "speed": 1.0}
```
Only `id` and `text` are mandatory fields. `ref_audio` and `ref_text` are used in voice cloning mode. `instruct` is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.
`language_id`, `language_name`, `duration`, and `speed` are optional. `duration` (in seconds) fixes the output length; `speed` controls the speaking rate. If `duration` and `speed` are both provided, `speed` will be ignored.
---
## Training & Evaluation
See [examples/](examples/) for the complete pipeline — from data preparation to training, evaluation, and finetuning.
---
## Discussion & Communication
You can directly discuss on [GitHub Issues](https://github.com/k2-fsa/OmniVoice/issues).
You can also scan the QR code to join our wechat group or follow our wechat official account.
| Wechat Group | Wechat Official Account |
| ------------ | ----------------------- |
| | |
---
## Citation
```bibtex
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
```