1. Project Overview
| Dimension | Information |
|---|---|
| Project | OpenMOSS/MOSS-TTS-Nano |
| Positioning | Multilingual tiny speech generation model |
| Parameter Scale | 0.1B |
| Main Language | Python |
| Open Source License | Apache-2.0 |
| Created | 2026-04-10 |
| Recently pushed | 2026-06-02 |
| GitHub Hot | 2026-06-30 Query: About 3.8k stars, 483 forks, 58 open issues |
| Supported languages | README lists 20 languages |
| Key capabilities | Voice cloning, streaming inference, CPU/ONNX, long text chunking, Web Demo, CLI |
official concept map and architecture diagram:
2. What does it mostly do?
| Capabilities | Descriptions | Pre-Sales Value |
|---|---|---|
| Multilingual TTS | Support Chinese, English, Japanese, Korean, French, German and other 20 languages | Suitable for sea, multilingual broadcast, learning products |
| Voice Clone | Generate similar timbre voice through reference audio | Can be used as brand timbre, virtual teacher, personalized reading |
| Streaming Inference | For Low Latency and First Pack Audio Speed | For Real-Time Assistant and Interactive Voice Products |
| CPU friendly | 0.1B small model, README says streaming generation can run on 4-core CPU | Reduce deployment cost, suitable for edge/local demonstration |
| ONNX CPU version | No PyTorch dependency, ONNX Runtime CPU inference | Easier integration into lightweight services and end-side applications |
| Browser/Plug-in Route | Official Mention that Reader Can Run Directly in Browser Extension | Suitable for Local Reader, Web Page Reading, Privacy Scenarios |
| Android example | Provides Android ONNX Runtime smoke example | Can verify the feasibility of mobile integration |
| Fine-tuning code | 2026-04-16 Release finetuning code | Custom timbre/domain style requirements can be further explored |
3. Language Support
README currently lists 20 languages: Chinese, English, German, Spanish, French, Japanese, Italian, Hungarian, Korean, Russian, Persian, Arabic, Polish, Portuguese, Czech, Danish, Swedish, Greek, Turkish, etc.
This means that it is not only suitable for Chinese reading, but also can enter cross-border e-commerce, overseas education, overseas customer service, international content broadcast and other scenes. However, the actual naturalness and accent performance of each language still need to be verified with customer samples.
4. Applicable Scenario
| Scene | Fit | Description |
|---|---|---|
| Educational Products Reading/Sparring | High | Small model, low latency, multilingual, suitable for sentence-level/paragraph-level reading |
| Voice broadcast in the Enterprise Knowledge Base | High | Turn text answers to voice, and local deployment can protect privacy. |
| Browser Reading Plug-in | Gao | Official MOSS-TTS-Nano-Reader Direction |
| Lightweight Voice Assistant | Medium to High | Low Latency TTS as Voice Agent Output Layer |
| Mobile/Edge TTS | Mid-to-high | ONNX Android example with end-side exploration value |
| Brand sound clone | Medium | Support reference audio, but commercial use requires strict authorization |
| Movie-level dubbing | Medium-low | 0.1B small model is more real-time and lightweight, and should not over-promise top-level sound quality |
5. Not suitable for the scene
| Unsuitable point | Cause |
|---|---|
| High requirements for extreme anthropomorphic emotional expression | Small model positioning is lightweight real-time, complex emotions and performance may not be as good as large models/commercial TTS |
| Sound clones without authorized timbre | Sound cloning involves portraiture rights, personality rights and compliance risks |
| Direct launch of highly concurrent cloud services | Requires service, throttling, queuing, caching, monitoring, authentication, and compliance auditing |
| Strictly broadcast-level quality | Evaluate pronunciation, pauses, prosody, accent, and long-text stability with realistic scripts |
6. Architecture Understanding
MOSS-TTS-Nano a pure autoregressive architecture using Audio Tokenizer LLM. This can be explained to the customer:
- Audio is first converted into discrete audio token through MOSS-Audio-Tokenizer-Nano.
- The TTS model generates audio tokens just like the language model generates text tokens.
- The audio tokenizer then decodes the token into 48 kHz, two-channel audio.
The official also provides the MOSS-Audio-Tokenizer-Nano architecture and evaluation map:
! MOSS-Audio-Tokenizer-Nano Architecture
! MOSS-Audio-Tokenizer Evaluation
The pre-sale selling point of this architecture is the unified audio token representation, which can be extended to voice, dialogue, sound effects and other models in the MOSS-TTS family. However, for the current customer landing, the most practical is Nano's lightweight deployment.
How to use #7.
Environment:
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano
pip install -r requirements.txt
pip install -e .
Voice Cloning:
python infer.py \
--prompt-audio-path assets/audio/zh_1.wav \
--text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
Local Web Demo:
python app.py
Open:
http://127.0.0.1:18083
ONNX CPU reasoning:
python infer_onnx.py \
--prompt-audio-path assets/audio/zh_1.wav \
--text "Welcome to the ONNX Runtime CPU demo."
CLI:
moss-tts-nano generate \
--backend onnx \
--prompt-speech assets/audio/zh_1.wav \
--text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
Service Model:
moss-tts-nano serve --backend onnx8. What can I say before sales
One-sentence positioning:
"MOSS-TTS-Nano is a lightweight TTS model that can run locally, is CPU-friendly, supports voice cloning and multi-language, and is suitable for turning the text output of AI applications into low-latency voice output."
Customer Value Mapping:
| Customer Pain Points | MOSS-TTS-Nano Value |
|---|---|
| High cost and privacy concerns for commercial TTS | Can be deployed locally for private data and offline presentations |
| End-side voice capability is difficult to integrate | ONNX CPU, Android example reduces integration threshold |
| The AI Assistant is not natural only because of the text. | It can be used as the output layer of the voice agent. |
| Education/reading products need to read aloud in multiple languages | Support multiple languages, suitable for learning, reading aloud, and following reading |
| Hope to have brand sound | Reference audio voice clone can do proof of concept |
9. PoC Recommendations
| PoC Items | Acceptance Indicators | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Chinese long text reading | misreading rate, pause naturalness, long text stability | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| multilingual reading | target language naturalness, accent acceptance, speed | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CPU/ONNX Performance | First Packet Latency, Real-Time Rate, CPU Usage, Memory | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mobile verification | Android demo can run through, model size and power consumption | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Voice Assistant Link | LLM Generated Text-> End-to-End Delay for TTS Streaming |
It is recommended to prepare three types of audio samples before sales: ordinary reading, business speech and interactive short sentences. Don't just test a short text, you must test long text, numbers, English abbreviations, names, professional terms and mixed reading in Chinese and English.
10. Frequently Asked Customer Questions
| Can it run on the CPU? | The official emphasizes CPU friendliness and provides ONNX CPU version; The actual performance should be measured according to the customer's hardware pressure. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can I use the reference audio for voice clone? | You can use the reference audio for voice clone, but you must ensure voice authorization and compliance. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Does it support the mobile terminal? | The official provides Android ONNX Runtime examples, which are suitable for feasibility verification. The formal products still need to optimize the model body and performance. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| How does it compare to commercial TTS? | Commercial TTS may be more mature in terms of stability, sound library and SLA; the MOSS-TTS-Nano advantage is open source, light weight, local and customizable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can I do a real-time voice assistant? | Can be a TTS output layer candidate, but the end-to-end experience also depends on ASR, LLM, dialog management, and audio playback pipelines. |
11. Risks and Considerations
- Voice cloning compliance: must have clear authorization, especially employees, anchors, teachers, customer service and other real timbre.
- Sound quality needs to be measured: small models pursue deployment efficiency and cannot reach top commercial dubbing quality by default.
- Language coverage does not equal quality: 20 languages are subject to separate acceptance by target market.
- Dependent installation: README mentions that WeTextProcessing / pynini may require additional processing.
- Production still needs service layer: authentication, concurrency, cache, log, audit, text cleaning, sensitive word control all need to be supplemented.
12. My Pre-Sales Judgment
The advantages of MOSS-TTS-Nano are clear: lightweight, open source, local, ONNX, and voice cloning. It is not used to confront the strongest commercial TTS in emotional performance, but is very suitable for customers who "need to put voice capabilities in edge, local, browser or lightweight services.
pre-sales recommendations are used for education, reading, AI assistant, enterprise knowledge base voice broadcast, and terminal-side voice demo. When pushing forward, PoC should be done with the customer's real text and target hardware: if CPU delay, sound quality and compliance can be passed, it can become a very cost-effective TTS scheme component.
13. REFERENCE
-GitHub: https://github.com/OpenMOSS/MOSS-TTS-Nano
-Demo: https://openmoss.github.io/MOSS-TTS-Nano-Demo/
-Hugging Face: https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano
-ModelScope: https://modelscope.cn/models/openmoss/MOSS-TTS-Nano
-Thesis: https://arxiv.org/abs/2603.18090
-ONNX Weights: https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX