← Back to Project List
MOSS-TTS-Nano is OpenMOSS/MOSI.AI open source 0.1B parameter multi-language small voice generation model, focusing on real-time voice generation, CPU-friendly, voice cloning, ONNX deployment and lightweight product integration. It is suitable for latency and deployment cost-sensitive scenarios such as end-side reading, browser reading, lightweight voice assistant, education/companion training products, and customer service speech broadcasting. The biggest highlight of the pre-sales is "not the pursuit of the largest model, but the pursuit of deployable, low-cost, local operation of TTS capabilities".

1. Project Overview

DimensionInformation
ProjectOpenMOSS/MOSS-TTS-Nano
PositioningMultilingual tiny speech generation model
Parameter Scale0.1B
Main LanguagePython
Open Source LicenseApache-2.0
Created2026-04-10
Recently pushed2026-06-02
GitHub Hot2026-06-30 Query: About 3.8k stars, 483 forks, 58 open issues
Supported languagesREADME lists 20 languages
Key capabilitiesVoice cloning, streaming inference, CPU/ONNX, long text chunking, Web Demo, CLI

official concept map and architecture diagram:

! MOSS-TTS-Nano Concept

! MOSS-TTS-Nano Architecture

2. What does it mostly do?

CapabilitiesDescriptionsPre-Sales Value
Multilingual TTSSupport Chinese, English, Japanese, Korean, French, German and other 20 languagesSuitable for sea, multilingual broadcast, learning products
Voice CloneGenerate similar timbre voice through reference audioCan be used as brand timbre, virtual teacher, personalized reading
Streaming InferenceFor Low Latency and First Pack Audio SpeedFor Real-Time Assistant and Interactive Voice Products
CPU friendly0.1B small model, README says streaming generation can run on 4-core CPUReduce deployment cost, suitable for edge/local demonstration
ONNX CPU versionNo PyTorch dependency, ONNX Runtime CPU inferenceEasier integration into lightweight services and end-side applications
Browser/Plug-in RouteOfficial Mention that Reader Can Run Directly in Browser ExtensionSuitable for Local Reader, Web Page Reading, Privacy Scenarios
Android exampleProvides Android ONNX Runtime smoke exampleCan verify the feasibility of mobile integration
Fine-tuning code2026-04-16 Release finetuning codeCustom timbre/domain style requirements can be further explored

3. Language Support

README currently lists 20 languages: Chinese, English, German, Spanish, French, Japanese, Italian, Hungarian, Korean, Russian, Persian, Arabic, Polish, Portuguese, Czech, Danish, Swedish, Greek, Turkish, etc.

This means that it is not only suitable for Chinese reading, but also can enter cross-border e-commerce, overseas education, overseas customer service, international content broadcast and other scenes. However, the actual naturalness and accent performance of each language still need to be verified with customer samples.

4. Applicable Scenario

SceneFitDescription
Educational Products Reading/SparringHighSmall model, low latency, multilingual, suitable for sentence-level/paragraph-level reading
Voice broadcast in the Enterprise Knowledge BaseHighTurn text answers to voice, and local deployment can protect privacy.
Browser Reading Plug-inGaoOfficial MOSS-TTS-Nano-Reader Direction
Lightweight Voice AssistantMedium to HighLow Latency TTS as Voice Agent Output Layer
Mobile/Edge TTSMid-to-highONNX Android example with end-side exploration value
Brand sound cloneMediumSupport reference audio, but commercial use requires strict authorization
Movie-level dubbingMedium-low0.1B small model is more real-time and lightweight, and should not over-promise top-level sound quality

5. Not suitable for the scene

Unsuitable pointCause
High requirements for extreme anthropomorphic emotional expressionSmall model positioning is lightweight real-time, complex emotions and performance may not be as good as large models/commercial TTS
Sound clones without authorized timbreSound cloning involves portraiture rights, personality rights and compliance risks
Direct launch of highly concurrent cloud servicesRequires service, throttling, queuing, caching, monitoring, authentication, and compliance auditing
Strictly broadcast-level qualityEvaluate pronunciation, pauses, prosody, accent, and long-text stability with realistic scripts

6. Architecture Understanding

MOSS-TTS-Nano a pure autoregressive architecture using Audio Tokenizer LLM. This can be explained to the customer:

  1. Audio is first converted into discrete audio token through MOSS-Audio-Tokenizer-Nano.
  2. The TTS model generates audio tokens just like the language model generates text tokens.
  3. The audio tokenizer then decodes the token into 48 kHz, two-channel audio.

The official also provides the MOSS-Audio-Tokenizer-Nano architecture and evaluation map:

! MOSS-Audio-Tokenizer-Nano Architecture

! MOSS-Audio-Tokenizer Evaluation

The pre-sale selling point of this architecture is the unified audio token representation, which can be extended to voice, dialogue, sound effects and other models in the MOSS-TTS family. However, for the current customer landing, the most practical is Nano's lightweight deployment.

How to use #7.

Environment:

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Voice Cloning:

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Local Web Demo:

python app.py

Open:

http://127.0.0.1:18083

ONNX CPU reasoning:

python infer_onnx.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Welcome to the ONNX Runtime CPU demo."

CLI:

moss-tts-nano generate \
  --backend onnx \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Service Model:

moss-tts-nano serve --backend onnx

8. What can I say before sales

One-sentence positioning:

"MOSS-TTS-Nano is a lightweight TTS model that can run locally, is CPU-friendly, supports voice cloning and multi-language, and is suitable for turning the text output of AI applications into low-latency voice output."

Customer Value Mapping:

Customer Pain PointsMOSS-TTS-Nano Value
High cost and privacy concerns for commercial TTSCan be deployed locally for private data and offline presentations
End-side voice capability is difficult to integrateONNX CPU, Android example reduces integration threshold
The AI Assistant is not natural only because of the text.It can be used as the output layer of the voice agent.
Education/reading products need to read aloud in multiple languagesSupport multiple languages, suitable for learning, reading aloud, and following reading
Hope to have brand soundReference audio voice clone can do proof of concept

9. PoC Recommendations

PoC ItemsAcceptance Indicators
Chinese long text readingmisreading rate, pause naturalness, long text stability
multilingual readingtarget language naturalness, accent acceptance, speed
CPU/ONNX PerformanceFirst Packet Latency, Real-Time Rate, CPU Usage, Memory
Mobile verificationAndroid demo can run through, model size and power consumption
Voice Assistant LinkLLM Generated Text-> End-to-End Delay for TTS Streaming

It is recommended to prepare three types of audio samples before sales: ordinary reading, business speech and interactive short sentences. Don't just test a short text, you must test long text, numbers, English abbreviations, names, professional terms and mixed reading in Chinese and English.

10. Frequently Asked Customer Questions

Can it run on the CPU?The official emphasizes CPU friendliness and provides ONNX CPU version; The actual performance should be measured according to the customer's hardware pressure.
Can I use the reference audio for voice clone?You can use the reference audio for voice clone, but you must ensure voice authorization and compliance.
Does it support the mobile terminal?The official provides Android ONNX Runtime examples, which are suitable for feasibility verification. The formal products still need to optimize the model body and performance.
How does it compare to commercial TTS?Commercial TTS may be more mature in terms of stability, sound library and SLA; the MOSS-TTS-Nano advantage is open source, light weight, local and customizable.
Can I do a real-time voice assistant?Can be a TTS output layer candidate, but the end-to-end experience also depends on ASR, LLM, dialog management, and audio playback pipelines.

11. Risks and Considerations

  1. Voice cloning compliance: must have clear authorization, especially employees, anchors, teachers, customer service and other real timbre.
  2. Sound quality needs to be measured: small models pursue deployment efficiency and cannot reach top commercial dubbing quality by default.
  3. Language coverage does not equal quality: 20 languages are subject to separate acceptance by target market.
  4. Dependent installation: README mentions that WeTextProcessing / pynini may require additional processing.
  5. Production still needs service layer: authentication, concurrency, cache, log, audit, text cleaning, sensitive word control all need to be supplemented.

12. My Pre-Sales Judgment

The advantages of MOSS-TTS-Nano are clear: lightweight, open source, local, ONNX, and voice cloning. It is not used to confront the strongest commercial TTS in emotional performance, but is very suitable for customers who "need to put voice capabilities in edge, local, browser or lightweight services.

pre-sales recommendations are used for education, reading, AI assistant, enterprise knowledge base voice broadcast, and terminal-side voice demo. When pushing forward, PoC should be done with the customer's real text and target hardware: if CPU delay, sound quality and compliance can be passed, it can become a very cost-effective TTS scheme component.

13. REFERENCE

-GitHub: https://github.com/OpenMOSS/MOSS-TTS-Nano

-Demo: https://openmoss.github.io/MOSS-TTS-Nano-Demo/

-Hugging Face: https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano

-ModelScope: https://modelscope.cn/models/openmoss/MOSS-TTS-Nano

-Thesis: https://arxiv.org/abs/2603.18090

-ONNX Weights: https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX