From cslt Wiki
Jump to: navigation, search

We introduce THCHS-30 (Tsinghua Chinese 30-hour database), a free Chinese speech database for automatic speech recogniton (ASR) research. The database was published by the Center for Speech and Language Technologies (CSLT) at Tsinghua University. Our release involves speech data, language models and lexicons, as well as the corresponding Kaldi-based recipe that can be used to construct a baseline Chinese ASR system.


Database is highly important for speech recognition. Due to the complex patterns within human speech, the ASR research requires large mounts of speech data to train robust acoustic models. There have been many ‘standard’ databases, among which several famous ones are the TIMIT database [1], the WSJ database [2], the Switchboard database [3], the RAS 863 corpus [4] (for Chinese). Theses databases have been widely used to build baseline systems and verify new algorithms. However, most of them are costly and are not affordable for many individual researchers. This has impeded the initial interest of students and young researchers towards speech research. We support the public data movement and hold the idea that free speech data will provide valuable resources for potential researchers to start their first step in the exciting ASR research field.


The THCHS-30 speech database was recorded in 2000 - 2001 by Dr. Dong Wang when he was a master student at the department of computer science, Tsinghua University, supervised by Prof. Xiaoyan Zhu. The database was originally named as TCMSD (Tsinghua Continuous Mandrin Speech Database) [5], and the name was changed to THCHS-30 when it was published 10 years later, to respect the name convention of the CSLT open data series1.

A. Audio data and transcriptions

THCHS-30 involves about 30 hours of speech signals recorded by a single carbon microphone at the condition of silent office. There are 40 people participating the recording. Most of the participants are young colleague students, and all are fluent in standard Mandarin. The sampling rate of the recording is 16; 000 Hz, and the sample size is 16 bits. The transcriptions are provided in word level, syllable level and phone level. The entire database involves four individual subsets, and participants in the same subset record the same 500 sentences. We divide the database into a training set, a development set and a test set. The training set involves subset A, B and C by 30 people, amounting to 10; 000 utterances. The development set involves nearly 900 utterances spoken by the same people in the same subsets as the training set. The test set involves the recordings of subset D, amounting to 2495 utterances spoken by 10 people. More details are in [6].

B. Language model and lexicon

We release a word-based tri-gram language model (LM) and a phone-based tri-gram LM to support the word and phone decoding tasks respectively. The word-based LM involves 48; 000 words and the phone LM involves 218 Chinese tonal finalinitials. The associated lexica are provided as part of the release.

C. Recipe

To demonstrate how to build a baseline ASR system using THCHS-30, a recipe was released with the Kaldi toolkit. The recipe is similar to the WSJ s5 recipe using GPU, with a few modifications to support phone-based decoding and noisy training. The baseline results are also published. Update the Kaldi to the latest snapshot, and the THCHS-30 recipe is in egs/thchs30. With this recipe and the associated free resources, a full-fledged Chinese ASR system can be constructed from scratch.


We publish two versions of THCHS-30: the ‘openslr version’ which is easily used by the Kaldi toolkit, and the ‘standalone version’ which contains the same content but in a slightly different format. If you work on other toolkits (e.g., HTK, Sphinx), the standalone version is probably more appropriate.

The data can be downloaded freely from the following links:

The above links are from our own web server at Tsinghua University, which may be unstable and slow for some connections. The mirrors in the public cloud disks can be used as a backup (and actually more recommended):


We call for challenge based on THCHS-30. Two tasks are focused at present, one is based on the original clean data (CLEAN TEST), and the other is based on very noisy data where the SNR is 0db (0DB TEST). Details can be found in the challenge web page

The baseline results we have obtained are based on the deep neural network (DNN) model trained with the minimum phone error (MPE) criterion [7], plus the deep auto encoder (DAE) model for the 0DB TEST [8]. The DAE training has been checked in as part of the Kaldi thchs30 recipe. Any improvement reported by any researchers will be released in the challenge web page.


  • Dong Wang
  • Xuewei Zhang
  • Zhiyong Zhang


[1] C. Lopes and F. Perdigo, Phoneme Recognition on the TIMIT Database. Speech Technologies, 2011.

[2] D. Paul and J. Baker, “The design of wall street journal-based CSR corpus,” Proceedings of the International Conference on Spoken Language Systems (ICSLP), pp. 899–902, 1992.

[3] J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: telephone speech corpus for research and development,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, vol. 1, Mar 1992, pp. 517–520 vol.1.

[4] A. Li, Z. Yin, T. Wang, Q. Fang, and F. Hu, “RASC863- A Chinese speech corpus with four regional accents,” ICSLT-o-COCOSDA, New Delhi, India, 2004.

[5] D. Wang, D. Wu, and X. Zhu, “TCMSD: A new Chinese continuous speech database,” in International Conference on Chinese Computing (ICCC01), 2001,, 2001. [Online]. Available: camera.pdf

[6] Z. Z. Dong Wang, Xuewei Zhang, “THCHS-30: A free Chinese speech corpus,” 2015. [Online]. Available:

[7] X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern recognition,” IEEE SIGNAL PROCESSING MAGAZINE, pp. 14–36, 2008.

[8] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer, 2015.