From cslt Wiki
- Current text resource has been re-arranged and listed
- Optimal Brain Damage(OBD).
- GA-based block sparsity
Efficient DNN training
- Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
- Pure Chinese training reached 4.9%
- Chinese + English reduced to 7.9%
- English phone set should discriminate beginning phone and ending phone
- Should set up multilingual network structure which shares low layers but separate languages at high layers
- Train with wsj database by corrupting data with various noise types
- baseline system ready
- noise data ready, selected 5 noise which is noise in reality
- Liuchao's noise-adding toolkit ready
- Investigating LOUDS FST.
- Tested adaptation performance with adapted utterances from 10 to 40.
Word to Vector
- Test a training toolkit Standford University, which can involve global information into word2vector training
- C++ implementation (instead of python) for data pre-processing, problem encountered
- Basic wordvector plus global sense
- Training 100M data (with global sense), memory overflow
- Split the data into small pieces
- Improved wordvector with multi sense
- Prepare scripts
- Keyword extraction based on wordvectors
- Using google word vectors
- Using k-mean to cluster
- Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.
- Character-based NNLM (6700 chars, 7gram), 500M data training done.
- 3hours per iteration
- For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
- Performance lower than word-based NNLM
- WordVector-based word and char NNLM training done
- Google wordvecotr-based NNLM is worse than random initialized NNLM
3T Sogou LM
- Naive training
- all-word in lexicon
- split into 9G text blocks
- Merge one-by-one
- Cutting to 110k lexicon
- Test on QA
- Performance reduced compared to Liurong's previous LM
- Improved training
- re-segmentation by Tencent 110k lexicon
- re-train with 4G text blocks
- sub-model training done, ready for merge based Tencent online1 test set.
- CLG embedded decoder is almost done. Online compiler is on progress.
- Zhiyong is working on layer-by-layer DNN training.
- Current N-best results
- N-best search plus pinyin correction
- Total 2718 QA requests
- default 1844 QA correct
- no-entity 1650 QA correct
- with-entity 1884 QA correct
- Analyze error patterns for Nbest match
- 10.8% song transcriptions errors
- 18.3% English error
- 38.7% entity (song name, singer name) recognition lost
- 32.3% non-entity recognition error
- Computing complexity
- 11000 entity has 23000 different pronunciations
- Use tree to improve efficiency
- Entity-class LM comparision
- re-segmentation & re-train
- SRILM class-based LM
- Subgraph integration from Zhiyong