From cslt Wiki
- Current text resource has been re-arranged and listed
- Optimal Brain Damage(OBD).
- GA-based block sparsity
Efficient DNN training
- Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?
Multi GPU training
- Error encountered
GMM - DNN co-training
- Error encountered
- Pure Chinese training reached 4.9%
- Chinese + English reduced to 7.9%
- English phone set should discriminate beginning phone and ending phone
- Should set up multilingual network structure which shares low layers but separate languages at high layers
- Train with wsj database by corrupting data with various noise types
- White noise training completed. All results are fine
- Car noise training almost finished. Large-variance training on progress
- Investigating LOUDS FST.
Word to Vector
- Test a training toolkit Standford University, which can involve global information into word2vector training
- C++ implementation (instead of python) for data pre-processing. Failed. Just use python.
- Basic wordvector plus global sense
- 1 MB corpus costs 5 mins,vocab size 16698
- 10 MB corpus costs about 82 mins vocab size 56287
- Improved wordvector with multi sense
- Almost impossible with the toolkit
- Can think of pre-training vectors and then do clusering
- WordVecteor-based keyword extraction
- wordvector keyword extraction seems more reasonable if the keywords are in the lexicon
- For oov words, wv-based extraction is limited by the vocabulary
- Need a standard new word extraction
- Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.
- Character-based NNLM (6700 chars, 7gram), 500M data training done.
- 3hours per iteration
- For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
- Performance lower than word-based NNLM
- WordVector-based word and char NNLM training done
- Google wordvecotr-based NNLM is worse than random initialized NNLM
3T Sogou LM
- Improved training
- re-segmentation by Tencent 110k lexicon
- re-train with 4G text blocks
- 1/6 merge done. PPL reduced to 466(vs Tencent 8w8 213.74)
- Need to check the OOV problem
- Need to finish the final merge.
- CLG embedded decoder is almost done. Online compiler is on progress.
- Zhiyong is working on layer-by-layer DNN training.
- N-best with entity LM was analyzed
- Entity-class LM comparision
- re-segmentation & re-train
- SRILM class-based LM ???
- Subgraph integration from Zhiyong