From cslt Wiki
Jump to: navigation, search

Resoruce Building

  • Current text resource has been re-arranged and listed

AM development

Sparse DNN

  • Optimal Brain Damage(OBD).
  1. GA-based block sparsity

Efficient DNN training

  1. Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

Multi GPU training

  • Error encountered

GMM - DNN co-training

  • Error encountered

Multilanguage training

  1. Pure Chinese training reached 4.9%
  2. Chinese + English reduced to 7.9%
  3. English phone set should discriminate beginning phone and ending phone
  4. Should set up multilingual network structure which shares low layers but separate languages at high layers

Noise training

  • Train with wsj database by corrupting data with various noise types
  • White noise training completed. All results are fine
  • Car noise training almost finished. Large-variance training on progress

Engine optimization

  • Investigating LOUDS FST.

Word to Vector

  • Test a training toolkit Standford University, which can involve global information into word2vector training
  • C++ implementation (instead of python) for data pre-processing. Failed. Just use python.
  • Basic wordvector plus global sense
  • 1 MB corpus costs 5 mins,vocab size 16698
  • 10 MB corpus costs about 82 mins vocab size 56287
  • Improved wordvector with multi sense
  • Almost impossible with the toolkit
  • Can think of pre-training vectors and then do clusering

  • WordVecteor-based keyword extraction
  • wordvector keyword extraction seems more reasonable if the keywords are in the lexicon
  • For oov words, wv-based extraction is limited by the vocabulary
  • Need a standard new word extraction

  • Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

LM development


  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • 3hours per iteration
  • For word-based NNLM, 1 hour/iteration for 1024 words, 4 hours/iteration for 10240 words
  • Performance lower than word-based NNLM
  • WordVector-based word and char NNLM training done
  • Google wordvecotr-based NNLM is worse than random initialized NNLM

3T Sogou LM

  • Improved training
  • re-segmentation by Tencent 110k lexicon
  • re-train with 4G text blocks
  • 1/6 merge done. PPL reduced to 466(vs Tencent 8w8 213.74)
  • Need to check the OOV problem
  • Need to finish the final merge.

Embedded development

  • CLG embedded decoder is almost done. Online compiler is on progress.
  • Zhiyong is working on layer-by-layer DNN training.

Speech QA

  • N-best with entity LM was analyzed
  • Entity-class LM comparision
  • re-segmentation & re-train
  • SRILM class-based LM ???
  • Subgraph integration from Zhiyong