From cslt Wiki
Revision as of 04:58, 7 March 2014 by Zhaomy (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Resoruce Building

  • Current text resource has been re-arranged and listed

AM development

Sparse DNN

  • Optimal Brain Damage(OBD).
  1. GA-based block sparsity

Efficient DNN training

  1. Asymmetric window: Great improvement on training set(WER 34% to 24%), however the improvement is lost on test. Overfitting?

Multi GPU training

  • Error encountered

GMM - DNN co-training

  • Error encountered

Multilanguage training

  1. Pure Chinese training reached 4.9%
  2. Chinese + English reduced to 7.9%
  3. English phone set should discriminate beginning phone and ending phone
  4. Should set up multilingual network structure which shares low layers but separate languages at high layers

Noise training

  • Train with wsj database by corrupting data with various noise types
  • White noise + car noise training partially completed
  • Mixture training produces better performance for both car and white noise
  • Unknown noise testing is on progress

AMR compression re-training

  • WeChat uses AMR compression method, which requires adaptation for our AM
  • Test AMR & non-AMR model
               test-wav       WAV     AMR
        WAV                   4.31     26.09
        AMR                  13.80      6.77 
  • Prepare to do adaptation


  • Finished the first round of gfbank training & test
  • The same gmm model (mfcc feature) was used to get the alignment
  • Traing fbank & gfbank based on the mfcc alignment
  • Clean training and noise test
     clean     25db    5db
gf   4.22      5.60    73.03
fb   4.31      5.87    84.12

Engine optimization

  • Investigating LOUDS FST.

Word to Vector

  • Test a training toolkit Standford University, which can involve global information into word2vector training
  • C++ implementation (instead of python) for data pre-processing. Failed. Just use python.
  • Basic wordvector plus global sense
  • 1 MB corpus costs 5 mins,vocab size 16698
  • 10 MB corpus costs about 82 mins vocab size 56287
  • Improved wordvector with multi sense
  • Almost impossible with the toolkit
  • Can think of pre-training vectors and then do clusering
  • WordVecteor-based keyword extraction
  • Prepared 7 category totally 500+ articles
  • A problem in keyword identification. Fix it by using the article vector space
  • Investigating Senna toolkit from NEC. Intending to implement POS tagging based on word vectors.

LM development


  • Character-based NNLM (6700 chars, 7gram), 500M data training done.
  • Performance lower than word-based NNLM
  • Prepare to run boundary-involved char NNLM
  • WordVector-based word and char NNLM training done
  • Google wordvecotr-based NNLM is worse than random initialized NNLM

3T Sogou LM

  • Improved training
  • 3T LM + Tencent 80k lM: performance worse than the original 80K LM
  • Need to check if it is caused by the mismatched vocabu9lary
  • 3T LM + QA LM : use online1 as the EM target, performance worse than QA LM
  • Probably due to the incorrect EM target

QA Matching

  • Working on edit FST for fuzzy matching
  • TF/IDF score matching completed

Embedded development

  • CLG embedded decoder is almost done. Online compiler is on progress.
  • English scoring is under go

Speech QA

  • N-best with entity LM was analyzed
  • Entity-class LM comparision
  • re-segmentation & re-train
  • SRILM class-based LM ???
  • Subgraph integration from Zhiyong
  • WER summary is done
  • Prepare to compose a paper