Difference between revisions of "11-16 Bin Yuan"

From cslt Wiki
Jump to: navigation, search
(Result)
(以“=== Accomplished this week === * build a new jsgf file * construct a test set for address tag language model * conduct a new experiment, result is as below === Planned for n...”替换内容)
 
(4 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
* build a new jsgf file
 
* build a new jsgf file
 
* construct a test set for address tag language model
 
* construct a test set for address tag language model
* conduct a new experiment, result in
+
* conduct a new experiment, result is as below
  
 
=== Planned for next week ===
 
=== Planned for next week ===
 
+
* check the relation that between weight and size of dict.
=== Result===
+
* the short term should be punished.
1. experiment 1
+
* make a summary about tag-lm.
 
+
* read some paper about knowledge vector.
  1.1 baseline
+
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
+
    am: mdl_v3.0.S
+
    test set: test_BJYD
+
    result:
+
      %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
+
      %SER 93.20 [ 1096 / 1176 ]
+
      北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)
+
 
+
  1.2 use address tag:
+
    jsgf: extract top 500 frequent address(include "北京") from corpus
+
    corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt,remove sentences containing "北京",
+
      add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课",
+
      then add a sentence "我 在 <address> 上课" to corpus)
+
    am: mdl_v3.0.S
+
    test set: test_BJYD
+
 
+
    try different merge weight, the result is as follow:
+
      weight: 0.1
+
        %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
+
        %SER 94.98 [ 1117 / 1176 ]
+
        北京: 4 / 10
+
 
+
      weight: 0.5
+
        %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
+
        %SER 93.88 [ 1104 / 1176 ]
+
        北京: 4 / 10
+
 
+
      weight: 1
+
        %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
+
        %SER 93.28 [ 1097 / 1176 ]
+
        北京: 2 / 10
+
 
+
      weight: 2
+
        %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
+
        %SER 93.71 [ 1102 / 1176 ]
+
        北京: 1 / 10
+
 
+
      weight: 3
+
        can't decode "北京"
+
 
+
-------------------------------------------------------------------------------
+
This weekend I find two mistakes in experiment 1:
+
    1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
+
      and run this script under my directory, leading to higher WER.
+
    2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
+
      generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
+
      and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
+
      so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
+
      the decode result contains TAG, it would always be truncated. This explains why the deletion error is
+
      high when merge weight is small in experiment 1.2.
+
 
+
2. experiment 2
+
  2.1 pre-work:
+
    2.1.1 build jsgf file
+
      extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
+
      from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
+
      appear in the corpus.
+
 
+
      some samples of the 490 address:
+
        黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县
+
      some samples of the 10 address:
+
        上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市 海淀区 清华大学、明斯克、摩纳哥
+
 
+
    2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
+
      测试集中120条文本包含的地名有三种情况:
+
        训练预料中频繁的地名(出现次数大于10),不在jsgf当中(30条,按照地名在训练预料中出现的次数等间隔采样) 
+
        jsgf中的第一种地名:在训练预料中出现次数小于10次(40条,按照地名在训练预料中出现的次数等间隔采样)
+
        jsgf中的第二种地名:在训练预料中没出现过(50条,每个地名的测试样本5条)
+
      120条文本每条录音两遍(不是同一个人),一共240个音频,12个人录音,每人录音20条
+
 
+
  2.2 baseline
+
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
+
    am: mdl_1400
+
    test set: test_address_tag
+
    result:
+
    %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
+
    %SER 73.33 [ 176 / 240 ]
+
 
+
  2.3 address tag
+
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
+
    am: mdl_1400
+
    test set: test_address_tag
+
    weight: 1
+
      %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
+
      %SER 69.17 [ 166 / 240 ]
+

Latest revision as of 14:51, 23 November 2014

Accomplished this week

  • build a new jsgf file
  • construct a test set for address tag language model
  • conduct a new experiment, result is as below

Planned for next week

  • check the relation that between weight and size of dict.
  • the short term should be punished.
  • make a summary about tag-lm.
  • read some paper about knowledge vector.