Difference between revisions of "11-16 Bin Yuan"

From cslt Wiki
Jump to: navigation, search
(以“=== Accomplished this week === * build a new jsgf file * construct a test set for address tag language model * conduct a new experiment, result is as below === Planned for n...”替换内容)
 
(2 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
* make a summary about tag-lm.
 
* make a summary about tag-lm.
 
* read some paper about knowledge vector.
 
* read some paper about knowledge vector.
 
=== Result===
 
1. experiment 1
 
 
  1.1 baseline
 
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
 
    am: mdl_v3.0.S
 
    test set: test_BJYD
 
    result:
 
      %WER 56.58 [ 8541 / 15096, 288 ins, 5075 del, 3178 sub ]
 
      %SER 93.20 [ 1096 / 1176 ]
 
      北京: 6 / 10 (BJYD test set's text contains 10 "北京", decode 6 of 10)
 
 
  1.2 use address tag:
 
    jsgf: extract top 500 frequent address(include "北京") from corpus
 
    corpus: BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt,remove sentences containing "北京",
 
      add tag to corpus(e.g. if "清华大学" is in jsgf and a sentence in corpus is "我 在 清华大学 上课",
 
      then add a sentence "我 在 <address> 上课" to corpus)
 
    am: mdl_v3.0.S
 
    test set: test_BJYD
 
 
    try different merge weight, the result is as follow:
 
      weight: 0.1
 
        %WER 69.49 [ 10490 / 15096, 196 ins, 6016 del, 4278 sub ]
 
        %SER 94.98 [ 1117 / 1176 ]
 
        北京: 4 / 10
 
 
      weight: 0.5
 
        %WER 62.23 [ 9394 / 15096, 190 ins, 5870 del, 3334 sub ]
 
        %SER 93.88 [ 1104 / 1176 ]
 
        北京: 4 / 10
 
 
      weight: 1
 
        %WER 58.03 [ 8760 / 15096, 243 ins, 5294 del, 3223 sub ]
 
        %SER 93.28 [ 1097 / 1176 ]
 
        北京: 2 / 10
 
 
      weight: 2
 
        %WER 56.90 [ 8589 / 15096, 344 ins, 4558 del, 3687 sub ]
 
        %SER 93.71 [ 1102 / 1176 ]
 
        北京: 1 / 10
 
 
      weight: 3
 
        can't decode "北京"
 
 
-------------------------------------------------------------------------------
 
This weekend I find two mistakes in experiment 1:
 
    1. use run_decode.sh incorrectly. I copy this script from xiaoxi's directory to my own directory
 
      and run this script under my directory, leading to higher WER.
 
    2. one step of making merged lexicon fst is wrong(in experiment 1.2). Merging grammar_G.fst and lm_G.fst
 
      generates a new sym.txt and a new lexicon, the new sym.txt contains a "#0" at the end of the file,
 
      and format_lm.sh will use this sym.txt to generate a words.txt and add another "#0" to the end of words.txt,
 
      so there are two "#0" in words.txt, leading to wrong result. Under this condition, I find out when
 
      the decode result contains TAG, it would always be truncated. This explains why the deletion error is
 
      high when merge weight is small in experiment 1.2.
 
 
2. experiment 2
 
  2.1 pre-work:
 
    2.1.1 build jsgf file
 
      extract a address list from corpus, sort and count the address list, and、 uniformly sample 490 address
 
      from the address which appears no more than 10 times in the corpus, finally add 10 address which does not
 
      appear in the corpus.
 
 
      some samples of the 490 address:
 
        黑龙江省、宿迁市、安定门、吉林省 吉林市、芙蓉 西街、南三环 中路、朝阳 北路 大悦城、石门县
 
      some samples of the 10 address:
 
        上海市 浦东新区 陆家嘴、布鲁塞尔、阿姆斯特丹、圣马力诺、北京市 海淀区 清华大学、明斯克、摩纳哥
 
 
    2.1.2 construct a new test set named "test_address_tag", some sample is as follow:
 
      测试集中120条文本包含的地名有三种情况:
 
        训练预料中频繁的地名(出现次数大于10),不在jsgf当中(30条,按照地名在训练预料中出现的次数等间隔采样) 
 
        jsgf中的第一种地名:在训练预料中出现次数小于10次(40条,按照地名在训练预料中出现的次数等间隔采样)
 
        jsgf中的第二种地名:在训练预料中没出现过(50条,每个地名的测试样本5条)
 
      120条文本每条录音两遍(不是同一个人),一共240个音频,12个人录音,每人录音20条
 
 
  2.2 baseline
 
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt
 
    am: mdl_1400
 
    test set: test_address_tag
 
    result:
 
    %WER 20.66 [ 848 / 4104, 189 ins, 354 del, 305 sub ]
 
    %SER 73.33 [ 176 / 240 ]
 
 
  2.3 address tag
 
    corpus:BJYD2.txt, gxdx500h.txt, huawei_126h.txt, huawei_new.txt, and add tag to corpus
 
    am: mdl_1400
 
    test set: test_address_tag
 
    weight: 1
 
      %WER 15.98 [ 656 / 4104, 169 ins, 291 del, 196 sub ]
 
      %SER 69.17 [ 166 / 240 ]
 

Latest revision as of 14:51, 23 November 2014

Accomplished this week

  • build a new jsgf file
  • construct a test set for address tag language model
  • conduct a new experiment, result is as below

Planned for next week

  • check the relation that between weight and size of dict.
  • the short term should be punished.
  • make a summary about tag-lm.
  • read some paper about knowledge vector.