From cslt Wiki
- Data resource
- Uyghur: 250h seed speech data ready, 10k sentences for morpheme learning ready (XJU)
- Kazak: 300h seed speech data ready, 5k sentences for morpheme learning ready (XJU)
- Kirgiz: 0 speech data; 500k text sentences collected. (XJU)
- Tibetan: seed speech data of 42 people, Lexicon with 50k words; 50M text + 40M new blog data collected (NMU)
- Mongolian: Lexicon with 30k words; 50M text collected; text sentences for seed speech dataset recording under preparation (NMU)
- Technical progress
- Multilingual decoding is done. Performance is good, and better than single language systems. Uyghur and Kazak are confusing. (THU)
- Zero resource ASR is undergoing: structure & knowledge transfer + learning with unlabelled data (THU)
- Resource collection
- Seed data for Kirgiz and Mongolian should be collected quickly. They should be done before August, 1st.
- Body data should be collected as soon as possible. Shiying will release a recording APP and a check platform for the collection. This should be done before Just 1st.
- Resource centeralization
- A key problem is that the resource has not been well managed. We should put all light resources (lexicon,transcription, recipe, tools) on github, heavy resources (speech data, text data) on disk but can be accessed by URL. All the resources should be indexed from the wiki.
- State-of-the-art recipe
- The research has not been put on a unified baseline. We should set up the baseline systems for the 5 languages, so that individual research can has a good reference.
- We also need to put the multilingual ASR system onto github, so that all can start their research from the state-of-the-art.
- Tang Zhiyuan will be response for the above task, and Shiying will be the main researcher (done before June 1st).