MagicHub, an open-source community for AI, releases 180-hour conversational speech dataset in Mandarin for free, enriching the open source speech corpus and promoting the development of spoken language processing technology and conversational AI.
Data Profile
MagicData-RAMC is a collection of high quality and richly annotated training data that includes 351 sets of multi-turn Mandarin conversations recorded in indoor environment by smart phone with a total duration of 180 hours.
Also Read: Enterprise Security Concerns Drive Global Demand for Fraud Detection…
In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south.
The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.
MagicData-RAMC is currently available for download at
Researches Based on MagicData-RAMC
Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, completes the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech.
Preprint available on arxiv
Challenge and Baseline
Together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, Magic Data held the Magic Data ASR-SD Challenge in July to October, 2021 for evaluating the MagicData-RAMC.