Archives

Open-Source MagicData-RAMC: 180-Hour Conversational Speech Dataset in Mandarin Released

Open-Source MagicData-RAMC 180-Hour Conversational Speech Dataset in Mandarin Released logo/IT Digest

MagicHub, an open-source community for AI, releases 180-hour conversational speech dataset in Mandarin for free, enriching the open source speech corpus and promoting the development of spoken language processing technology and conversational AI.

Data Profile

MagicData-RAMC is a collection of high quality and richly annotated training data that includes 351 sets of multi-turn Mandarin conversations recorded in indoor environment by smart phone with a total duration of 180 hours.

Also Read: Enterprise Security Concerns Drive Global Demand for Fraud Detection…

In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south.

The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.

MagicData-RAMC is currently available for download at

Researches Based on MagicData-RAMC

Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, completes the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech.

Preprint available on arxiv

Challenge and Baseline

Together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, Magic Data held the Magic Data ASR-SD Challenge in July to October, 2021 for evaluating the MagicData-RAMC.