Spoken english doc

9/17/2023

Our results show that the size and variability of this corpus opens up new avenues for research.

This is orders of magnitude larger than previous speech corpora used for search and summarization. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. Abstract Podcasts are a large and growing repository of spoken audio.

0 Comments

Spoken english doc

Leave a Reply.

Author

Archives

Categories