“Audio diarization in the cocktail party scenario”
Improvements to the performance of diarization of audio in the presence of high levels of noise, distortion, reverberation, and other forms of speech corruption.
Statement of the general topic area
Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources.
Diarization identifies for example whether speech or non-speech is present, and the type of non-speech such as silence, noise or music. It might also identify the gender of the speaker, the style and structure of the content and when a change in speaker occurs.
With the continually decreasing cost of and increasing access to processing power, storage capacity and network bandwidth allowing for the amassing of large volumes of audio, including broadcasts, voice mails, meetings and other “spoken documents,” there is a growing need to apply automatic Human Language Technologies to allow efficient and effective searching, indexing and accessing of these information sources. In addition to the fundamental technology of speech recognition, to extract the words being spoken, other technologies are needed to extract meta-data that provides context and information beyond the words. Audio diarization, or the marking and categorising of audio sources within a spoken document, is one such technology.
Diarization can therefore help in making material available that would be otherwise too time consuming to access, and provide new uses for information previously discarded.
This research will seek to build on existing research and improve upon diarization performance initially in noisy and clipped multiple speaker environments.