This paper investigates robust privacy-sensitive audio features for speaker diarization in multiparty conversations: i.e., a set of audio features having low linguistic information for speaker diarization in a single and multiple distant microphone scenarios. We systematically investigate Linear Prediction (LP) residual. Issues such as prediction order and choice of representation of LP residual are studied. Additionally, we explore the combination of LP residual with subband information from 2.5 kHz to 3.5 kHz and spectral slope. Next, we propose a supervised framework using deep neural architecture for deriving privacy-sensitive audio features. We benchmark these approaches against the traditional Mel Frequency Cepstral Coefficients (MFCC) features for speaker diarization in both the microphone scenarios. Experiments on the RT07 evaluation dataset show that the proposed approaches yield diarization performance close to the MFCC features on the single distant microphone dataset. To objectively evaluate the notion of privacy in terms of linguistic information, we perform human and automatic speech recognition tests, showing that the proposed approaches to privacy-sensitive audio features yield much lower recognition accuracies compared to MFCC features.

Sree Hari Krishnan Parthasarathi

H. Bourlard

D. Gatica-Perez

IEEE Transactions on Audio Speech and Language Processing

The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.

The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies, and applications related to the analysis, coding, enhancement, recognition, and synthesis of audio, music, speech, and language. Specifically, audio processing includes auditory modeling, acoustic modeling, and source separation. Speech processing encompasses speech production and perception, adaptation, lexical modeling, and speaker recognition. Language processing involves spoken language understanding, translation, summarization, mining, general language modeling, and spoken dialog systems.

《IEEE Transactions on Audio, Speech and Language Processing》涵盖了与音频、音乐、语音和语言的分析、编码、增强、识别和合成相关的科学、技术和应用。具体而言，音频处理还包括听觉建模、声学建模和源分离。语音处理涉及语音的产生和感知、适应、词汇建模和说话人识别。语言处理则包括口语理解、翻译、摘要、挖掘、通用语言建模以及口语对话系统。

IEEE Transactions on Audio, Speech and Language Processing

Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations

A supervised framework using deep neural architecture for deriving privacy-sensitive audio features for speaker diarization in multiparty conversations is proposed and experiments show that the proposed approaches yield darization performance close to the MFCC features on the single distant microphone dataset.

Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations

ivySCI AI Smartly Parses PDF, Answers Researchers' Questions, and Helps You Understand Papers in Seconds

Journal Info