Abstract

Speaker recognition is a popular topic in biometric authentication and many deep learning approaches have achieved extraordinary performances. However, it has been shown in both image and speech applications that deep neural networks are vulnerable to adversarial examples. In this study, we aim to exploit this weakness to perform targeted adversarial attacks against the x-vector based speaker recognition system. We propose to generate inaudible adversarial perturbations achieving targeted white-box attacks to speaker recognition system based on the psychoacoustic principle of frequency masking. Specifically, we constrict the perturbation under the masking threshold of original audio, instead of using a common l_p norm to measure the perturbations. Experiments on Aishell-1 corpus show that our approach yields up to 98.5% attack success rate to arbitrary gender speaker targets, while retaining indistinguishable attribute to listeners. Furthermore, we also achieve an effective speaker attack when applying the proposed approach to a completely irrelevant waveform, such as music.


Proposed Method

In this study, we were inspired by the work in [1] [2] and propose to generate inaudible adversarial perturbations for targeted attacking speaker recognition directly on wave-level. We use the structure of the x-vector speaker recognition system proposed in [3] as our baseline to conduct targeted white-box attacks. To generate the inaudible adversarial perturbations, we adopt the frequency masking concept [4] where one faint but audible sound becomes inaudible in the presence of another louder audible sound. Our experimental results based on Aishell-1 corpus demonstrate that the inaudible adversarial perturbations can achieve better targeted attack performance than previous l_p norm based adversarial examples. To further compare the frequency masking based approach with previous ones, we also evaluate them from both subjective and objective metrics. Results show that the adversarial perturbations generated by proposed methods are more inaudible, even with a larger absolute energy.

generation of adversarial examples based on frequency masking

Figure1: An overview of the generation of adversarial examples based on frequency masking.

Speech Samples

The adversarial examples are generated from wave of two database:

Aishell-1.

The music portion of MUSAN.

We present the adversarial examples generated based on different approaches:

Attack Stage1: L_p norm based approach.

Attack Stage2 : frequency masking based appraoch.

Female → Female

Original
(Original lable: S0131 Predict Label: S0131)
Attack Stage1
(Original lable: S0131 Predict Label: S0173)
Attack Stage2
(Original lable: S0131 Predict Label: S0173)
Sample 1
Sample 1
Original
(Original lable: S0131 Predict Label: S0131)
Attack Stage1
(Original lable: S0131 Predict Label: S0173)
Attack Stage2
(Original lable: S0131 Predict Label: S0173)
Original
(Original lable: S0355 Predict Label: S0355)
Attack Stage1
(Original lable: S0355 Predict Label: S0358)
Attack Stage2
(Original lable: S0355 Predict Label: S0358)
Sample 2
Sample 2
Original
(Original lable: S0355 Predict Label: S0355)
Attack Stage1
(Original lable: S0355 Predict Label: S0358)
Attack Stage2
(Original lable: S0355 Predict Label: S0358)
Original
(Original lable: S0336 Predict Label: S0336)
Attack Stage1
(Original lable: S0336 Predict Label: S0186)
Attack Stage2
(Original lable: S0336 Predict Label: S0186)
Sample 3
Sample 3
Original
(Original lable: S0336 Predict Label: S0336)
Attack Stage1
(Original lable: S0336 Predict Label: S0186)
Attack Stage2
(Original lable: S0336 Predict Label: S0186)

Female → Male

Original
(Original lable: S0153 Predict Label: S0153)
Attack Stage1
(Original lable: S0153 Predict Label: S0003)
Attack Stage2
(Original lable: S0153 Predict Label: S0003)
Sample 1
Sample 1
Original
(Original lable: S0153 Predict Label: S0153)
Attack Stage1
(Original lable: S0153 Predict Label: S0003)
Attack Stage2
(Original lable: S0153 Predict Label: S0003)
Original
(Original lable: S0660 Predict Label: S0660)
Attack Stage1
(Original lable: S0660 Predict Label: S0027)
Attack Stage2
(Original lable: S0660 Predict Label: S0027)
Sample 2
Sample 2
Original
(Original lable: S0660 Predict Label: S0660)
Attack Stage1
(Original lable: S0660 Predict Label: S0027)
Attack Stage2
(Original lable: S0660 Predict Label: S0027)
Original
(Original lable: S0336 Predict Label: S0336)
Attack Stage1
(Original lable: S0336 Predict Label: S0068)
Attack Stage2
(Original lable: S0336 Predict Label: S0068)
Sample 3
Sample 3
Original
(Original lable: S0336 Predict Label: S0336)
Attack Stage1
(Original lable: S0336 Predict Label: S0068)
Attack Stage2
(Original lable: S0336 Predict Label: S0068)

Male → Male

Original
(Original lable: S0050 Predict Label: S0050)
Attack Stage1
(Original lable: S0050 Predict Label: S0248)
Attack Stage2
(Original lable: S0050 Predict Label: S0248)
Sample 1
Sample 1
Original
(Original lable: S0050 Predict Label: S0050)
Attack Stage1
(Original lable: S0050 Predict Label: S0248)
Attack Stage2
(Original lable: S0050 Predict Label: S0248)
Original
(Original lable: S0102 Predict Label: S0102)
Attack Stage1
(Original lable: S0102 Predict Label: S0068)
Attack Stage2
(Original lable: S0102 Predict Label: S0068)
Sample 2
Sample 2
Original
(Original lable: S0102 Predict Label: S0102)
Attack Stage1
(Original lable: S0102 Predict Label: S0068)
Attack Stage2
(Original lable: S0102 Predict Label: S0068)
Original
(Original lable: S0092 Predict Label: S0092)
Attack Stage1
(Original lable: S0092 Predict Label: S0122)
Attack Stage2
(Original lable: S0092 Predict Label: S0122)
Sample 3
Sample 3
Original
(Original lable: S0092 Predict Label: S0092)
Attack Stage1
(Original lable: S0092 Predict Label: S0122)
Attack Stage2
(Original lable: S0092 Predict Label: S0122)

Male → Female

Original
(Original lable: S0010 Predict Label: S0010)
Attack Stage1
(Original lable: S0010 Predict Label: S0186)
Attack Stage2
(Original lable: S0010 Predict Label: S0186)
Sample 1
Sample 1
Original
(Original lable: S0010 Predict Label: S0010)
Attack Stage1
(Original lable: S0010 Predict Label: S0186)
Attack Stage2
(Original lable: S0010 Predict Label: S0186)
Original
(Original lable: S0039 Predict Label: S0039)
Attack Stage1
(Original lable: S0039 Predict Label: S0235)
Attack Stage2
(Original lable: S0039 Predict Label: S0235)
Sample 2
Sample 2
Original
(Original lable: S0039 Predict Label: S0039)
Attack Stage1
(Original lable: S0039 Predict Label: S0235)
Attack Stage2
(Original lable: S0039 Predict Label: S0235)
Original
(Original lable: S0065 Predict Label: S0065)
Attack Stage1
(Original lable: S0065 Predict Label: S0521)
Attack Stage2
(Original lable: S0065 Predict Label: S0521)
Sample 3
Sample 3
Original
(Original lable: S0065 Predict Label: S0065)
Attack Stage1
(Original lable: S0065 Predict Label: S0521)
Attack Stage2
(Original lable: S0065 Predict Label: S0521)

Music → Male

Original
(Original lable: - Predict Label: S0068)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)
Sample 1
Sample 1
Original
(Original lable: - Predict Label: S0068)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)
Original
(Original lable: - Predict Label: S0718)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)
Sample 2
Sample 2
Original
(Original lable: - Predict Label: S0718)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)
Original
(Original lable: - Predict Label: S0130)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)
Sample 3
Sample 3
Original
(Original lable: - Predict Label: S0130)
Attack Stage1
(Original lable: - Predict Label: S0027)
Attack Stage2
(Original lable: - Predict Label: S0027)

References

[1] Nicholas Carlini, David Wagner, “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text,” in Proceedings of SPW, 2018, pp. 1–7.

[2] Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, Colin Raffel, “Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition,” in Proceedings of ICML, 2019, pp. 5231–5240.

[3] David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur, “ Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proceedings of INTERSPEECH, 2017, pp. 999–1003.

[4] Yiqing LinWaleed H. Abdulla, “Principles of Psychoacoustics,” in Audio Watermark, 2015, pp. 15–49.