Voice Activity Detection(VAD) distinguishes the speech segments from the non-speech contents in the audio, which is implemented in many speech-related applications such as speech coding, transmission and recognition. This enabling function is also challanged by the complicated auditory environment in the real-world: low signal-noise ratio or various acoustic events in metropolitan environment would interfere with the progress of VAD working. And this project was proprosed for separating the presence of speech from the urban sound environment.
In order to remove the distractions, we could translate this problem to the task of urban Sound Event Detection(SED), which was a common multiclass classification problem. However, ours is relatively easier for the reason of binary classification.(Speech vs. Non-speech) As a result, we could handle this problem by merely using off-the-shelf machine learning frameworks.
Fulltext entrance here, a tiny first step but momentous.
- (Optional)In case of no data set available, some data sets and tools offered here for synthesizing urban soundscape data:
- UrbanSound8K by @justinsalamon, which contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.
- Mozilla Common Voice, a crowdsourcing project started by Mozilla to create a free database for speech recognition software.
- scaper by @justinsalamon, A library for soundscape synthesis and augmentation.
- UrbanSound-SED by @justinsalamon,a example dataset of 10,000 synthetic soundscapes with sound event annotations generated using scaper
- SONYC Urban Sound Tagging (SONYC-UST), a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring.
- Synthetic Data Generation:
- scaper, A python library for soundscape synthesis and augmentation.
- FrontEnd Processing & Viterbi Decoding:
- librosa, A python package for music and audio analysis.
- Model Selection:
- sklearn, which is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
- Data Augmentation:
- muda, A library for Musical Data Augmentation.
This project consists of 5 stages as: 0. Soundscape generation
- Feature extraction
- Model selection
- Data Augmentation
- Decision Smoothing
Step.0 was performed due to the lack of the off-the-shelf data set, in which the UrbanSound8K and Mozilla Common Voice were used to create our synthetic urban soundscape data.
Step.1 is the frontend processing that turns raw audio data into concise but logical representation. In this research, the Per-Channel Energy Normalization(PCEN)(Wang et al., 2017,Lostanlen et al., 2018) was tested with a promising result even better than Mel Frequency Cepstral Coefficents (MFCCs), especially in the condition of low signal-noise-ratio(SNR).
Step.2 is the procedure of selecting a statistical model. Since we've already determined to used Random Forest Classifier, this step only contains metric selection, hyperparameter optimization through grid-search cross-validation.
A comparison of two feature extraction algorithms was conducted, and the result of balanced accuracy shows PCEN is promising for speech with lower signal-noise ratio:
Step.3 is for improving the robustness of our model by fitting the model to more degraded data and evaluated on the unmodified example. For audio-related data augmentation, deformations such as pitch shift, time stretch, colored noise, dynamic range compression, and IR convolution would effective and their open-source implementations are available in muda
Deformation | Parameters Setting |
---|---|
Random Pitch Shifting | Pitch ∼ N(μ=0,σ^2 =1) |
Random Time Stretching | ln(Rate) ∼ N(μ = 0, σ = 0.3) |
Colored Noise | Brownian Noise ∈ (Min Weight = 0.1, Max Weight = 0.9) |
Dynamic Compression | Dolby E standards: speech |
IR Convolution | Isophonics Room Impulse Response Data Set: Great Hall sample IR |
(Please ignore the result of IR convolution due to the distribution gap in deformed train set, which will be fixed and updated later)
Step.4 models the occurrence of speech and non-speech by Hidden-Markov-Model, and find its most likely states sequence through Viterbi Decoding in order to smooth the snatchy decisions by classifier. The viterbi path was computed as:
And we could see the robustness improvement through each stage performed as: