RandomForestVoiceActivityDetector

Voice Activity Detection(VAD) distinguishes the speech segments from the non-speech contents in the audio, which is implemented in many speech-related applications such as speech coding, transmission and recognition. This enabling function is also challanged by the complicated auditory environment in the real-world: low signal-noise ratio or various acoustic events in metropolitan environment would interfere with the progress of VAD working. And this project was proprosed for separating the presence of speech from the urban sound environment.

In order to remove the distractions, we could translate this problem to the task of urban Sound Event Detection(SED), which was a common multiclass classification problem. However, ours is relatively easier for the reason of binary classification.(Speech vs. Non-speech) As a result, we could handle this problem by merely using off-the-shelf machine learning frameworks.

Fulltext entrance here, a tiny first step but momentous.

Dataset

(Optional)In case of no data set available, some data sets and tools offered here for synthesizing urban soundscape data:
- UrbanSound8K by @justinsalamon, which contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.
- Mozilla Common Voice, a crowdsourcing project started by Mozilla to create a free database for speech recognition software.
- scaper by @justinsalamon, A library for soundscape synthesis and augmentation.
- UrbanSound-SED by @justinsalamon,a example dataset of 10,000 synthetic soundscapes with sound event annotations generated using scaper
- SONYC Urban Sound Tagging (SONYC-UST), a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring.

Install

Synthetic Data Generation:
- scaper, A python library for soundscape synthesis and augmentation.
FrontEnd Processing & Viterbi Decoding:
- librosa, A python package for music and audio analysis.
Model Selection:
- sklearn, which is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
Data Augmentation:
- muda, A library for Musical Data Augmentation.

Overview

This project consists of 5 stages as: 0. Soundscape generation

Feature extraction
Model selection
Data Augmentation
Decision Smoothing

Step.0 was performed due to the lack of the off-the-shelf data set, in which the UrbanSound8K and Mozilla Common Voice were used to create our synthetic urban soundscape data.

Step.1 is the frontend processing that turns raw audio data into concise but logical representation. In this research, the Per-Channel Energy Normalization(PCEN)(Wang et al., 2017,Lostanlen et al., 2018) was tested with a promising result even better than Mel Frequency Cepstral Coefficents (MFCCs), especially in the condition of low signal-noise-ratio(SNR).

Step.2 is the procedure of selecting a statistical model. Since we've already determined to used Random Forest Classifier, this step only contains metric selection, hyperparameter optimization through grid-search cross-validation.

A comparison of two feature extraction algorithms was conducted, and the result of balanced accuracy shows PCEN is promising for speech with lower signal-noise ratio:

Step.3 is for improving the robustness of our model by fitting the model to more degraded data and evaluated on the unmodified example. For audio-related data augmentation, deformations such as pitch shift, time stretch, colored noise, dynamic range compression, and IR convolution would effective and their open-source implementations are available in muda

Deformation	Parameters Setting
Random Pitch Shifting	Pitch ∼ N(μ=0,σ^2 =1)
Random Time Stretching	ln(Rate) ∼ N(μ = 0, σ = 0.3)
Colored Noise	Brownian Noise ∈ (Min Weight = 0.1, Max Weight = 0.9)
Dynamic Compression	Dolby E standards: speech
IR Convolution	Isophonics Room Impulse Response Data Set: Great Hall sample IR

(Please ignore the result of IR convolution due to the distribution gap in deformed train set, which will be fixed and updated later)

Step.4 models the occurrence of speech and non-speech by Hidden-Markov-Model, and find its most likely states sequence through Viterbi Decoding in order to smooth the snatchy decisions by classifier. The viterbi path was computed as:

And we could see the robustness improvement through each stage performed as:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
jupyternotebook		jupyternotebook
pics		pics
.DS_Store		.DS_Store
Liu_Thesis.pdf		Liu_Thesis.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RandomForestVoiceActivityDetector

Table of Contents

Dataset

Install

Overview

About

Releases

Packages

Languages

yeliuyChuy/RandomForestVoiceActivityDetector

Folders and files

Latest commit

History

Repository files navigation

RandomForestVoiceActivityDetector

Table of Contents

Dataset

Install

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages