Skip to content

yeliuyChuy/RandomForestVoiceActivityDetector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RandomForestVoiceActivityDetector

Voice Activity Detection(VAD) distinguishes the speech segments from the non-speech contents in the audio, which is implemented in many speech-related applications such as speech coding, transmission and recognition. This enabling function is also challanged by the complicated auditory environment in the real-world: low signal-noise ratio or various acoustic events in metropolitan environment would interfere with the progress of VAD working. And this project was proprosed for separating the presence of speech from the urban sound environment.

In order to remove the distractions, we could translate this problem to the task of urban Sound Event Detection(SED), which was a common multiclass classification problem. However, ours is relatively easier for the reason of binary classification.(Speech vs. Non-speech) As a result, we could handle this problem by merely using off-the-shelf machine learning frameworks.

Fulltext entrance here, a tiny first step but momentous.

Table of Contents

Dataset

  • (Optional)In case of no data set available, some data sets and tools offered here for synthesizing urban soundscape data:
    • UrbanSound8K by @justinsalamon, which contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.
    • Mozilla Common Voice, a crowdsourcing project started by Mozilla to create a free database for speech recognition software.
    • scaper by @justinsalamon, A library for soundscape synthesis and augmentation.
    • UrbanSound-SED by @justinsalamon,a example dataset of 10,000 synthetic soundscapes with sound event annotations generated using scaper
    • SONYC Urban Sound Tagging (SONYC-UST), a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring.

Install

  • Synthetic Data Generation:
    • scaper, A python library for soundscape synthesis and augmentation.
  • FrontEnd Processing & Viterbi Decoding:
    • librosa, A python package for music and audio analysis.
  • Model Selection:
    • sklearn, which is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
  • Data Augmentation:
    • muda, A library for Musical Data Augmentation.

Overview

This project consists of 5 stages as: 0. Soundscape generation

  1. Feature extraction
  2. Model selection
  3. Data Augmentation
  4. Decision Smoothing

Step.0 was performed due to the lack of the off-the-shelf data set, in which the UrbanSound8K and Mozilla Common Voice were used to create our synthetic urban soundscape data.

Statistic of Synthetic Train Set

Step.1 is the frontend processing that turns raw audio data into concise but logical representation. In this research, the Per-Channel Energy Normalization(PCEN)(Wang et al., 2017,Lostanlen et al., 2018) was tested with a promising result even better than Mel Frequency Cepstral Coefficents (MFCCs), especially in the condition of low signal-noise-ratio(SNR).

Step.2 is the procedure of selecting a statistical model. Since we've already determined to used Random Forest Classifier, this step only contains metric selection, hyperparameter optimization through grid-search cross-validation.

A comparison of two feature extraction algorithms was conducted, and the result of balanced accuracy shows PCEN is promising for speech with lower signal-noise ratio:

MFCC vs. PCEN in Frontend Processing

Step.3 is for improving the robustness of our model by fitting the model to more degraded data and evaluated on the unmodified example. For audio-related data augmentation, deformations such as pitch shift, time stretch, colored noise, dynamic range compression, and IR convolution would effective and their open-source implementations are available in muda

Deformation Parameters Setting
Random Pitch Shifting Pitch ∼ N(μ=0,σ^2 =1)
Random Time Stretching ln(Rate) ∼ N(μ = 0, σ = 0.3)
Colored Noise Brownian Noise ∈ (Min Weight = 0.1, Max Weight = 0.9)
Dynamic Compression Dolby E standards: speech
IR Convolution Isophonics Room Impulse Response Data Set: Great Hall sample IR

Data Augmentation (Please ignore the result of IR convolution due to the distribution gap in deformed train set, which will be fixed and updated later)

Step.4 models the occurrence of speech and non-speech by Hidden-Markov-Model, and find its most likely states sequence through Viterbi Decoding in order to smooth the snatchy decisions by classifier. The viterbi path was computed as: Computing Viterbi path

And we could see the robustness improvement through each stage performed as: BACC Improvement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published