Skip to content

Text-to-speech and speech recognition, VAD with NVIDIA NeMo and ONNX Runtime for .NET Core.

License

Notifications You must be signed in to change notification settings

kaiidams/NeMoOnnxSharp

Repository files navigation

Speech recognizer with QuartzNet and ONNX Runtime

build

This repository explains how to export QuartzNet of NeMo as ONNX and use ONNX Runtime to recognize English texts from audio.

Quick Start

Clone the repository. Make sure to enable Git LFS as ONNX and WAV files are stored in LFS.

git clone https://github.com/kaiidams/NeMoOnnxSharp.git

Run dotnet command to run the program. The project file is for .NET Core 7.0 SDK. This should work either with Linux and Windows, also probably with MacOS.

cd NeMoOnnxSharp
dotnet run --project NeMoOnnxSharp.Example

If you are more famililar to Visual Studio, you can open NeMoOnnxSharp\NeMoOnnxSharp.sln and run the program with F5.

The program reads a test file, test_data\transcript.txt and print predicted results. The format of the output is three columns separated by |, names of wav files, target texts, predicted texts. The test data is from test-clean.tar.gz of LibriSpeech.

Name Target Predicted
61-70968-0000.wav he began a confused complaint against the wizard who had vanished behind the curtain on the left he began a confused complaint against the wizard who had vanished behind the curtain on the left
61-70968-0001.wav give not so earnest a mind to these mummeries child kive not so earnest a mind to these mummeries child
61-70968-0002.wav a golden fortune and a happy life a golden fortune and a happy life
61-70968-0003.wav he was like unto my father in a way and yet was not my father he was like unto my father in a way and yet was not my father
61-70968-0004.wav also there was a stripling page who turned into a maid also there was a stripling page who turned it to a maid
61-70968-0005.wav this was so sweet a lady sir and in some manner i do think she died this was so sweet a lady sir and in some manner i do think she died
61-70968-0006.wav but then the picture was gone as quickly as it came but then the picture was gone as quickly as it came
61-70968-0007.wav sister nell do you hear these marvels sister nell do you hear these marvels
61-70968-0008.wav take your place and let us see what the crystal can show to you take your place and let us see what the crystal can show to you
61-70968-0009.wav like as not young master though i am an old man like as not young master though i am an old man
61-70968-0010.wav forthwith all ran to the opening of the tent to see what might be amiss but master will who peeped out first needed no more than one glance forthwithal ran to the opening of the tent to see what might be amiss but master will who peeped out first needed no more than one glance
61-70968-0011.wav he gave way to the others very readily and retreated unperceived by the squire and mistress fitzooth to the rear of the tent he gave way to the others very readily and retreated unperceived by the squire and mistress fitzooth to the rear of the tent
61-70968-0012.wav cries of a nottingham a nottingham cries of a nottingham an nottingham
61-70968-0013.wav before them fled the stroller and his three sons capless and terrified before them fled the stroller and his three sons capless and terrified
61-70968-0014.wav what is the tumult and rioting cried out the squire authoritatively and he blew twice on a silver whistle which hung at his belt what is the tumult an rioting cried out the squire authoritatively and he blew twice on the silver whistle which hung at his belt
61-70968-0015.wav nay we refused their request most politely most noble said the little stroller nay we refused their request most politely most noble said the little stroller
61-70968-0016.wav and then they became vexed and would have snatched your purse from us and then they became vexed and would have snatched your purse from us
61-70968-0017.wav i could not see my boy injured excellence for but doing his duty as one of cumberland's sons i could not see my boy injured excellence for but doing his duty as one of cumberland's sons
61-70968-0018.wav so i did push this fellow so i did push this fellow
61-70968-0019.wav it is enough said george gamewell sharply and he turned upon the crowd it is eough said george gamwell sharply as he turned upon the crowd
61-70968-0020.wav shame on you citizens cried he i blush for my fellows of nottingham shame on you citizens cried he i blush for my fellows of nottingham
61-70968-0021.wav surely we can submit with good grace surely we can submit with good grace
61-70968-0022.wav tis fine for you to talk old man answered the lean sullen apprentice tis fine for you to talk old man answered the lean sullen apprentice
61-70968-0023.wav but i wrestled with this fellow and do know that he played unfairly in the second bout but i wrestled with this fellow and do know that he played unfairly in the second bout
61-70968-0024.wav spoke the squire losing all patience and it was to you that i gave another purse in consolation spoke the squire losing all patient and it was to you that i gave another person consolation
61-70968-0025.wav come to me men here here he raised his voice still louder come to me men her here he raised his voice still louder
61-70968-0026.wav the strollers took their part in it with hearty zest now that they had some chance of beating off their foes the strollers took their part in it with heardy zest now that they had some chance of beating off their foes
61-70968-0027.wav robin and the little tumbler between them tried to force the squire to stand back and very valiantly did these two comport themselves robin and the little tumbler between them tried to force the squire to stand back and very valiantly did these two comport themselves
61-70968-0028.wav the head and chief of the riot the nottingham apprentice with clenched fists threatened montfichet the head and chief of the riot the nottingham apprenticed with clenched fists threatened montfichet
61-70968-0029.wav the squire helped to thrust them all in and entered swiftly himself the squire helped to thrust them all in and entered swiftly himself
61-70968-0030.wav now be silent on your lives he began but the captured apprentice set up an instant shout now be silent on your lives he began but the captured apprentice set up an instant shout
61-70968-0031.wav silence you knave cried montfichet silence you knave cried montfichet
61-70968-0032.wav he felt for and found the wizard's black cloth the squire was quite out of breath he felt fur and found the wizzard's black cloth the squire was quite out of breath
61-70968-0033.wav thrusting open the proper entrance of the tent robin suddenly rushed forth with his burden with a great shout thrusting open the proper entrance of the tent robin suddenly rushed forth with his burden with a great shout
61-70968-0034.wav a montfichet a montfichet gamewell to the rescue a montfichet a montfichet came well to the rescue
61-70968-0035.wav taking advantage of this the squire's few men redoubled their efforts and encouraged by robin's and the little stroller's cries fought their way to him taking advantage of this the squire's few men redoubled their efforts and encouraged by robins and the little strollers cries fought their way to him
61-70968-0036.wav george montfichet will never forget this day george montfichet will never forget this day
61-70968-0037.wav what is your name lording asked the little stroller presently what is your name lordding asked the little stroller presently
61-70968-0038.wav robin fitzooth robin fitzooth
61-70968-0039.wav and mine is will stuteley shall we be comrades and mine is will stutley shall we be come rads
61-70968-0040.wav right willingly for between us we have won the battle answered robin right willingly for between us we have won the battle answered robin
61-70968-0041.wav i like you will you are the second will that i have met and liked within two days is there a sign in that i like you wil you are the second will that i have met in light within two days is there a sign in that
61-70968-0042.wav montfichet called out for robin to give him an arm montfichet called out for robin to give him an arm
61-70968-0043.wav friends said montfichet faintly to the wrestlers bear us escort so far as the sheriff's house friends said mont fichet faintly to the wrestlers bear us escort so far as the sheriff's house
61-70968-0044.wav it will not be safe for you to stay here now it will not be safe for you to stay here now
61-70968-0045.wav pray follow us with mine and my lord sheriff's men pray follow us with mine in my lord sheriff's men
61-70968-0046.wav nottingham castle was reached and admittance was demanded nottingham castle was reached and admittance was demanded
61-70968-0047.wav master monceux the sheriff of nottingham was mightily put about when told of the rioting master monceux the sheriff of nottingham was mightily put about when told of the rioting
61-70968-0048.wav and henry might return to england at any moment and henry might return to england at any moment
61-70968-0049.wav have your will child if the boy also wills it montfichet answered feeling too ill to oppose anything very strongly just then have your will child if the boy also wilts it montfichet answered feeling too ill to oppose anything very strongly just then
61-70968-0050.wav he made an effort to hide his condition from them all and robin felt his fingers tighten upon his arm he made an effort to hide his condition from them all and robin felt his fingers tightened upon his arm
61-70968-0051.wav beg me a room of the sheriff child quickly beg me a room of the sheriff child quickly
61-70968-0052.wav but who is this fellow plucking at your sleeve but who is this fellow plucking at your steeve
61-70968-0053.wav he is my esquire excellency returned robin with dignity he is my esquire excellency returned robin with dignity
61-70968-0054.wav mistress fitzooth had been carried off by the sheriff's daughter and her maids as soon as they had entered the house so that robin alone had the care of montfichet mistress fitzoth had been carried off by the sheriff's daughter and her maids as soon as they had entered the house so that robin alone had the care of montfichet
61-70968-0055.wav robin was glad when at length they were left to their own devices robin was glad when at length they were left to their own devices
61-70968-0056.wav the wine did certainly bring back the color to the squire's cheeks the wine did certainly bring back the color to the squire's cheeks
61-70968-0057.wav these escapades are not for old gamewell lad his day has come to twilight these escapades are not for old gamewell lad his day has come to twilight
61-70968-0058.wav will you forgive me now will you forgive me now
61-70968-0059.wav it will be no disappointment to me itill be no disappointment to me
61-70968-0060.wav no thanks i am glad to give you such easy happiness no thanks i am glad to give you such easy happiness
61-70968-0061.wav you are a worthy leech will presently whispered robin the wine has worked a marvel you are a worthy leech will presently whispered robin the wine has worked a marvel
61-70968-0062.wav ay and show you some pretty tricks i enshow you some pretty tricks

Text-to-speech samples

NeMoOnnxSharp supports text-to-speech with FastSpeech and HiFiGAN.

Generated Target
generated-61-70968-0000.wav he began a confused complaint against the wizard who had vanished behind the curtain on the left
generated-61-70968-0001.wav give not so earnest a mind to these mummeries child
generated-61-70968-0002.wav a golden fortune and a happy life

Exporting ONNX

Exported ONNX file is included in this repository. But if you want to do it yourself, you can use NeMo.

pip install 'git+https://github.com/NVIDIA/NeMo.git#egg=nemo_toolkit[asr]'

Then run the script below to export the model as an ONNX file.

import nemo.collections.asr as nemo_asr
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
quartznet.export("QuartzNet15x5Base-En.onnx")

How it works

Most of deep-learning system are composed of pre-processing, model and post-processing. Model code is usually written with deep-learning framework like PyTorch and TensorFlow. Model consumes numeric arrays called tensors and produces tensors. Pre-processing code receives inputs for the system like texts, audio clips, images and converts them into tensors so that deep-learning framework can handle. Post-processing code receives tensors and converts them into the final output forms.

For pre-processing of ASR task, audio data are usually divided into short time frames and converted into log spectrogram, log mel-spectrogram or MFCC. QuartzNet uses 16000Hz sampling rate, 10ms frame, 64 dimention mel-spectrogram. 5 sec audio is divided into 500 frames so it makes a 32-bit float tensor of 64x500 elements. There are a lot of flavors of conversions and they are not well explained in research papers, so you need to read the code carefully so that you make sure that doing the exactly same conversion.

Pre-processing

QuartzNet's pre-processing is implemented in nemo_asr.modules.AudioToMelSpectrogramPreprocessor. You can instantiate a preprocessor with proper parameters from config file examples/asr/conf/quartznet/quartznet_15x5.yaml. See NeMoOnnxTest.test_nemo_preprocess() of Python/nemo_onnx_test.py for example.

config = OmegaConf.load(config_file)
preprocessor = nemo_asr.models.EncDecCTCModel.from_config_dict(config.model.preprocessor)
preprocessor.eval()
audio_signal, audio_signal_length = preprocessor(
    input_signal=input_signal,
    length=input_signal_length)

nemo_asr.modules.AudioToMelSpectrogramPreprocessor is a complicated class so that researchers can try various configurations. Essentially it does the following things for QuartzNet.

  • Pre-emphasis (High band filter so emphasize high-freq)
  • Get frames with Hann window
  • Short-time Fourier fransform
  • Convert complex to squared magnitude
  • Convert to mel-spectrogram
  • Convert to log mel-spectrogram
  • Normalize per feature (i.e. normalize using mean and std along the time per feature)

in C# code, NeMoOnnxSharp.AudioToMelSpectrogramPreprocessor does the same conversion except input audio format is 16-bit integers, not 32-bit floats. Torch implementation uses vectorization for efficient parallelization, but C# implementation computes frame by frame for efficient memory usage.

Post-processing

Post-processing is simpler than pre-processing. The model outputs log probabilities for each time frame and labels. The label is raw English text in case of QuartzNet. Post-processing does the following things,

  • gets the most probable labels
  • decode into characters
  • then decode CTC.

Decoding CTC is eliminating duplicated characters as one English characters may span more than one time frame.

See NeMoOnnxTest.postprocess() for Python implementation.

Examples

About

Text-to-speech and speech recognition, VAD with NVIDIA NeMo and ONNX Runtime for .NET Core.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published