Elixir client for Google Speech-to-Text V2 streaming API using gRPC
The package can be installed by adding :ex_google_stt
to your list of dependencies in mix.exs
:
def deps do
[
{:ex_google_stt, "~> 0.5.1"}
]
end
This library uses Goth
to obtain authentication tokens. It requires Google Cloud credendials to be configured. See Goth's README for details.
Using Google's V2 API requires that you set a recognizer to use for your requests (see here). It is a string like the following:
projects/{project}/locations/{location}/recognizers/{recognizer}
You can either set this in the config or send it as a configuration when starting the TranscriptionServer
.
In the config:
config :ex_google_stt, recognizer: "projects/{project}/locations/{location}/recognizers/_"
The library is designed to abstract most of the GRPC logic, so I'll provide the most basic use of it here.
- In summary, we use a
TranscriptionServer
than handles the GRPC streams to Google. - That
TranscriptionServer
is responsible for monitoring/opening the streams and to parse the responses. - Send audio data (as binary) using
TranscriptionServer.process_audio(server_pid, audio_data)
- The
TranscriptionServer
will then send the responses to the target pid, set when creating the server. - The caller should define a
handle_info
that will receive the transcripts and handle eventual errors.
When starting the TranscriptionServer
, you can define a few configs:
- target - a pid to send the results to, defaults to self()
- language_codes - a list of language codes to use for recognition, defaults to ["en-US"]
- enable_automatic_punctuation - a boolean to enable automatic punctuation, defaults to true
- interim_results - a boolean to enable interim results, defaults to false
- recognizer - a string representing the recognizer to use, defaults to use the recognizer from the config
- model - a string representing the model to use, defaults to "latest_long". Be careful, changing to 'short' may have unintended consequences
- explicit_decoding_config - a struct with audio decoding parameters
Note that apart from the interim_results
these configurations are better off set-up in the reconizer directly, so that you can control it without deploying any code.
See here for details: https://cloud.google.com/speech-to-text/v2/docs/recognizers
Basically, create a recognizer in GCP then add a system_env with the recognizer string on it.
defmodule MyModule.Transcribing do
use GenServer
alias ExGoogleSTT.{Error, SpeechEvent, Transcript, TranscriptionServer}
...
def init(_opts) do
{:ok, transcription_server} = TranscriptionServer.start_link(target: self(), interim_results: true)
end
def handle_info({:got_new_speech, speech_binary}, state) do
TranscriptionServer.process_audio(state.server_pid, speech_binary)
end
def handle_info({:stt_event, %{Transcript{} = transcript}}, state) do
# Do whatever you need with the transcription
end
def handle_info({:stt_event, %SpeechEvent{event: :SPEECH_ACTIVITY_BEGIN}}, state) do
# You probably want to ignore these
end
def handle_info({:stt_event, :stream_timeout}, state) do
# You probably want to ignore these as well. This is only a simple GRPC timeout, when nothing is coming.
end
def handle_info({:response, %Error{status: some_status, message: message}}, state) do
# You might want to to log these, as they are real errors.
end
end
The library allows you define other response handling functions and even ditch the GenServer
part of TranscriptionServer
altogether.
If you are not relying on auto decoding, you can specify the custom encoding parameters of your audio stream.
defmodule MyModule.Transcribing do
use GenServer
alias ExGoogleSTT.TranscriptionServer
alias Google.Cloud.Speech.V2.ExplicitDecodingConfig
def init(_opts) do
{:ok, transcription_server} =
TranscriptionServer.start_link(
target: self(),
interim_results: true,
explicit_decoding_config: %ExplicitDecodingConfig{
encoding: :LINEAR16,
sample_rate_hertz: 16000,
audio_channel_count: 1
}
)
end
Google's STT V2 knows when a sentence finishes, as long as there's some silence after it. When that happens, it'll return the transcription without ending the stream.
Therefore, as long as we keep the stream open, we can keep transcribing realtime speech.
A few points to notice though.
- The
model
must belong
orlatest_long
.short
will result in ending the stream after the first utterance. - One must end the stream to ensure the transcription stops.
This library uses protobuf-elixir
and its protoc-gen-elixir
plugin to generate Elixir modules from *.proto
files for Google's Speech gRPC API. The documentation for the types defined in *.proto
files can be found here
ALL the tests require communication with google, so you must have a google credentials configured to run them in this repo.
Tests with tag :load_test
are excluded by default, since they can be a bit expensive to run, use mix test --include load_test
to run them.
A recording fragment in test/fixtures
comes from an audiobook
"The adventures of Sherlock Holmes (version 2)" available on LibriVox
Current version of library supports only Streaming API and not tested in production. Treat this as experimental.
Portions of this project are modifications based on work created by Sofware Mansion and used according to terms described in the Apache License 2.0. See here for the original repository.
The work it is not endorsed by or affiliated with the original authors or their organizations.
The modifications are also licensed under Apache License 2.0.