Speech Recognition Integration | Eleveo User Guides

Supported for CLOUD Deployments + on-premise DEPLOYMENTS + Hybrid Deployments

Overview

Eleveo offers a Speech Recognition package that is installed on a separate, dedicated, server. The solution is provided for both on-premise and cloud deployments. Feature availability may vary based on your installation.

The Eleveo solution does not support multiple engines in parallel. Multiple language packs can be configured (based on what is supported by the given speech engine) but only a single speech engine can be configured.

Speech Recognition

Speech Recognition works with a limited number of languages. Speech Recognition is installed as an add-on to Quality Management and must be configured. This feature provides transcription services for all supported languages. The audio files generated in the contact center or back-office are sent via a dedicated API to a secondary system that processes the recording, analyzes the audio, detects emotion/sentiment, transcribes the audio, and tags the relevant section. View the transcription within the Conversation Explorer.

speech rec external speech eng.png — Cloud Installation: Graphical overview of how the Speech Recognition service is interconnected with other services

What is Supported - Based on Speech Recognition Service

Languages

Speech Recognition - Voci		Speech Recognition - Phonexia
Supported Languages	Dialects supported	Supported Languages
		CPU Based Languages	GPU Based Languages
English	North America Australia United Kingdom Europe Philippines International	Arabic (Gulf) Arabic (Levantine) Bengali Chinese Mandarin Croatian Czech Dutch English (US) Farsi French Georgian German Hungarian Italian Kazakh Pashto Polish Russian Serbian Slovak Spanish Swedish Turkish Ukrainian Vietnamese	Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bengali Bosnian Bulgarian Catalan Cantonese (HK, CN) Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician German Greek Gujarati Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kannada Kazakh Korean	Latvian Lithuanian Macedonian Malay Mandarin (TW, CN) Marathi Maori Nepali Norwegian Nynorsk Persian Polish Portuguese Punjabi Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Tagalog Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese Welsh
French	Canada France Europe
Spanish	North America Spain Mexico Argentina Columbia Panama
German
Italian
Portuguese	Brazil
Dutch
For up-to-date information regarding supported language packs please refer to the providers documentation.		For up-to-date information regarding supported language packs please refer to the providers documentation.
Medallia Documentation		https://docs.cloud.phonexia.com/docs/products/speech-platform-4

Additional Features - Installation Dependent

Feature	Speech Recognition - Voci	Speech Recognition - Phonexia
Transcription
Phrase Spotting (on top of transcription)		UI differences – There are minor differences in the way the Speech Tags are displayed to end users. Found speech tags may display as overlapping each other for GPU based languages.
Emotion/sentiment detection	(English only) Available on transcription utterance as well as participant level	Emotion is not supported
Acoustic parameters (crosstalk, silence, speed of speech, talk time, gender, etc.)		Gender is not supported Silence and talking count are not supported
Transcription redaction
Automated language identification	Automated language detection – If automated language detection is enabled for your server, the system will automatically detect what language is used in the first twenty seconds of the recording and switch the language processor to the detected language. This means that if multiple languages are used in a conversation, the system will transcribe text according to the language detected at the beginning of the recording. If the speech recognition engine fails to detect the language accurately, it may produce transcriptions for the incorrect language. The system does not automatically detect and switch languages after the first twenty seconds, even if speakers switch between different languages. Automatic recognition for the following language pairs: English / Spanish English / French	Automated language detection – GPU languages only – When configuring the language model to be used, it is possible to define a single language or to set the system to auto-detect the language spoken. Language is detected every 30 seconds and then it can start using a different language model.
Transcription tuning	Possibility to define vocabulary.	Not supported at this time
Supported formats	WAV only	WAV, MP3, MP4
Availability	Cloud, Hybrid, On Prem	On Prem
Reprocessing of Archived Media

Acoustic Parameters by Provider

The following list is provided as additional information. Data available may vary based on th quality of the recorded conversation.

Acoustic Parameter	Speech Recognition - Voci	Speech Recognition - Phonexia
General statistics – Aggregated for the entire conversation
Interruptions count – Number of interruptions
Total crosstalk duration (sec.) – Total time that the speakers were interrupting or speaking over each other
Total crosstalk ratio (%) – Ratio of time that the speakers were interrupting or speaking over each other
Silence count – Silence count includes all silences that are greater in length than 800 milliseconds. This means that the silence count may be 0. In contrast, Total silence duration might be greater than 0 as it combines all silence time, even short periods of silence.
Total silence duration (sec.) – How much time was silent (no audio)
Silence ratio (%) – Ratio of time that was silent relative to talk time
Talking count – Total count of utterances (i.e. phrases, sentences in the transcription)
Total talking duration (sec.) – Total time a participant was speaking
Talking ratio (%) – How much time (as a ratio) a participant was speaking
Speaker specific statistics
Gender (Male/Female) – If detected the system displays the gender of the speaker (this information is not displayed unless configured by an administrator)
Total talking duration (sec.) – Total time the participant was speaking
Talking ratio (%) – How much time (as a ratio) the participant was speaking
Average speed (words/min.) – How fast the speaker was speaking. Average number of words per minute (rounded to 2 decimal places)
Interruptions count – Number of interruptions (times the speakers spoke over each other)
Total crosstalk duration (sec.) – Total time that the speaker was interrupting or speaking over the other		)
Total crosstalk ratio (%) – Ratio of time that the speaker was interrupting or speaking over the other
Average talk speed – Average number of words spoken per minute
Agent talking ratio – Ratio of the call, in percent, where the agent is speaking
Agent crosstalk ratio – Ratio of the call, in percent, where there is crosstalk
Agent number of interruptions – Number of times crosstalk is detected