Speech Emotion & Sentiment Analysis
Enabling Smarter Call Centers with AI-Driven Insights
Leveraging Shaip’s expertise in audio data collection and annotation to enhance real-time emotion and sentiment detection for improved customer service.
Automated Speech Emotion &
Sentiment Analysis
The Client partnered with Shaip to develop an automated speech emotion and sentiment analysis model for call centers. The project involved collecting and annotating 250 hours of call center audio data across four English dialects – US, UK, Australian, and Indian. This enabled the client to enhance their AI models for detecting emotions such as Happy, Neutral, and Angry, and sentiment like Dissatisfied and Satisfied in real-time customer interactions.
The project overcame challenges such as sarcasm detection, varying audio lengths, and subtle verbal cues of dissatisfaction, delivering precise and scalable results.
Key Stats
Call center audio data collected & annotated across 4 English dialects
250 Hrs
No. of Languages
US English, UK English, Australian English & Indian English
Use Cases
Automated Speech Emotion & Sentiment Analysis
Project Scope
Collect and annotate 250 hours of call center audio data in four dialects of English:
- US English (30%)
- UK English (30%)
- Australian English (20%)
- Indian English (20%)
In Scope
The project consists of three parts:
- Audio data with specific entities, including metadata.
- Corresponding transcribed files with segmentation and time-stamping details.
- Emotion and sentiment annotations:
- Audio Emotion: Happy, Neutral, Angry
- Transcription Sentiment: Extremely Dissatisfied, Dissatisfied, Neutral, Satisfied, Extremely Satisfied
Challenges
Ensuring that the audio data accurately represents the dialects specified (US, UK, Australian, and Indian) can be challenging. Different regions within these categories may use varied vocabulary, accents, and pronunciation.
Annotating audio and transcriptions for emotion and sentiment requires trained annotators familiar with the cultural nuances and linguistic subtleties of each dialect.
Audio emotion & transcription sentiment do not always align. For instance, a person may sound angry but actually express satisfaction. E.g., handling sarcasm conversations in sarcastic phrases like "Oh, wonderful, another person who can't solve my problem" need to be correctly annotated for emotion & sentiment.
The quality of the audio recordings can vary, affecting transcription accuracy and emotion detection. Background noise, overlapping conversations, and varying recording equipment can pose significant challenges.
Dissatisfaction through verbal cues like heavy exhales or other signs of frustration.
Solution
Leveraging advanced natural language processing (NLP) techniques, the following solutions were implemented:
Data Collection
- 250 hours of audio data split into dialect-specific quotas.
- US English (30% or 75 hours)
- UK English (30% or 75 hours)
- Australian English (20% or 50 hours)
- Indian English (20% or 50 hours)
- Native accent users from the U.S., U.K., Australia, and India.
- Speech samples containing varying tones, with special focus on cases where voice emotion is Angry and text sentiment is Dissatisfied or Extremely Dissatisfied.
Text Classification/Annotation
- Annotation of emotions and sentiments based on specific categories:
- Audio Emotion: Happy, Neutral, Angry.
- Transcription Sentiment: Extremely Dissatisfied, Dissatisfied, Neutral, Satisfied, Extremely Satisfied.
- Each audio segment contained only one primary emotion.
- Varying delay segments (from 2 to 30 seconds) applied within conversations.
- The transcription format followed JSON output, including left and right speaker information, sentiment tags, and final segment sentiment.
Quality Assurance
Transcription Accuracy:
- Ensured that 250 hours of audio were delivered with a minimum of:
- 90% Transcription Error Rate (TER) accuracy.
- 95% Word Recognition Rate (WER) accuracy.
QA Process:
- Regular audits of randomly selected samples from the dataset were conducted.
- Used automated tools to measure TER and WER across the dataset.
- Manual review of flagged sections ensured accuracy thresholds were met.
The Outcome
The training data will support the development of an automated emotion and sentiment detection model, delivering:
- Real-time emotion detection in call center interactions.
- More effective handling of complex cases, such as sarcasm or dissatisfaction.
- Scalability for future projects, easily adapting to increased data volumes and more languages.
Deliverables
- 250 hrs of Audio files (in 8 kHz PCM WAV format, mono)
- Transcription files (with segmentation, sentiment tags, and speaker identifiers)
- Metadata (audio duration, speaker details, etc.)
Partnering with Shaip for our call center data project has been a pivotal moment in advancing our AI solutions. Their team expertly collected and annotated 250 hours of audio data across four key English dialects – US, UK, Australian, and Indian – ensuring the highest quality and precision. The attention to linguistic nuances across these regions significantly improved the accuracy of our speech recognition models. Additionally, Shaip’s expertise in handling complex data annotation projects, has been instrumental in helping us build reliable, compliant models at scale.