Speech Emotion & Sentiment Analysis

Enabling Smarter Call Centers with AI-Driven Insights

Leveraging Shaip’s expertise in audio data collection and annotation to enhance real-time emotion and sentiment detection for improved customer service.

Speech emotion & sentiment analysis

Automated Speech Emotion &
Sentiment Analysis

The Client partnered with Shaip to develop an automated speech emotion and sentiment analysis model for call centers. The project involved collecting and annotating 250 hours of call center audio data across four English dialects – US, UK, Australian, and Indian. This enabled the client to enhance their AI models for detecting emotions such as Happy, Neutral, and Angry, and sentiment like Dissatisfied and Satisfied in real-time customer interactions.

The project overcame challenges such as sarcasm detection, varying audio lengths, and subtle verbal cues of dissatisfaction, delivering precise and scalable results.

Automated speech emotion & sentiment analysis

Key Stats

Call center audio data collected &  annotated across 4 English dialects

250 Hrs

No. of Languages

US English, UK English, Australian English & Indian English

Use Cases

Automated Speech Emotion & Sentiment Analysis

Project Scope

Collect and annotate 250 hours of call center audio data in four dialects of English:

  • US English (30%)
  • UK English (30%)
  • Australian English (20%)
  • Indian English (20%)

In Scope

The project consists of three parts:

  • Audio data with specific entities, including metadata.
  • Corresponding transcribed files with segmentation and time-stamping details.
  • Emotion and sentiment annotations:
    • Audio Emotion: Happy, Neutral, Angry
    • Transcription Sentiment: Extremely Dissatisfied, Dissatisfied, Neutral, Satisfied, Extremely Satisfied

Challenges

Diversity of Dialects

Ensuring that the audio data accurately represents the dialects specified (US, UK, Australian, and Indian) can be challenging. Different regions within these categories may use varied vocabulary, accents, and pronunciation.

Expertise Requirement

Annotating audio and transcriptions for emotion and sentiment requires trained annotators familiar with the cultural nuances and linguistic subtleties of each dialect.

Complexity of Emotions & Sentiments

Audio emotion & transcription sentiment do not always align. For instance, a person may sound angry but actually express satisfaction. E.g., handling sarcasm conversations in sarcastic phrases like "Oh, wonderful, another person who can't solve my problem" need to be correctly annotated for emotion & sentiment.

Audio Quality

The quality of the audio recordings can vary, affecting transcription accuracy and emotion detection. Background noise, overlapping conversations, and varying recording equipment can pose significant challenges.

Accurately Capturing

Dissatisfaction through verbal cues like heavy exhales or other signs of frustration.

Solution

Leveraging advanced natural language processing (NLP) techniques, the following solutions were implemented:

Data Collection

  • 250 hours of audio data split into dialect-specific quotas.
    • US English (30% or 75 hours)
    • UK English (30% or 75 hours)
    • Australian English (20% or 50 hours)
    • Indian English (20% or 50 hours)
  • Native accent users from the U.S., U.K., Australia, and India.
  • Speech samples containing varying tones, with special focus on cases where voice emotion is Angry and text sentiment is Dissatisfied or Extremely Dissatisfied.

Text Classification/Annotation

Text classification

  • Annotation of emotions and sentiments based on specific categories:
    • Audio Emotion: Happy, Neutral, Angry.
    • Transcription Sentiment: Extremely Dissatisfied, Dissatisfied, Neutral, Satisfied, Extremely Satisfied.
  • Each audio segment contained only one primary emotion.
  • Varying delay segments (from 2 to 30 seconds) applied within conversations.
  • The transcription format followed JSON output, including left and right speaker information, sentiment tags, and final segment sentiment.

 

Quality Assurance

Quality assurance
Transcription Accuracy:

  • Ensured that 250 hours of audio were delivered with a minimum of:
    • 90% Transcription Error Rate (TER) accuracy.
    • 95% Word Recognition Rate (WER) accuracy.

QA Process:

  • Regular audits of randomly selected samples from the dataset were conducted.
    • Used automated tools to measure TER and WER across the dataset.
    • Manual review of flagged sections ensured accuracy thresholds were met.

The Outcome

The training data will support the development of an automated emotion and sentiment detection model, delivering:

  • Real-time emotion detection in call center interactions.
  • More effective handling of complex cases, such as sarcasm or dissatisfaction.
  • Scalability for future projects, easily adapting to increased data volumes and more languages.

Deliverables

  • 250 hrs of Audio files (in 8 kHz PCM WAV format, mono)
  • Transcription files (with segmentation, sentiment tags, and speaker identifiers)
  • Metadata (audio duration, speaker details, etc.)

Partnering with Shaip for our call center data project has been a pivotal moment in advancing our AI solutions. Their team expertly collected and annotated 250 hours of audio data across four key English dialects – US, UK, Australian, and Indian – ensuring the highest quality and precision. The attention to linguistic nuances across these regions significantly improved the accuracy of our speech recognition models. Additionally, Shaip’s expertise in handling complex data annotation projects, has been instrumental in helping us build reliable, compliant models at scale.

Golden-5-star