Augmenting Speech Therapy via LLMs and Metahumans.
AI in Healthcare

Reimagining speech therapy

Prologue

We explore the integration of 3D avatars to augment remote speech therapy for children suffering from moto-speech disorders.


We hypothesized that the inclusion of a 3D avatar would keep the child engaged, in addition to providing them with visual cues while learning syllables or words.


Please scroll to understand the design directions, solution synthesis, and the challenges faced. To understand more about the research switch to the following case study-

Summary

PROBLEM

Contemporary solutions use bulky hardware and are not covered by medical insurance making remote speech therapy inaccessible to the masses.

SOLUTION

We designed a mobile-based remote speech therapy platform, Echo. It leverages 3D avatars to visually engage the child when learning syllables or practicing speaking patterns.

IMPACT

86% of the users reported a higher level of engagement when going through our application.

KEY DECISIONS

Designing a children-friendly user experience, text/speech input-multimodal interaction, and inclusion of conversational interfaces.

MY ROLE

I supervised the design of the product, subsequently created the technical pipeline for conversational 3D avatars via Convai.

MY TEAM

1 HCAI practitioner (Me), 3 HCI practitioner, 3 software developers

1 Why a 3D Model?

Multimodal Interaction (Text)

Through primary research, it was found out, that speech therapists prefer to cue the children by making them focus on their lip movement. (User testing the text interactions, text-to-speech)

Multimodal Interaction (Voice)

Often times children are made to practice in front of a mirror. This made a good case for us to incorporate a 3D avatar, which can take up similar responsibilities when delivering therapy remotely. (User testing the voice interaction, speech-to-speech)

2 How We Built the Model?

Metahuman Creator

I used Metahuman Creator to create high-fidelity 3D avatars which look friendly and welcoming. Incorporating subtle movements like blinking, and body sway ensured that the character looked friendly and human and not eerie or uncanny.

NVIDIA Audio2Face to Convai (Prompt Design)

The initial prototypes leveraged NVIDIA Audio2Face to test proof of concept, but I ran into latency which negatively affected user experience.

Therefore, switched to Convai as it has conversational capabilities with reduced latency making interactions life-like. The critical aspect of using Convai was prompt design, as it ensures context and intention when using 3D avatars to deliver speech therapy.

Facial Animation

Through Convai I was able to accurately capture the facial expressions of the 3D avatar which proved to be beneficial for the child’s learning process.

3 The Challenges We Faced

It wasn’t a smooth sailing all the way, here are some of the challenges which we faced along the way, which increased the robustness of the solutions.

I

Challenge faced: Users having strong reservations about the application.

Problem generated

Mobile-based solution is limited for children suffering from severe motor-speech disorders.

Solution synthesized

Limiting the solution to mild to moderate cases. Human intervention for severe cases is imperative.

II

Challenge faced: Latency and async between lip movement and audio.

Problem generated

User hesitance, latency and async. Latency in lip movement led to confusion among users.

Solution synthesized

Convai reduced latency by 950ms leading to natural conversation, improving user experience.

4 Testing and Iterating Early Ideas

First iteration

The first iteration focused on establishing the flow of the therapy process through a 3D avatar. Although limited in functionality, it laid the foundation for future iterations.

Second iteration

Second iteration optimized the user experience for children by increasing the size of the buttons to align with a child’s dexterity, in addition to changing the color palette to a softer and friendly tone.

5 Refining through Feedback

Based on user research and testing, we identified design directions which were translated into the final iteration, with refined user flows and storytelling exercises.

Learning via 3D Model

3D avatar took the centerstage. The user experience was augmented with clear markers for progression and a home screen underscoring the importance of prosodic variations.

Storytelling Exercises

Storytelling is an effective method to execute the last step of the app-based therapy - spontaneous production. As it checks the child’s ability to speak spontaneously.

Clear Messages and Progression

Lastly, we implemented progression markers indicating the child about their learning progress, and providing them with motivating toast messages to increase engagement.

6 The Impact We Achieved

What is the impact of integrating a 3D model when delivering therapy for CAS?

The feedback was positive as we observed 86% of users citing the application to be engaging, as we were able to reduce latency and remove async from audio production.

High Engagement!

6 of 7 participants were interested
to learn more.

7 What Comes Next?

Full Body Animation

We plan to make a full body animation of the current model, as it can guide the child through hand gestures making it intuitive for them to understand speech production.

Lessons We Learned

Epilogue

CONVERSATIONAL AI

I realized the importance of spontaneous conversational capabilities to drive engagement.

Child-Computer Interaction

Echo, primarily deals with children. I implemented the color palette and changed the button sizes after understanding child’s psychology and dexterity.

PROMPT DESIGN

Prompt design works in-tandem with conversational AI. Warm, encouraging, and friendly were some of the qualities that I highlighted through prompt design.

Understanding Human

Behaviors via AI Agents

Next Project

Understanding Human

Behaviors via AI Agents

Next Project

Understanding Human

Behaviors via AI Agents

Next Project