Augmenting Speech Therapy via LLMs and Metahumans.
AI in Healthcare
Reimagining speech therapy
Prologue
We explore the integration of 3D avatars to augment remote speech therapy for children suffering from moto-speech disorders.
We hypothesized that the inclusion of a 3D avatar would keep the child engaged, in addition to providing them with visual cues while learning syllables or words.
Please scroll to understand the design directions, solution synthesis, and the challenges faced. To understand more about the research switch to the following case study-
Summary
PROBLEM
Contemporary solutions use bulky hardware and are not covered by medical insurance making remote speech therapy inaccessible to the masses.
SOLUTION
We designed a mobile-based remote speech therapy platform, Echo. It leverages 3D avatars to visually engage the child when learning syllables or practicing speaking patterns.
IMPACT
86% of the users reported a higher level of engagement when going through our application.
KEY DECISIONS
Designing a children-friendly user experience, text/speech input-multimodal interaction, and inclusion of conversational interfaces.
MY ROLE
I supervised the design of the product, subsequently created the technical pipeline for conversational 3D avatars via Convai.
MY TEAM
1 HCAI practitioner (Me), 3 HCI practitioner, 3 software developers
1 Why a 3D Model?
Multimodal Interaction (Text)
Through primary research, it was found out, that speech therapists prefer to cue the children by making them focus on their lip movement. (User testing the text interactions, text-to-speech)
Multimodal Interaction (Voice)

Often times children are made to practice in front of a mirror. This made a good case for us to incorporate a 3D avatar, which can take up similar responsibilities when delivering therapy remotely. (User testing the voice interaction, speech-to-speech)
2 How We Built the Model?
Metahuman Creator

I used Metahuman Creator to create high-fidelity 3D avatars which look friendly and welcoming. Incorporating subtle movements like blinking, and body sway ensured that the character looked friendly and human and not eerie or uncanny.
NVIDIA Audio2Face to Convai (Prompt Design)

The initial prototypes leveraged NVIDIA Audio2Face to test proof of concept, but I ran into latency which negatively affected user experience.

Therefore, switched to Convai as it has conversational capabilities with reduced latency making interactions life-like. The critical aspect of using Convai was prompt design, as it ensures context and intention when using 3D avatars to deliver speech therapy.
Facial Animation
Through Convai I was able to accurately capture the facial expressions of the 3D avatar which proved to be beneficial for the child’s learning process.



3 The Challenges We Faced
It wasn’t a smooth sailing all the way, here are some of the challenges which we faced along the way, which increased the robustness of the solutions.
I
Challenge faced: Users having strong reservations about the application.
Problem generated
Mobile-based solution is limited for children suffering from severe motor-speech disorders.
Solution synthesized
Limiting the solution to mild to moderate cases. Human intervention for severe cases is imperative.
II
Challenge faced: Latency and async between lip movement and audio.
Problem generated
User hesitance, latency and async. Latency in lip movement led to confusion among users.
Solution synthesized
Convai reduced latency by 950ms leading to natural conversation, improving user experience.
4 Testing and Iterating Early Ideas
First iteration





The first iteration focused on establishing the flow of the therapy process through a 3D avatar. Although limited in functionality, it laid the foundation for future iterations.
Second iteration





Second iteration optimized the user experience for children by increasing the size of the buttons to align with a child’s dexterity, in addition to changing the color palette to a softer and friendly tone.
5 Refining through Feedback
Based on user research and testing, we identified design directions which were translated into the final iteration, with refined user flows and storytelling exercises.
Learning via 3D Model




3D avatar took the centerstage. The user experience was augmented with clear markers for progression and a home screen underscoring the importance of prosodic variations.
Storytelling Exercises



Storytelling is an effective method to execute the last step of the app-based therapy - spontaneous production. As it checks the child’s ability to speak spontaneously.
Clear Messages and Progression

Lastly, we implemented progression markers indicating the child about their learning progress, and providing them with motivating toast messages to increase engagement.
6 The Impact We Achieved
What is the impact of integrating a 3D model when delivering therapy for CAS?
The feedback was positive as we observed 86% of users citing the application to be engaging, as we were able to reduce latency and remove async from audio production.

High Engagement!
6 of 7 participants were interested
to learn more.
7 What Comes Next?
Full Body Animation
We plan to make a full body animation of the current model, as it can guide the child through hand gestures making it intuitive for them to understand speech production.
Lessons We Learned
Epilogue

CONVERSATIONAL AI
I realized the importance of spontaneous conversational capabilities to drive engagement.
Child-Computer Interaction
Echo, primarily deals with children. I implemented the color palette and changed the button sizes after understanding child’s psychology and dexterity.
PROMPT DESIGN
Prompt design works in-tandem with conversational AI. Warm, encouraging, and friendly were some of the qualities that I highlighted through prompt design.