Nora El-Zein, Peter Cady and Paul Fessele

Last updated:January 17, 2025 7:49 PM (GMT+1)

Mindfulness Master Class is a system that mimics a mindfulness coach by providing adaptive and contextually meaningful responses based on users' emotional states and speech. The system consists of two parts: The first part is the User Perception subsystem and captures emotional expressions through image analysis and machine learning models that predict emotions. The second part is the Interaction subsystem where users engage with a robot from Furhat robotics. This robot processes both emotional and speech inputs during interactions.

The complete project can be found here:

GitHub - norael-zein/IntelligentInteractiveSystems: Course project

System Overview

The overview system can be seen in the image below. User’s facial expression and verbal input are analyzed using a camera and microphone. Visual processing followed by model prediction detects valence, arousal and emotional states, while speech recognition converts verbal messages into text. This data is processed to generate emotional states. The interaction system uses the Furhat agent, a screen, and a speaker to provide visual and audio feedback. Text-to-speech generates verbal responses, and gestures are used to make interactions engaging. The subsystem generates responses and adaptively adjusts based on the user's affective state.

1. User Perception Subsystem

This subsystem focuses on real-time detection and analysis of user emotions through webcam input. Its primary objective is to automatically recognize affective states, such as valence and arousal, from one or more users interacting with Furhat. To achieve this, a comprehensive dataset containing thousands of images showcasing diverse emotional expressions has been used.

The system processes these images to extract Action Units (AUs), which serve as the fundamental building blocks for emotion detection. These features are then used as input to a machine learning pipeline, which comprises multiple algorithms tested with a variety of configurations.

To optimize performance, the pipeline incorporates techniques such as parameter tuning and model selection, ensuring that the final model accurately predicts emotions based on the extracted features. The trained model is then capable of processing real-time inputs to detect and classify affective states dynamically during interaction.

Visual processing of images

Webcam Input

The first part of the system consists of capturing the webcam input of the user, for which we’re using OpenCV’s VideoCapture class. This class allows us to read the individual frames of the webcam feed that we can then process further to extract AUs corresponding to the users face. Because frames are provided as numpy arrays that are incompatible with the next processing step, a temporary file is created to save the data as a PIL image.

Feature Extraction

In order to extract the AUs from an image, we’re utilizing the Detector class from the package Py-Feat, a Facial Expression Analysis Toolbox. The class provides a function detect_image that returns a Pandas DataFrame containing all of (and only) the AUs in sequential order. This data is then passed along to the Machine Learning Pipeline to predict the emotion of the user.