Affective Faces for Goal-Driven Dyadic Communication

Columbia University

Overview



We introduce a video framework for modeling the goal-conditioned association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker (left), and conditioned on a listener's goals, personalities, or backgrounds (middle), our approach retrieves a video of a listener (right), who has facial expressions that would be socially appropriate given the context. We accomplish this by modeling conversations via a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we contribute a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress.

The RealTalk Video Dataset

Studying multimodal human dyadic interaction requires video data of high-quality, natural, and unexaggerated conversations. We contribute the RealTalk dataset, consisting of 692 in-the-wild videos from The Skin Deep, a popular YouTube channel capturing long-form, unscripted personal conversations between diverse individuals about different facets of the human experience. Conversations in our dataset deal with topics such as family, dreams, relationships, illness, mental health, and many more, enabling our dataset to organically capture a wide gamut of emotions and expressions (shown below). On average, each video in RealTalk is slightly over 10 minutes long, adding up to a total of 115 hours. Alongside these videos, we release pre-computed ASR transcripts (from Whisper), visual embeddings (from various pretrained face models), and active speaker annotations (from Talknet).

Qualitative Results

We highlight several example outputs of our framework. For each speaker, we retrieve a listener with a specified goal, and stitch the retrieved listener video with the speaker's video for visualization. The ASR transcript of the speaker is placed below each video for convenience.

Specified goal: Act socially correct Specified goal: Act socially incorrect

New York has been your dream for a while and you've been here for quite some time so I feel bad. I already feel like heaviness about having you move as opposed to me and we've talked about it a bunch.


New York has been your dream for a while and you've been here for quite some time so I feel bad. I already feel like heaviness about having you move as opposed to me and we've talked about it a bunch.


Specified goal: Listener is a Ronaldo fan Specified goal: Listener is a Messi fan

Cristiano Ronaldo is the type of player working hard. But Messi is different. Messi is like you know, genius. He's like natural, you know. At training it's always easy, he is very good with the ball.


Cristiano Ronaldo is the type of player working hard. But Messi is different. Messi is like you know, genius. He's like natural, you know. At training it's always easy, he is very good with the ball.


Specified goal: Listener is a lazy student Specified goal: Listener is Hermione Granger

Also in light of the recent events, as a (Hogwarts) school treat, all exams have been canceled.


Also in light of the recent events, as a (Hogwarts) school treat, all exams have been canceled.

Click here to see more qualitative results of our framework.

Paper


PDF
BibTeX Citation
@inproceedings{geng2023affective,
title={Affective Faces for Goal-Driven Dyadic Communication},
author={Geng, Scott and Teotia, Revant and Tendulkar, Purva and Menon, Sachit and Vondrick, Carl},
year={2023}
}
Acknowledgements

We thank Athena Tsu for her valuable assistance in creating visualizations, as well as Dídac Surís and Sruthi Sudhakar for their helpful feedback. This research is based on work partially supported by the DARPA CCU program under contract HR001122C0034 and the NSF NRI Award #2132519. Scott Geng is supported by the Rabi Scholarship; Sachit Menon is supported by the NSF GRFP fellowship. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. The webpage template was inspired by this project page.