SLVideo: A Sign Language Video Moment Retrieval Framework

Sign Language Recognition has been an increasingly studied and developed subject throughout the years to help deaf and hard-of-hearing individuals in their social interactions in everyday life. These technologies employ manual sign recognition algorithms; however, the majority of them lack the capacity to recognise facial expressions, which are also an essential part of sign language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs.

SLVideo is a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing the gap in existing technology by focusing on both hand and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in their entirety, enabling users to search for a specific sign language video segment with text queries or to search by similar sign language videos.

To evaluate this system, a collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting.

In addition, SLVideo incorporates a thesaurus that enables users to search for similar signs to those retrieved, using the video segment embeddings. The users can also edit existing annotations and create new ones, making it a collaborative tool for annotators working with the same videos.

SLVideo: A Sign Language Video Moment Retrieval Framework

Abstract

Model Architecture

UI Design

Examples

Searches

Thesaurus

Annotations Edition

Annotations Creation

Video Watching

Results

BibTeX