SLVideo: A Sign Language Video Moment Retrieval Framework

Submission to ECIR 2025
Universidade NOVA de Lisboa

Abstract

Sign Language Recognition has been an increasingly studied and developed subject throughout the years to help deaf and hard-of-hearing individuals in their social interactions in everyday life. These technologies employ manual sign recognition algorithms; however, the majority of them lack the capacity to recognise facial expressions, which are also an essential part of sign language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs.

SLVideo is a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing the gap in existing technology by focusing on both hand and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in their entirety, enabling users to search for a specific sign language video segment with text queries or to search by similar sign language videos.

To evaluate this system, a collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting.

In addition, SLVideo incorporates a thesaurus that enables users to search for similar signs to those retrieved, using the video segment embeddings. The users can also edit existing annotations and create new ones, making it a collaborative tool for annotators working with the same videos.

Model Architecture

Model Architecture

UI Design

UI Overview

Examples

All these examples were produced using the CAPIVARA model


Searches


Thesaurus


Annotations Edition


Annotations Creation


Video Watching


Results

Automatic Evaluation

Results for searching the words "muito" (a lot), "correr" (run), "grande" (big), "pensar" (think), "lobo" (wolf), "dúvida" (doubt), "então" (so), "lebre" (hare) and "não" (no) using the seven techniques of frame embedding-based search and the two models for embedding generation. The metric used in these results is the median F1 score of the frame embedding-based search options and the F1 score of the annotation embedding-based search for each signed word.

BibTeX

@misc{martins2024slvideosignlanguagevideo,
      title={SLVideo: A Sign Language Video Moment Retrieval Framework}, 
      author={Gonçalo Vinagre Martins and Afonso Quinaz and Carla Viegas and Sofia Cavaco and João Magalhães},
      year={2024},
      eprint={2407.15668},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.15668}, 
}