OpenFS project page

Abstract

Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. We release the code and processed data.

Recognizer

Overview of the multi-hand-capable fingerspelling recognizer. The hand pose sequence is embedded into a feature space and encoded using our proposed dual-level positional encoding, which consists of hand-identity encoding (τ) and temporal positional encoding (η). The recognizer's decoder then predicts the next letter token based on the pose-aware, semantically rich encoder features. ψ denotes the standard positional encoding, and W_i represents the i-th letter of the word. <start> and <end> are special tokens indicating the start and end of the letter token sequence, respectively.

Proposed Losses

Overview of the signing-hand (SF) and monotonic alignment (MA) losses. (a) The signing-hand focus (SF) loss ℒ_SF measures the entropy of the hand-wise attention distribution derived from the cross-attention map between input hand pose tokens and output letter tokens. Minimizing this entropy encourages the recognizer to focus on the single signing hand. (b) The monotonic alignment (MA) loss ℒ_MA penalizes misalignments that violate the natural temporal order between input hand pose tokens and output letter tokens in fingerspelling. Reducing these violations encourages the model to interpret the hand pose tokens in a temporally coherent manner to predict the letter token.

Annotation Method

Overview of the coarse-to-fine frame-wise letter annotation method. (a) We utilize the cross-attention map between input hand pose tokens and output letter tokens to generate coarse frame-wise letter annotations, where φ denotes a non-letter annotation. (b) To refine the coarse frame-wise letter annotations, we freeze the pre-trained recognizer and train a frame-wise annotation refiner supervised by the coarse annotations. (c) The trained refiner produces refined frame-wise letter annotations; the coarse and refined annotations are compared with the corresponding image frames, where each label–frame pair is linked with arrows and mismatched cases are highlighted in red.

Generator

Overview of the frame-wise letter-conditioned generator. W_i is the i-th letter of the word, |W| is the word length, ⊗ denotes concatenation, and ψ denotes the standard positional encoding. The generator embeds each letter token and each noised pose vector through their respective embedding layers. The resulting letter and pose embeddings are concatenated frame-wise and, given a diffusion timestep, are denoised by the generator encoder to produce a clean hand-pose sequence.

Qualitative Comparisons (Recognition)

Qualitative recognition results on ChicagoFSWild. For each example, we show the input frames, the ground-truth letters, and the predictions from PoseNet, PoseNet^†, Ours, and Ours^†. The symbol † denotes models trained with additional synthetic data generated by our frame-wise letter-conditioned generator. We also report the letter accuracy for each prediction. Colored blocks indicate different types of prediction outcomes: blue for ground-truth letters, green for correct predictions, purple for substitution errors, red for deletion errors, and yellow for insertion errors. The symbol ˆ denotes a space character. For the first example in the third case, earlier input frames are omitted for space.

Labeling

Qualitative examples of coarse-to-fine frame-wise letter annotation. For each word-level label (top: garden, middle: asl, bottom: sk), selected frames are shown with extracted hand keypoints. We compare coarse pseudo labels, fine-grained refined labels, and human annotations. The symbol φ denotes the absence of a letter label for the corresponding frame. The results show that the refined fine labels align more closely with human annotations, especially in frames where the coarse labels fail to assign a valid letter.

Generation

Generated using the input words "Denver", "700 14th", and "hello 1234".
(15 fps, 0.5× playback speed).

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis