Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. We release the code and processed data.
Overview of the multi-hand-capable fingerspelling recognizer. The hand pose sequence is embedded into a feature space and encoded using our proposed dual-level positional encoding, which consists of hand-identity encoding (τ) and temporal positional encoding (η). The recognizer's decoder then predicts the next letter token based on the pose-aware, semantically rich encoder features. ψ denotes the standard positional encoding, and Wi represents the i-th letter of the word. <start> and <end> are special tokens indicating the start and end of the letter token sequence, respectively.
Overview of the signing-hand (SF) and monotonic alignment (MA) losses. (a) The signing-hand focus (SF) loss 𝒻SF measures the entropy of the hand-wise attention distribution derived from the cross-attention map between input hand pose tokens and output letter tokens. Minimizing this entropy encourages the recognizer to focus on the single signing hand. (b) The monotonic alignment (MA) loss 𝒻MA penalizes misalignments that violate the natural temporal order between input hand pose tokens and output letter tokens in fingerspelling. Reducing these violations encourages the model to interpret the hand pose tokens in a temporally coherent manner to predict the letter token.
Overview of the coarse-to-fine frame-wise letter annotation method. (a) We utilize the cross-attention map between input hand pose tokens and output letter tokens to generate coarse frame-wise letter annotations, where φ denotes a non-letter annotation. (b) To refine the coarse frame-wise letter annotations, we freeze the pre-trained recognizer and train a frame-wise annotation refiner supervised by the coarse annotations. (c) The trained refiner produces refined frame-wise letter annotations; the coarse and refined annotations are compared with the corresponding image frames, where each label–frame pair is linked with arrows and mismatched cases are highlighted in red.
Overview of the frame-wise letter-conditioned generator. Wi is the i-th letter of the word, |W| is the word length, ⊗ denotes concatenation, and ψ denotes the standard positional encoding. The generator embeds each letter token and each noised pose vector through their respective embedding layers. The resulting letter and pose embeddings are concatenated frame-wise and, given a diffusion timestep, are denoised by the generator encoder to produce a clean hand-pose sequence.
Qualitative recognition results on ChicagoFSWild. For each example, we show the input frames, the ground-truth letters, and the predictions from PoseNet, PoseNet†, Ours, and Ours†. The symbol † denotes models trained with additional synthetic data generated by our frame-wise letter-conditioned generator. We also report the letter accuracy for each prediction. Colored blocks indicate different types of prediction outcomes: blue for ground-truth letters, green for correct predictions, purple for substitution errors, red for deletion errors, and yellow for insertion errors. The symbol ˆ denotes a space character. For the first example in the third case, earlier input frames are omitted for space.
Generated using the input word "Denver" (15 fps, 0.5× playback speed).