Try sign translate to experience state-of-the art-sign language translation technology.

Introduction

Signed languages (also known as sign languages) are languages that use the visual-gestural modality to convey meaning through manual articulations in combination with non-manual elements like the face and body. They are the primary means of communication for many deaf and hard-of-hearing individuals. Similar to spoken languages, signed languages are natural languages governed by a set of linguistic rules (Sandler and Lillo-Martin 2006), both emerging through an abstract, protracted aging process and evolving without meticulous planning. Signed languages are not universal or mutually intelligible, despite often having striking similarities among them. They are also distinct from spoken languages—i.e., American Sign Language (ASL) is not a visual form of English but its own unique language.

Sign Language Processing (Bragg et al. 2019; Yin et al. 2021) is an emerging field of artificial intelligence concerned with the automatic processing and analysis of sign language content. While, to date, research has focused more on the visual aspects of signed languages, it is a subfield of both Natural Language Processing (NLP) and Computer Vision (CV). Challenges in sign language processing frequently involve machine translation of sign language videos to spoken language text (sign language translation), from spoken language text (sign language production) or sign language recognition for sign language understanding.

Unfortunately, the latest advances in language-based artificial intelligence, like machine translation and personal assistants, expect a spoken language input (text or transcribed speech), excluding around 200-to-300 different signed languages (United Nations 2022) and up to 70 million deaf people (World Health Organization 2021; World Federation of the Deaf 2022).

Throughout history, Deaf communities fought for the right to learn and use signed languages and for the public recognition of signed languages as legitimate ones. Indeed, signed languages are sophisticated communication modalities that are at least as capable as spoken languages in all manners, linguistic and social. However, in a predominantly oral society, deaf people are constantly encouraged to use spoken languages through lip-reading or text-based communication. The exclusion of signed languages from modern language technologies further suppresses signing in favor of spoken languages. This exclusion disregards the preferences of the Deaf communities who strongly prefer to communicate in signed languages both online and for in-person day-to-day interactions, among themselves and when interacting with spoken language communities (C. A. Padden and Humphries 1988; Glickman and Hall 2018). Thus, it is essential to make signed languages accessible.

To date, a large amount of research on Sign Language Processing (SLP) has been focused on the visual aspect of signed languages, led by the Computer Vision (CV) community, with little NLP involvement. This focus is not unreasonable, given that a decade ago, we lacked adequate CV tools to process videos for further linguistic analyses. However, like spoken languages, signed languages are fully-fledged systems that exhibit all the fundamental characteristics of natural languages, and current SLP techniques fail to address or leverage the linguistic structure of signed languages. Signed languages introduce novel challenges for NLP due to their visual-gestural modality, simultaneity, spatial coherence, and lack of written form. The lack of a written form makes the spoken language processing pipelines - which often start with audio transcription before processing - incompatible with signed languages, forcing researchers to work directly on the raw video signal.

Moreover, SLP is not only an intellectually appealing area but also an important research area with a solid potential to benefit signing communities. Examples of beneficial applications enabled by signed language technologies include better documentation of endangered sign languages; educational tools for sign language learners; tools for query and retrieval of information from signed language videos; personal assistants that react to signed languages; real-time automatic sign language interpretations; and more. Needless to say, in addressing this research area, researchers should work alongside and under the direction of deaf communities, and to benefit the signing communities’ interest above all (Harris, Holmes, and Mertens 2009).

In this work, we describe the different representations used for sign language processing, as well as survey the various tasks and recent advances on them. We also make a comprehensive list of existing datasets and make the ones available easy to load using a simple and standardized interface.

(Brief) History of Signed Languages and Deaf Culture

Throughout modern history, spoken languages were dominant, so much so that signed languages struggled to be recognized as languages in their own right and educators developed misconceptions that signed language acquisition might hinder the development of speech skills. For example, in 1880, a large international conference of deaf educators called the “Second International Congress on Education of the Deaf” banned teaching signed languages, favoring speech therapy instead. It was not until the seminal work on American Sign Language (ASL) by Stokoe (1960) that signed languages started gaining recognition as natural, independent, and well-defined languages, which inspired other researchers to further explore signed languages as a research area. Nevertheless, antiquated notions that deprioritized signed languages continue to do harm and subjects many to linguistic neglect (Humphries et al. 2016). Several studies have shown that deaf children raised solely with spoken languages do not gain enough access to a first language during their critical period of language acquisition (Murray, Hall, and Snoddon 2020). This language deprivation can lead to life-long consequences on the cognitive, linguistic, socio-emotional, and academic development of the deaf (Hall, Levin, and Anderson 2017).

Signed languages are the primary languages of communication for the Deaf1 and are at the heart of Deaf communities. Failing to recognize signed languages as fully-fledged natural language systems in their own right has had harmful effects in the past, and in an increasingly digitized world, NLP research should strive to enable a world in which all people, including the Deaf, have access to languages that fit their lived experience.

Sign Language Linguistics Overview

Signed languages consist of phonological, morphological, syntactic, and semantic levels of structure that fulfill the same social, cognitive, and communicative purposes as other natural languages. While spoken languages primarily channel the oral-auditory modality, signed languages use the visual-gestural modality, relying on the signer’s face, hands, body, and space around them to create distinctions in meaning. We present the linguistic features of signed languages2 that researchers must consider during their modeling.

Phonology

Signs are composed of minimal units that combine manual features such as hand configuration, palm orientation, placement, contact, path movement, local movement, as well as non-manual features including eye aperture, head movement, and torso positioning (Liddell and Johnson 1989; Johnson and Liddell 2011; Brentari 2011; Sandler 2012). Not all possible phonemes are realized in both signed and spoken languages, and inventories of two languages’ phonemes/features may not overlap completely. Different languages are also subject to rules for the allowed combinations of features.

Simultaneity

Though an ASL sign takes about twice as long to produce than an English word, the rates of transmission of information between the two languages are similar (Bellugi and Fischer 1972). One way signed languages compensate for the slower production rate of signs is through simultaneity: Signed languages use multiple visual cues to convey different information simultaneously (Sandler 2012). For example, the signer may produce the sign for ‘cup’ on one hand while simultaneously pointing to the actual cup with the other to express ``that cup’’. Similarly to tone in spoken languages, the face and torso can convey additional affective information (Liddell and others 2003; Johnston and Schembri 2007). Facial expressions can modify adjectives, adverbs, and verbs; a head shake can negate a phrase or sentence; eye direction can help indicate referents.

Referencing

The signer can introduce referents in discourse either by pointing to their actual locations in space or by assigning a region in the signing space to a non-present referent and by pointing to this region to refer to it (Rathmann and Mathur 2011; Schembri, Cormier, and Fenlon 2018). Signers can also establish relations between referents grounded in signing space by using directional signs or embodying the referents using body shift or eye gaze (Dudis 2004; Liddell and Metzger 1998). Spatial referencing also impact morphology when the directionality of a verb depends on the location of the reference to its subject and/or object (Beuzeville 2008; Fenlon, Schembri, and Cormier 2018): For example, a directional verb can move from its subject’s location and end at its object’s location. While the relation between referents and verbs in spoken language is more arbitrary, referent relations are usually grounded in signed languages. The visual space is heavily exploited to make referencing clear.

Another way anaphoric entities are referenced in sign language is by using classifiers or depicting signs (Supalla 1986; Wilcox and Hafer 2004; Roy 2011) that help describe the characteristics of the referent. Classifiers are typically one-handed signs that do not have a particular location or movement assigned to them, or derive features from meaningful discourse (Liddell and others 2003), so they can be used to convey how the referent relates to other entities, describe its movement, and give more details. For example, to tell about a car swerving and crashing, one might use the hand classifier for a vehicle, move it to indicate swerving, and crash it with another entity in space.

To quote someone other than oneself, signers perform role shift (Cormier, Smith, and Sevcikova-Sehyr 2015), where they may physically shift in space to mark the distinction and take on some characteristics of the people they represent. For example, to recount a dialogue between a taller and a shorter person, the signer may shift to one side and look up when taking the shorter person’s role, shift to the other side and look down when taking the taller person’s role.

Fingerspelling

Fingerspelling results from language contact between a signed language and a surrounding spoken language written form (Battison 1978; Wilcox 1992; Brentari and Padden 2001; Patrie and Johnson 2011). A set of manual gestures correspond with a written orthography or phonetic system. This phenomenon, found in most signed languages, is often used to indicate names or places or new concepts from the spoken language but has often become integrated into the signed languages as another linguistic strategy (Padden 1998; Montemurro and Brentari 2018).

Sign Language Representations

Representation is a significant challenge for SLP. Unlike spoken languages, signed languages have no widely adopted written form. As signed languages are conveyed through the visual-gestural modality, video recording is the most straightforward way to capture them. However, as videos include more information than needed for modeling and are expensive to record, store, and transmit, a lower-dimensional representation has been sought after.

The following figure illustrates each signed language representation we will describe below. In this demonstration, we deconstruct the video into its individual frames to exemplify the alignment of the annotations between the video and representations.

Videos

are the most straightforward representation of a signed language and can amply incorporate the information conveyed through signing. One major drawback of using videos is their high dimensionality: They usually include more information than needed for modeling and are expensive to store, transmit, and encode. As facial features are essential in sign, anonymizing raw videos remains an open problem, limiting the possibility of making these videos publicly available (Isard 2020).

Poses

reduce the visual cues from videos to skeleton-like wireframes or mesh representing the location of joints. While motion capture equipment can often provide better quality pose estimation, it is expensive and intrusive, and estimating pose from videos is the preferred method currently (Pishchulin et al. 2012; Chen et al. 2017; Cao et al. 2019; Güler, Neverova, and Kokkinos 2018). Compared to video representations, accurate poses are lower in complexity and semi-anonymized while observing relatively low information loss. However, they remain a continuous, multidimensional representation that is not adapted to most NLP models.

Written notation systems

represent signs as discrete visual features. Some systems are written linearly, and others use graphemes in two dimensions. While various universal (Sutton 1990; Prillwitz and Zienert 1990) and language-specific notation systems (Stokoe Jr 2005; Kakumasu 1968; Bergman 1977) have been proposed, no writing system has been adopted widely by any sign language community, and the lack of standards hinders the exchange and unification of resources and applications between projects. The figure above depicts two universal notation systems: SignWriting (Sutton 1990), a two-dimensional pictographic system, and HamNoSys (Prillwitz and Zienert 1990), a linear stream of graphemes designed to be machine-readable.

Glossing

is the transcription of signed languages sign-by-sign, where every sign has a unique identifier. Although various sign language corpus projects have provided gloss annotation guidelines (Mesch and Wallin 2015; Johnston and De Beuzeville 2016; Konrad et al. 2018), again, there is yet to be a single agreed-upon standard. Linear gloss annotations are also an imprecise representation of signed language: they do not adequately capture all information expressed simultaneously through different cues (i.e., body posture, eye gaze) or spatial relations, which leads to an inevitable information loss up to a semantic level that affects downstream performance on SLP tasks (Yin and Read 2020).

The following table additionally exemplifies the various representations for more isolated signs. For this example, we use SignWriting as the notation system. Note that the same sign might have two unrelated glosses, and the same gloss might have multiple valid spoken language translations.

Video Pose Estimation Notation Gloss English Translation
ASL HOUSE ASL HOUSE HOUSE HOUSE House
ASL WRONG-WHAT ASL WRONG-WHAT WRONG-WHAT WRONG-WHAT What’s the matter?
What’s wrong?
ASL DIFFERENT ASL DIFFERENT DIFFERENT DIFFERENT
BUT
Different
But

Tasks

So far, the computer vision community has mainly led the SLP research to focus on processing the visual features in signed language videos. As a result, current SLP methods do not fully address the linguistic complexity of signed languages. We survey common SLP tasks and current methods’ limitations, drawing on signed languages’ linguistic theories.

Sign Language Detection

Sign language detection (Borg and Camilleri 2019; Moryossef et al. 2020) is the binary classification task to determine whether a signed language is being used in a given video frame. A similar task in spoken languages is voice activity detection (VAD) (Sohn, Kim, and Sung 1999; Ramırez et al. 2004), the detection of when a human voice is used in an audio signal. However, as VAD methods often rely on speech-specific representations such as spectrograms, they are not necessarily applicable to videos.

Borg and Camilleri (2019) introduced the classification of frames taken from YouTube videos as either signing or not. They take a spatial and temporal approach based on VGG-16 (Simonyan and Zisserman 2015) CNN to encode each frame and use a GRU (Cho et al. 2014) to encode the sequence of frames in a window of 20 frames at 5fps. In addition to the raw frame, they either encode optical-flow history, aggregated motion history, or frame difference.

Moryossef et al. (2020) improved upon their method by performing sign language detection in real time. They identified that sign language use involves movement of the body and, as such, designed a model that works on top of estimated human poses rather than directly on the video signal. They calculate the optical flow norm of every joint detected on the body and apply a shallow yet effective contextualized model to predict for every frame whether the person is signing or not.

While these recent detection models achieve high performance, we need well-annotated data that include interference and distractions with non-signing instances for proper real-world evaluation.

Sign Language Identification

Sign language identification (Gebre, Wittenburg, and Heskes 2013; Monteiro et al. 2016) classifies which signed language is used in a given video.

Gebre, Wittenburg, and Heskes (2013) found that a simple random-forest classifier utilizing the distribution of phonemes can distinguish between British Sign Language (BSL) and Greek Sign Language (ENN) with a 95% F1 score. This finding is further supported by Monteiro et al. (2016), which, based on activity maps in signing space, manages to differentiate between British Sign Language and French Sign Language (Langue des Signes Française, LSF) with a 98% F1 score in videos with static backgrounds, and between American Sign Language and British Sign Language, with a 70% F1 score for videos mined from popular video-sharing sites. The authors attribute their success mainly to the different fingerspelling systems, which are two-handed in the case of BSL and one-handed in the case of ASL and LSF.

Although these pairwise classification results seem promising, better models would be needed for classifying from a large set of signed languages. These methods only rely on low-level visual features, while signed languages have several distinctive features on a linguistic level, such as lexical or structural differences (McKee and Kennedy 2000; Kimmelman 2014; Ferreira-Brito 1984; Shroyer and Shroyer 1984) which have not been explored for this task.

Sign Language Segmentation

Segmentation consists of detecting the frame boundaries for signs or phrases in videos to divide them into meaningful units. While the most canonical way of dividing a spoken language text is into a linear sequence of words, due to the simultaneity of sign language, the notion of a sign language “word” is ill-defined, and sign language cannot be fully linearly modeled.

Current methods resort to segmenting units loosely mapped to signed language units (Santemiz et al. 2009; Farag and Brock 2019; Bull, Gouiffès, and Braffort 2020; Renz, Stache, Albanie, et al. 2021; Renz, Stache, Fox, et al. 2021; Bull et al. 2021), and do not leverage reliable linguistic predictors of sentence boundaries such as prosody in signed languages (i.e., pauses, sign duration, facial expressions, eye apertures) (Sandler 2010; Ormel and Crasborn 2012).

Santemiz et al. (2009) automatically extract isolated signs from continuous signing by aligning the sequences obtained via speech recognition, modeled by Dynamic Time Warping (DTW) and Hidden Markov Models (HMMs) approaches.

Farag and Brock (2019) use a random forest classifier to distinguish frames containing words in Japanese Sign Language based on the composition of spatio-temporal angular and distance features between domain-specific pairs of joint segments.

Bull, Gouiffès, and Braffort (2020) segment French Sign Language into subtitle units by detecting the temporal boundaries of subtitles aligned with sign language videos, leveraging a spatio-temporal graph convolutional network with a BiLSTM on 2D skeleton data.

Renz, Stache, Albanie, et al. (2021) determine the location of temporal boundaries between signs in continuous sign language videos by employing 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues. Renz, Stache, Fox, et al. (2021) further propose the Changepoint-Modulated Pseudo-Labelling (CMPL) algorithm to solve the problem of source-free domain adaptation.

Bull et al. (2021) present a Transformer-based approach to segment sign language videos and align them with subtitles simultaneously, encoding subtitles by BERT and videos by CNN video representations.

Sign Language Recognition, Translation, and Production

Sign language translation (SLT) commonly refers to the translation of signed language to spoken language. Sign language production is the reverse process of producing a sign language video from spoken language text. Sign language recognition (SLR) (Adaloglou et al. 2020) detects and labels signs from a video, either on isolated (Imashev et al. 2020; Sincan and Keles 2020) or continuous (Cui, Liu, and Zhang 2017; Cihan Camgöz et al. 2018; N. C. Camgöz et al. 2020b) signs.

In the following graph, we can see a fully connected pentagon where each node is a single data representation, and each directed edge represents the task of converting one data representation to another.

We split the graph into two:

Language Agnostic Tasks Language Specific Tasks
Sign language tasks graph

There are 20 tasks conceptually defined by this graph, with varying amounts of previous research. Every path between two nodes might or might not be valid, depending on how lossy the tasks in the path are.


Video-to-Pose

Video-to-Pose—commonly known as pose estimation—is the task of detecting human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It was shown (Vogler and Goldenstein 2005) that the face pose correlates with facial non-manual features like head direction.

This area has been thoroughly researched (Pishchulin et al. 2012; Chen et al. 2017; Cao et al. 2019; Güler, Neverova, and Kokkinos 2018) with objectives varying from predicting 2D / 3D poses to a selection of a small specific set of landmarks or a dense mesh of a person.

OpenPose (Cao et al. 2019; Simon et al. 2017; Cao et al. 2017; Wei et al. 2016) is the first multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) in 2D on single images. While their model can estimate the full pose directly from an image in a single inference, they also suggest a pipeline approach where they first estimate the body pose and then independently estimate the hands and face pose by acquiring higher resolution crops around those areas. Building on the slow pipeline approach, a single network whole body OpenPose model has been proposed (Hidalgo et al. 2019), which is faster and more accurate for the case of obtaining all keypoints. With multiple recording angles, OpenPose also offers keypoint triangulation to reconstruct the pose in 3D.

Güler, Neverova, and Kokkinos (2018) take a different approach with DensePose. Instead of classifying for every keypoint which pixel is most likely, they suggest similar to semantic segmentation, for each pixel to classify which body part it belongs to. Then, for each pixel, knowing the body part, they predict where that pixel is on the body part relative to a 2D projection of a representative body model. This approach results in the reconstruction of the full-body mesh and allows sampling to find specific keypoints similar to OpenPose.

However, 2D human poses might not be sufficient to fully understand the position and orientation of landmarks in space, and applying pose estimation per frame does not take the video temporal movement information into account, especially in cases of rapid movement, which contain motion blur.

Pavllo et al. (2019) developed two methods to convert between 2D poses to 3D poses. The first, a supervised method, was trained to use the temporal information between frames to predict the missing Z-axis. The second is an unsupervised method, leveraging the fact that the 2D poses are merely a projection of an unknown 3D pose and train a model to estimate the 3D pose and back-project to the input 2D poses. This back projection is a deterministic process, applying constraints on the 3D pose encoder. Zelinka and Kanis (2020) follow a similar process and adds a constraint for bones to stay of a fixed length between frames.

Panteleris, Oikonomidis, and Argyros (2018) suggest converting the 2D poses to 3D using inverse kinematics (IK), a process taken from computer animation and robotics to calculate the variable joint parameters needed to place the end of a kinematic chain, such as a robot manipulator or animation character’s skeleton in a given position and orientation relative to the start of the chain. Demonstrating their approach to hand pose estimation, they manually explicitly encode the constraints and limits of each joint, resulting in 26 degrees of freedom. Then, non-linear least-squares minimization fits a 3D model of the hand to the estimated 2D joint positions, recovering the 3D hand pose. This process is similar to the back-projection used by Pavllo et al. (2019), except here, no temporal information is being used.

MediaPipe Holistic (Grishchenko and Bazarevsky 2020) attempts to solve the 3D pose estimation problem directly by taking a similar approach to OpenPose, having a pipeline system to estimate the body, then the face and hands. Unlike OpenPose, the estimated poses are in 3D, and the pose estimator runs in real-time on CPU, allowing for pose-based sign language models on low-powered mobile devices. This pose estimation tool is widely available and built for Android, iOS, C++, Python, and the Web using Javascript.

Pose-to-Video

Pose-to-Video, also known as motion transfer or skeletal animation in the field of robotics and animation, is the conversion of a sequence of poses to a realistic-looking video. This task is the final “rendering” for sign language production to make the produced sign language look human.

Chan et al. (2019) demonstrates a semi-supervised approach where they take a set of videos, run pose-estimation with OpenPose (Cao et al. 2019), and learn an image-to-image translation (Isola et al. 2017) between the rendered skeleton and the original video. They demonstrate their approach to human dancing, where they can extract poses from a choreography, and render any person as if they were dancing that dance. They predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for a more realistic face synthesis, although still flawed.

Wang et al. (2018) suggest a similar method using DensePose (Güler, Neverova, and Kokkinos 2018) representations in addition to the OpenPose (Cao et al. 2019) ones. They formalize a different model, with various objectives to optimize for, such as background-foreground separation and temporal coherence by using the previous two timestamps in the input.

Using the same method by Chan et al. (2019) on “Everybody Dance Now”, Giró-i-Nieto (2020) asks, “Can Everybody Sign Now”? They evaluate the generated videos by asking signers various tasks after watching them and comparing the signers’ ability to perform these tasks on the original videos, rendered pose videos, and reconstructed videos. They show that subjects prefer synthesized realistic videos over skeleton visualizations, and that out-of-the-box synthesis methods are not effective enough, as subjects struggled to understand the reconstructed videos.

As a direct response, Saunders, Camgöz, and Bowden (2020b) shows that like in Chan et al. (2019), where an adversarial loss is added to specifically generate the face, adding a similar loss to the hand-generation process yields high-resolution, more photo-realistic continuous sign language videos.

Deepfakes is a technique to replace a person in an existing image or video with someone else’s likeness (Nguyen et al. 2019). This technique can be used to improve the unrealistic face synthesis resulting from not face-specialized models, or even replace cartoon faces rendered by animated 3D models.


Pose-to-Gloss

Pose-to-Gloss—also known as sign language recognition—is the task of recognizing a sequence of signs from a sequence of poses. Though some previous works have referred to this as ``sign language translation’’, recognition merely determines the associated label of each sign, without handling the syntax and morphology of the signed language (C. Padden 1988) to create a spoken language output. Instead, SLR has often been used as an intermediate step during translation to produce glosses from signed language videos.

Jiang et al. (2021) propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction.

Dafnis et al. (2022) work on the same modified WLASL dataset as Jiang et al. (2021), but do not require multimodal data input. Instead, they propose a bidirectional skeleton-based graph convolutional network framework with linguistically motivated parameters and attention to the start and end frames of signs. They cooperatively use forward and backward data streams, including various sub-streams, as input. They also use pre-training to leverage transfer learning.

Gloss-to-Pose

Gloss-to-Pose—also known as sign language production—is the task of producing a sequence of poses that adequately represent a sequence of signs written as gloss.

To produce a sign language video, Stoll et al. (2018) constructs a lookup table between glosses and sequences of 2D poses. They align all pose sequences at the neck joint of a reference skeleton and group all sequences belonging to the same gloss. Then, for each group, they apply dynamic time warping and average out all sequences in the group to construct the mean pose sequence. This approach suffers from not having an accurate set of poses aligned to the gloss and from unnatural motion transitions between glosses.

To alleviate the downsides of the previous work, Stoll et al. (2020) constructs a lookup table of gloss to a group of sequences of poses rather than creating a mean pose sequence. They build a Motion Graph (Min and Chai 2012) - which is a Markov process used to generate new motion sequences that are representative of natural motion, and select the motion primitives (sequence of poses) per gloss with the highest transition probability. To smooth that sequence and reduce unnatural motion, they use Savitzky–Golay motion transition smoothing filter (Savitzky and Golay 1964).


Video-to-Gloss

Video-to-Gloss—also known as sign language recognition—is the task of recognizing a sequence of signs from a video.

For this recognition, Cui, Liu, and Zhang (2017) constructs a three-step optimization model. First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder and predict the gloss using a Connectionist Temporal Classification (CTC) (Graves et al. 2006). Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category, and use this gloss video segments encoding to optimize the sequence learning model.

Cihan Camgöz et al. (2018) fundamentally differ from that approach and formulate this problem as if it is a natural-language translation problem. They encode each video frame using AlexNet (Krizhevsky, Sutskever, and Hinton 2012), initialized using weights trained on ImageNet (Deng et al. 2009). Then they apply a GRU encoder-decoder architecture with Luong Attention (Luong, Pham, and Manning 2015) to generate the gloss. In follow-up work, N. C. Camgöz et al. (2020b) use a transformer encoder (Vaswani et al. 2017) to replace the GRU and use a CTC to decode the gloss. They show a slight improvement with this approach on the video-to-gloss task.

Adaloglou et al. (2020) perform a comparative experimental assessment of computer vision-based methods for the video-to-gloss task. They implement various approaches from previous research (Camgöz et al. 2017; Cui, Liu, and Zhang 2019; Vaezi Joze and Koller 2019) and test them on multiple datasets (Huang et al. 2018; Cihan Camgöz et al. 2018; Von Agris and Kraiss 2007; Vaezi Joze and Koller 2019) either for isolated sign recognition or continuous sign recognition. They conclude that 3D convolutional models outperform models using only recurrent networks to capture the temporal information, and that these models are more scalable given the restricted receptive field, which results from the CNN “sliding window” technique.

Gloss-to-Video

Gloss-to-Video—also known as sign language production—is the task of producing a video that adequately represents a sequence of signs written as gloss.

As of 2020, no research discusses the direct translation task between gloss and video. This lack of discussion results from the computational impracticality of the desired model, leading researchers to refrain from performing this task directly and instead rely on pipeline approaches using intermediate pose representations.


Gloss-to-Text

Gloss-to-Text—also known as sign language translation—is the natural language processing task of translating between gloss text representing sign language signs and spoken language text. These texts commonly differ in terminology, capitalization, and sentence structure.

Cihan Camgöz et al. (2018) experimented with various machine-translation architectures and compared using an LSTM vs. GRU for the recurrent model, as well as Luong attention (Luong, Pham, and Manning 2015) vs. Bahdanau attention (Bahdanau, Cho, and Bengio 2015) and various batch sizes. They concluded that on the RWTH-PHOENIX-Weather-2014T dataset, which was also presented in this work, using GRUs, Luong attention, and a batch size of 1 outperforms all other configurations.

In parallel with the advancements in spoken language machine translation, Yin and Read (2020) proposed replacing the RNN with a Transformer (Vaswani et al. 2017) encoder-decoder model, showing improvements on both RWTH-PHOENIX-Weather-2014T (DGS) and ASLG-PC12 (ASL) datasets both using a single model and ensemble of models. Interestingly, in gloss-to-text, they show that using the sign language recognition (video-to-gloss) system output outperforms using the gold annotated glosses.

Building on the code published by Yin and Read (2020), Moryossef et al. (2021) show it is beneficial to pre-train these translation models using augmented monolingual spoken language corpora. They try three different approaches for data augmentation: (1) Back-translation; (2) General text-to-gloss rules, including lemmatization, word reordering, and dropping of words; (3) Language-pair-specific rules augmenting the spoken language syntax to its corresponding sign language syntax. When pretraining, all augmentations show improvements over the baseline for RWTH-PHOENIX-Weather-2014T (DGS) and NCSLGR (ASL).

Text-to-Gloss

Text-to-gloss—also known as sign language translation—is the task of translating between a spoken language text and sign language glosses.

Zhao et al. (2000) used a Tree Adjoining Grammar (TAG) based system to translate English sentences and American Sign Language glosses. They parse the English text and simultaneously assemble an American Sign Language gloss tree, using Synchronous TAGs (Shieber and Schabes 1990; Shieber 1994), by associating the ASL elementary trees with the English elementary trees and associating the nodes at which subsequent substitutions or adjunctions can occur. Synchronous TAGs have been used for machine translation between spoken languages (Abeillé, Schabes, and Joshi 1991), but this is the first application to a signed language.

For the automatic translation of gloss-to-text, Othman and Jemni (2012) identified the need for a large parallel sign language gloss and spoken language text corpus. They develop a part-of-speech-based grammar to transform English sentences from the Gutenberg Project ebooks collection (Lebert 2008) into American Sign Language gloss. Their final corpus contains over 100 million synthetic sentences and 800 million words and is the most extensive English-ASL gloss corpus we know of. Unfortunately, it is hard to attest to the quality of the corpus, as they didn’t evaluate their method on real English-ASL gloss pairs, and only a small sample of this corpus is available online.


Video-to-Text

Video-to-text—also known as sign language translation—is the task of translating a raw video to spoken language text.

N. C. Camgöz et al. (2020b) proposed a single architecture to perform this task that can use both the sign language gloss and the spoken language text in joint supervision. They use the pre-trained spatial embeddings from Koller et al. (2019) to encode each frame independently and encode the frames with a transformer. On this encoding, they use a Connectionist Temporal Classification (CTC) (Graves et al. 2006) to classify the sign language gloss. Using the same encoding, they use a transformer decoder to decode the spoken language text one token at a time. They show that adding gloss supervision improves the model over not using it and that it outperforms previous video-to-gloss-to-text pipeline approaches (Cihan Camgöz et al. 2018).

Following up, N. C. Camgöz et al. (2020a) propose a new architecture that does not require the supervision of glosses, named “Multi-channel Transformers for Multi-articulatory Sign Language Translation”. In this approach, they crop the signing hand and the face and perform 3D pose estimation to obtain three separate data channels. They encode each data channel separately using a transformer, then encode all channels together and concatenate the separate channels for each frame. Like their previous work, they use a transformer decoder to decode the spoken language text, but unlike their previous work, do not use the gloss as additional supervision. Instead, they add two “anchoring” losses to predict the hand shape and mouth shape from each frame independently, as silver annotations are available to them using the model proposed in Koller et al. (2019). They conclude that this approach is on-par with previous approaches requiring glosses, and so they have broken the dependency upon costly annotated gloss information in the video-to-text task.

Text-to-Video

Text-to-Video—also known as sign language production—is the task of producing a video that adequately represents a spoken language text in sign language.

As of 2020, no research discusses the direct translation task between text and video. This lack of discussion results from the computational impracticality of the desired model, leading researchers to refrain from performing this task directly and instead rely on pipeline approaches using intermediate pose representations.


Pose-to-Text

Pose-to-text—also known as sign language translation—is the task of translating a captured or estimated pose sequence to spoken language text.

Ko et al. (2019) demonstrate impressive performance on the pose-to-text task by inputting the pose sequence into a standard encoder-decoder translation network. They experiment both with GRU and various types of attention (Luong, Pham, and Manning 2015; Bahdanau, Cho, and Bengio 2015) and with a Transformer (Vaswani et al. 2017), and show similar performance, with the transformer underperforming on the validation set and overperforming on the test set, which consists of unseen signers. They experiment with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint either concerning the entire frame or the relevant “object” (Body, Face, and Hand).

Text-to-Pose

Text-to-Pose—also known as sign language production—is the task of producing a sequence of poses that adequately represent a spoken language text in sign language, as an intermediate representation to overcome challenges in animation. Most efforts use poses as an intermediate representation to overcome the challenges in generating videos directly, with the goal of using computer animation or pose-to-video models to perform video production.

Saunders, Camgöz, and Bowden (2020c) propose Progressive Transformers, a model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an autoregressive manner. Unlike symbolic transformers (Vaswani et al. 2017), which use a discrete vocabulary and thus can predict an end-of-sequence (EOS) token in every step, the progressive transformer predicts a counter ∈ [0, 1] in addition to the pose. In inference time, counter = 1 is considered the end of the sequence. They test their approach on the RWTH-PHOENIX-Weather-2014T dataset using OpenPose 2D pose estimation, uplifted to 3D (Zelinka and Kanis 2020), and show favorable results when evaluating using back-translation from the generated poses to spoken language. They further show (Saunders, Camgöz, and Bowden 2020a) that using an adversarial discriminator between the ground truth poses, and the generated poses, conditioned on the input spoken language text improves the production quality as measured using back-translation.

To overcome the issues of under-articulation seen in the above works, Saunders, Camgöz, and Bowden (2020b) expands on the progressive transformer model using a Mixture Density Network (MDN) (Bishop 1994) to model the variation found in sign language. While this model underperforms on the validation set, compared to previous work, it outperforms on the test set.

Zelinka and Kanis (2020) present a similar autoregressive decoder approach, with added dynamic-time-warping (DTW) and soft attention. They test their approach on Czech Sign Language weather data extracted from the news, which is not manually annotated, or aligned to the spoken language captions, and show their DTW is advantageous for this kind of task.

Xiao, Qin, and Yin (2020) close the loop by proposing a text-to-pose-to-text model for the case of isolated sign language recognition. They first train a classifier to take a sequence of poses encoded by a BiLSTM and classify the relevant sign, then, propose a production system to take a single sign and sample a constant length sequence of 50 poses from a Gaussian Mixture Model. These components are combined such that given a sign class y, a pose sequence is generated, then classified back into a sign class ŷ, and the loss is applied between y and ŷ, and not directly on the generated pose sequence. They evaluate their approach on the CSL dataset (Huang et al. 2018) and show that their generated pose sequences almost reach the same classification performance as the reference sequences.

Due to the need for more suitable automatic evaluation methods for generated signs, existing works resort to measuring back-translation quality, which cannot accurately capture the quality of the produced signs nor their usability in real-world settings. Understanding how distinctions in meaning are created in signed language may help develop a better evaluation method.


Notation-to-X

As of 2020, no research discusses the translation task between a writing notation system and any other modality.

Text-to-Notation

Walsh, Saunders, and Bowden (2022) explore Text to HamNoSys (T2H) translation, with HamNoSys as the target sign language notation system. They experiment with direct T2H and Text to Gloss to HamNoSys (T2G2H) on a subset of the data from the MEINE DGS dataset (Hanke et al. 2020), where all glosses are mapped to HamNoSys by a dictionary look up. They find that direct T2H translation results in higher BLEU (it still needs to be clarified how well BLEU represents the quality of HamNoSys translations, though). They encode HamNoSys with BPE (Sennrich, Haddow, and Birch 2016), and it outperforms character-level and word-level tokenization. They also leverage BERT to create better sentence-level embeddings and use HamNoSys to extract the hand shape of a sign as additional supervision during training.


Fingerspelling

Fingerspelling is spelling a word letter-by-letter, borrowing from the spoken language alphabet (Battison 1978; Wilcox 1992; Brentari and Padden 2001; Patrie and Johnson 2011). This phenomenon, found in most signed languages, often occurs when there is no previously agreed-upon sign for a concept, like in technical language, colloquial conversations involving names, conversations involving current events, emphatic forms, and the context of code-switching between the sign language and corresponding spoken language (Padden 1998; Montemurro and Brentari 2018). The relative amount of fingerspelling varies between signed languages, and for American Sign Language (ASL), it accounts for 12–35% of the signed content (Padden and Gunsauls 2003).

Patrie and Johnson (2011) describe the following terminology to describe three different forms of fingerspelling:

Recognition

Fingerspelling recognition–a sub-task of sign language recognition–is the task of recognizing fingerspelled words from a sign language video.

Shi et al. (2018) introduced a large dataset available for American Sign Language fingerspelling recognition. This dataset includes both the “careful” and “rapid” forms of fingerspelling collected from naturally occurring videos “in the wild”, which are more challenging than studio conditions. They train a baseline model to take a sequence of images cropped around the signing hand and either use an autoregressive decoder or a CTC. They found that the CTC outperforms the autoregressive decoder model, but both achieve poor recognition rates (35-41% character level accuracy) compared to human performance (around 82%).

In follow-up work, Shi et al. (2019) collected nearly an order-of-magnitude larger dataset and designed a new recognition model. Instead of detecting the signing hand, they detect the face and crop a large area around it. Then, they perform an iterative process of zooming in to the hand using visual attention to retain sufficient information in high resolution of the hand. Finally, like their previous work, they encode the image hand crops sequence and use a CTC to obtain the frame labels. They show that this method outperforms their original “hand crop” method by 4% and that they can achieve up to 62.3% character-level accuracy using the additional data collected. Looking through this dataset, we note that the videos in the dataset are taken from longer videos, and as they are cut, they do not retain the signing before the fingerspelling. This context relates to language modeling, where at first, one fingerspells a word carefully, and when repeating it, might fingerspell it rapidly, but the interlocutors can infer they are fingerspelling the same word.

Production

Fingerspelling production–a sub-task of sign language production–is the task of producing a fingerspelling video for words.

In its basic form, “Careful” fingerspelling production can be trivially solved using pre-defined letter handshapes interpolation. Adeline (2013) demonstrates this approach for American Sign Language and English fingerspelling. They rig a hand armature for each letter in the English alphabet (N = 26) and generate all (N2 = 676) transitions between every two letters using interpolation or manual animation. Then, to fingerspell entire words, they chain pairs of letter transitions. For example, for the word “CHLOE”, they would chain the following transitions sequentially: #C CH HL LO OE E#.

However, to produce life-like animations, one must also consider the rhythm and speed of holding letters, and transitioning between letters, as those can affect how intelligible fingerspelling motions are to an interlocutor (Wilcox (1992)). Wheatland et al. (2016) analyzes both “careful” and “rapid” fingerspelling videos for these features. They find that for both forms of fingerspelling, on average, the longer the word, the shorter the transition and hold time. Furthermore, they find that less time is spent on middle letters on average, and the last letter is held on average for longer than the other letters in the word. Finally, they use this information to construct an animation system using letter pose interpolation and control the timing using a data-driven statistical model.

Annotation Tools

ELAN - EUDICO Linguistic Annotator

ELAN (Wittenburg et al. 2006) is an annotation tool for audio and video recordings. With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word, gloss, comment, translation, or description of any feature observed in the media. Annotations can be created on multiple layers, called tiers, which can be hierarchically interconnected. An annotation can either be time-aligned to the media or refer to other existing annotations. The content of annotations consists of Unicode text, and annotation documents are stored in an XML format (EAF). ELAN is open source (GPLv3), and installation is available for Windows, macOS, and Linux. PyMPI (Lubbers and Torreira 2013) allows for simple python interaction with Elan files.

iLex

iLex (Hanke 2002) is a tool for sign language lexicography and corpus analysis, that combines features found in empirical sign language lexicography and sign language discourse transcription. It supports the user in integrated lexicon building while working on the transcription of a corpus and offers several unique features considered essential due to the specific nature of signed languages. iLex binaries are available for macOS.

SignStream

SignStream (Neidle, Sclaroff, and Athitsos 2001) is a tool for linguistic annotations and computer vision research on visual-gestural language data SignStream installation is only available for old macOS versions and is distributed under an MIT license.

Anvil - The Video Annotation Research Tool

Anvil (Kipp 2001) is a free video annotation tool, offering multi-layered annotation based on a user-defined coding scheme. In Anvil, the annotator can see color-coded elements on multiple tracks in time alignment. Some special features are cross-level links, non-temporal objects, timepoint tracks, coding agreement analysis, 3D viewing of motion capture data and a project tool for managing whole corpora of annotation files. Anvil installation is available for Windows, macOS, and Linux.

Resources

Bilingual dictionaries

for signed language (Mesch and Wallin 2012; Fenlon, Cormier, and Schembri 2015; Crasborn et al. 2016; Gutierrez-Sigut et al. 2016) map a spoken language word or short phrase to a signed language video. One notable dictionary is, SpreadTheSign is a parallel dictionary containing around 23,000 words with up to 41 different spoken-signed language pairs and more than 500,000 videos in total. Unfortunately, while dictionaries may help create lexical rules between languages, they do not demonstrate the grammar or the usage of signs in context.

Fingerspelling corpora

usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created (Dreuw et al. 2006) or mined from online resources (Shi et al. 2018, @dataset:fs18iccv). However, they only capture one aspect of signed languages.

Isolated sign corpora

are collections of annotated single signs. They are synthesized (Ebling et al. 2018; Huang et al. 2018; Sincan and Keles 2020; Hassan et al. 2020) or mined from online resources (Vaezi Joze and Koller 2019; Li et al. 2020), and can be used for isolated sign language recognition or contrastive analysis of minimal signing pairs (Imashev et al. 2020). However, like dictionaries, they do not describe relations between signs, nor do they capture coarticulation during the signing, and are often limited in vocabulary size (20-1000 signs)

Continuous sign corpora

contain parallel sequences of signs and spoken language. Available continuous sign corpora are extremely limited, containing 4-6 orders of magnitude fewer sentence pairs than similar corpora for spoken language machine translation (Arivazhagan et al. 2019). Moreover, while automatic speech recognition (ASR) datasets contain up to 50,000 hours of recordings (Pratap et al. 2020), the most extensive continuous sign language corpus contains only 1,150 hours, and only 50 of them are publicly available (Hanke et al. 2020). These datasets are usually synthesized (Databases 2007; Crasborn and Zwitserlood 2008; Ko et al. 2019; Hanke et al. 2020) or recorded in studio conditions (Forster et al. 2014, @cihan2018neural), which does not account for noise in real-life conditions. Moreover, some contain signed interpretations of spoken language rather than naturally-produced signs, which may not accurately represent native signing since translation is now a part of the discourse event.

Availability

Unlike the vast amount and diversity of available spoken language resources that allow various applications, signed language resources are scarce and, currently only support translation and production. Unfortunately, most of the signed language corpora discussed in the literature are either not available for use or available under heavy restrictions and licensing terms. Furthermore, signed language data is especially challenging to anonymize due to the importance of facial and other physical features in signing videos, limiting its open distribution. Developing anonymization with minimal information loss or accurate anonymous representations is a promising research direction.

Collect Real-World Data

Data is essential to develop any of the core NLP tools previously described, and current efforts in SLP are often limited by the lack of adequate data. We discuss the considerations to keep in mind when building datasets, the challenges of collecting such data, and directions to facilitate data collection.

What is Good Signed Language Data?

For SLP models to be deployable, they must be developed using data that represents the real world accurately. What constitutes an ideal signed language dataset is an open question, we suggest including the following requirements: (1) a broad domain; (2) sufficient data and vocabulary size; (3) real-world conditions; (4) naturally produced signs; (5) a diverse signer demographic; (6) native signers; and when applicable, (7) dense annotations.

To illustrate the importance of data quality during modeling, Yin et al. (2021) first take as an example a current benchmark for SLP, the RWTH-PHOENIX-Weather 2014T dataset (Cihan Camgöz et al. 2018) of German Sign Language, that does not meet most of the above criteria: it is restricted to the weather domain (1); contains only around 8K segments with 1K unique signs (2); filmed in studio conditions (3); interpreted from German utterances (4); and signed by nine Caucasian interpreters (5,6). Although this dataset successfully addressed data scarcity issues at the time and successfully rendered results comparable and fueled competitive research, it does not accurately represent signed languages in the real world. On the other hand, the Public DGS Corpus (Hanke et al. 2020) is an open-domain (1) dataset consisting of 50 hours of natural signing (4) by 330 native signers from various regions in Germany (5,6), annotated with glosses, HamNoSys and German translations (7), meeting all but two requirements we suggest.

They train a gloss-to-text sign language translation transformer (Yin and Read 2020) on both datasets. On RWTH-PHOENIX-Weather 2014T, they obtain 22.17 BLEU on testing; on Public DGS Corpus, they obtain a mere BLEU. Although Transformers achieve encouraging results on RWTH-PHOENIX-Weather 2014T (Saunders, Camgöz, and Bowden 2020c; N. C. Camgöz et al. 2020a), they fail on more realistic, open-domain data. These results reveal that, for real-world applications, we need more data to train such models. At the same time, available data is severely limited in size; less data-hungry and more linguistically-informed approaches may be more suitable. This experiment reveals how it is crucial to use data that accurately represent the complexity and diversity of signed languages to precisely assess what types of methods are suitable and how well our models would deploy to the real world.

Challenges of Data Collection

Collecting and annotating signed data in line with the ideal requires more resources than speech or text data, taking up to 600 minutes per minute of an annotated signed language video (Hanke et al. 2020). Moreover, annotation usually requires specific knowledge and skills, which makes recruiting or training qualified annotators challenging. Additionally, there is little existing signed language data in the wild openly licensed for use, especially from native signers that are not interpretations of speech. Therefore, data collection often requires significant efforts and costs of on-site recording.

Automating Annotation

One helpful research direction for collecting more data that enables the development of deployable SLP models is creating tools that can simplify or automate parts of the collection and annotation process. One of the most significant bottlenecks in obtaining more adequate signed language data is the time and scarcity of experts required to perform annotation. Therefore, tools that perform automatic parsing, detection of frame boundaries, extraction of articulatory features, suggestions for lexical annotations, and allow parts of the annotation process to be crowdsourced to non-experts, to name a few, have a high potential to facilitate and accelerate the availability of good data.

Practice Deaf Collaboration

Finally, when working with signed languages, it is vital to keep in mind this technology should benefit and they need. Researchers in SLP must honor that signed languages belong to the Deaf community and avoid exploiting their language as a commodity (Bird 2020).

Solving Real Needs

Many efforts in SLP have developed intrusive methods (e.g., requiring signers to wear special gloves), which are often rejected by signing communities and therefore have limited real-world value. Such efforts are often marketed to perform ``sign language translation" when they, in fact, only identify fingerspelling or recognize a minimal set of isolated signs at best. These approaches oversimplify the rich grammar of signed languages, promote the misconception that signs are solely expressed through the hands, and are considered by the Deaf community as a manifestation of audism, where it is the signers who must make the extra effort to wear additional sensors to be understood by non-signers (Erard 2017). To avoid such mistakes, we encourage close Deaf involvement throughout the research process to ensure that we direct our efforts toward applications that will be adopted by signers and do not make false assumptions about signed languages or the needs of signing communities.

Building Collaboration

Deaf collaborations and leadership are essential for developing signed language technologies to ensure they address the community’s needs and will be adopted, not relying on misconceptions or inaccuracies about signed language (Harris, Holmes, and Mertens 2009; Kusters, De Meulder, and O’Brien 2017). Hearing researchers cannot relate to the deaf experience or fully understand the context in which the tools being developed would be used, nor can they speak for the deaf. Therefore, we encourage creating a long-term collaborative environment between signed language researchers and users so that deaf users can identify meaningful challenges and provide insights on the considerations to take while researchers cater to the signers’ needs as the field evolves. We also recommend reaching out to signing communities for reviewing papers on signed languages to ensure an adequate evaluation of this type of research results published at ACL venues. There are several ways to connect with Deaf communities for collaboration: one can seek deaf students in their local community, reach out to schools for the deaf, contact deaf linguists, join a network of researchers of sign-related technologies, and/or participate in deaf-led projects.

Downloading

Currently, there is no easy way or agreed-upon format to download and load sign language datasets, and as such, evaluation of these datasets is scarce. As part of this work, we streamlined the loading of available datasets using Tensorflow Datasets (“TensorFlow Datasets, a Collection of Ready-to-Use Datasets,” n.d.). This tool allows researchers to load large and small datasets alike with a simple command and be comparable to other works. We make these datasets available using a custom library, Sign Language Datasets.

import tensorflow_datasets as tfds
import sign_language_datasets.datasets

# Loading a dataset with default configuration
aslg_pc12 = tfds.load("aslg_pc12")

# Loading a dataset with custom configuration
from sign_language_datasets.datasets.config import SignDatasetConfig
config = SignDatasetConfig(name="videos_and_poses256x256:12", 
                           version="3.0.0",          # Specific version
                           include_video=True,       # Download, and load dataset videos
                           fps=12,                   # Load videos at constant, 12 fps
                           resolution=(256, 256),    # Convert videos to a constant resolution, 256x256
                           include_pose="holistic")  # Download and load Holistic pose estimation
rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))

Furthermore, we follow a unified interface when possible, making attributes the same and comparable between datasets:

{
    "id": tfds.features.Text(),
    "signer": tfds.features.Text() | tf.int32,
    "video": tfds.features.Video(shape=(None, HEIGHT, WIDTH, 3)),
    "depth_video": tfds.features.Video(shape=(None, HEIGHT, WIDTH, 1)),
    "fps": tf.int32,
    "pose": {
        "data": tfds.features.Tensor(shape=(None, 1, POINTS, CHANNELS), dtype=tf.float32),
        "conf": tfds.features.Tensor(shape=(None, 1, POINTS), dtype=tf.float32)
    },
    "gloss": tfds.features.Text(),
    "text": tfds.features.Text()
}

The following table contains a curated list of datasets, including various signed languages and data formats:

🎥 Video | 👋 Pose | 👄 Mouthing | ✍ Notation | 📋 Gloss | 📜 Text | 🔊 Speech

Dataset Publication Language Features #Signs #Samples #Signers License
ASL-100-RGBD Hassan et al. (2020) American 🎥👋📋 100 4,150 Tokens 22 Authorized Academics
ASL-Homework-RGBD Hassan et al. (2022) American 🎥👋📋 935 45 Authorized Academics
ASLG-PC12 💾 Othman and Jemni (2012) American (Synthetic) 📋📜 > 100,000,000 Sentences N/A Sample Available (1, 2)
ASLLVD Athitsos et al. (2008) American TODO 3,000 12,000 Samples 4 Attribution
ATIS Bungeroth et al. (2008) Multilingual TODO 292 595 Sentences
AUSLAN Johnston (2010) Australian TODO 1,100 Videos 100
AUTSL 💾 Sincan and Keles (2020) Turkish 🎥📋 226 36,302 Samples 43 Codalab
BosphorusSign Camgöz et al. (2016) Turkish TODO 636 24,161 Samples 6 Not Published
BSL Corpus 💾 Schembri et al. (2013) British TODO 40,000 Lexical Items 249 Partially Restricted
CDPSL Łacheta and Rutkowski (2014) Polish 🎥📜 300 hours
ChicagoFSWild 💾 Shi et al. (2018) American 🎥📜 26 7,304 Sequences 160 Public
ChicagoFSWild+ 💾 Shi et al. (2019) American 🎥📜 26 55,232 Sequences 260 Public
Content4All Camgöz et al. (2021) Swiss-German, Flemish 🎥👋📜📜 190 Hours CC BY-NC-SA 4.0
CopyCat Zafrulla et al. (2010) American TODO 22 420 Phrases 5
Corpus NGT 💾 Crasborn and Zwitserlood (2008) Netherlands TODO 15 Hours 92 CC BY-NC-SA 3.0 NL
DEVISIGN Chai, Wang, and Chen (2014) Chinese TODO 2,000 24,000 Samples 8 Research purpose on request
Dicta-Sign 💾 Matthes et al. (2012) Multilingual TODO 6-8 Hours (/Participant) 16-18 /Language
How2Sign 💾 Duarte et al. (2020) American 🎥👋📋📜🔊 16,000 79 hours (35,000 sentences) 11 CC BY-NC 4.0
K-RSL Imashev et al. (2020) Kazakh-Russian 🎥👋📜 600 28,250 Videos 10 Attribution
KETI Ko et al. (2019) Korean 🎥👋📋📜 524 14,672 Videos 14 TODO (emailed Sang-Ki Ko)
KRSL-OnlineSchool Mukushev et al. (2022) Kazakh-Russian 🎥📋📜 890 Hours (1M sentences) 7
LSE-SIGN Gutierrez-Sigut et al. (2016) Spanish TODO 2,400 2,400 Samples 2
MS-ASL Vaezi Joze and Koller (2019) American TODO 1,000 25,000 (25 hours) 200 Public
NCSLGR 💾 Databases (2007) American 🎥📋📜 1,875 sentences 4 TODO
Public DGS Corpus 💾 Hanke et al. (2020) German 🎥🎥👋👄📋📜📜 50 Hours 330 Custom
RVL-SLLL ASL Martínez et al. (2002) American TODO 104 2,576 Videos 14 Research Attribution
RWTH Fingerspelling Dreuw et al. (2006) German 🎥📜 35 1,400 single-char videos 20
RWTH-BOSTON-104 Dreuw et al. (2008) American 🎥📜 104 201 Sentences 3
RWTH-PHOENIX-Weather T 💾 Forster et al. (2014);Cihan Camgöz et al. (2018) German 🎥📋📜 1,231 8,257 Sentences 9 CC BY-NC-SA 3.0
S-pot Viitaniemi et al. (2014) Finnish TODO 1,211 5,539 Videos 5 Permission
Sign2MINT 💾 2021 German 🎥📜 740 1135 CC BY-NC-SA 3.0 DE
SignBank 💾 Multilingual 🎥📜 222148
SIGNOR Vintar, Jerko, and Kulovec (2012) Slovene 🎥👄📋📜 80 TODO emailed Špela
SIGNUM Von Agris and Kraiss (2007) German TODO 450 15,600 Sequences 20
SMILE Ebling et al. (2018) Swiss-German TODO 100 9,000 Samples 30 Not Published
SSL Corpus Öqvist, Riemer Kankkonen, and Mesch (2020) Swedish 🎥📋📜 TODO In January
SSL Lexicon Mesch and Wallin (2012) Swedish 🎥📋📜📜 20,000 CC BY-NC-SA 2.5 SE
Video-Based CSL Huang et al. (2018) Chinese TODO 500 125,000 Videos 50 Research Attribution
WLASL 💾 Li et al. (2020) American 🎥📋 2,000 100 C-UDA 1.0

Other Resources

Citation

For attribution in academic contexts, please cite this work as:

@misc{moryossef2021slp, 
    title = "{S}ign {L}anguage {P}rocessing", 
    author = "Moryossef, Amit and Goldberg, Yoav",
    howpublished = "\url{https://sign-language-processing.github.io/}",
    year = "2021"
}

References

Abeillé, Anne, Yves Schabes, and Aravind K Joshi. 1991. “Using Lexicalized Tags for Machine Translation.”

Adaloglou, Nikolas, Theocharis Chatzis, Ilias Papastratis, Andreas Stergioulas, Georgios Th Papadopoulos, Vassia Zacharopoulou, George J Xydopoulos, Klimnis Atzakas, Dimitris Papazachariou, and Petros Daras. 2020. “A Comprehensive Study on Sign Language Recognition Methods.” arXiv Preprint arXiv:2007.12530.

Adeline, Chloe. 2013. “Fingerspell.net.” http://fingerspell.net/.

Arivazhagan, Naveen, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, et al. 2019. “Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges.” arXiv Preprint arXiv:1907.05019.

Athitsos, Vassilis, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali. 2008. “The American Sign Language Lexicon Video Dataset.” In 2008 Ieee Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 1–8. IEEE.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” Edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1409.0473.

Battison, Robbin. 1978. “Lexical Borrowing in American Sign Language.”

Bellugi, Ursula, and Susan Fischer. 1972. “A Comparison of Sign Language and Spoken Language.” Cognition 1 (2-3): 173–200.

Bergman, Brita. 1977. Tecknad Svenska:[Signed Swedish]. LiberLäromedel/Utbildningsförl.:

Beuzeville, Louise de. 2008. “Pointing and Verb Modification: The Expression of Semantic Roles in the Auslan Corpus.” In Workshop Programme, 13. Citeseer.

Bird, Steven. 2020. “Decolonising Speech and Language Technology.” In Proceedings of the 28th International Conference on Computational Linguistics, 3504–19. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.313.

Bishop, Christopher M. 1994. “Mixture Density Networks.”

Borg, Mark, and Kenneth P Camilleri. 2019. “Sign Language Detection "in the Wild" with Recurrent Neural Networks.” In ICASSP 2019-2019 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 1637–41. IEEE.

Bragg, Danielle, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, et al. 2019. “Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective.” In The 21st International Acm Sigaccess Conference on Computers and Accessibility, 16–31.

Brentari, Diane. 2011. “Sign Language Phonology.” The Handbook of Phonological Theory, 691–721.

Brentari, Diane, and Carol Padden. 2001. “A Language with Multiple Origins: Native and Foreign Vocabulary in American Sign Language.” Foreign Vocabulary in Sign Language: A Cross-Linguistic Investigation of Word Formation, 87–119.

Bull, Hannah, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, and Andrew Zisserman. 2021. “Aligning Subtitles in Sign Language Videos.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 11552–61.

Bull, Hannah, Michèle Gouiffès, and Annelies Braffort. 2020. “Automatic Segmentation of Sign Language into Subtitle-Units.” In European Conference on Computer Vision, 186–98. Springer.

Bungeroth, Jan, Daniel Stein, Philippe Dreuw, Hermann Ney, Sara Morrissey, Andy Way, and Lynette van Zijl. 2008. “The ATIS Sign Language Corpus.” In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/pdf/748_paper.pdf.

Camgöz, Necati Cihan, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. “Subunets: End-to-End Hand Shape and Continuous Sign Language Recognition.” In 2017 Ieee International Conference on Computer Vision (Iccv), 3075–84. IEEE.

Camgöz, Necati Cihan, Ahmet Alp Kındıroğ lu, Serpil Karabüklü, Meltem Kelepir, Ayş e Sumru Özsoy, and Lale Akarun. 2016. “BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1383–8. Portorož, Slovenia: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L16-1220.

Camgöz, Necati Cihan, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020a. “Multi-Channel Transformers for Multi-Articulatory Sign Language Translation.” In European Conference on Computer Vision, 301–19.

———. 2020b. “Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 10023–33.

Camgöz, Necati Cihan, Ben Saunders, Guillaume Rochette, Marco Giovanelli, Giacomo Inches, Robin Nachtrab-Ribback, and Richard Bowden. 2021. “Content4all Open Research Sign Language Translation Datasets.” In 2021 16th Ieee International Conference on Automatic Face and Gesture Recognition (Fg 2021), 1–5. IEEE.

Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. “Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” In CVPR.

Cao, Z., G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Chai, Xiujuan, Hanjie Wang, and Xilin Chen. 2014. “The Devisign Large Vocabulary of Chinese Sign Language Database and Baseline Evaluations.” Technical Report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS.

Chan, Caroline, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. “Everybody Dance Now.” In Proceedings of the Ieee International Conference on Computer Vision, 5933–42.

Chen, Yu, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. 2017. “Adversarial Posenet: A Structure-Aware Convolutional Network for Human Pose Estimation.” In Proceedings of the Ieee International Conference on Computer Vision, 1212–21.

Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–34. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1179.

Cihan Camgöz, Necati, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. “Neural Sign Language Translation.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7784–93.

Cormier, Kearsy, Sandra Smith, and Zed Sevcikova-Sehyr. 2015. “Rethinking Constructed Action.” Sign Language & Linguistics 18 (2): 167–204.

Crasborn, Onno, R Bank, I Zwitserlood, E Van der Kooij, E Ormel, J Ros, A Schüller, et al. 2016. “NGT Signbank.” Nijmegen: Radboud University, Centre for Language Studies.

Crasborn, O., and I. Zwitserlood. 2008. “The Corpus Ngt: An Online Corpus for Professionals and Laymen.” In.

Cui, Runpeng, Hu Liu, and Changshui Zhang. 2017. “Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7361–9.

———. 2019. “A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training.” IEEE Transactions on Multimedia 21 (7): 1880–91.

Dafnis, Konstantinos M, Evgenia Chroni, Carol Neidle, and Dimitris N Metaxas. 2022. “Bidirectional Skeleton-Based Isolated Sign Recognition Using Graph Convolution Networks.” In Proceedings of the 13th Conference on Language Resources and Evaluation (Lrec 2022), Marseille, 20-25 June 2022.

Databases, NCSLGR. 2007. “Volumes 2–7.” American Sign Language Linguistic Research Project (Distributed on CD-ROM ….

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In 2009 Ieee Conference on Computer Vision and Pattern Recognition, 248–55. Ieee.

Dreuw, Philippe, Thomas Deselaers, Daniel Keysers, and Hermann Ney. 2006. “Modeling Image Variability in Appearance-Based Gesture Recognition.” In ECCV Workshop on Statistical Methods in Multi-Image and Video Processing, 7–18.

Dreuw, Philippe, Carol Neidle, Vassilis Athitsos, Stan Sclaroff, and Hermann Ney. 2008. “Benchmark Databases for Video-Based Automatic Sign Language Recognition.” In LREC.

Duarte, Amanda, Shruti Palaskar, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i-Nieto. 2020. “How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language.” arXiv Preprint arXiv:2008.08143.

Dudis, Paul G. 2004. “Body Partitioning and Real-Space Blends.” Cognitive Linguistics 15 (2): 223–38.

Ebling, Sarah, Necati Cihan Camg ö z, Penny Boyes Braem, Katja Tissi, Sandra Sidler-Miserez, Stephanie Stoll, Simon Hadfield, et al. 2018. “SMILE Swiss German Sign Language Dataset.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1666.

Erard, Michael. 2017. “Why Sign-Language Gloves Don’t Help Deaf People.” The Atlantic 9.

Farag, Iva, and Heike Brock. 2019. “Learning Motion Disfluencies for Automatic Sign Language Segmentation.” In ICASSP 2019-2019 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 7360–4. IEEE.

Fenlon, Jordan, Kearsy Cormier, and Adam Schembri. 2015. “Building Bsl Signbank: The Lemma Dilemma Revisited.” International Journal of Lexicography 28 (2): 169–206.

Fenlon, Jordan, Adam Schembri, and Kearsy Cormier. 2018. “Modification of Indicating Verbs in British Sign Language: A Corpus-Based Study.” Language 94 (1): 84–118.

Ferreira-Brito, Lucinda. 1984. “Similarities & Differences in Two Brazilian Sign Languages.” Sign Language Studies 42: 45–56.

Forster, Jens, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014. “Extensions of the Sign Language Recognition and Translation Corpus Rwth-Phoenix-Weather.” In LREC, 1911–6.

Gebre, Binyam Gebrekidan, Peter Wittenburg, and Tom Heskes. 2013. “Automatic Sign Language Identification.” In 2013 Ieee International Conference on Image Processing, 2626–30. IEEE.

Giró-i-Nieto, Xavier. 2020. “Can Everybody Sign Now? Exploring Sign Language Video Generation from 2D Poses.”

Glickman, Neil S, and Wyatte C Hall. 2018. Language Deprivation and Deaf Mental Health. Routledge.

Graves, Alex, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” In Proceedings of the 23rd International Conference on Machine Learning, 369–76.

Grishchenko, Ivan, and Valentin Bazarevsky. 2020. “MediaPipe Holistic.” https://google.github.io/mediapipe/solutions/holistic.html.

Gutierrez-Sigut, Eva, Brendan Costello, Cristina Baus, and Manuel Carreiras. 2016. “LSE-Sign: A Lexical Database for Spanish Sign Language.” Behavior Research Methods 48 (1): 123–37.

Güler, Rıza Alp, Natalia Neverova, and Iasonas Kokkinos. 2018. “Densepose: Dense Human Pose Estimation in the Wild.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7297–7306.

Hall, Wyatte C, Leonard L Levin, and Melissa L Anderson. 2017. “Language Deprivation Syndrome: A Possible Neurodevelopmental Disorder with Sociocultural Origins.” Social Psychiatry and Psychiatric Epidemiology 52 (6): 761–76.

Hanke, Thomas. 2002. “ILex-a Tool for Sign Language Lexicography and Corpus Analysis.” In LREC.

Hanke, Thomas, Marc Schulder, Reiner Konrad, and Elena Jahn. 2020. “Extending the Public DGS Corpus in Size and Depth.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 75–82. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.12.

Harris, Raychelle, Heidi M Holmes, and Donna M Mertens. 2009. “Research Ethics in Sign Language Communities.” Sign Language Studies 9 (2): 104–31.

Hassan, Saad, Larwan Berke, Elahe Vahdani, Longlong Jing, Yingli Tian, and Matt Huenerfauth. 2020. “An Isolated-Signing RGBD Dataset of 100 American Sign Language Signs Produced by Fluent ASL Signers.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 89–94. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.14.

Hassan, Saad, Matthew Seita, Larwan Berke, Yingli Tian, Elaine Gale, Sooyeon Lee, and Matt Huenerfauth. 2022. “ASL-Homework-RGBD Dataset: An Annotated Dataset of 45 Fluent and Non-Fluent Signers Performing American Sign Language Homeworks.” In 13th International Conference on Language Resources and Evaluation (LREC 2022), edited by Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie A. Hochgesang, Jette Kristoffersen, Johanna Mesch, and Marc Schulder, 67–72. Marseille, France: European Language Resources Association (ELRA). https://www.sign-lang.uni-hamburg.de/lrec/pub/22008.pdf.

Hidalgo, Gines, Yaadhav Raaj, Haroon Idrees, Donglai Xiang, Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2019. “Single-Network Whole-Body Pose Estimation.” In ICCV.

Huang, Jie, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. “Video-Based Sign Language Recognition Without Temporal Segmentation.” In Proceedings of the Aaai Conference on Artificial Intelligence. Vol. 32. 1.

Humphries, Tom, Poorna Kushalnagar, Gaurav Mathur, Donna Jo Napoli, Carol Padden, Christian Rathmann, and Scott Smith. 2016. “Avoiding Linguistic Neglect of Deaf Children.” Social Service Review 90 (4): 589–619.

Imashev, Alfarabi, Medet Mukushev, Vadim Kimmelman, and Anara Sandygulova. 2020. “A Dataset for Linguistic Understanding, Visual Evaluation, and Recognition of Sign Languages: The K-Rsl.” In Proceedings of the 24th Conference on Computational Natural Language Learning, 631–40.

Isard, Amy. 2020. “Approaches to the Anonymisation of Sign Language Corpora.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 95–100.

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. “Image-to-Image Translation with Conditional Adversarial Networks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 1125–34.

Jiang, Songyao, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. “Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble.” arXiv Preprint arXiv:2110.06161.

Johnson, Robert E, and Scott K Liddell. 2011. “Toward a Phonetic Representation of Signs: Sequentiality and Contrast.” Sign Language Studies 11 (2): 241–74.

Johnston, Trevor. 2010. “From Archive to Corpus: Transcription and Annotation in the Creation of Signed Language Corpora.” International Journal of Corpus Linguistics 15 (1): 106–31.

Johnston, Trevor, and Louise De Beuzeville. 2016. “Auslan Corpus Annotation Guidelines.” Auslan Corpus.

Johnston, Trevor, and Adam Schembri. 2007. Australian Sign Language (Auslan): An Introduction to Sign Language Linguistics. Cambridge University Press.

Kakumasu, Jim. 1968. “Urubu Sign Language.” International Journal of American Linguistics 34 (4): 275–81.

Kimmelman, Vadim. 2014. “Information Structure in Russian Sign Language and Sign Language of the Netherlands.” Sign Language & Linguistics 18 (1): 142–50.

Kipp, Michael. 2001. “Anvil-a Generic Annotation Tool for Multimodal Dialogue.” In Seventh European Conference on Speech Communication and Technology.

Ko, Sang-Ki, Chang Jo Kim, Hyedong Jung, and Choongsang Cho. 2019. “Neural Sign Language Translation Based on Human Keypoint Estimation.” Applied Sciences 9 (13): 2683.

Koller, Oscar, Cihan Camgöz, Hermann Ney, and Richard Bowden. 2019. “Weakly Supervised Learning with Multi-Stream Cnn-Lstm-Hmms to Discover Sequential Parallelism in Sign Language Videos.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Konrad, Reiner, Thomas Hanke, Gabriele Langer, Susanne König, Lutz König, Rie Nishio, and Anja Regen. 2018. “Public Dgs Corpus: Annotation Conventions.” Project Note AP03-2018-01, DGS-Korpus project, IDGS, Hamburg University.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–1105.

Kusters, Annelies, Maartje De Meulder, and Dai O’Brien. 2017. Innovations in Deaf Studies: The Role of Deaf Scholars. Oxford University Press.

Lebert, Marie. 2008. “Project Gutenberg (1971-2008).” Project Gutenberg.

Li, Dongxu, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. “Word-Level Deep Sign Language Recognition from Video: A New Large-Scale Dataset and Methods Comparison.” In The Ieee Winter Conference on Applications of Computer Vision, 1459–69.

Liddell, Scott K, and Robert E Johnson. 1989. “American Sign Language: The Phonological Base.” Sign Language Studies 64 (1): 195–277.

Liddell, Scott K, and Melanie Metzger. 1998. “Gesture in Sign Language Discourse.” Journal of Pragmatics 30 (6): 657–97.

Liddell, Scott K, and others. 2003. Grammar, Gesture, and Meaning in American Sign Language. Cambridge University Press.

Lubbers, Mart, and Francisco Torreira. 2013. “Pympi-Ling: A Python Module for Processing ELANs EAF and Praats TextGrid Annotation Files.” https://pypi.python.org/pypi/pympi-ling.

Luong, Thang, Hieu Pham, and Christopher D. Manning. 2015. “Effective Approaches to Attention-Based Neural Machine Translation.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1412–21. Lisbon, Portugal: Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1166.

Martínez, Aleix M, Ronnie B Wilbur, Robin Shay, and Avinash C Kak. 2002. “Purdue Rvl-Slll Asl Database for Automatic Recognition of American Sign Language.” In Proceedings. Fourth Ieee International Conference on Multimodal Interfaces, 167–72. IEEE.

Matthes, Silke, Thomas Hanke, Anja Regen, Jakob Storz, Satu Worseck, Eleni Efthimiou, Athanasia-Lida Dimou, Annelies Braffort, John Glauert, and Eva Safar. 2012. “Dicta-Sign–Building a Multilingual Sign Language Corpus.” In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon (Lrec 2012).

McKee, David, and Graeme Kennedy. 2000. “Lexical Comparison of Signs from American, Australian, British and New Zealand Sign Languages.” The Signs of Language Revisited: An Anthology to Honor Ursula Bellugi and Edward Klima, 49–76.

Mesch, Johanna, and Lars Wallin. 2012. “From Meaning to Signs and Back: Lexicography and the Swedish Sign Language Corpus.” In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon [Language Resources and Evaluation Conference (Lrec)], 123–26.

———. 2015. “Gloss Annotations in the Swedish Sign Language Corpus.” International Journal of Corpus Linguistics 20 (1): 102–20.

Min, Jianyuan, and Jinxiang Chai. 2012. “Motion Graphs++ a Compact Generative Model for Semantic Motion Analysis and Synthesis.” ACM Transactions on Graphics (TOG) 31 (6): 1–12.

Monteiro, Caio DD, Christy Maria Mathew, Ricardo Gutierrez-Osuna, and Frank Shipman. 2016. “Detecting and Identifying Sign Languages Through Visual Features.” In 2016 Ieee International Symposium on Multimedia (Ism), 287–90. IEEE.

Montemurro, Kathryn, and Diane Brentari. 2018. “Emphatic Fingerspelling as Code-Mixing in American Sign Language.” Proceedings of the Linguistic Society of America 3 (1): 61–61.

Moryossef, Amit, Ioannis Tsochantaridis, Roee Yosef Aharoni, Sarah Ebling, and Srini Narayanan. 2020. “Real-Time Sign-Language Detection Using Human Pose Estimation.”

Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. 2021. “Data Augmentation for Sign Language Gloss Translation.” In Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (At4ssl), 1–11. Virtual: Association for Machine Translation in the Americas. https://aclanthology.org/2021.mtsummit-at4ssl.1.

Mukushev, Medet, Aigerim Kydyrbekova, Vadim Kimmelman, and Anara Sandygulova. 2022. “Towards Large Vocabulary Kazakh-Russian Sign Language Dataset: KRSL-OnlineSchool.” In 13th International Conference on Language Resources and Evaluation (LREC 2022), edited by Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie A. Hochgesang, Jette Kristoffersen, Johanna Mesch, and Marc Schulder, 154–58. Marseille, France: European Language Resources Association (ELRA). https://www.sign-lang.uni-hamburg.de/lrec/pub/22031.pdf.

Murray, Joseph J, Wyatte C Hall, and Kristin Snoddon. 2020. “The Importance of Signed Languages for Deaf Children and Their Families.” The Hearing Journal 73 (3): 30–32.

Neidle, Carol, Stan Sclaroff, and Vassilis Athitsos. 2001. “SignStream: A Tool for Linguistic and Computer Vision Research on Visual-Gestural Language Data.” Behavior Research Methods, Instruments, & Computers 33 (3): 311–20.

Nguyen, Thanh Thi, Cuong M. Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, and Saeid Nahavandi. 2019. “Deep Learning for Deepfakes Creation and Detection.” CoRR abs/1909.11573. http://arxiv.org/abs/1909.11573.

Ormel, Ellen, and Onno Crasborn. 2012. “Prosodic Correlates of Sentences in Signed Languages: A Literature Review and Suggestions for New Types of Studies.” Sign Language Studies 12 (2): 279–315.

Othman, Achraf, and Mohamed Jemni. 2012. “English-Asl Gloss Parallel Corpus 2012: Aslg-Pc12.” In 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon Lrec.

Öqvist, Zrajm, Nikolaus Riemer Kankkonen, and Johanna Mesch. 2020. “STS-Korpus: A Sign Language Web Corpus Tool for Teaching and Public Use.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 177–80. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.29.

Padden, C. 1988. Interaction of Morphology and Syntax in American Sign Language. Outstanding Disc Linguistics Series. Garland. https://books.google.com/books?id=Mea7AAAAIAAJ.

Padden, Carol A. 1998. “The Asl Lexicon.” Sign Language & Linguistics 1 (1): 39–60.

Padden, Carol A, and Darline Clark Gunsauls. 2003. “How the Alphabet Came to Be Used in a Sign Language.” Sign Language Studies, 10–33.

Padden, Carol A, and Tom Humphries. 1988. Deaf in America. Harvard University Press.

Panteleris, Paschalis, Iason Oikonomidis, and Antonis Argyros. 2018. “Using a Single Rgb Frame for Real Time 3d Hand Pose Estimation in the Wild.” In 2018 Ieee Winter Conference on Applications of Computer Vision (Wacv), 436–45. IEEE.

Patrie, Carol J, and Robert E Johnson. 2011. Fingerspelled Word Recognition Through Rapid Serial Visual Presentation: RSVP. DawnSignPress.

Pavllo, Dario, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. “3d Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7753–62.

Pishchulin, Leonid, Arjun Jain, Mykhaylo Andriluka, Thorsten Thorm ä hlen, and Bernt Schiele. 2012. “Articulated People Detection and Pose Estimation: Reshaping the Future.” In 2012 Ieee Conference on Computer Vision and Pattern Recognition, 3178–85. IEEE.

Pratap, Vineel, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. “MLS: A Large-Scale Multilingual Dataset for Speech Research.” In Proc. Interspeech 2020, 2757–61. https://doi.org/10.21437/Interspeech.2020-2826.

Prillwitz, Siegmund, and Heiko Zienert. 1990. “Hamburg Notation System for Sign Language: Development of a Sign Writing with Computer Application.” In Current Trends in European Sign Language Research. Proceedings of the 3rd European Congress on Sign Language Research, 355–79.

Ramırez, Javier, José C Segura, Carmen Benıtez, Angel De La Torre, and Antonio Rubio. 2004. “Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information.” Speech Communication 42 (3-4): 271–87.

Rathmann, Christian, and Gaurav Mathur. 2011. “A Featural Approach to Verb Agreement in Signed Languages.” Theoretical Linguistics 37 (3-4): 197–208.

Renz, Katrin, Nicolaj C Stache, Samuel Albanie, and Gül Varol. 2021. “Sign Language Segmentation with Temporal Convolutional Networks.” In ICASSP 2021-2021 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 2135–9. IEEE.

Renz, Katrin, Nicolaj C Stache, Neil Fox, Gul Varol, and Samuel Albanie. 2021. “Sign Segmentation with Changepoint-Modulated Pseudo-Labelling.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 3403–12.

Roy, Cynthia B. 2011. Discourse in Signed Languages. Gallaudet University Press.

Sandler, Wendy. 2010. “Prosody and Syntax in Sign Languages.” Transactions of the Philological Society 108 (3): 298–328.

———. 2012. “The Phonological Organization of Sign Languages.” Language and Linguistics Compass 6 (3): 162–82.

Sandler, Wendy, and Diane Lillo-Martin. 2006. Sign Language and Linguistic Universals. Cambridge University Press.

Santemiz, Pinar, Oya Aran, Murat Saraclar, and Lale Akarun. 2009. “Automatic Sign Segmentation from Continuous Signing via Multiple Sequence Alignment.” In 2009 Ieee 12th International Conference on Computer Vision Workshops, Iccv Workshops, 2001–8. IEEE.

Saunders, Ben, Necati Cihan Camgöz, and Richard Bowden. 2020a. “Adversarial Training for Multi-Channel Sign Language Production.” In The 31st British Machine Vision Virtual Conference. British Machine Vision Association.

———. 2020b. “Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video.” arXiv Preprint arXiv:2011.09846.

———. 2020c. “Progressive Transformers for End-to-End Sign Language Production.” In European Conference on Computer Vision, 687–705.

Savitzky, Abraham, and Marcel JE Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39.

Schembri, Adam, Kearsy Cormier, and Jordan Fenlon. 2018. “Indicating Verbs as Typologically Unique Constructions: Reconsidering Verb ‘Agreement’in Sign Languages.” Glossa: A Journal of General Linguistics 3 (1).

Schembri, Adam, Jordan Fenlon, Ramas Rentelis, Sally Reynolds, and Kearsy Cormier. 2013. “Building the British Sign Language Corpus.” Language Documentation & Conservation 7: 136–54.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162.

Shi, B., A. Martinez Del Rio, J. Keane, D. Brentari, G. Shakhnarovich, and K. Livescu. 2019. “Fingerspelling Recognition in the Wild with Iterative Visual Attention.” ICCV.

Shi, B., A. Martinez Del Rio, J. Keane, J. Michaux, G. Shakhnarovich D. Brentari, and K. Livescu. 2018. “American Sign Language Fingerspelling Recognition in the Wild.” SLT.

Shieber, Stuart M. 1994. “RESTRICTING the Weak-Generative Capacity of Synchronous Tree-Adjoining Grammars.” Computational Intelligence 10 (4): 371–85.

Shieber, Stuart, and Yves Schabes. 1990. “Synchronous Tree-Adjoining Grammars.” In Proceedings of the 13th International Conference on Computational Linguistics. Association for Computational Linguistics.

Shroyer, Edgar H, and Susan P Shroyer. 1984. Signs Across America: A Look at Regional Differences in American Sign Language. Gallaudet University Press.

Simon, Tomas, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. “Hand Keypoint Detection in Single Images Using Multiview Bootstrapping.” In CVPR.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR.

Sincan, Ozge Mercanoglu, and Hacer Yalim Keles. 2020. “AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods.” IEEE Access 8: 181340–55.

Sohn, Jongseo, Nam Soo Kim, and Wonyong Sung. 1999. “A Statistical Model-Based Voice Activity Detection.” IEEE Signal Processing Letters 6 (1): 1–3.

Stokoe, Jr., William C. 1960. “Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf.” The Journal of Deaf Studies and Deaf Education 10 (1): 3–37. https://doi.org/10.1093/deafed/eni001.

Stokoe Jr, William C. 2005. “Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf.” Journal of Deaf Studies and Deaf Education 10 (1): 3–37.

Stoll, Stephanie, Necati Cihan Camgöz, Simon Hadfield, and Richard Bowden. 2018. “Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks.” In Proceedings of the 29th British Machine Vision Conference (Bmvc 2018). British Machine Vision Association.

———. 2020. “Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks.” International Journal of Computer Vision, 1–18.

Supalla, Ted. 1986. “The Classifier System in American Sign Language.” Noun Classes and Categorization 7: 181–214.

Sutton, Valerie. 1990. Lessons in Sign Writing. SignWriting.

“TensorFlow Datasets, a Collection of Ready-to-Use Datasets.” n.d. https://www.tensorflow.org/datasets.

United Nations. 2022. “International Day of Sign Languages.” https://www.un.org/en/observances/sign-languages-day.

Vaezi Joze, Hamid, and Oscar Koller. 2019. “MS-Asl: A Large-Scale Data Set and Benchmark for Understanding American Sign Language.” In The British Machine Vision Conference (Bmvc). https://www.microsoft.com/en-us/research/publication/ms-asl-a-large-scale-data-set-and-benchmark-for-understanding-american-sign-language/.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, 5998–6008.

Viitaniemi, Ville, Tommi Jantunen, Leena Savolainen, Matti Karppa, and Jorma Laaksonen. 2014. “S-Pot - a Benchmark in Spotting Signs Within Continuous Signing.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1892–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/440_Paper.pdf.

Vintar, Špela, Boštjan Jerko, and Marjetka Kulovec. 2012. “Compiling the Slovene Sign Language Corpus.” In 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon. Language Resources and Evaluation Conference (Lrec), 5:159–62.

Vogler, Christian, and Siome Goldenstein. 2005. “Analysis of Facial Expressions in American Sign Language.” In Proc, of the 3rd Int. Conf. On Universal Access in Human-Computer Interaction, Springer.

Von Agris, Ulrich, and Karl-Friedrich Kraiss. 2007. “Towards a Video Corpus for Signer-Independent Continuous Sign Language Recognition.” Gesture in Human-Computer Interaction and Simulation, Lisbon, Portugal, May 11.

Walsh, Harry Thomas, Ben Saunders, and Richard Bowden. 2022. “Changing the Representation: Examining Language Representation for Neural Sign Language Production.” In LREC 2022.

Wang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. “Video-to-Video Synthesis.” In Advances in Neural Information Processing Systems (Neurips).

Wei, Shih-En, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. “Convolutional Pose Machines.” In CVPR.

Wheatland, Nkenge, Ahsan Abdullah, Michael Neff, Sophie Jörg, and Victor Zordan. 2016. “Analysis in Support of Realistic Timing in Animated Fingerspelling.” In 2016 Ieee Virtual Reality (Vr), 309–10. IEEE.

Wilcox, Sherman. 1992. The Phonetics of Fingerspelling. Vol. 4. John Benjamins Publishing.

Wilcox, Sherman, and Sarah Hafer. 2004. “Rethinking Classifiers. Emmorey, K.(Ed.).(2003). Perspectives on Classifier Constructions in Sign Languages. Mahwah, Nj: Lawrence Erlbaum Associates. 332 Pages. Hardcover.” Oxford University Press.

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann, and Han Sloetjes. 2006. “ELAN: A Professional Framework for Multimodality Research.” In 5th International Conference on Language Resources and Evaluation (Lrec 2006), 1556–9.

World Federation of the Deaf. 2022. “World Federation of the Deaf - Our Work.” https://wfdeaf.org/our-work/.

World Health Organization. 2021. “Deafness and Hearing Loss.” https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss.

Xiao, Qinkun, Minying Qin, and Yuting Yin. 2020. “Skeleton-Based Chinese Sign Language Recognition and Generation for Bidirectional Communication Between Deaf and Hearing People.” Neural Networks 125: 41–55.

Yin, Kayo, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021. “Including Signed Languages in Natural Language Processing.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7347–60. Online: Association for Computational Linguistics. https://aclanthology.org/2021.acl-long.570.

Yin, Kayo, and Jesse Read. 2020. “Better Sign Language Translation with STMC-Transformer.” In Proceedings of the 28th International Conference on Computational Linguistics, 5975–89. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.525.

Zafrulla, Zahoor, Helene Brashear, Harley Hamilton, and Thad Starner. 2010. “A Novel Approach to American Sign Language (ASL) Phrase Verification Using Reversed Signing.” In 2010 Ieee Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, 48–55. IEEE.

Zelinka, Jan, and Jakub Kanis. 2020. “Neural Sign Language Synthesis: Words Are Our Glosses.” In The Ieee Winter Conference on Applications of Computer Vision, 3395–3403.

Zhao, Liwei, Karin Kipper, William Schuler, Christian Vogler, Norman Badler, and Martha Palmer. 2000. “A Machine Translation System from English to American Sign Language.” In Conference of the Association for Machine Translation in the Americas, 54–67. Springer.

Łacheta, Joanna, and Paweł Rutkowski. 2014. “A Corpus-Based Dictionary of Polish Sign Language (Pjm).” In.


  1. When capitalized, “Deaf” refers to a community of deaf people who share a language and a culture, whereas the lowercase “deaf” refers to the audiological condition of not hearing.↩︎

  2. We mainly refer to ASL, where most sign language research has been conducted, but not exclusively.↩︎