Try sign translate to experience state-of-the art-sign language translation technology.

Introduction

Signed languages (also known as sign languages) are languages that use the visual-gestural modality to convey meaning through manual articulations in combination with non-manual elements like the face and body. They serve as the primary means of communication for numerous deaf and hard-of-hearing individuals. Similar to spoken languages, signed languages are natural languages governed by a set of linguistic rules (Sandler and Lillo-Martin 2006), both emerging through an abstract, protracted aging process and evolving without deliberate meticulous planning. Signed languages are not universal or mutually intelligible, despite often having striking similarities among them. They are also distinct from spoken languages—i.e., American Sign Language (ASL) is not a visual form of English but its own unique language.

Sign Language Processing (Bragg et al. 2019; Yin et al. 2021) is an emerging field of artificial intelligence concerned with the automatic processing and analysis of sign language content. While research has focused more on the visual aspects of signed languages, it is a subfield of both Natural Language Processing (NLP) and Computer Vision (CV). Challenges in sign language processing often include machine translation of sign language videos into spoken language text (sign language translation), from spoken language text (sign language production), or sign language recognition for sign language understanding.

Unfortunately, the latest advances in language-based artificial intelligence, like machine translation and personal assistants, expect a spoken language input (text or transcribed speech), excluding around 200 to 300 different signed languages (United Nations 2022) and up to 70 million deaf people (World Health Organization 2021; World Federation of the Deaf 2022).

Throughout history, Deaf communities fought for the right to learn and use signed languages and for the public recognition of signed languages as legitimate ones. Indeed, signed languages are sophisticated communication modalities, at least as capable as spoken languages in all aspects, both linguistic and social. However, in a predominantly oral society, deaf people are constantly encouraged to use spoken languages through lip-reading or text-based communication. The exclusion of signed languages from modern language technologies further suppresses signing in favor of spoken languages. This exclusion disregards the preferences of the Deaf communities who strongly prefer to communicate in signed languages both online and for in-person day-to-day interactions, among themselves and when interacting with spoken language communities (C. A. Padden and Humphries 1988; Glickman and Hall 2018). Thus, it is essential to make signed languages accessible.

To date, a large amount of research on Sign Language Processing (SLP) has been focused on the visual aspect of signed languages, led by the Computer Vision (CV) community, with little NLP involvement. This focus is not unreasonable, given that a decade ago, we lacked adequate CV tools to process videos for further linguistic analyses. However, similar to spoken languages, signed languages are fully-fledged systems exhibiting all the fundamental characteristics of natural languages, and existing SLP techniques do not adequately address or leverage the linguistic structure of signed languages. Signed languages introduce novel challenges for NLP due to their visual-gestural modality, simultaneity, spatial coherence, and lack of written form. The lack of a written form makes the spoken language processing pipelines - which often start with audio transcription before processing - incompatible with signed languages, forcing researchers to work directly on the raw video signal.

Furthermore, SLP is not only intellectually appealing but also an important research area with significant potential to benefit signing communities. Beneficial applications enabled by signed language technologies include improved documentation of endangered sign languages; educational tools for sign language learners; tools for query and retrieval of information from signed language videos; personal assistants that react to signed languages; real-time automatic sign language interpretations; and more. Needless to say, in addressing this research area, researchers should work alongside and under the direction of deaf communities, and to benefit the signing communities’ interest above all (Harris, Holmes, and Mertens 2009).

In this work, we describe the different representations used for sign language processing, as well as survey the various tasks and recent advances on them. We also make a comprehensive list of existing datasets and make the ones available easy to load using a simple and standardized interface.

(Brief) History of Signed Languages and Deaf Culture

Throughout modern history, spoken languages were dominant, so much so that signed languages struggled to be recognized as languages in their own right, and educators developed misconceptions that signed language acquisition might hinder the development of speech skills. For example, in 1880, a large international conference of deaf educators called the “Second International Congress on Education of the Deaf” banned teaching signed languages, favoring speech therapy instead. It was not until the seminal work on American Sign Language (ASL) by Stokoe (1960) that signed languages started gaining recognition as natural, independent, and well-defined languages, which inspired other researchers to further explore signed languages as a research area. Nevertheless, antiquated attitudes that placed less importance on signed languages continue to inflict harm and subject many to linguistic neglect (Humphries et al. 2016). Several studies have shown that deaf children raised solely with spoken languages do not gain enough access to a first language during their critical period of language acquisition (Murray, Hall, and Snoddon 2020). This language deprivation can lead to life-long consequences on the cognitive, linguistic, socio-emotional, and academic development of the deaf (Hall, Levin, and Anderson 2017).

Signed languages are the primary languages of communication for the Deaf1 and are at the heart of Deaf communities. In the past, the failure to recognize signed languages as fully-fledged natural language systems in their own right has had detrimental effects, and in an increasingly digitized world, NLP research should strive to enable a world in which all people, including the Deaf, have access to languages that fit their lived experience.

Sign Language Linguistics Overview

Signed languages consist of phonological, morphological, syntactic, and semantic levels of structure that fulfill the same social, cognitive, and communicative purposes as other natural languages. While spoken languages primarily channel the oral-auditory modality, signed languages use the visual-gestural modality, relying on the signer’s face, hands, body, and space around them to create distinctions in meaning. We present the linguistic features of signed languages2 that researchers must consider during their modeling.

Phonology

Signs are composed of minimal units that combine manual features such as hand configuration, palm orientation, placement, contact, path movement, local movement, as well as non-manual features including eye aperture, head movement, and torso positioning (Liddell and Johnson 1989; Johnson and Liddell 2011; Brentari 2011; Sandler 2012). Not all possible phonemes are realized in both signed and spoken languages, and inventories of two languages’ phonemes/features may not overlap completely. Different languages are also subject to rules for the allowed combinations of features.

Simultaneity

Though an ASL sign takes about twice as long to produce than an English word, the rates of transmission of information between the two languages are similar (Bellugi and Fischer 1972). One way signed languages compensate for the slower production rate of signs is through simultaneity: Signed languages use multiple visual cues to convey different information simultaneously (Sandler 2012). For example, the signer may produce the sign for “cup” on one hand while simultaneously pointing to the actual cup with the other to express “that cup.” Similarly to tone in spoken languages, the face and torso can convey additional affective information (Liddell and others 2003; Johnston and Schembri 2007). Facial expressions can modify adjectives, adverbs, and verbs; a head shake can negate a phrase or sentence; eye direction can help indicate referents.

Referencing

The signer can introduce referents in discourse either by pointing to their actual locations in space or by assigning a region in the signing space to a non-present referent and by pointing to this region to refer to it (Rathmann and Mathur 2011; Schembri, Cormier, and Fenlon 2018). Signers can also establish relations between referents grounded in signing space by using directional signs or embodying the referents using body shift or eye gaze (Dudis 2004; Liddell and Metzger 1998). Spatial referencing also impact morphology when the directionality of a verb depends on the location of the reference to its subject and/or object (Beuzeville 2008; Fenlon, Schembri, and Cormier 2018): For example, a directional verb can move from its subject’s location and end at its object’s location. While the relation between referents and verbs in spoken language is more arbitrary, referent relations are usually grounded in signed languages. The visual space is heavily exploited to make referencing clear.

Another way anaphoric entities are referenced in sign language is by using classifiers or depicting signs (Supalla 1986; Wilcox and Hafer 2004; Roy 2011) that help describe the characteristics of the referent. Classifiers are typically one-handed signs that do not have a particular location or movement assigned to them, or derive features from meaningful discourse (Liddell and others 2003), so they can be used to convey how the referent relates to other entities, describe its movement, and give more details. For example, to tell about a car swerving and crashing, one might use the hand classifier for a vehicle, move it to indicate swerving, and crash it with another entity in space.

To quote someone other than oneself, signers perform role shift (Cormier, Smith, and Sevcikova-Sehyr 2015), where they may physically shift in space to mark the distinction and take on some characteristics of the people they represent. For example, to recount a dialogue between a taller and a shorter person, the signer may shift to one side and look up when taking the shorter person’s role, shift to the other side and look down when taking the taller person’s role.

Fingerspelling

Fingerspelling results from language contact between a signed language and a surrounding spoken language written form (Battison 1978; Wilcox 1992; Brentari and Padden 2001; Patrie and Johnson 2011). A set of manual gestures correspond with a written orthography or phonetic system. This phenomenon, found in most signed languages, is often used to indicate names or places or new concepts from the spoken language but has often become integrated into the signed languages as another linguistic strategy (Padden 1998; Montemurro and Brentari 2018).

Sign Language Representations

Representation is a significant challenge for SLP. Unlike spoken languages, signed languages have no widely adopted written form. As signed languages are conveyed through the visual-gestural modality, video recording is the most straightforward way to capture them. However, as videos include more information than needed for modeling and are expensive to record, store, and transmit, a lower-dimensional representation has been sought after.

The following figure illustrates each signed language representation we will describe below. In this demonstration, we deconstruct the video into its individual frames to exemplify the alignment of the annotations between the video and representations.

Videos

are the most straightforward representation of a signed language and can amply incorporate the information conveyed through signing. One major drawback of using videos is their high dimensionality: They usually include more information than needed for modeling and are expensive to store, transmit, and encode. As facial features are essential in sign, anonymizing raw videos remains an open problem, limiting the possibility of making these videos publicly available (Isard 2020).

Skeletal Poses

reduce the visual cues in videos to skeleton-like wireframes or mesh representing the location of joints. This technique has been extensively used in the field of computer vision to estimate human pose from video data, where the goal is to determine the spatial configuration of the body at each point in time. Although high-quality pose estimation can be achieved using motion capture equipment, such methods are often expensive and intrusive. As a result, estimating pose from videos has become the preferred method in recent years (Pishchulin et al. 2012; Chen et al. 2017; Cao et al. 2019; Güler, Neverova, and Kokkinos 2018). Compared to video representations, accurate skeletal poses have a lower complexity and provide a semi-anonymized representation of the human body, while observing relatively low information loss. However, they remain a continuous, multidimensional representation that is not adapted to most NLP models.

Written notation systems

represent signs as discrete visual features. Some systems are written linearly, and others use graphemes in two dimensions. While various universal (Sutton 1990; Prillwitz and Zienert 1990) and language-specific notation systems (Stokoe Jr 2005; Kakumasu 1968; Bergman 1977) have been proposed, no writing system has been adopted widely by any sign language community, and the lack of standards hinders the exchange and unification of resources and applications between projects. The figure above depicts two universal notation systems: SignWriting (Sutton 1990), a two-dimensional pictographic system, and HamNoSys (Prillwitz and Zienert 1990), a linear stream of graphemes designed to be machine-readable.

Glosses

are the transcription of signed languages sign-by-sign, with each sign having a unique semantic identifier. While various sign language corpus projects have provided guidelines for gloss annotation (Mesch and Wallin 2015; Johnston and De Beuzeville 2016; Konrad et al. 2018), a standardized gloss annotation protocol has yet to be established. Linear gloss annotations have been criticized for their imprecise representation of signed language. These annotations fail to capture all the information expressed simultaneously through different cues, such as body posture, eye gaze, or spatial relations, leading to a loss of information that can significantly affect downstream performance on SLP tasks (Yin and Read 2020; Müller et al. 2023).

The following table additionally exemplifies the various representations for more isolated signs. For this example, we use SignWriting as the notation system. Note that the same sign might have two unrelated glosses, and the same gloss might have multiple valid spoken language translations.

Video Pose Estimation Notation Gloss English Translation
ASL HOUSE ASL HOUSE HOUSE HOUSE House
ASL WRONG-WHAT ASL WRONG-WHAT WRONG-WHAT WRONG-WHAT What’s the matter?
What’s wrong?
ASL DIFFERENT ASL DIFFERENT DIFFERENT DIFFERENT
BUT
Different
But

Tasks

So far, the computer vision community has primarily led the SLP research to focus on processing the visual features in signed language videos. As a result, current SLP methods do not fully address the linguistic complexity of signed languages. We survey common SLP tasks and current methods’ limitations, drawing on signed languages’ linguistic theories.

Sign Language Detection

Sign language detection (Borg and Camilleri 2019; Moryossef et al. 2020; Pal et al. 2023) is the binary classification task of determining whether signing activity is present in a given video frame. A similar task in spoken languages is voice activity detection (VAD) (Sohn, Kim, and Sung 1999; Ramırez et al. 2004), the detection of when a human voice is used in an audio signal. As VAD methods often rely on speech-specific representations such as spectrograms, they are not necessarily applicable to videos.

Borg and Camilleri (2019) introduced the classification of frames taken from YouTube videos as either signing or not signing. They took a spatial and temporal approach based on VGG-16 (Simonyan and Zisserman 2015) CNN to encode each frame and used a Gated Recurrent Unit (GRU) (Cho et al. 2014) to encode the sequence of frames in a window of 20 frames at 5fps. In addition to the raw frame, they either encoded optical-flow history, aggregated motion history, or frame difference.

Moryossef et al. (2020) improved upon their method by performing sign language detection in real time. They identified that sign language use involves movement of the body and, as such, designed a model that works on top of estimated human poses rather than directly on the video signal. They calculated the optical flow norm of every joint detected on the body and applied a shallow yet effective contextualized model to predict for every frame whether the person is signing or not.

While these recent detection models achieve high performance, we need well-annotated data that include interference and distractions with non-signing instances for proper real-world evaluation. Pal et al. (2023) conducted a detailed analysis of the impact of signer overlap between the training and test sets on two sign detection benchmark datasets (Signing in the Wild (Borg and Camilleri 2019) and the DGS Corpus (Hanke et al. 2020)) used by Borg and Camilleri (2019) and Moryossef et al. (2020). By comparing the accuracy with and without overlap, they noticed a relative decrease in performance for signers not present during training. As a result, they suggested new dataset partitions that eliminate overlap between train and test sets and facilitate a more accurate evaluation of performance.

Sign Language Identification

Sign language identification (Gebre, Wittenburg, and Heskes 2013; Monteiro et al. 2016) classifies which signed language is used in a given video.

Gebre, Wittenburg, and Heskes (2013) found that a simple random-forest classifier utilizing the distribution of phonemes can distinguish between British Sign Language (BSL) and Greek Sign Language (ENN) with a 95% F1 score. This finding is further supported by Monteiro et al. (2016), which, based on activity maps in signing space, manages to differentiate between British Sign Language and French Sign Language (Langue des Signes Française, LSF) with a 98% F1 score in videos with static backgrounds, and between American Sign Language and British Sign Language, with a 70% F1 score for videos mined from popular video-sharing sites. The authors attribute their success mainly to the different fingerspelling systems, which are two-handed in the case of BSL and one-handed in the case of ASL and LSF.

Although these pairwise classification results seem promising, better models would be needed for classifying from a large set of signed languages. These methods only rely on low-level visual features, while signed languages have several distinctive features on a linguistic level, such as lexical or structural differences (McKee and Kennedy 2000; Kimmelman 2014; Ferreira-Brito 1984; Shroyer and Shroyer 1984) which have not been explored for this task.

Sign Language Segmentation

Segmentation consists of detecting the frame boundaries for signs or phrases in videos to divide them into meaningful units. While the most canonical way of dividing a spoken language text is into a linear sequence of words, due to the simultaneity of sign language, the notion of a sign language “word” is ill-defined, and sign language cannot be fully linearly modeled.

Current methods resort to segmenting units loosely mapped to signed language units (Santemiz et al. 2009; Farag and Brock 2019; Bull, Gouiffès, and Braffort 2020; Renz, Stache, Albanie, et al. 2021; Renz, Stache, Fox, et al. 2021; Bull et al. 2021) and do not explicitly leverage reliable linguistic predictors of sentence boundaries such as prosody in signed languages (i.e., pauses, extended sign duration, facial expressions) (Sandler 2010; Ormel and Crasborn 2012). De Sisto et al. (2021) call for a better understanding of sign language structure, which they believe is the necessary ground for the design and development of sign language recognition and segmentation methodologies.

Santemiz et al. (2009) automatically extracted isolated signs from continuous signing by aligning the sequences obtained via speech recognition, modeled by Dynamic Time Warping (DTW) and Hidden Markov Models (HMMs) approaches.

Farag and Brock (2019) used a random forest classifier to distinguish frames containing signs in Japanese Sign Language based on the composition of spatio-temporal angular and distance features between domain-specific pairs of joint segments.

Bull, Gouiffès, and Braffort (2020) segmented French Sign Language into segments corresponding to subtitle units by relying on the alignment between subtitles and sign language videos, leveraging a spatio-temporal graph convolutional network (STGCN; Yu, Yin, and Zhu (2017)) with a BiLSTM on 2D skeleton data.

Renz, Stache, Albanie, et al. (2021) located temporal boundaries between signs in continuous sign language videos by employing 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues. Renz, Stache, Fox, et al. (2021) further proposed the Changepoint-Modulated Pseudo-Labelling (CMPL) algorithm to solve the problem of source-free domain adaptation.

Bull et al. (2021) presented a Transformer-based approach to segment sign language videos and align them with subtitles simultaneously, encoding subtitles by BERT (Devlin et al. 2019) and videos by CNN video representations.

Moryossef, Jiang, et al. (2023) presented a method motivated by linguistic cues observed in sign language corpora, such as prosody (pauses, pace, etc) and handshape changes. They also find that using BIO, an annotation scheme that notes the beginning, inside and outside, makes a significant difference over previous ones that only note IO (inside or outside). They find that including optical flow and 3D hand normalization helps with out-of-domain generalization and other signed languages as well.

Sign Language Recognition, Translation, and Production

Sign language translation (SLT) commonly refers to the translation of signed language to spoken language (De Coster et al. 2022; Müller et al. 2022). Sign language production is the reverse process of producing a sign language video from spoken language text. Sign language recognition (SLR) (Adaloglou et al. 2020) detects and labels signs from a video, either on isolated (Imashev et al. 2020; Sincan and Keles 2020) or continuous (Cui, Liu, and Zhang 2017; Camgöz et al. 2018; N. C. Camgöz et al. 2020b) signs.

In the following graph, we can see a fully connected pentagon where each node is a single data representation, and each directed edge represents the task of converting one data representation to another.

We split the graph into two:

Language Agnostic Tasks Language Specific Tasks Sign language tasks graph

There are 20 tasks conceptually defined by this graph, with varying amounts of previous research. Every path between two nodes might or might not be valid, depending on how lossy the tasks in the path are.


Video-to-Pose

Video-to-Pose—commonly known as pose estimation—is the task of detecting human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It was shown that the face pose correlates with facial non-manual features like head direction (Vogler and Goldenstein 2005).

This area has been thoroughly researched (Pishchulin et al. 2012; Chen et al. 2017; Cao et al. 2019; Güler, Neverova, and Kokkinos 2018) with objectives varying from predicting 2D / 3D poses to a selection of a small specific set of landmarks or a dense mesh of a person.

OpenPose (Cao et al. 2019; Simon et al. 2017; Cao et al. 2017; Wei et al. 2016) is the first multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) in 2D on single images. While their model can estimate the full pose directly from an image in a single inference, they also suggest a pipeline approach where they first estimate the body pose and then independently estimate the hands and face pose by acquiring higher resolution crops around those areas. Building on the slow pipeline approach, a single network whole body OpenPose model has been proposed (Hidalgo et al. 2019), which is faster and more accurate for the case of obtaining all keypoints. With multiple recording angles, OpenPose also offers keypoint triangulation to reconstruct the pose in 3D.

DensePose (Güler, Neverova, and Kokkinos 2018) takes a different approach. Instead of classifying for every keypoint which pixel is most likely, they suggest a method similar to semantic segmentation, for each pixel to classify which body part it belongs to. Then, for each pixel, knowing the body part, they predict where that pixel is on the body part relative to a 2D projection of a representative body model. This approach results in the reconstruction of the full-body mesh and allows sampling to find specific keypoints similar to OpenPose.

However, 2D human poses might not be sufficient to fully understand the position and orientation of landmarks in space, and applying pose estimation per frame disregards video temporal movement information into account, especially in cases of rapid movement, which contain motion blur.

Pavllo et al. (2019) developed two methods to convert between 2D poses to 3D poses. The first, a supervised method, was trained to use the temporal information between frames to predict the missing Z-axis. The second is an unsupervised method, leveraging the fact that the 2D poses are merely a projection of an unknown 3D pose and training a model to estimate the 3D pose and back-project to the input 2D poses. This back projection is a deterministic process, applying constraints on the 3D pose encoder. Zelinka and Kanis (2020) followed a similar process and added a constraint for bones to stay of a fixed length between frames.

Panteleris, Oikonomidis, and Argyros (2018) suggest converting the 2D poses to 3D using inverse kinematics (IK), a process taken from computer animation and robotics to calculate the variable joint parameters needed to place the end of a kinematic chain, such as a robot manipulator or animation character’s skeleton, in a given position and orientation relative to the start of the chain. Demonstrating their approach to hand pose estimation, they manually explicitly encode the constraints and limits of each joint, resulting in 26 degrees of freedom. Then, non-linear least-squares minimization fits a 3D model of the hand to the estimated 2D joint positions, recovering the 3D hand pose. This process is similar to the back-projection used by Pavllo et al. (2019), except here, no temporal information is being used.

MediaPipe Holistic (Grishchenko and Bazarevsky 2020) attempts to solve 3D pose estimation by taking a similar approach to OpenPose, having a pipeline system to estimate the body, then the face and hands. Unlike OpenPose, the estimated poses are in 3D, and the pose estimator runs in real-time on CPU, allowing for pose-based sign language models on low-powered mobile devices. This pose estimation tool is widely available and built for Android, iOS, C++, Python, and the Web using JavaScript.

Pose-to-Video

Pose-to-Video, also known as motion transfer or skeletal animation in the field of robotics and animation, is the conversion of a sequence of poses to a video. This task is the final “rendering” of sign language in a visual modality.

Chan et al. (2019) demonstrated a semi-supervised approach where they took a set of videos, ran pose estimation with OpenPose (Cao et al. 2019), and learned an image-to-image translation (Isola et al. 2017) between the rendered skeleton and the original video. They demonstrated their approach on human dancing, extracting poses from a choreography and rendering any person as if they were dancing. They predicted two consecutive frames for temporally coherent video results and introduced a separate pipeline for a more realistic face synthesis, although still flawed.

Wang et al. (2018) suggested a similar method using DensePose (Güler, Neverova, and Kokkinos 2018) representations in addition to the OpenPose (Cao et al. 2019) ones. They formalized a different model, with various objectives to optimize for, such as background-foreground separation and temporal coherence by using the previous two timestamps in the input.

Using the method of Chan et al. (2019) on “Everybody Dance Now”, Giró-i-Nieto (2020) asked, “Can Everybody Sign Now?” and investigated if people could understand sign language from automatically generated videos. They conducted a study in which participants watched three types of videos: the original signing videos, videos showing only poses (skeletons), and reconstructed videos with realistic signing. The researchers evaluated the participants’ understanding after watching each type of video. Results revealed a preference for reconstructed videos over skeleton videos. However, the standard video synthesis methods used in the study were not effective enough for clear sign language translation. Participants had trouble understanding the reconstructed videos, suggesting that improvements are needed for better sign language translation in the future.

As a direct response, Saunders, Camgöz, and Bowden (2020b) showed that like in Chan et al. (2019), where an adversarial loss was added to specifically generate the face, adding a similar loss to the hand generation process yielded high-resolution, more photo-realistic continuous sign language videos. To further improve the hand image synthesis quality, they introduced a keypoint-based loss function to avoid issues caused by motion blur.

In a follow-up paper, Saunders, Camgöz, and Bowden (2021) introduced the task of Sign Language Video Anonymisation (SLVA) as an automatic method to anonymize the visual appearance of a sign language video while retaining the original sign language content. Using a conditional variational autoencoder framework, they first extracted pose information from the source video to remove the original signer appearance, then generated a photo-realistic sign language video of a novel appearance from the pose sequence. The authors proposed a novel style loss that ensures style consistency in the anonymized sign language videos.

Sign Language Avatars
JASigning

is a virtual signing system that generates sign language performances using virtual human characters. This system evolved from the earlier SiGMLSigning system, which was developed during the ViSiCAST (Bangham et al. 2000; Elliott et al. 2000) and eSIGN (Zwitserlood et al. 2004) projects, and later underwent further development as part of the Dicta-Sign project (Matthes et al. 2012; Efthimiou et al. 2012).

Originally, JASigning relied on Java JNLP apps for standalone use and integration into web pages. However, this approach became outdated due to the lack of support for Java in modern browsers. Consequently, the more recent CWA Signing Avatars (CWASA) system was developed, which is based on HTML5, utilizing JavaScript and WebGL technologies.

SiGML (Signing Gesture Markup Language) (Elliott et al. 2004) is an XML application that enables the transcription of sign language gestures. SiGML builds on HamNoSys, and indeed, one variant of SiGML is essentially an encoding of HamNoSys manual features, accompanied by a representation of non-manual aspects. SiGML is the input notation used by the JASigning applications and web applets. A number of editing tools for SiGML are available, mostly produced by the University of Hamburg.

The system parses the English text into SiGML before mapping it onto a 3D signing avatar that can produce signing. CWASA then uses a large database of pre-defined 3D signing avatar animations, which can be combined to form new sentences. The system includes a 3D editor, allowing users to create custom signing avatars and animations.

PAULA (Davidson 2006)

is a computer-based sign language avatar, initially developed for teaching sign language to hearing adults. The avatar is a 3D model of a person with a sign vocabulary that is manually animated. It takes an ASL utterance as a stream of glosses, performs syntactic and morphological modifications, decides on the appropriate phonemes and timings, and combines the results into a 3D animation of the avatar. Over the years, several techniques were used to make the avatar look more realistic.

Over the years, several advancements have been made to enhance the realism and expressiveness of the PAULA avatar, such as refining the eyebrow motion to appear more natural (Wolfe et al. 2011), combining emotion and co-occurring facial nonmanual signals (Schnepp et al. 2012, 2013), improving smoothness while avoiding robotic movements (McDonald et al. 2016), and facilitating simultaneity (McDonald et al. 2017). Other developments include interfacing with sign language notation systems like AZee (Filhol, McDonald, and Wolfe 2017), enhancing mouthing animation (Johnson, Brumm, and Wolfe 2018; Wolfe et al. 2022), multi-layering facial textures and makeup (Wolfe et al. 2019), and applying adverbial modifiers (Moncrief 2020, 2021).

Additional improvements to PAULA focus on making the avatar more lifelike by relaxing wrist orientations and other extreme “mathematical” angles (Filhol and McDonald 2020), refining hand shape transition, relaxation, and collision (Baowidan 2021), implementing hierarchical phrase transitions (McDonald and Filhol 2021), creating more realistic facial muscle control (McDonald, Johnson, and Wolfe 2022), and supporting geometric relocations (Filhol and McDonald 2022).

SiMAX (Sign Time GmbH 2020)

is a software application developed to transform textual input into 3D animated sign language representations. Utilizing a comprehensive database and the expertise of deaf sign language professionals, SiMAX ensures accurate translations of both written and spoken content. The process begins with the generation of a translation suggestion, which is subsequently reviewed and, if necessary, modified by deaf translators to ensure accuracy and cultural appropriateness. These translations are carried out by a customizable digital avatar that can be adapted to reflect the corporate identity or target audience of the user. This approach offers a cost-effective alternative to traditional sign language video production, as it eliminates the need for expensive film studios and complex video technology typically associated with such productions.

Image and Video Generation Models

Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks (Karras, Laine, and Aila 2018, @style–to–image:Karras2019stylegan2, @style–to–image:Karras2021), Variational Diffusion Models (Kingma et al. 2021), High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al. 2021), High Definition Video Generation with Diffusion Models (Ho et al. 2022), and High-Resolution Video Synthesis with Latent Diffusion Models (Blattmann et al. 2023). These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.

However, despite their remarkable progress in generating high-quality images and videos, these models trade-off computational efficiency. The complexity of these algorithms often results in slower inference times, making real-time applications challenging. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. While compute-aware optimizations, specifically targeting hardware capabilities of different devices, could improve the inference latency of these models, Chen et al. (2023) found that optimizing such models on top-of-the-line mobile devices such as the Samsung S23 Ultra or iPhone 14 Pro Max can decrease per-frame inference latency from around 23 seconds to around 12.

ControlNet (L. Zhang and Agrawala 2023) recently presented a neural network structure for controlling pretrained large diffusion models with additional input conditions. This approach enables end-to-end learning of task-specific conditions, even with a small training dataset. Training a ControlNet is as fast as fine-tuning a diffusion model and can be executed on personal devices or scaled to large amounts of data using powerful computation clusters. ControlNet has been demonstrated to augment large diffusion models like Stable Diffusion with conditional inputs such as edge maps, segmentation maps, and keypoints. One of the applications of ControlNet is pose-to-image translation control, which allows the generation of images based on pose information. Although this method has shown promising results, it still requires retraining the model and does not inherently support temporal coherency, which is important for tasks like sign language translation.

In the near future, we can expect many works on controlling video diffusion models directly from text for sign language translation. These models will likely generate visually appealing and realistic videos. However, they may still make mistakes and be limited to scenarios with more training data available. Developing models that can accurately generate sign language videos from text or pose information while maintaining visual quality and temporal coherency will be essential for advancing the field of sign language production.


Pose-to-Gloss

Pose-to-Gloss, also known as sign language recognition, is the task of recognizing a sequence of signs from a sequence of poses. Though some previous works have referred to this as “sign language translation,” recognition merely determines the associated label of each sign, without handling the syntax and morphology of the signed language (C. Padden 1988) to create a spoken language output. Instead, SLR has often been used as an intermediate step during translation to produce glosses from signed language videos.

Jiang et al. (2021) proposed a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction.

Dafnis et al. (2022) work on the same modified WLASL dataset as Jiang et al. (2021), but do not require multimodal data input. Instead, they propose a bidirectional skeleton-based graph convolutional network framework with linguistically motivated parameters and attention to the start and end frames of signs. They cooperatively use forward and backward data streams, including various sub-streams, as input. They also use pre-training to leverage transfer learning.

Selvaraj et al. (2022) introduced an open-source OpenHands library, which consists of standardized pose datasets for different existing sign language datasets and trained checkpoints of four pose-based isolated sign language recognition models across six languages (American, Argentinian, Chinese, Greek, Indian, and Turkish). To address the lack of labeled data, they propose self-supervised pretraining on unlabeled data and curate the largest pose-based pretraining dataset on Indian Sign Language (Indian-SL). They established that pretraining is effective for sign language recognition by demonstrating improved fine-tuning performance especially in low-resource settings and high crosslingual transfer from Indian-SL to a few other sign languages.

The work of Kezar, Thomason, and Sehyr (2023), based on the OpenHands library, explicitly recognizes the role of phonology to achieve more accurate isolated sign language recognition (ISLR). To allow additional predictions on phonological characteristics (such as handshape), they combine the phonological annotations in ASL-LEX 2.0 (Sehyr et al. 2021) with signs in the WLASL 2000 ISLR benchmark (Li et al. 2020). Interestingly, Tavella et al. (2022) construct a similar dataset aiming just for phonological property recognition in American Sign Language (ASL).

Gloss-to-Pose

Gloss-to-Pose, subsumed under the task of sign language production, is the task of producing a sequence of poses that adequately represent a sequence of signs written as gloss.

To produce a sign language video, Stoll et al. (2018) constructed a lookup table between glosses and sequences of 2D poses. They aligned all pose sequences at the neck joint of a reference skeleton and grouped all sequences belonging to the same gloss. Then, for each group, they applied dynamic time warping and averaged out all sequences in the group to construct the mean pose sequence. This approach suffers from not having an accurate set of poses aligned to the gloss and from unnatural motion transitions between glosses.

To alleviate the downsides of the previous work, Stoll et al. (2020) constructed a lookup table of gloss to a group of sequences of poses rather than creating a mean pose sequence. They built a Motion Graph (Min and Chai 2012), which is a Markov process used to generate new motion sequences that are representative of natural motion, and selected the motion primitives (sequence of poses) per gloss with the highest transition probability. To smooth that sequence and reduce unnatural motion, they used a Savitzky–Golay motion transition smoothing filter (Savitzky and Golay 1964). Moryossef, Müller, et al. (2023) re-implemented their approach and made it open-source.

Huang et al. (2021) used a new non-autoregressive model to generate a sequence of poses for a sequence of glosses. They argued that existing models like Saunders, Camgöz, and Bowden (2020a) are prone to error accumulation and high inference latency due to their autoregressive nature. Their model performs gradual upsampling of the poses, by starting with a pose including only two joints in the first layer, and gradually introducing more keypoints. They evaluated their model on the Phoenix-14T dataset (???) using Dynamic Time Warping (DTW) (Berndt and Clifford 1994) to align the poses before computing Mean Joint Error (DTW-MJE). They demonstrated that their model outperforms existing methods in terms of accuracy and speed, making it a promising approach for fast and high-quality sign language production.


Video-to-Gloss

Video-to-Gloss, also known as sign language recognition, is the task of recognizing a sequence of signs from a video.

For this recognition, Cui, Liu, and Zhang (2017) constructs a three-step optimization model. First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder and predict the gloss using a Connectionist Temporal Classification (CTC) (Graves et al. 2006). Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category, and use this gloss video segments encoding to optimize the sequence learning model.

Camgöz et al. (2018) fundamentally differ from that approach and formulate this problem as if it is a natural-language translation problem. They encode each video frame using AlexNet (Krizhevsky, Sutskever, and Hinton 2012), initialized using weights trained on ImageNet (Deng et al. 2009). Then they apply a GRU encoder-decoder architecture with Luong Attention (Luong, Pham, and Manning 2015) to generate the gloss. In follow-up work, N. C. Camgöz et al. (2020b) use a transformer encoder (Vaswani et al. 2017) to replace the GRU and use a CTC to decode the gloss. They show a slight improvement with this approach on the video-to-gloss task.

Adaloglou et al. (2020) perform a comparative experimental assessment of computer vision-based methods for the video-to-gloss task. They implement various approaches from previous research (Camgöz et al. 2017; Cui, Liu, and Zhang 2019; Vaezi Joze and Koller 2019) and test them on multiple datasets (Huang et al. 2018; Camgöz et al. 2018; Von Agris and Kraiss 2007; Vaezi Joze and Koller 2019) either for isolated sign recognition or continuous sign recognition. They conclude that 3D convolutional models outperform models using only recurrent networks to capture the temporal information, and that these models are more scalable given the restricted receptive field, which results from the CNN “sliding window” technique.

Momeni, Bull, Prajwal, et al. (2022a) developed a comprehensive pipeline that combines various models to densely annotate sign language videos. By leveraging the use of synonyms and subtitle-signing alignment, their approach demonstrates the value of pseudo-labeling from a sign recognition model for sign spotting. They propose a novel method to increase annotations for both known and unknown classes, relying on in-domain exemplars. As a result, their framework significantly expands the number of confident automatic annotations on the BOBSL BSL sign language corpus (Albanie et al. 2021) from 670K to 5M, and they generously make these annotations publicly available.

Gloss-to-Video

Gloss-to-Video, also known as sign language production, is the task of producing a video that adequately represents a sequence of signs written as gloss.

As of 2020, no research discusses the direct translation task between gloss and video. This lack of discussion results from the computational impracticality of the desired model, leading researchers to refrain from performing this task directly and instead rely on pipeline approaches using intermediate pose representations.


Gloss-to-Text

Gloss-to-Text, also known as sign language translation, is the natural language processing task of translating between gloss text representing sign language signs and spoken language text. These texts commonly differ in terminology, capitalization, and sentence structure.

Camgöz et al. (2018) experimented with various machine-translation architectures and compared using an LSTM (Hochreiter and Schmidhuber 1997) vs. GRU for the recurrent model, as well as Luong attention (Luong, Pham, and Manning 2015) vs. Bahdanau attention (Bahdanau, Cho, and Bengio 2015) and various batch sizes. They concluded that on the RWTH-PHOENIX-Weather-2014T dataset, which was also presented in this work, using GRUs, Luong attention, and a batch size of 1 outperforms all other configurations.

In parallel with the advancements in spoken language machine translation, Yin and Read (2020) proposed replacing the RNN with a Transformer (Vaswani et al. 2017) encoder-decoder model, showing improvements on both RWTH-PHOENIX-Weather-2014T (DGS) and ASLG-PC12 (ASL) datasets both using a single model and ensemble of models. Interestingly, in gloss-to-text, they show that using the sign language recognition (video-to-gloss) system output outperforms using the gold annotated glosses.

Building on the code published by Yin and Read (2020), Moryossef, Yin, et al. (2021) show it is beneficial to pre-train these translation models using augmented monolingual spoken language corpora. They try three different approaches for data augmentation: (1) Back-translation; (2) General text-to-gloss rules, including lemmatization, word reordering, and dropping of words; (3) Language-pair-specific rules augmenting the spoken language syntax to its corresponding sign language syntax. When pretraining, all augmentations show improvements over the baseline for RWTH-PHOENIX-Weather-2014T (DGS) and NCSLGR (ASL).

Text-to-Gloss

Text-to-gloss, an instantiation of sign language translation, is the task of translating between a spoken language text and sign language glosses. It is an appealing area of research because of its simplicity for integrating in existing NMT pipelines, despite recent works such as Yin and Read (2020) and@muller2022considerations claim that glosses are an inefficient representation of sign language, and that glosses are not a complete representation of signs (Pizzuto, Rossini, and Russo 2006). Zhao et al. (2000) used a Tree Adjoining Grammar (TAG)-based system to translate English sentences to American Sign Language (ASL) gloss sequences. They parsed the English text and simultaneously assembled an ASL gloss tree, using Synchronous TAGs (Shieber and Schabes 1990; Shieber 1994), by associating the ASL elementary trees with the English elementary trees and associating the nodes at which subsequent substitutions or adjunctions can occur. Synchronous TAGs have been used for machine translation between spoken languages (Abeillé, Schabes, and Joshi 1991), but this was the first application to a signed language.

For the automatic translation of gloss-to-text, Othman and Jemni (2012) identified the need for a large parallel sign language gloss and spoken language text corpus. They developed a part-of-speech-based grammar to transform English sentences from the Gutenberg Project ebooks collection (Lebert 2008) into American Sign Language gloss. Their final corpus contains over 100 million synthetic sentences and 800 million words and is the most extensive English-ASL gloss corpus we know of. Unfortunately, it is hard to attest to the quality of the corpus, as the authors did not evaluate their method on real English-ASL gloss pairs.

Egea Gómez, McGill, and Saggion (2021) presented a syntax-aware transformer for this task, by injecting word dependency tags to augment the embeddings inputted to the encoder. This involves minor modifications in the neural architecture leading to negligible impact on computational complexity of the model. Testing their model on the RWTH-PHOENIX-Weather-2014T (Camgöz et al. 2018), they demonstrated that injecting this additional information results in better translation quality.


Video-to-Text

Video-to-text, also known as sign language translation, is the task of translating a raw video to spoken language text.

N. C. Camgöz et al. (2020b) proposed a single architecture to perform this task that can use both the sign language gloss and the spoken language text in joint supervision. They use the pre-trained spatial embeddings from Koller et al. (2019) to encode each frame independently and encode the frames with a transformer. On this encoding, they use a Connectionist Temporal Classification (CTC) (Graves et al. 2006) to classify the sign language gloss. Using the same encoding, they use a transformer decoder to decode the spoken language text one token at a time. They show that adding gloss supervision improves the model over not using it and that it outperforms previous video-to-gloss-to-text pipeline approaches (Camgöz et al. 2018).

Following up, N. C. Camgöz et al. (2020a) propose a new architecture that does not require the supervision of glosses, named “Multi-channel Transformers for Multi-articulatory Sign Language Translation”. In this approach, they crop the signing hand and the face and perform 3D pose estimation to obtain three separate data channels. They encode each data channel separately using a transformer, then encode all channels together and concatenate the separate channels for each frame. Like their previous work, they use a transformer decoder to decode the spoken language text, but unlike their previous work, do not use the gloss as additional supervision. Instead, they add two “anchoring” losses to predict the hand shape and mouth shape from each frame independently, as silver annotations are available to them using the model proposed in Koller et al. (2019). They conclude that this approach is on-par with previous approaches requiring glosses, and so they have broken the dependency upon costly annotated gloss information in the video-to-text task.

Shi et al. (2022) introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube), and then propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features to improve translation quality in the absence of glosses and in the presence of visually challenging data.

B. Zhang, Müller, and Sennrich (2023) propose a multi-modal, multi-task learning approach to end-to-end sign language translation. The model features shared representations for different modalities such as text and video and is trained jointly on several tasks such as video-to-gloss, gloss-to-text, and video-to-text. The approach allows leveraging external data such as parallel data for spoken language machine translation.

Text-to-Video

Text-to-Video, also known as sign language production, is the task of producing a video that adequately represents a spoken language text in sign language.

As of 2020, no research discusses the direct translation task between text and video. This lack of discussion results from the computational impracticality of the desired model, leading researchers to refrain from performing this task directly and instead rely on pipeline approaches using intermediate pose representations.


Pose-to-Text

Pose-to-text, also known as sign language translation, is the task of translating a captured or estimated pose sequence to spoken language text.

Ko et al. (2019) demonstrated impressive performance on the pose-to-text task by inputting the pose sequence into a standard encoder-decoder translation network. They experimented both with GRU and various types of attention (Luong, Pham, and Manning 2015; Bahdanau, Cho, and Bengio 2015) and with a Transformer (Vaswani et al. 2017), and showed similar performance, with the transformer underperforming on the validation set and overperforming on the test set, which consists of unseen signers. They experimented with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint either concerning the entire frame or the relevant “object” (Body, Face, and Hand).

Text-to-Pose

Text-to-Pose, also known as sign language production, is the task of producing a sequence of poses that adequately represent a spoken language text in sign language, as an intermediate representation to overcome challenges in animation. Most efforts use poses as an intermediate representation to overcome the challenges in generating videos directly, with the goal of using computer animation or pose-to-video models to perform video production.

Saunders, Camgöz, and Bowden (2020c) proposed Progressive Transformers, a model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an autoregressive manner. Unlike symbolic transformers (Vaswani et al. 2017), which use a discrete vocabulary and thus can predict an end-of-sequence (EOS) token in every step, the progressive transformer predicts a counter ∈ [0, 1] in addition to the pose. In inference time, counter = 1 is considered the end of the sequence. They tested their approach on the RWTH-PHOENIX-Weather-2014T dataset using OpenPose 2D pose estimation, uplifted to 3D (Zelinka and Kanis 2020), and showed favorable results when evaluating using back-translation from the generated poses to spoken language. They further showed (Saunders, Camgöz, and Bowden 2020a) that using an adversarial discriminator between the ground truth poses and the generated poses, conditioned on the input spoken language text, improves the production quality as measured using back-translation.

To overcome the issues of under-articulation seen in the above works, Saunders, Camgöz, and Bowden (2020b) expanded on the progressive transformer model using a Mixture Density Network (MDN) (Bishop 1994) to model the variation found in sign language. While this model underperformed on the validation set, compared to previous work, it outperformed on the test set.

Zelinka and Kanis (2020) presented a similar autoregressive decoder approach, with added dynamic-time-warping (DTW) and soft attention. They tested their approach on Czech Sign Language weather data extracted from the news, which is not manually annotated, or aligned to the spoken language captions, and showed their DTW is advantageous for this kind of task.

Xiao, Qin, and Yin (2020) closed the loop by proposing a text-to-pose-to-text model for the case of isolated sign language recognition. They first trained a classifier to take a sequence of poses encoded by a BiLSTM and classify the relevant sign, then proposed a production system to take a single sign and sample a constant length sequence of 50 poses from a Gaussian Mixture Model. These components are combined such that given a sign class y, a pose sequence is generated, then classified back into a sign class ŷ, and the loss is applied between y and ŷ, and not directly on the generated pose sequence. They evaluate their approach on the CSL dataset (Huang et al. 2018) and show that their generated pose sequences almost reach the same classification performance as the reference sequences.

Due to the need for more suitable automatic evaluation methods for generated signs, existing works resort to measuring back-translation quality, which cannot accurately capture the quality of the produced signs nor their usability in real-world settings. Understanding how distinctions in meaning are created in signed language may help develop a better evaluation method.


Notation-to-Text

Jiang et al. (2023) explore text-to-text sign to spoken language translation, with SignWriting as the chosen sign language notation system. Despite SignWriting usually represented in 2D, they use the 1D Formal SignWriting specification and propose a neural factored machine translation approach to encode sequences of SignWriting graphemes as well as their positions in the 2D space. They verify the proposed approach on the SignBank dataset in both a bilingual setup (American Sign Language to English) and two multilingual setups (4 and 21 language pairs, respectively). They apply several low-resource machine translation techniques used to improve spoken language translation to similarly improve the performance of sign language translation. Their findings validate the use of an intermediate text representation for signed language translation, and pave the way for including sign language translation in natural language processing research.

Text-to-Notation

Jiang et al. (2023) also explore the reverse translation direction, i.e., text to SignWriting translation. They conduct experiments under a same condition of their multilingual SignWriting to text (4 language pairs) experiment, and again propose a neural factored machine translation approach to decode the graphemes and their position separately. They borrow BLEU from spoken language translation to evaluate the predicted graphemes and mean absolute error to evaluate the positional numbers.

Walsh, Saunders, and Bowden (2022) explore Text to HamNoSys (T2H) translation, with HamNoSys as the target sign language notation system. They experiment with direct T2H and Text to Gloss to HamNoSys (T2G2H) on a subset of the data from the MEINE DGS dataset (Hanke et al. 2020), where all glosses are mapped to HamNoSys by a dictionary lookup. They find that direct T2H translation results in higher BLEU (it still needs to be clarified how well BLEU represents the quality of HamNoSys translations, though). They encode HamNoSys with BPE (Sennrich, Haddow, and Birch 2016), outperforming character-level and word-level tokenization. They also leverage BERT to create better sentence-level embeddings and use HamNoSys to extract the hand shapes of a sign as additional supervision during training.


Notation-to-Pose

Arkushin, Moryossef, and Fried (2023) proposed Ham2Pose, a model to animate HamNoSys into a sequence of poses. They first encode the HamNoSys into a meaningful “context” representation using a transform encoder, and use it to predict the length of the pose sequence to be generated. Then, starting from a still frame they used an iterative non-autoregressive decoder to gradually refine the sign over T steps, In each time step t from T to 1, the model predicts the required change from step t to step t − 1. After T steps, the pose generator outputs the final pose sequence. Their model outperformed previous methods like Saunders, Camgöz, and Bowden (2020c), animating HamNoSys into more realistic sign language sequences.

Fingerspelling

Fingerspelling is spelling a word letter-by-letter, borrowing from the spoken language alphabet (Battison 1978; Wilcox 1992; Brentari and Padden 2001; Patrie and Johnson 2011). This phenomenon, found in most signed languages, often occurs when there is no previously agreed-upon sign for a concept, like in technical language, colloquial conversations involving names, conversations involving current events, emphatic forms, and the context of code-switching between the signed language and the corresponding spoken language (Padden 1998; Montemurro and Brentari 2018). The relative amount of fingerspelling varies between signed languages, and for American Sign Language (ASL), accounts for 12-35% of the signed content (Padden and Gunsauls 2003).

Patrie and Johnson (2011) described the following terminology to describe three different forms of fingerspelling:

Recognition

Fingerspelling recognition, a sub-task of sign language recognition, is the task of recognizing fingerspelled words from a sign language video.

Shi et al. (2018) introduced a large dataset available for American Sign Language fingerspelling recognition. This dataset includes both the “careful” and “rapid” forms of fingerspelling collected from naturally occurring videos “in the wild”, which are more challenging than studio conditions. They trained a baseline model to take a sequence of images cropped around the signing hand and either use an autoregressive decoder or a CTC. They found that the CTC outperformed the autoregressive decoder model, but both achieved poor recognition rates (35-41% character level accuracy) compared to human performance (around 82%).

In follow-up work, Shi et al. (2019) collected nearly an order-of-magnitude larger dataset and designed a new recognition model. Instead of detecting the signing hand, they detected the face and cropped a large area around it. Then, they performed an iterative process of zooming in to the hand using visual attention to retain sufficient information in high resolution of the hand. Finally, like their previous work, they encoded the image hand crops sequence and used a CTC to obtain the frame labels. They showed that this method outperformed their original “hand crop” method by 4% and that they could achieve up to 62.3% character-level accuracy using the additional data collected. Looking through this dataset, we note that the videos in the dataset were taken from longer videos, and as they were cut, they did not retain the signing before the fingerspelling. This context relates to language modeling, where at first, one fingerspells a word carefully, and when repeating it, might fingerspell it rapidly, but the interlocutors can infer they are fingerspelling the same word.

Production

Fingerspelling production, a sub-task of sign language production, is the task of producing a fingerspelling video for words.

In its basic form, “careful” fingerspelling production can be trivially solved using pre-defined letter handshapes interpolation. Adeline (2013) demonstrated this approach for American Sign Language and English fingerspelling. They rigged a hand armature for each letter in the English alphabet (N = 26) and generated all (N2 = 676) transitions between every two letters using interpolation or manual animation. Then, to fingerspell entire words, they chain pairs of letter transitions. For example, for the word “CHLOE”, they would chain the following transitions sequentially: #C CH HL LO OE E#.

However, to produce life-like animations, one must also consider the rhythm and speed of holding letters, and transitioning between letters, as those can affect how intelligible fingerspelling motions are to an interlocutor (Wilcox (1992)). Wheatland et al. (2016) analyzed both “careful” and “rapid” fingerspelling videos for these features. They found that for both forms of fingerspelling, on average, the longer the word, the shorter the transition and hold time. Furthermore, they found that less time is spent on middle letters on average, and the last letter is held on average for longer than the other letters in the word. Finally, they used this information to construct an animation system using letter pose interpolation and controlled the timing using a data-driven statistical model.

Annotation Tools

ELAN - EUDICO Linguistic Annotator

ELAN (Wittenburg et al. 2006) is an annotation tool for audio and video recordings. With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word, gloss, comment, translation, or description of any feature observed in the media. Annotations can be created on multiple layers, called tiers, which can be hierarchically interconnected. An annotation can either be time-aligned to the media or refer to other existing annotations. The content of annotations consists of Unicode text, and annotation documents are stored in an XML format (EAF). ELAN is open source (GPLv3), and installation is available for Windows, macOS, and Linux. PyMPI (Lubbers and Torreira 2013) allows for simple python interaction with Elan files.

iLex

iLex (Hanke 2002) is a tool for sign language lexicography and corpus analysis, that combines features found in empirical sign language lexicography and sign language discourse transcription. It supports the user in integrated lexicon building while working on the transcription of a corpus and offers several unique features considered essential due to the specific nature of signed languages. iLex binaries are available for macOS.

SignStream

SignStream (Neidle, Sclaroff, and Athitsos 2001) is a tool for linguistic annotations and computer vision research on visual-gestural language data SignStream installation is available for macOS and is distributed under an MIT license.

Anvil - The Video Annotation Research Tool

Anvil (Kipp 2001) is a free video annotation tool, offering multi-layered annotation based on a user-defined coding scheme. In Anvil, the annotator can see color-coded elements on multiple tracks in time alignment. Some special features are cross-level links, non-temporal objects, timepoint tracks, coding agreement analysis, 3D viewing of motion capture data and a project tool for managing whole corpora of annotation files. Anvil installation is available for Windows, macOS, and Linux.

Resources

Bilingual dictionaries

for signed language (Mesch and Wallin 2012; Fenlon, Cormier, and Schembri 2015; Crasborn et al. 2016; Gutierrez-Sigut et al. 2016) map a spoken language word or short phrase to a signed language video. One notable dictionary, SpreadTheSign is a parallel dictionary containing around 25,000 words with up to 42 different spoken-signed language pairs and more than 600,000 videos in total. Unfortunately, while dictionaries may help create lexical rules between languages, they do not demonstrate the grammar or the usage of signs in context.

Fingerspelling corpora

usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created (Dreuw et al. 2006) or mined from online resources (Shi et al. 2018, @dataset:fs18iccv). However, they only capture one aspect of signed languages.

Isolated sign corpora

are collections of annotated single signs. They are synthesized (Ebling et al. 2018; Huang et al. 2018; Sincan and Keles 2020; Hassan et al. 2020) or mined from online resources (Vaezi Joze and Koller 2019; Li et al. 2020), and can be used for isolated sign language recognition or contrastive analysis of minimal signing pairs (Imashev et al. 2020). However, like dictionaries, they do not describe relations between signs, nor do they capture coarticulation during the signing, and are often limited in vocabulary size (20-1000 signs).

Continuous sign corpora

contain parallel sequences of signs and spoken language. Available continuous sign corpora are extremely limited, containing 4-6 orders of magnitude fewer sentence pairs than similar corpora for spoken language machine translation (Arivazhagan et al. 2019). Moreover, while automatic speech recognition (ASR) datasets contain up to 50,000 hours of recordings (Pratap et al. 2020), the most extensive continuous sign language corpus contains only 1,150 hours, and only 50 of them are publicly available (Hanke et al. 2020). These datasets are usually synthesized (Databases 2007; Crasborn and Zwitserlood 2008; Ko et al. 2019; Hanke et al. 2020) or recorded in studio conditions (Forster et al. 2014, @cihan2018neural), which does not account for noise in real-life conditions. Moreover, some contain signed interpretations of spoken language rather than naturally-produced signs, which may not accurately represent native signing since translation is now a part of the discourse event.

Availability

Unlike the vast amount and diversity of available spoken language resources that allow various applications, sign language resources are scarce and, currently only support translation and production. Unfortunately, most of the sign language corpora discussed in the literature are either not available for use or available under heavy restrictions and licensing terms. Furthermore, sign language data is especially challenging to anonymize due to the importance of facial and other physical features in signing videos, limiting its open distribution. Developing anonymization with minimal information loss or accurate anonymous representations is a promising research direction.

Collect Real-World Data

Data is essential to develop any of the core NLP tools previously described, and current efforts in SLP are often limited by the lack of adequate data. We discuss the considerations to keep in mind when building datasets, the challenges of collecting such data, and directions to facilitate data collection.

What is Good Signed Language Data?

For SLP models to be deployable, they must be developed using data that represents the real world accurately. What constitutes an ideal signed language dataset is an open question, we suggest including the following requirements: (1) a broad domain; (2) sufficient data and vocabulary size; (3) real-world conditions; (4) naturally produced signs; (5) a diverse signer demographic; (6) native signers; and when applicable, (7) dense annotations.

To illustrate the importance of data quality during modeling, Yin et al. (2021) first take as an example a current benchmark for SLP, the RWTH-PHOENIX-Weather 2014T dataset (Camgöz et al. 2018) of German Sign Language, that does not meet most of the above criteria: it is restricted to the weather domain (1); contains only around 8K segments with 1K unique signs (2); filmed in studio conditions (3); interpreted from German utterances (4); and signed by nine Caucasian interpreters (5,6). Although this dataset successfully addressed data scarcity issues at the time and successfully rendered results comparable and fueled competitive research, it does not accurately represent signed languages in the real world. On the other hand, the Public DGS Corpus (Hanke et al. 2020) is an open-domain (1) dataset consisting of 50 hours of natural signing (4) by 330 native signers from various regions in Germany (5,6), annotated with glosses, HamNoSys and German translations (7), meeting all but two requirements we suggest.

They train a gloss-to-text sign language translation transformer (Yin and Read 2020) on both datasets. On RWTH-PHOENIX-Weather 2014T, they obtain 22.17 BLEU on testing; on Public DGS Corpus, they obtain a mere BLEU. Although Transformers achieve encouraging results on RWTH-PHOENIX-Weather 2014T (Saunders, Camgöz, and Bowden 2020c; N. C. Camgöz et al. 2020a), they fail on more realistic, open-domain data. These results reveal that, for real-world applications, we need more data to train such models. At the same time, available data is severely limited in size; less data-hungry and more linguistically-informed approaches may be more suitable. This experiment reveals how it is crucial to use data that accurately represent the complexity and diversity of signed languages to precisely assess what types of methods are suitable and how well our models would deploy to the real world.

Challenges of Data Collection

Collecting and annotating signed data in line with the ideal requires more resources than speech or text data, taking up to 600 minutes per minute of an annotated signed language video (Hanke et al. 2020). Moreover, annotation usually requires specific knowledge and skills, which makes recruiting or training qualified annotators challenging. Additionally, there is little existing signed language data in the wild openly licensed for use, especially from native signers that are not interpretations of speech. Therefore, data collection often requires significant efforts and costs of on-site recording.

Automating Annotation

One helpful research direction for collecting more data that enables the development of deployable SLP models is creating tools that can simplify or automate parts of the collection and annotation process. One of the most significant bottlenecks in obtaining more adequate signed language data is the time and scarcity of experts required to perform annotation. Therefore, tools that perform automatic parsing, detection of frame boundaries, extraction of articulatory features, suggestions for lexical annotations, and allow parts of the annotation process to be crowdsourced to non-experts, to name a few, have a high potential to facilitate and accelerate the availability of good data.

Practice Deaf Collaboration

Finally, when working with signed languages, it is vital to keep in mind this technology should benefit and they need. Researchers in SLP should acknowledge that signed languages belong to the Deaf community and avoid exploiting their language as a commodity (Bird 2020).

Solving Real Needs

Many efforts in SLP have developed intrusive methods (e.g., requiring signers to wear special gloves), which are often rejected by signing communities and therefore have limited real-world value. Such efforts are often marketed to perform “sign language translation” when they, in fact, only identify fingerspelling or recognize a minimal set of isolated signs at best. These approaches oversimplify the rich grammar of signed languages, promote the misconception that signs are solely expressed through the hands, and are considered by the Deaf community as a manifestation of audism, where it is the signers who must make the extra effort to wear additional sensors to be understood by non-signers (Erard 2017). To avoid such mistakes, we encourage close Deaf involvement throughout the research process to ensure that we direct our efforts toward applications that will be adopted by signers and do not make false assumptions about signed languages or the needs of signing communities.

Building Collaboration

Deaf collaborations and leadership are essential for developing signed language technologies to ensure they address the community’s needs and will be adopted, not relying on misconceptions or inaccuracies about signed language (Harris, Holmes, and Mertens 2009; Annelies Kusters, De Meulder, and O’Brien 2017). Hearing researchers cannot relate to the deaf experience or fully understand the context in which the tools being developed would be used, nor can they speak for the deaf. Therefore, we encourage creating a long-term collaborative environment between signed language researchers and users so that deaf users can identify meaningful challenges and provide insights on the considerations to take while researchers cater to the signers’ needs as the field evolves. We also recommend reaching out to signing communities for reviewing papers on signed languages to ensure an adequate evaluation of this type of research results published at academic venues. There are several ways to connect with Deaf communities for collaboration: one can seek deaf students in their local community, reach out to schools for the deaf, contact deaf linguists, join a network of researchers of sign-related technologies, and/or participate in deaf-led projects.

Downloading

Currently, there is no easy way or agreed-upon format to download and load sign language datasets, and as such, evaluation of these datasets is scarce. As part of this work, we streamlined the loading of available datasets using Tensorflow Datasets (“TensorFlow Datasets, a Collection of Ready-to-Use Datasets,” n.d.). This tool allows researchers to load large and small datasets alike with a simple command and be comparable to other works. We make these datasets available using a custom library, Sign Language Datasets (Moryossef and Müller 2021).

import tensorflow_datasets as tfds
import sign_language_datasets.datasets

# Loading a dataset with default configuration
aslg_pc12 = tfds.load("aslg_pc12")

# Loading a dataset with custom configuration
from sign_language_datasets.datasets.config import SignDatasetConfig

config = SignDatasetConfig(
    name="videos_and_poses256x256:12",
    # Specific version
    version="3.0.0",
    # Download, and load dataset videos
    include_video=True,
    # Load videos at constant, 12 fps
    fps=12,
    # Convert videos to a constant resolution, 256x256
    resolution=(256, 256),
    # Download and load Holistic pose estimation
    include_pose="holistic")

rwth_phoenix2014_t = tfds.load(
    name='rwth_phoenix2014_t',
    builder_kwargs=dict(config=config))

Furthermore, we follow a unified interface when possible, making attributes the same and comparable between datasets:

{
    "id": tfds.features.Text(),
    "signer": tfds.features.Text() | tf.int32,
    "video": tfds.features.Video(
        shape=(None, HEIGHT, WIDTH, 3)),
    "depth_video": tfds.features.Video(
        shape=(None, HEIGHT, WIDTH, 1)),
    "fps": tf.int32,
    "pose": {
        "data": tfds.features.Tensor(
            shape=(None, 1, POINTS, CHANNELS),
            dtype=tf.float32),
        "conf": tfds.features.Tensor(
            shape=(None, 1, POINTS),
            dtype=tf.float32)
    },
    "gloss": tfds.features.Text(),
    "text": tfds.features.Text()
}

The following table contains a curated list of datasets, including various signed languages and data formats:

🎥 Video | 👋 Pose | 👄 Mouthing | ✍ Notation | 📋 Gloss | 📜 Text | 🔊 Speech

Dataset Publication Language Features #Signs #Samples #Signers License
ASL-100-RGBD Hassan et al. (2020) American 🎥👋📋 100 4,150 Tokens 22 Authorized Academics
ASL-Homework-RGBD Hassan et al. (2022) American 🎥👋📋 935 45 Authorized Academics
ASL-LEX Sehyr et al. (2021) American 📋 2,723 2723 glosses+linguistic annotations, video downloads not allowed CC BY-NC 4.0
ASLG-PC12 💾 Othman and Jemni (2012) American (Synthetic) 📋📜 > 100,000,000 Sentences N/A Sample Available (1, 2)
ASLLVD Athitsos et al. (2008) American TODO 3,000 12,000 Samples 4 Attribution
ATIS Bungeroth et al. (2008) Multilingual TODO 292 595 Sentences
AUSLAN Johnston (2010) Australian TODO 1,100 Videos 100
AUTSL 💾 Sincan and Keles (2020) Turkish 🎥📋 226 36,302 Samples 43 Codalab
BOBSL Momeni, Bull, Prajwal, et al. (2022b) British 🎥📜 2,281 1.2M Sentences 37 non-commercial authorized academics
BosphorusSign Camgöz et al. (2016) Turkish TODO 636 24,161 Samples 6 Not Published
BSL Corpus 💾 Schembri et al. (2013) British TODO 40,000 Lexical Items 249 Partially Restricted
CDPSL Łacheta and PawełRutkowski (2014) Polish 🎥📜 300 hours
ChicagoFSWild 💾 Shi et al. (2018) American 🎥📜 26 7,304 Sequences 160 Public
ChicagoFSWild+ 💾 Shi et al. (2019) American 🎥📜 26 55,232 Sequences 260 Public
Content4All Camgöz et al. (2021) Swiss-German, Flemish 🎥👋📜📜 190 Hours CC BY-NC-SA 4.0
CopyCat Zafrulla et al. (2010) American TODO 22 420 Phrases 5
Corpus NGT 💾 Crasborn and Zwitserlood (2008) Netherlands TODO 15 Hours 92 CC BY-NC-SA 3.0 NL
DEVISIGN Chai, Wang, and Chen (2014) Chinese TODO 2,000 24,000 Samples 8 Research purpose on request
Dicta-Sign 💾 Matthes et al. (2012) Multilingual TODO 6-8 Hours (/Participant) 16-18 /Language
How2Sign 💾 Duarte et al. (2020) American 🎥👋📋📜🔊 16,000 79 hours (35,000 sentences) 11 CC BY-NC 4.0
ISL-HS Oliveira et al. (2017) Irish 🎥📋 23 468 videos->58,114 images->23 handshapes 6
K-RSL Imashev et al. (2020) Kazakh-Russian 🎥👋📜 600 28,250 Videos 10 Attribution
KETI Ko et al. (2019) Korean 🎥👋📋📜 524 14,672 Videos 14 TODO (emailed Sang-Ki Ko)
KRSL-OnlineSchool Mukushev et al. (2022) Kazakh-Russian 🎥📋📜 890 Hours (1M sentences) 7
LSE-SIGN Gutierrez-Sigut et al. (2016) Spanish TODO 2,400 2,400 Samples 2 Custom
MS-ASL Vaezi Joze and Koller (2019) American TODO 1,000 25,000 (25 hours) 200 Public
NCSLGR 💾 Databases (2007) American 🎥📋📜 1,875 sentences 4 TODO
Public DGS Corpus 💾 Prillwitz et al. (2008) German 🎥🎥👋👄📋📜📜 50 Hours 330 Custom
RVL-SLLL ASL Martínez et al. (2002) American TODO 104 2,576 Videos 14 Research Attribution
RWTH Fingerspelling Dreuw et al. (2006) German 🎥📜 35 1,400 single-char videos 20
RWTH-BOSTON-104 Dreuw et al. (2008) American 🎥📜 104 201 Sentences 3
RWTH-PHOENIX-Weather T 💾 Forster et al. (2014);Camgöz et al. (2018) German 🎥📋📜 1,231 8,257 Sentences 9 CC BY-NC-SA 3.0
S-pot Viitaniemi et al. (2014) Finnish TODO 1,211 5,539 Videos 5 Permission
Sign2MINT 💾 2021 German 🎥📜 740 1135 CC BY-NC-SA 3.0 DE
SignBank 💾 Multilingual 🎥📜 222148
SignBD-Word Sams, Akash, and Rahman (2023) Bangla 🎥👋 200 6000 videos 16
SIGNOR Vintar, Jerko, and Kulovec (2012) Slovene 🎥👄📋📜 80 TODO emailed Špela
SIGNUM Von Agris and Kraiss (2007) German TODO 450 15,600 Sequences 20
SMILE Ebling et al. (2018) Swiss-German TODO 100 9,000 Samples 30 Custom
SSL Corpus Öqvist, Riemer Kankkonen, and Mesch (2020) Swedish 🎥📋📜
SSL Lexicon Mesch and Wallin (2012) Swedish 🎥📋📜📜 20,000 CC BY-NC-SA 2.5 SE
Video-Based CSL Huang et al. (2018) Chinese TODO 500 125,000 Videos 50 Research Attribution
WLASL 💾 Li et al. (2020) American 🎥📋 2,000 100 C-UDA 1.0

Other Resources

Citation

For attribution in academic contexts, please cite this work as:

@misc{moryossef2021slp, 
    title = "{S}ign {L}anguage {P}rocessing", 
    author = "Moryossef, Amit and Goldberg, Yoav",
    howpublished = "\url{https://sign-language-processing.github.io/}",
    year = "2021"
}

References

Abeillé, Anne, Yves Schabes, and Aravind K Joshi. 1991. “Using Lexicalized Tags for Machine Translation.”

Adaloglou, Nikolas, Theocharis Chatzis, Ilias Papastratis, Andreas Stergioulas, Georgios Th Papadopoulos, Vassia Zacharopoulou, George J Xydopoulos, Klimnis Atzakas, Dimitris Papazachariou, and Petros Daras. 2020. “A Comprehensive Study on Sign Language Recognition Methods.” arXiv Preprint arXiv:2007.12530.

Adeline, Chloe. 2013. “Fingerspell.net.” http://fingerspell.net/.

Albanie, Samuel, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, et al. 2021. “BOBSL: BBC-Oxford British Sign Language Dataset.” In. https://www.robots.ox.ac.uk/~vgg/data/bobsl.

Arivazhagan, Naveen, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, et al. 2019. “Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges.” arXiv Preprint arXiv:1907.05019.

Arkushin, Rotem Shalev, Amit Moryossef, and Ohad Fried. 2023. “Ham2Pose: Animating Sign Language Notation into Pose Sequences,” 21046–56.

Athitsos, Vassilis, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali. 2008. “The American Sign Language Lexicon Video Dataset.” In 2008 Ieee Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 1–8. IEEE.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” Edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1409.0473.

Bangham, J Andrew, SJ Cox, Ralph Elliott, John RW Glauert, Ian Marshall, Sanja Rankov, and Mark Wells. 2000. “Virtual Signing: Capture, Animation, Storage and Transmission-an Overview of the Visicast Project.” In IEE Seminar on Speech and Language Processing for Disabled and Elderly People (Ref. No. 2000/025), 6–1. IET.

Baowidan, Souad. 2021. “Improving Realism in Automated Fingerspelling of American Sign Language.” Machine Translation 35 (3): 387–404.

Battison, Robbin. 1978. “Lexical Borrowing in American Sign Language.”

Bellugi, Ursula, and Susan Fischer. 1972. “A Comparison of Sign Language and Spoken Language.” Cognition 1 (2-3): 173–200.

Bergman, Brita. 1977. Tecknad Svenska:[Signed Swedish]. LiberLäromedel/Utbildningsförl.:

Berndt, Donald J., and James Clifford. 1994. “Using Dynamic Time Warping to Find Patterns in Time Series.” In KDD Workshop.

Beuzeville, Louise de. 2008. “Pointing and Verb Modification: The Expression of Semantic Roles in the Auslan Corpus.” In Workshop Programme, 13. Citeseer.

Bird, Steven. 2020. “Decolonising Speech and Language Technology.” In Proceedings of the 28th International Conference on Computational Linguistics, 3504–19. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.313.

Bishop, Christopher M. 1994. “Mixture Density Networks.”

Blattmann, Andreas, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. “Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models.” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Borg, Mark, and Kenneth P Camilleri. 2019. “Sign Language Detection "in the Wild" with Recurrent Neural Networks.” In ICASSP 2019-2019 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 1637–41. IEEE.

Bragg, Danielle, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, et al. 2019. “Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective.” In The 21st International Acm Sigaccess Conference on Computers and Accessibility, 16–31.

Brentari, Diane. 2011. “Sign Language Phonology.” The Handbook of Phonological Theory, 691–721.

Brentari, Diane, and Carol Padden. 2001. “A Language with Multiple Origins: Native and Foreign Vocabulary in American Sign Language.” Foreign Vocabulary in Sign Language: A Cross-Linguistic Investigation of Word Formation, 87–119.

Bull, Hannah, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, and Andrew Zisserman. 2021. “Aligning Subtitles in Sign Language Videos.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 11552–61.

Bull, Hannah, Michèle Gouiffès, and Annelies Braffort. 2020. “Automatic Segmentation of Sign Language into Subtitle-Units.” In European Conference on Computer Vision, 186–98. Springer.

Bungeroth, Jan, Daniel Stein, Philippe Dreuw, Hermann Ney, Sara Morrissey, Andy Way, and Lynette van Zijl. 2008. “The ATIS Sign Language Corpus.” In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/pdf/748_paper.pdf.

Camgöz, Necati Cihan, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. “Subunets: End-to-End Hand Shape and Continuous Sign Language Recognition.” In 2017 Ieee International Conference on Computer Vision (Iccv), 3075–84. IEEE.

Camgöz, Necati Cihan, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. “Neural Sign Language Translation.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7784–93.

Camgöz, Necati Cihan, Ahmet Alp Kındıroğ lu, Serpil Karabüklü, Meltem Kelepir, Ayş e Sumru Özsoy, and Lale Akarun. 2016. “BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1383–8. Portorož, Slovenia: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L16-1220.

Camgöz, Necati Cihan, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020a. “Multi-Channel Transformers for Multi-Articulatory Sign Language Translation.” In European Conference on Computer Vision, 301–19.

———. 2020b. “Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 10023–33.

Camgöz, Necati Cihan, Ben Saunders, Guillaume Rochette, Marco Giovanelli, Giacomo Inches, Robin Nachtrab-Ribback, and Richard Bowden. 2021. “Content4all Open Research Sign Language Translation Datasets.” In 2021 16th Ieee International Conference on Automatic Face and Gesture Recognition (Fg 2021), 1–5. IEEE.

Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. “Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” In CVPR.

Cao, Z., G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Chai, Xiujuan, Hanjie Wang, and Xilin Chen. 2014. “The Devisign Large Vocabulary of Chinese Sign Language Database and Baseline Evaluations.” Technical Report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS.

Chan, Caroline, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. “Everybody Dance Now.” In Proceedings of the Ieee International Conference on Computer Vision, 5933–42.

Chen, Yu-Hui, Raman Sarokin, Juhyun Lee, Jiuqiang Tang, Chuo-Ling Chang, Andrei Kulik, and Matthias Grundmann. 2023. “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via Gpu-Aware Optimizations.” In.

Chen, Yu, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. 2017. “Adversarial Posenet: A Structure-Aware Convolutional Network for Human Pose Estimation.” In Proceedings of the Ieee International Conference on Computer Vision, 1212–21.

Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–34. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1179.

Cormier, Kearsy, Sandra Smith, and Zed Sevcikova-Sehyr. 2015. “Rethinking Constructed Action.” Sign Language & Linguistics 18 (2): 167–204.

Crasborn, Onno, R Bank, I Zwitserlood, E Van der Kooij, E Ormel, J Ros, A Schüller, et al. 2016. “NGT Signbank.” Nijmegen: Radboud University, Centre for Language Studies.

Crasborn, O., and I. Zwitserlood. 2008. “The Corpus Ngt: An Online Corpus for Professionals and Laymen.” In.

Cui, Runpeng, Hu Liu, and Changshui Zhang. 2017. “Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7361–9.

———. 2019. “A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training.” IEEE Transactions on Multimedia 21 (7): 1880–91.

Dafnis, Konstantinos M, Evgenia Chroni, Carol Neidle, and Dimitris N Metaxas. 2022. “Bidirectional Skeleton-Based Isolated Sign Recognition Using Graph Convolution Networks.” In Proceedings of the 13th Conference on Language Resources and Evaluation (Lrec 2022), Marseille, 20-25 June 2022.

Databases, NCSLGR. 2007. “Volumes 2–7.” American Sign Language Linguistic Research Project (Distributed on CD-ROM ….

Davidson, Mary Jo. 2006. “PAULA: A Computer-Based Sign Language Tutor for Hearing Adults.” In Intelligent Tutoring Systems 2006 Workshop on Teaching with Robots, Agents, and Natural Language Processing, 66–72. Citeseer.

De Coster, Mathieu, Dimitar Shterionov, Mieke Van Herreweghe, and Joni Dambre. 2022. “Machine Translation from Signed to Spoken Languages: State of the Art and Challenges.” arXiv Preprint arXiv:2202.03086.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In 2009 Ieee Conference on Computer Vision and Pattern Recognition, 248–55. Ieee.

De Sisto, Mirella, Dimitar Shterionov, Irene Murtagh, Myriam Vermeerbergen, and Lorraine Leeson. 2021. “Defining Meaningful Units. Challenges in Sign Segmentation and Segment-Meaning Mapping (Short Paper).” In Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (At4ssl), 98–103. Virtual: Association for Machine Translation in the Americas. https://aclanthology.org/2021.mtsummit-at4ssl.11.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.

Dreuw, Philippe, Thomas Deselaers, Daniel Keysers, and Hermann Ney. 2006. “Modeling Image Variability in Appearance-Based Gesture Recognition.” In ECCV Workshop on Statistical Methods in Multi-Image and Video Processing, 7–18.

Dreuw, Philippe, Carol Neidle, Vassilis Athitsos, Stan Sclaroff, and Hermann Ney. 2008. “Benchmark Databases for Video-Based Automatic Sign Language Recognition.” In LREC.

Duarte, Amanda, Shruti Palaskar, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i-Nieto. 2020. “How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language.” arXiv Preprint arXiv:2008.08143.

Dudis, Paul G. 2004. “Body Partitioning and Real-Space Blends.” Cognitive Linguistics 15 (2): 223–38.

Ebling, Sarah, Necati Cihan Camg ö z, Penny Boyes Braem, Katja Tissi, Sandra Sidler-Miserez, Stephanie Stoll, Simon Hadfield, et al. 2018. “SMILE Swiss German Sign Language Dataset.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1666.

Efthimiou, Eleni, Stavroula-Evita Fotinea, Thomas Hanke, John Glauert, Richard Bowden, Annelies Braffort, Christophe Collet, Petros Maragos, and François Lefebvre-Albaret. 2012. “Sign Language Technologies and Resources of the Dicta-Sign Project.” In 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon. Satellite Workshop to the Eighth International Conference on Language Resources and Evaluation (Lrec-2012).

Egea Gómez, Santiago, Euan McGill, and Horacio Saggion. 2021. “Syntax-Aware Transformers for Neural Machine Translation: The Case of Text to Sign Gloss Translation.” In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (Bucc 2021), 18–27. Online (Virtual Mode): INCOMA Ltd. https://aclanthology.org/2021.bucc-1.4.

Elliott, Ralph, John Glauert, Vince Jennings, and Richard Kennaway. 2004. “An Overview of the Sigml Notation and Sigmlsigning Software System.” Sign-Lang LREC 2004, 98–104.

Elliott, Ralph, John RW Glauert, JR Kennaway, and Ian Marshall. 2000. “The Development of Language Processing Support for the Visicast Project.” In Proceedings of the Fourth International Acm Conference on Assistive Technologies, 101–8.

Erard, Michael. 2017. “Why Sign-Language Gloves Don’t Help Deaf People.” The Atlantic 9.

Farag, Iva, and Heike Brock. 2019. “Learning Motion Disfluencies for Automatic Sign Language Segmentation.” In ICASSP 2019-2019 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 7360–4. IEEE.

Fenlon, Jordan, Kearsy Cormier, and Adam Schembri. 2015. “Building Bsl Signbank: The Lemma Dilemma Revisited.” International Journal of Lexicography 28 (2): 169–206.

Fenlon, Jordan, Adam Schembri, and Kearsy Cormier. 2018. “Modification of Indicating Verbs in British Sign Language: A Corpus-Based Study.” Language 94 (1): 84–118.

Ferreira-Brito, Lucinda. 1984. “Similarities & Differences in Two Brazilian Sign Languages.” Sign Language Studies 42: 45–56.

Filhol, Michael, and John C McDonald. 2020. “The Synthesis of Complex Shape Deployments in Sign Language.” In Proceedings of the 9th Workshop on the Representation and Processing of Sign Languages.

———. 2022. “Representation and Synthesis of Geometric Relocations.” In Workshop on the Representation and Processing Sign Language.

Filhol, Michael, John C McDonald, and Rosalee J Wolfe. 2017. “Synthesizing Sign Language by Connecting Linguistically Structured Descriptions to a Multi-Track Animation System.” In Universal Access in Human–Computer Interaction. Designing Novel Interactions: 11th International Conference, Uahci 2017, Held as Part of Hci International 2017, Vancouver, Bc, Canada, July 9–14, 2017, Proceedings, Part Ii 11, 27–40. Springer.

Forster, Jens, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014. “Extensions of the Sign Language Recognition and Translation Corpus Rwth-Phoenix-Weather.” In LREC, 1911–6.

Gebre, Binyam Gebrekidan, Peter Wittenburg, and Tom Heskes. 2013. “Automatic Sign Language Identification.” In 2013 Ieee International Conference on Image Processing, 2626–30. IEEE.

Giró-i-Nieto, Xavier. 2020. “Can Everybody Sign Now? Exploring Sign Language Video Generation from 2D Poses.” SLRTP 2020: The Sign Language Recognition, Translation & Production Workshop.

Glickman, Neil S, and Wyatte C Hall. 2018. Language Deprivation and Deaf Mental Health. Routledge.

Graves, Alex, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” In Proceedings of the 23rd International Conference on Machine Learning, 369–76.

Grishchenko, Ivan, and Valentin Bazarevsky. 2020. “MediaPipe Holistic.” https://google.github.io/mediapipe/solutions/holistic.html.

Gutierrez-Sigut, Eva, Brendan Costello, Cristina Baus, and Manuel Carreiras. 2016. “LSE-Sign: A Lexical Database for Spanish Sign Language.” Behavior Research Methods 48 (1): 123–37.

Güler, Rıza Alp, Natalia Neverova, and Iasonas Kokkinos. 2018. “Densepose: Dense Human Pose Estimation in the Wild.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7297–7306.

Hall, Wyatte C, Leonard L Levin, and Melissa L Anderson. 2017. “Language Deprivation Syndrome: A Possible Neurodevelopmental Disorder with Sociocultural Origins.” Social Psychiatry and Psychiatric Epidemiology 52 (6): 761–76.

Hanke, Thomas. 2002. “ILex-a Tool for Sign Language Lexicography and Corpus Analysis.” In LREC.

Hanke, Thomas, Marc Schulder, Reiner Konrad, and Elena Jahn. 2020. “Extending the Public DGS Corpus in Size and Depth.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 75–82. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.12.

Harris, Raychelle, Heidi M Holmes, and Donna M Mertens. 2009. “Research Ethics in Sign Language Communities.” Sign Language Studies 9 (2): 104–31.

Hassan, Saad, Larwan Berke, Elahe Vahdani, Longlong Jing, Yingli Tian, and Matt Huenerfauth. 2020. “An Isolated-Signing RGBD Dataset of 100 American Sign Language Signs Produced by Fluent ASL Signers.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 89–94. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.14.

Hassan, Saad, Matthew Seita, Larwan Berke, Yingli Tian, Elaine Gale, Sooyeon Lee, and Matt Huenerfauth. 2022. “ASL-Homework-RGBD Dataset: An Annotated Dataset of 45 Fluent and Non-Fluent Signers Performing American Sign Language Homeworks.” In 13th International Conference on Language Resources and Evaluation (LREC 2022), edited by Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie A. Hochgesang, Jette Kristoffersen, Johanna Mesch, and Marc Schulder, 67–72. Marseille, France: European Language Resources Association (ELRA). https://www.sign-lang.uni-hamburg.de/lrec/pub/22008.pdf.

Hidalgo, Gines, Yaadhav Raaj, Haroon Idrees, Donglai Xiang, Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2019. “Single-Network Whole-Body Pose Estimation.” In ICCV.

Ho, Jonathan, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, et al. 2022. “Imagen Video: High Definition Video Generation with Diffusion Models.” ArXiv abs/2210.02303.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80.

Huang, Jie, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. “Video-Based Sign Language Recognition Without Temporal Segmentation.” In Proceedings of the Aaai Conference on Artificial Intelligence. Vol. 32. 1.

Huang, Wencan, Wenwen Pan, Zhou Zhao, and Qi Tian. 2021. “Towards Fast and High-Quality Sign Language Production.” In Proceedings of the 29th Acm International Conference on Multimedia, 3172–81.

Humphries, Tom, Poorna Kushalnagar, Gaurav Mathur, Donna Jo Napoli, Carol Padden, Christian Rathmann, and Scott Smith. 2016. “Avoiding Linguistic Neglect of Deaf Children.” Social Service Review 90 (4): 589–619.

Imashev, Alfarabi, Medet Mukushev, Vadim Kimmelman, and Anara Sandygulova. 2020. “A Dataset for Linguistic Understanding, Visual Evaluation, and Recognition of Sign Languages: The K-Rsl.” In Proceedings of the 24th Conference on Computational Natural Language Learning, 631–40.

Isard, Amy. 2020. “Approaches to the Anonymisation of Sign Language Corpora.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 95–100.

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. “Image-to-Image Translation with Conditional Adversarial Networks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 1125–34.

Jiang, Songyao, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. “Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble.” arXiv Preprint arXiv:2110.06161.

Jiang, Zifan, Amit Moryossef, Mathias Müller, and Sarah Ebling. 2023. “Machine Translation Between Spoken Languages and Signed Languages Represented in SignWriting.” In Findings of the Association for Computational Linguistics: EACL 2023, 1661–79. Dubrovnik, Croatia: Association for Computational Linguistics. https://aclanthology.org/2023.findings-eacl.127.

Johnson, Robert E, and Scott K Liddell. 2011. “Toward a Phonetic Representation of Signs: Sequentiality and Contrast.” Sign Language Studies 11 (2): 241–74.

Johnson, Ronan, Maren Brumm, and Rosalee J Wolfe. 2018. “An Improved Avatar for Automatic Mouth Gesture Recognition.” In Language Resources and Evaluation Conference, 107–8.

Johnston, Trevor. 2010. “From Archive to Corpus: Transcription and Annotation in the Creation of Signed Language Corpora.” International Journal of Corpus Linguistics 15 (1): 106–31.

Johnston, Trevor, and Louise De Beuzeville. 2016. “Auslan Corpus Annotation Guidelines.” Auslan Corpus.

Johnston, Trevor, and Adam Schembri. 2007. Australian Sign Language (Auslan): An Introduction to Sign Language Linguistics. Cambridge University Press.

Kakumasu, Jim. 1968. “Urubu Sign Language.” International Journal of American Linguistics 34 (4): 275–81.

Karras, Tero, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. “Alias-Free Generative Adversarial Networks.” In Proc. NeurIPS.

Karras, Tero, Samuli Laine, and Timo Aila. 2018. “A Style-Based Generator Architecture for Generative Adversarial Networks.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4396–4405.

Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. “Analyzing and Improving the Image Quality of StyleGAN.” In Proc. CVPR.

Kezar, Lee, Jesse Thomason, and Zed Sehyr. 2023. “Improving Sign Recognition with Phonology.” In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2724–9. Dubrovnik, Croatia: Association for Computational Linguistics. https://aclanthology.org/2023.eacl-main.200.

Kimmelman, Vadim. 2014. “Information Structure in Russian Sign Language and Sign Language of the Netherlands.” Sign Language & Linguistics 18 (1): 142–50.

Kingma, Diederik P., Tim Salimans, Ben Poole, and Jonathan Ho. 2021. “Variational Diffusion Models.” ArXiv abs/2107.00630.

Kipp, Michael. 2001. “Anvil-a Generic Annotation Tool for Multimodal Dialogue.” In Seventh European Conference on Speech Communication and Technology.

Ko, Sang-Ki, Chang Jo Kim, Hyedong Jung, and Choongsang Cho. 2019. “Neural Sign Language Translation Based on Human Keypoint Estimation.” Applied Sciences 9 (13): 2683.

Koller, Oscar, Cihan Camgöz, Hermann Ney, and Richard Bowden. 2019. “Weakly Supervised Learning with Multi-Stream Cnn-Lstm-Hmms to Discover Sequential Parallelism in Sign Language Videos.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Konrad, Reiner, Thomas Hanke, Gabriele Langer, Susanne König, Lutz König, Rie Nishio, and Anja Regen. 2018. “Public Dgs Corpus: Annotation Conventions.” Project Note AP03-2018-01, DGS-Korpus project, IDGS, Hamburg University.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–1105.

Kusters, Annelies, Maartje De Meulder, and Dai O’Brien. 2017. Innovations in Deaf Studies: The Role of Deaf Scholars. Oxford University Press.

Kusters, Annelies Maria Jozef, Dai O’Brien, and Maartje De Meulder. 2017. “Innovations in Deaf Studies: Critically Mapping the Field.” In Innovations in Deaf Studies, edited by Annelies Kusters, Maartje De Meulder, and Dai O’Brien, 1–53. United Kingdom: Oxford University Press.

Lebert, Marie. 2008. “Project Gutenberg (1971-2008).” Project Gutenberg.

Li, Dongxu, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. “Word-Level Deep Sign Language Recognition from Video: A New Large-Scale Dataset and Methods Comparison.” In The Ieee Winter Conference on Applications of Computer Vision, 1459–69.

Liddell, Scott K, and Robert E Johnson. 1989. “American Sign Language: The Phonological Base.” Sign Language Studies 64 (1): 195–277.

Liddell, Scott K, and Melanie Metzger. 1998. “Gesture in Sign Language Discourse.” Journal of Pragmatics 30 (6): 657–97.

Liddell, Scott K, and others. 2003. Grammar, Gesture, and Meaning in American Sign Language. Cambridge University Press.

Lubbers, Mart, and Francisco Torreira. 2013. “Pympi-Ling: A Python Module for Processing ELANs EAF and Praats TextGrid Annotation Files.” https://pypi.python.org/pypi/pympi-ling.

Luong, Thang, Hieu Pham, and Christopher D. Manning. 2015. “Effective Approaches to Attention-Based Neural Machine Translation.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1412–21. Lisbon, Portugal: Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1166.

Martínez, Aleix M, Ronnie B Wilbur, Robin Shay, and Avinash C Kak. 2002. “Purdue Rvl-Slll Asl Database for Automatic Recognition of American Sign Language.” In Proceedings. Fourth Ieee International Conference on Multimodal Interfaces, 167–72. IEEE.

Matthes, Silke, Thomas Hanke, Anja Regen, Jakob Storz, Satu Worseck, Eleni Efthimiou, Athanasia-Lida Dimou, Annelies Braffort, John Glauert, and Eva Safar. 2012. “Dicta-Sign–Building a Multilingual Sign Language Corpus.” In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon (Lrec 2012).

McDonald, John C, and Michael Filhol. 2021. “Natural Synthesis of Productive Forms from Structured Descriptions of Sign Language.” Machine Translation 35 (3): 363–86.

McDonald, John C, Ronan Johnson, and Rosalee J Wolfe. 2022. “A Novel Approach to Managing Lower Face Complexity in Signing Avatars.” In Proceedings of the 7th International Workshop on Sign Language Translation and Avatar Technology: The Junction of the Visual and the Textual: Challenges and Perspectives, 67–72.

McDonald, John C, Rosalee J Wolfe, Sarah Johnson, Souad Baowidan, Robyn Moncrief, and Ningshan Guo. 2017. “An Improved Framework for Layering Linguistic Processes in Sign Language Generation: Why There Should Never Be a ‘Brows’ Tier.” In Universal Access in Human–Computer Interaction. Designing Novel Interactions: 11th International Conference, Uahci 2017, Held as Part of Hci International 2017, Vancouver, Bc, Canada, July 9–14, 2017, Proceedings, Part Ii 11, 41–54. Springer.

McDonald, John C, Rosalee J Wolfe, Jerry C Schnepp, Julie Hochgesang, Diana Gorman Jamrozik, Marie Stumbo, Larwan Berke, Melissa Bialek, and Farah Thomas. 2016. “An Automated Technique for Real-Time Production of Lifelike Animations of American Sign Language.” Universal Access in the Information Society 15: 551–66.

McKee, David, and Graeme Kennedy. 2000. “Lexical Comparison of Signs from American, Australian, British and New Zealand Sign Languages.” The Signs of Language Revisited: An Anthology to Honor Ursula Bellugi and Edward Klima, 49–76.

Mesch, Johanna, and Lars Wallin. 2012. “From Meaning to Signs and Back: Lexicography and the Swedish Sign Language Corpus.” In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon [Language Resources and Evaluation Conference (Lrec)], 123–26.

———. 2015. “Gloss Annotations in the Swedish Sign Language Corpus.” International Journal of Corpus Linguistics 20 (1): 102–20.

Min, Jianyuan, and Jinxiang Chai. 2012. “Motion Graphs++ a Compact Generative Model for Semantic Motion Analysis and Synthesis.” ACM Transactions on Graphics (TOG) 31 (6): 1–12.

Momeni, Liliane, Hannah Bull, KR Prajwal, Samuel Albanie, Gül Varol, and Andrew Zisserman. 2022a. “Automatic Dense Annotation of Large-Vocabulary Sign Language Videos.” In Computer Vision–Eccv 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part Xxxv, 671–90. Springer.

Momeni, Liliane, Hannah Bull, K. R. Prajwal, Samuel Albanie, Gül Varol, and Andrew Zisserman. 2022b. “Automatic Dense Annotation of Large-Vocabulary Sign Language Videos.” In Computer Vision – ECCV 2022, edited by Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, 671–90. Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-19833-5_39.

Moncrief, Robyn. 2020. “Extending a Model for Animating Adverbs of Manner in American Sign Language.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 151–56.

———. 2021. “Generalizing a Model for Animating Adverbs of Manner in American Sign Language.” Machine Translation 35 (3): 345–62.

Monteiro, Caio DD, Christy Maria Mathew, Ricardo Gutierrez-Osuna, and Frank Shipman. 2016. “Detecting and Identifying Sign Languages Through Visual Features.” In 2016 Ieee International Symposium on Multimedia (Ism), 287–90. IEEE.

Montemurro, Kathryn, and Diane Brentari. 2018. “Emphatic Fingerspelling as Code-Mixing in American Sign Language.” Proceedings of the Linguistic Society of America 3 (1): 61–61.

Moryossef, Amit, Zifan Jiang, Mathias Müller, Sarah Ebling, and Yoav Goldberg. 2023. “Linguistically Motivated Sign Language Segmentation.” In Findings of the Association for Computational Linguistics: EMNLP 2023, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 12703–24. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.846.

Moryossef, Amit, and Mathias Müller. 2021. “Sign Language Datasets.” https://github.com/sign-language-processing/datasets.

Moryossef, Amit, Mathias Müller, Anne Göhring, Zifan Jiang, Yoav Goldberg, and Sarah Ebling. 2023. “An Open-Source Gloss-Based Baseline for Spoken to Signed Language Translation.” In 2nd International Workshop on Automatic Translation for Signed and Spoken Languages (At4ssl). https://github.com/ZurichNLP/spoken-to-signed-translation.

Moryossef, Amit, Ioannis Tsochantaridis, Roee Yosef Aharoni, Sarah Ebling, and Srini Narayanan. 2020. “Real-Time Sign-Language Detection Using Human Pose Estimation.”

Moryossef, Amit, Kayo Yin, Graham Neubig, and Yoav Goldberg. 2021. “Data Augmentation for Sign Language Gloss Translation.” In Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (At4ssl), 1–11. Virtual: Association for Machine Translation in the Americas. https://aclanthology.org/2021.mtsummit-at4ssl.1.

Mukushev, Medet, Aigerim Kydyrbekova, Vadim Kimmelman, and Anara Sandygulova. 2022. “Towards Large Vocabulary Kazakh-Russian Sign Language Dataset: KRSL-OnlineSchool.” In 13th International Conference on Language Resources and Evaluation (LREC 2022), edited by Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie A. Hochgesang, Jette Kristoffersen, Johanna Mesch, and Marc Schulder, 154–58. Marseille, France: European Language Resources Association (ELRA). https://www.sign-lang.uni-hamburg.de/lrec/pub/22031.pdf.

Murray, Joseph J, Wyatte C Hall, and Kristin Snoddon. 2020. “The Importance of Signed Languages for Deaf Children and Their Families.” The Hearing Journal 73 (3): 30–32.

Müller, Mathias, Sarah Ebling, Eleftherios Avramidis, Alessia Battisti, Michèle Berger, Richard Bowden, Annelies Braffort, et al. 2022. “Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22).” In Proceedings of the Seventh Conference on Machine Translation (Wmt), 744–72. Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics. https://aclanthology.org/2022.wmt-1.71.

Müller, Mathias, Zifan Jiang, Amit Moryossef, Annette Rios, and Sarah Ebling. 2023. “Considerations for Meaningful Sign Language Machine Translation Based on Glosses.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 682–93. Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-short.60.

Napier, Jemina, and Lorraine Leeson. 2016. Sign Language in Action. London: Palgrave Macmillan.

Neidle, Carol, Stan Sclaroff, and Vassilis Athitsos. 2001. “SignStream: A Tool for Linguistic and Computer Vision Research on Visual-Gestural Language Data.” Behavior Research Methods, Instruments, & Computers 33 (3): 311–20.

Oliveira, Marlon, Houssem Chatbri, Ylva Ferstl, Mohamed Farouk, Suzanne Little, Noel O’Connor, and A. Sutherland. 2017. “A Dataset for Irish Sign Language Recognition.” In Proceedings of the Irish Machine Vision and Image Processing Conference (IMVIP).

Ormel, Ellen, and Onno Crasborn. 2012. “Prosodic Correlates of Sentences in Signed Languages: A Literature Review and Suggestions for New Types of Studies.” Sign Language Studies 12 (2): 279–315.

Othman, Achraf, and Mohamed Jemni. 2012. “English-Asl Gloss Parallel Corpus 2012: Aslg-Pc12.” In 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon Lrec.

Öqvist, Zrajm, Nikolaus Riemer Kankkonen, and Johanna Mesch. 2020. “STS-Korpus: A Sign Language Web Corpus Tool for Teaching and Public Use.” In Proceedings of the Lrec2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 177–80. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.signlang-1.29.

Padden, C. 1988. Interaction of Morphology and Syntax in American Sign Language. Outstanding Disc Linguistics Series. Garland. https://books.google.com/books?id=Mea7AAAAIAAJ.

Padden, Carol A. 1998. “The Asl Lexicon.” Sign Language & Linguistics 1 (1): 39–60.

Padden, Carol A, and Darline Clark Gunsauls. 2003. “How the Alphabet Came to Be Used in a Sign Language.” Sign Language Studies, 10–33.

Padden, Carol A, and Tom Humphries. 1988. Deaf in America. Harvard University Press.

Pal, Abhilash, Stephan Huber, Cyrine Chaabani, Alessandro Manzotti, and Oscar Koller. 2023. “On the Importance of Signer Overlap for Sign Language Detection.” arXiv Preprint arXiv:2303.10782.

Panteleris, Paschalis, Iason Oikonomidis, and Antonis Argyros. 2018. “Using a Single Rgb Frame for Real Time 3d Hand Pose Estimation in the Wild.” In 2018 Ieee Winter Conference on Applications of Computer Vision (Wacv), 436–45. IEEE.

Patrie, Carol J, and Robert E Johnson. 2011. Fingerspelled Word Recognition Through Rapid Serial Visual Presentation: RSVP. DawnSignPress.

Pavllo, Dario, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. “3d Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 7753–62.

Pishchulin, Leonid, Arjun Jain, Mykhaylo Andriluka, Thorsten Thorm ä hlen, and Bernt Schiele. 2012. “Articulated People Detection and Pose Estimation: Reshaping the Future.” In 2012 Ieee Conference on Computer Vision and Pattern Recognition, 3178–85. IEEE.

Pizzuto, Elena Antinoro, Paolo Rossini, and Tommaso Russo. 2006. “Representing Signed Languages in Written Form: Questions That Need to Be Posed.” In 5th International Conference on Language Resources and Evaluation (LREC 2006), edited by Chiara Vettori, 1–6. Genoa, Italy: European Language Resources Association (ELRA). https://www.sign-lang.uni-hamburg.de/lrec/pub/06001.pdf.

Pratap, Vineel, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. “MLS: A Large-Scale Multilingual Dataset for Speech Research.” In Proc. Interspeech 2020, 2757–61. https://doi.org/10.21437/Interspeech.2020-2826.

Prillwitz, Siegmund, Thomas Hanke, Susanne König, Reiner Konrad, Gabriele Langer, and Arvid Schwarz. 2008. “DGS Corpus Project–Development of a Corpus Based Electronic Dictionary German Sign Language/German.” In Sign-Lang at Lrec 2008, 159–64. European Language Resources Association (ELRA).

Prillwitz, Siegmund, and Heiko Zienert. 1990. “Hamburg Notation System for Sign Language: Development of a Sign Writing with Computer Application.” In Current Trends in European Sign Language Research. Proceedings of the 3rd European Congress on Sign Language Research, 355–79.

Ramırez, Javier, José C Segura, Carmen Benıtez, Angel De La Torre, and Antonio Rubio. 2004. “Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information.” Speech Communication 42 (3-4): 271–87.

Rathmann, Christian, and Gaurav Mathur. 2011. “A Featural Approach to Verb Agreement in Signed Languages.” Theoretical Linguistics 37 (3-4): 197–208.

Renz, Katrin, Nicolaj C Stache, Samuel Albanie, and Gül Varol. 2021. “Sign Language Segmentation with Temporal Convolutional Networks.” In ICASSP 2021-2021 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 2135–9. IEEE.

Renz, Katrin, Nicolaj C Stache, Neil Fox, Gul Varol, and Samuel Albanie. 2021. “Sign Segmentation with Changepoint-Modulated Pseudo-Labelling.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 3403–12.

Rombach, Robin, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. “High-Resolution Image Synthesis with Latent Diffusion Models.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–85.

Roy, Cynthia B. 2011. Discourse in Signed Languages. Gallaudet University Press.

Sams, Ataher, Ahsan Habib Akash, and S. M. Mahbubur Rahman. 2023. “SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation.” In 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1–7. https://doi.org/10.1109/ICCCNT56998.2023.10306914.

Sandler, Wendy. 2010. “Prosody and Syntax in Sign Languages.” Transactions of the Philological Society 108 (3): 298–328.

———. 2012. “The Phonological Organization of Sign Languages.” Language and Linguistics Compass 6 (3): 162–82.

Sandler, Wendy, and Diane Lillo-Martin. 2006. Sign Language and Linguistic Universals. Cambridge University Press.

Santemiz, Pinar, Oya Aran, Murat Saraclar, and Lale Akarun. 2009. “Automatic Sign Segmentation from Continuous Signing via Multiple Sequence Alignment.” In 2009 Ieee 12th International Conference on Computer Vision Workshops, Iccv Workshops, 2001–8. IEEE.

Saunders, Ben, Necati Cihan Camgöz, and Richard Bowden. 2020a. “Adversarial Training for Multi-Channel Sign Language Production.” In The 31st British Machine Vision Virtual Conference. British Machine Vision Association.

———. 2020b. “Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video.” arXiv Preprint arXiv:2011.09846.

———. 2020c. “Progressive Transformers for End-to-End Sign Language Production.” In European Conference on Computer Vision, 687–705.

———. 2021. “Anonysign: Novel Human Appearance Synthesis for Sign Language Video Anonymisation.” In 2021 16th Ieee International Conference on Automatic Face and Gesture Recognition (Fg 2021), 1–8. https://doi.org/10.1109/FG52635.2021.9666984.

Savitzky, Abraham, and Marcel JE Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39.

Schembri, Adam, Kearsy Cormier, and Jordan Fenlon. 2018. “Indicating Verbs as Typologically Unique Constructions: Reconsidering Verb ‘Agreement’in Sign Languages.” Glossa: A Journal of General Linguistics 3 (1).

Schembri, Adam, Jordan Fenlon, Ramas Rentelis, Sally Reynolds, and Kearsy Cormier. 2013. “Building the British Sign Language Corpus.” Language Documentation & Conservation 7: 136–54.

Schnepp, Jerry C, Rosalee J Wolfe, John C McDonald, and Jorge A Toro. 2012. “Combining Emotion and Facial Nonmanual Signals in Synthesized American Sign Language.” In Proceedings of the 14th International Acm Sigaccess Conference on Computers and Accessibility, 249–50.

Schnepp, Jerry C., Rosalee J. Wolfe, John C. McDonald, and Jorge A. Toro. 2013. “Generating Co-Occurring Facial Nonmanual Signals in Synthesized American Sign Language.” In GRAPP/Ivapp.

Sehyr, Zed Sevcikova, Naomi Caselli, Ariel M Cohen-Goldberg, and Karen Emmorey. 2021. “The Asl-Lex 2.0 Project: A Database of Lexical and Phonological Properties for 2,723 Signs in American Sign Language.” The Journal of Deaf Studies and Deaf Education 26 (2): 263–77.

Selvaraj, Prem, Gokul Nc, Pratyush Kumar, and Mitesh Khapra. 2022. “OpenHands: Making Sign Language Recognition Accessible with Pose-Based Pretrained Models Across Languages.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2114–33. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.150.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162.

Shi, Bowen, Diane Brentari, Gregory Shakhnarovich, and Karen Livescu. 2022. “Open-Domain Sign Language Translation Learned from Online Video.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6365–79. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.427.

Shi, B., A. Martinez Del Rio, J. Keane, D. Brentari, G. Shakhnarovich, and K. Livescu. 2019. “Fingerspelling Recognition in the Wild with Iterative Visual Attention.” ICCV.

Shi, B., A. Martinez Del Rio, J. Keane, J. Michaux, G. Shakhnarovich D. Brentari, and K. Livescu. 2018. “American Sign Language Fingerspelling Recognition in the Wild.” SLT.

Shieber, Stuart M. 1994. “RESTRICTING the Weak-Generative Capacity of Synchronous Tree-Adjoining Grammars.” Computational Intelligence 10 (4): 371–85.

Shieber, Stuart, and Yves Schabes. 1990. “Synchronous Tree-Adjoining Grammars.” In Proceedings of the 13th International Conference on Computational Linguistics. Association for Computational Linguistics.

Shroyer, Edgar H, and Susan P Shroyer. 1984. Signs Across America: A Look at Regional Differences in American Sign Language. Gallaudet University Press.

Sign Time GmbH. 2020. SiMAX. Sign Time GmbH. https://simax.media/.

Simon, Tomas, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. “Hand Keypoint Detection in Single Images Using Multiview Bootstrapping.” In CVPR.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR.

Sincan, Ozge Mercanoglu, and Hacer Yalim Keles. 2020. “AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods.” IEEE Access 8: 181340–55.

Sohn, Jongseo, Nam Soo Kim, and Wonyong Sung. 1999. “A Statistical Model-Based Voice Activity Detection.” IEEE Signal Processing Letters 6 (1): 1–3.

Stokoe, Jr., William C. 1960. “Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf.” The Journal of Deaf Studies and Deaf Education 10 (1): 3–37. https://doi.org/10.1093/deafed/eni001.

Stokoe Jr, William C. 2005. “Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf.” Journal of Deaf Studies and Deaf Education 10 (1): 3–37.

Stoll, Stephanie, Necati Cihan Camgöz, Simon Hadfield, and Richard Bowden. 2018. “Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks.” In Proceedings of the 29th British Machine Vision Conference (Bmvc 2018). British Machine Vision Association.

———. 2020. “Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks.” International Journal of Computer Vision, 1–18.

Supalla, Ted. 1986. “The Classifier System in American Sign Language.” Noun Classes and Categorization 7: 181–214.

Sutton, Valerie. 1990. Lessons in Sign Writing. SignWriting.

Tavella, Federico, Viktor Schlegel, Marta Romeo, Aphrodite Galata, and Angelo Cangelosi. 2022. “WLASL-LEX: A Dataset for Recognising Phonological Properties in American Sign Language.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 453–63. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-short.49.

“TensorFlow Datasets, a Collection of Ready-to-Use Datasets.” n.d. https://www.tensorflow.org/datasets.

United Nations. 2022. “International Day of Sign Languages.” https://www.un.org/en/observances/sign-languages-day.

Vaezi Joze, Hamid, and Oscar Koller. 2019. “MS-Asl: A Large-Scale Data Set and Benchmark for Understanding American Sign Language.” In The British Machine Vision Conference (Bmvc). https://www.microsoft.com/en-us/research/publication/ms-asl-a-large-scale-data-set-and-benchmark-for-understanding-american-sign-language/.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, 5998–6008.

Viitaniemi, Ville, Tommi Jantunen, Leena Savolainen, Matti Karppa, and Jorma Laaksonen. 2014. “S-Pot - a Benchmark in Spotting Signs Within Continuous Signing.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1892–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/440_Paper.pdf.

Vintar, Špela, Boštjan Jerko, and Marjetka Kulovec. 2012. “Compiling the Slovene Sign Language Corpus.” In 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon. Language Resources and Evaluation Conference (Lrec), 5:159–62.

Vogler, Christian, and Siome Goldenstein. 2005. “Analysis of Facial Expressions in American Sign Language.” In Proc, of the 3rd Int. Conf. On Universal Access in Human-Computer Interaction, Springer.

Von Agris, Ulrich, and Karl-Friedrich Kraiss. 2007. “Towards a Video Corpus for Signer-Independent Continuous Sign Language Recognition.” Gesture in Human-Computer Interaction and Simulation, Lisbon, Portugal, May 11.

Walsh, Harry Thomas, Ben Saunders, and Richard Bowden. 2022. “Changing the Representation: Examining Language Representation for Neural Sign Language Production.” In LREC 2022.

Wang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. “Video-to-Video Synthesis.” In Advances in Neural Information Processing Systems (Neurips).

Wei, Shih-En, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. “Convolutional Pose Machines.” In CVPR.

Wheatland, Nkenge, Ahsan Abdullah, Michael Neff, Sophie Jörg, and Victor Zordan. 2016. “Analysis in Support of Realistic Timing in Animated Fingerspelling.” In 2016 Ieee Virtual Reality (Vr), 309–10. IEEE.

Wilcox, Sherman. 1992. The Phonetics of Fingerspelling. Vol. 4. John Benjamins Publishing.

Wilcox, Sherman, and Sarah Hafer. 2004. “Rethinking Classifiers. Emmorey, K.(Ed.).(2003). Perspectives on Classifier Constructions in Sign Languages. Mahwah, Nj: Lawrence Erlbaum Associates. 332 Pages. Hardcover.” Oxford University Press.

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann, and Han Sloetjes. 2006. “ELAN: A Professional Framework for Multimodality Research.” In 5th International Conference on Language Resources and Evaluation (Lrec 2006), 1556–9.

Wolfe, Rosalee, Peter Cook, John C McDonald, and Jerry C Schnepp. 2011. “Linguistics as Structure in Computer Animation: Toward a More Effective Synthesis of Brow Motion in American Sign Language.” Sign Language & Linguistics 14 (1): 179–99.

Wolfe, Rosalee J, Elena Jahn, Ronan Johnson, and John C McDonald. 2019. “The Case for Avatar Makeup.”

Wolfe, Rosalee J, John C McDonald, Ronan Johnson, Ben Sturr, Syd Klinghoffer, Anthony Bonzani, Andrew Alexander, and Nicole Barnekow. 2022. “Supporting Mouthing in Signed Languages: New Innovations and a Proposal for Future Corpus Building.” In Proceedings of the 7th International Workshop on Sign Language Translation and Avatar Technology: The Junction of the Visual and the Textual: Challenges and Perspectives, 125–30.

World Federation of the Deaf. 2022. “World Federation of the Deaf - Our Work.” https://wfdeaf.org/our-work/.

World Health Organization. 2021. “Deafness and Hearing Loss.” https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss.

Xiao, Qinkun, Minying Qin, and Yuting Yin. 2020. “Skeleton-Based Chinese Sign Language Recognition and Generation for Bidirectional Communication Between Deaf and Hearing People.” Neural Networks 125: 41–55.

Yin, Kayo, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021. “Including Signed Languages in Natural Language Processing.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7347–60. Online: Association for Computational Linguistics. https://aclanthology.org/2021.acl-long.570.

Yin, Kayo, and Jesse Read. 2020. “Better Sign Language Translation with STMC-Transformer.” In Proceedings of the 28th International Conference on Computational Linguistics, 5975–89. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.525.

Yu, Ting, Haoteng Yin, and Zhanxing Zhu. 2017. “Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting.” In International Joint Conference on Artificial Intelligence.

Zafrulla, Zahoor, Helene Brashear, Harley Hamilton, and Thad Starner. 2010. “A Novel Approach to American Sign Language (ASL) Phrase Verification Using Reversed Signing.” In 2010 Ieee Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, 48–55. IEEE.

Zelinka, Jan, and Jakub Kanis. 2020. “Neural Sign Language Synthesis: Words Are Our Glosses.” In The Ieee Winter Conference on Applications of Computer Vision, 3395–3403.

Zhang, Biao, Mathias Müller, and Rico Sennrich. 2023. “SLTUNET: A Simple Unified Model for Sign Language Translation.” In The Eleventh International Conference on Learning Representations. Kigali, Rwanda. https://openreview.net/forum?id=EBS4C77p_5S.

Zhang, Lvmin, and Maneesh Agrawala. 2023. “Adding Conditional Control to Text-to-Image Diffusion Models.” http://arxiv.org/abs/2302.05543.

Zhao, Liwei, Karin Kipper, William Schuler, Christian Vogler, Norman Badler, and Martha Palmer. 2000. “A Machine Translation System from English to American Sign Language.” In Conference of the Association for Machine Translation in the Americas, 54–67. Springer.

Zwitserlood, Inge, Margriet Verlinden, Johan Ros, Sanny Van Der Schoot, and T Netherlands. 2004. “Synthetic Signing for the Deaf: Esign.” In Proceedings of the Conference and Workshop on Assistive Technologies for Vision and Hearing Impairment, Cvhi.

Łacheta, Joanna, and PawełRutkowski. 2014. “A Corpus-Based Dictionary of Polish Sign Language (Pjm).” In.


  1. When capitalized, “Deaf” refers to a community of deaf people who share a language and a culture, whereas the lowercase “deaf” refers to the audiological condition of not hearing. We follow the more recent convention of abandoning a distinction between “Deaf” and “deaf”, using the latter term also to refer to (deaf) members of the sign language community (Napier and Leeson 2016; Annelies Maria Jozef Kusters, O’Brien, and De Meulder 2017).↩︎

  2. We mainly refer to ASL, where most sign language research has been conducted, but not exclusively.↩︎