Sign Language Recognition - Complete System Overview

System Overview

This system combines three complementary approaches for robust sign language recognition:

2000

Sign Classes (WLASL)

2

Processing Streams

32 or 150

Frames per Sequence

27

Skeleton Keypoints

Processing Pipeline

RGB Video
Input Stream

→

Pose Estimation
HRNet by MMPose
Graph reduction: 133 nodes whole-body pose are trimmed to 27 nodes, contains 10 nodes for each hand and 7 nodes for upper body

→

Joint

Bone

Joint Motion

Bone Motion

Bone data generated by representing joint data in a vector form pointing from source joints to their target joints
Motion data are generated by calculating the difference between adjacent frames in both joint and bone streams

→

Sign Language Graph Convolutional Network
VSNet transform

→

Prediction
2000 Classes
1000 Classes
300 Classes
100 Classes

RGB Video
Input Stream

→

Video Frame Extraction
All frame of each video

→

Frame Selection

Target length: 32 frames
Training Stratege: Randomly select a continuous segment of frames
Testing Stratege: Video divide into 5 segments of equal length

→

Data Augmentation (Training Only)

Horizontal Flip: 50% probability
Random Rotation: Within a range of ± 5 degrees
Random Cropping: From 256 × 256 down to 224 × 224 pixels

→

3D-CNN
ResNet2plus1

→

Prediction
2000 Classes
1000 Classes
300 Classes
100 Classes

🦴 Skeleton Stream

27 body keypoints
VSFormer architecture
Spatial-temporal attention
Graph convolution (44→36→24)
Robust to background

🎥 RGB Stream

Raw video frames
R(2+1)D CNN
Swish activation
Spatiotemporal decomposition
Captures appearance details

🔄 Feature Stream

Hand-crafted features
Separable 3D Conv
Efficient processing
Complementary information
Domain knowledge

Skeleton-Based Recognition (GCN)

HRNet Graph Convolution Transformer

1

Pose Estimation (HRNet)

Extract 27 keypoints from each frame using hrnet_w48_coco_wholebody

Output: (T=150, V=27, C=3) where C = (x, y, confidence)

2

Graph Construction

Build skeleton graph with adjacency matrix A representing joint connections

Progressive reduction: 44 nodes → 36 nodes → 24 nodes

3

GCN Processing (3 Layers)

L1: TCN_GCN_drop(3 → 64, V=44) + DropBlock

L2: TCN_GCN_drop(64 → 96, V=36) + DropBlock

L3: TCN_GCN_drop(96 → 96, V=24) + DropBlock

4

VSFormer Blocks (×8)

Multi-head self-attention with partition size (6, 4)

Parallel GCN + TCN branches

Channel progression: 96 → 192 → 192 → 192

5

Classification

Global pooling over (T, V) dimensions

Fully connected layer: 192 → 100 classes

Key Components

Component	Description	Parameters
HRNet Extractor	Whole-body pose estimation	27 joints (hands, body, face)
Graph Reduction	Progressive joint pooling	44 → 36 → 24 nodes
DropBlock	Spatial-temporal regularization	Skeleton & time dropout
VSFormer	Spatial-temporal transformer	8 blocks, 32 heads

RGB-Based Recognition (3D-CNN)

3D CNN Spatiotemporal Decomposition Swish Activation

1

Video Input

Raw RGB video frames: (C=3, T, H, W)

Channels=3 (Red, Green, Blue)

2

Spatiotemporal Decomposition

2D Spatial Conv: Captures hand shapes and poses within each frame

1D Temporal Conv: Captures movement patterns across frames

Architecture: torchvision.models.video.r2plus1d_18

3

Activation Optimization

Replace all ReLU with SiLU (Swish)

Formula: Swish(x) = x · sigmoid(x)

Benefits: Smoother gradients, better learning

4

Feature Flattening

Global spatiotemporal pooling

Flatten to 1D feature vector: out.flatten(1)

5

Classification Head

Dropout(p=0.5) for regularization

Linear layer → 100 sign classes

R(2+1)D Architecture Details

Layer Type	Operation	Purpose
2D Spatial Conv	Conv2D on each frame	Extract spatial features (hand shapes)
1D Temporal Conv	Conv1D across time	Capture motion dynamics
Swish Activation	x · sigmoid(x)	Smooth, non-monotonic activation
Dropout	Random neuron dropout	Prevent overfitting

Method Comparison

Aspect	Skeleton (VSFormer)	RGB (R(2+1)D)	Feature Stream
Input	27 keypoints (x, y, conf)	RGB frames (3×H×W)	Hand-crafted features
Preprocessing	HRNet pose estimation	Frame normalization	Feature extraction
Main Architecture	GCN + Transformer	3D CNN (decomposed)	Separable 3D Conv
Key Strength	Structural understanding	Appearance details	Domain knowledge
Robustness	Background invariant	Lighting sensitive	Stable features
Computation	Medium (pose + GCN)	High (3D convolutions)	Low (efficient conv)
Extraction Time	~64 hours	~17 hours	...

Fusion Strategy

1

Individual Stream Predictions

Each stream produces probability distribution over 100 classes

2

Weighted Fusion

Final = α·Skeleton + β·RGB + γ·Feature

Weights learned during training or fixed empirically

3

Ensemble Decision

Final prediction: argmax(Final)

Combines complementary strengths of all streams

Complete Multi-Stream Architecture

Implementation Details

Component	Framework/Library	Key Parameters
Pose Estimation	MMPose / HRNet	hrnet_w48_coco_wholebody
Skeleton GCN	PyTorch	3 layers, dropout, batch_size=16
VSFormer	Custom PyTorch	8 blocks, 32 heads, partition (6,4)
R(2+1)D	torchvision.models.video	r2plus1d_18, Swish activation
Optimizer	Adam	lr=0.0005, epochs=100
Loss Function	CrossEntropyLoss	Multi-class classification

🤟 Sign Language Recognition System