🤟 Sign Language Recognition System

Multi-Stream Architecture: Skeleton + RGB + Feature Fusion

System Overview

This system combines three complementary approaches for robust sign language recognition:

2000
Sign Classes (WLASL)
2
Processing Streams
32 or 150
Frames per Sequence
27
Skeleton Keypoints

Processing Pipeline

RGB Video
Input Stream
Pose Estimation
HRNet by MMPose
Graph reduction: 133 nodes whole-body pose are trimmed to 27 nodes, contains 10 nodes for each hand and 7 nodes for upper body
Joint
Bone
Joint Motion
Bone Motion
Bone data generated by representing joint data in a vector form pointing from source joints to their target joints
Motion data are generated by calculating the difference between adjacent frames in both joint and bone streams
Sign Language Graph Convolutional Network
VSNet transform
Prediction
2000 Classes
1000 Classes
300 Classes
100 Classes
RGB Video
Input Stream
Video Frame Extraction
All frame of each video
Frame Selection
Target length: 32 frames
Training Stratege: Randomly select a continuous segment of frames
Testing Stratege: Video divide into 5 segments of equal length
Data Augmentation (Training Only)
Horizontal Flip: 50% probability
Random Rotation: Within a range of ± 5 degrees
Random Cropping: From 256 × 256 down to 224 × 224 pixels
3D-CNN
ResNet2plus1
Prediction
2000 Classes
1000 Classes
300 Classes
100 Classes

🦴 Skeleton Stream

  • 27 body keypoints
  • VSFormer architecture
  • Spatial-temporal attention
  • Graph convolution (44→36→24)
  • Robust to background

🎥 RGB Stream

  • Raw video frames
  • R(2+1)D CNN
  • Swish activation
  • Spatiotemporal decomposition
  • Captures appearance details

🔄 Feature Stream

  • Hand-crafted features
  • Separable 3D Conv
  • Efficient processing
  • Complementary information
  • Domain knowledge

Skeleton-Based Recognition (GCN)

HRNet Graph Convolution Transformer
1

Pose Estimation (HRNet)

Extract 27 keypoints from each frame using hrnet_w48_coco_wholebody

Output: (T=150, V=27, C=3) where C = (x, y, confidence)

2

Graph Construction

Build skeleton graph with adjacency matrix A representing joint connections

Progressive reduction: 44 nodes → 36 nodes → 24 nodes

3

GCN Processing (3 Layers)

L1: TCN_GCN_drop(3 → 64, V=44) + DropBlock

L2: TCN_GCN_drop(64 → 96, V=36) + DropBlock

L3: TCN_GCN_drop(96 → 96, V=24) + DropBlock

4

VSFormer Blocks (×8)

Multi-head self-attention with partition size (6, 4)

Parallel GCN + TCN branches

Channel progression: 96 → 192 → 192 → 192

5

Classification

Global pooling over (T, V) dimensions

Fully connected layer: 192 → 100 classes

Key Components

Component Description Parameters
HRNet Extractor Whole-body pose estimation 27 joints (hands, body, face)
Graph Reduction Progressive joint pooling 44 → 36 → 24 nodes
DropBlock Spatial-temporal regularization Skeleton & time dropout
VSFormer Spatial-temporal transformer 8 blocks, 32 heads

RGB-Based Recognition (3D-CNN)

3D CNN Spatiotemporal Decomposition Swish Activation
1

Video Input

Raw RGB video frames: (C=3, T, H, W)

Channels=3 (Red, Green, Blue)

2

Spatiotemporal Decomposition

2D Spatial Conv: Captures hand shapes and poses within each frame

1D Temporal Conv: Captures movement patterns across frames

Architecture: torchvision.models.video.r2plus1d_18

3

Activation Optimization

Replace all ReLU with SiLU (Swish)

Formula: Swish(x) = x · sigmoid(x)

Benefits: Smoother gradients, better learning

4

Feature Flattening

Global spatiotemporal pooling

Flatten to 1D feature vector: out.flatten(1)

5

Classification Head

Dropout(p=0.5) for regularization

Linear layer → 100 sign classes

R(2+1)D Architecture Details

Layer Type Operation Purpose
2D Spatial Conv Conv2D on each frame Extract spatial features (hand shapes)
1D Temporal Conv Conv1D across time Capture motion dynamics
Swish Activation x · sigmoid(x) Smooth, non-monotonic activation
Dropout Random neuron dropout Prevent overfitting

Method Comparison

Aspect Skeleton (VSFormer) RGB (R(2+1)D) Feature Stream
Input 27 keypoints (x, y, conf) RGB frames (3×H×W) Hand-crafted features
Preprocessing HRNet pose estimation Frame normalization Feature extraction
Main Architecture GCN + Transformer 3D CNN (decomposed) Separable 3D Conv
Key Strength Structural understanding Appearance details Domain knowledge
Robustness Background invariant Lighting sensitive Stable features
Computation Medium (pose + GCN) High (3D convolutions) Low (efficient conv)
Extraction Time ~64 hours ~17 hours ...

Fusion Strategy

1

Individual Stream Predictions

Each stream produces probability distribution over 100 classes

2

Weighted Fusion

Final = α·Skeleton + β·RGB + γ·Feature

Weights learned during training or fixed empirically

3

Ensemble Decision

Final prediction: argmax(Final)

Combines complementary strengths of all streams

Complete Multi-Stream Architecture

RGB Video H×W×T Pose HRNet Keypoints 27 joints (T, V, 3) GCN 44→36→24 VSFormer×8 Prediction 1 2000 classes Features Hand-crafted Sep Conv Spatial-Temporal 3D CNN Prediction 2 2000 classes RGB Raw frames (3, T, H, W) R(2+1)D 2D+1D Conv Swish Prediction 3 2000 classes

Implementation Details

Component Framework/Library Key Parameters
Pose Estimation MMPose / HRNet hrnet_w48_coco_wholebody
Skeleton GCN PyTorch 3 layers, dropout, batch_size=16
VSFormer Custom PyTorch 8 blocks, 32 heads, partition (6,4)
R(2+1)D torchvision.models.video r2plus1d_18, Swish activation
Optimizer Adam lr=0.0005, epochs=100
Loss Function CrossEntropyLoss Multi-class classification