Skeleton-Based Recognition (GCN)
HRNet Graph Convolution TransformerPose Estimation (HRNet)
Extract 27 keypoints from each frame using hrnet_w48_coco_wholebody
Output: (T=150, V=27, C=3) where C = (x, y, confidence)
Graph Construction
Build skeleton graph with adjacency matrix A representing joint connections
Progressive reduction: 44 nodes → 36 nodes → 24 nodes
GCN Processing (3 Layers)
L1: TCN_GCN_drop(3 → 64, V=44) + DropBlock
L2: TCN_GCN_drop(64 → 96, V=36) + DropBlock
L3: TCN_GCN_drop(96 → 96, V=24) + DropBlock
VSFormer Blocks (×8)
Multi-head self-attention with partition size (6, 4)
Parallel GCN + TCN branches
Channel progression: 96 → 192 → 192 → 192
Classification
Global pooling over (T, V) dimensions
Fully connected layer: 192 → 100 classes
Key Components
| Component | Description | Parameters |
|---|---|---|
| HRNet Extractor | Whole-body pose estimation | 27 joints (hands, body, face) |
| Graph Reduction | Progressive joint pooling | 44 → 36 → 24 nodes |
| DropBlock | Spatial-temporal regularization | Skeleton & time dropout |
| VSFormer | Spatial-temporal transformer | 8 blocks, 32 heads |