Real-Time Sign Language Translator
Model Description
A production-ready deep learning system for American Sign Language (ASL) recognition achieving 99.60% accuracy with real-time performance (25-30 FPS) on consumer hardware. The system combines ResNet18 with MediaPipe hand detection for robust gesture classification.
Key Features:
- 99.60% test accuracy on 26 ASL letter classes (A-Z)
- Real-time inference at 25-30 FPS on consumer GPU (NVIDIA MX450)
- MediaPipe integration for hand isolation and background removal
- Transfer learning from ImageNet (convergence in 9 epochs)
- Production-ready with comprehensive documentation
Model Architecture
Webcam Input β MediaPipe Hand Detection (~15ms)
β
Hand Region Extraction (21 landmarks)
β
Preprocessing (224Γ224, Normalize)
β
ResNet18 Model (~13ms)
β
Softmax Classification (26 classes)
β
Top-3 Predictions + Confidence
ResNet18 + Custom Classification Head:
Input: 224Γ224Γ3 RGB Image
β
ResNet18 Backbone (Pretrained on ImageNet)
βββ Conv1: 7Γ7, 64 filters, stride=2
βββ Layer1: 2Γ Residual Blocks (64 filters)
βββ Layer2: 2Γ Residual Blocks (128 filters)
βββ Layer3: 2Γ Residual Blocks (256 filters)
βββ Layer4: 2Γ Residual Blocks (512 filters)
βββ Global Average Pooling β 512-D features
β
Custom Classification Head
βββ Dropout(0.5)
βββ Linear(512 β 512)
βββ ReLU()
βββ Dropout(0.3)
βββ Linear(512 β 26)
β
Output: 26 class logits (A-Z)
Model Statistics:
- Total Parameters: 11,452,506
- Model Size: 44 MB
- Trainable Parameters: 100% (full fine-tuning)
Performance
Test Set Results (13,050 images)
| Metric | Value |
|---|---|
| Overall Accuracy | 99.60% |
| Precision (Macro Avg) | 99.59% |
| Recall (Macro Avg) | 99.57% |
| F1-Score (Macro Avg) | 99.58% |
| Correct Predictions | 12,998 / 13,050 |
| Misclassifications | 52 |
Real-Time Inference
| Configuration | FPS | Latency | Accuracy |
|---|---|---|---|
| Without Hand Detection | 76.4 | 13 ms | Poor (~10%) |
| With MediaPipe | 25-30 | 30-35 ms | Excellent (~95%) |
Per-Class Performance
Perfect Classes (100% F1-Score): C, D, F, L, Q, W, Y, Z
Most Challenging Classes:
- M: 98.08% (confused with N - visually similar)
- U: 98.79% (confused with V - finger orientation)
- A: 99.12% (confused with E - thumb position)
Training Details
- Hardware: NVIDIA GeForce MX450 (2GB VRAM)
- Framework: PyTorch 2.7.1 + CUDA 12.9
- Epochs: 9 (manually stopped, best at epoch 8)
- Optimizer: Adam (lr=0.001β0.0005, weight_decay=1e-4)
- Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)
- Batch Size: 32
- Training Time: ~1 hour
- Best Validation Accuracy: 99.72% (epoch 8)
Training Progress
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | LR |
|---|---|---|---|---|---|
| 1 | 0.3521 | 89.45% | 0.0893 | 97.21% | 0.001 |
| 2 | 0.0745 | 97.68% | 0.0421 | 98.76% | 0.001 |
| 3 | 0.0412 | 98.71% | 0.0298 | 99.12% | 0.001 |
| 4 | 0.0289 | 99.08% | 0.0234 | 99.34% | 0.001 |
| 5 | 0.0221 | 99.31% | 0.0198 | 99.51% | 0.0005 |
| 6 | 0.0187 | 99.42% | 0.0176 | 99.58% | 0.0005 |
| 7 | 0.0165 | 99.53% | 0.0162 | 99.64% | 0.0005 |
| 8 | 0.0152 | 99.61% | 0.0151 | 99.72% | 0.0005 |
| 9 | 0.0143 | 99.67% | 0.0158 | 99.69% | 0.0005 |
Usage
Installation
pip install torch torchvision mediapipe opencv-python numpy pillow
Download Model
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="huzaifanasirrr/realtime-sign-language-translator",
filename="best_model.pth"
)
# Download MediaPipe hand detector
mediapipe_path = hf_hub_download(
repo_id="huzaifanasirrr/realtime-sign-language-translator",
filename="hand_landmarker.task"
)
Load Model
import torch
import torch.nn as nn
from torchvision import models
# Define model architecture
class SignLanguageModel(nn.Module):
def __init__(self, num_classes=26, pretrained=False):
super().__init__()
self.model = models.resnet18(pretrained=pretrained)
self.model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(512, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, x):
return self.model(x)
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SignLanguageModel(num_classes=26)
checkpoint = torch.load(model_path, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(device)
model.eval()
print(f"Model loaded successfully!")
print(f"Validation Accuracy: {checkpoint['val_acc']:.2f}%")
Real-Time Inference with MediaPipe
import cv2
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
from torchvision import transforms
from PIL import Image
# Setup MediaPipe hand detector
base_options = python.BaseOptions(model_asset_path=mediapipe_path)
options = vision.HandLandmarkerOptions(
base_options=base_options,
running_mode=vision.RunningMode.VIDEO,
num_hands=1
)
hands = vision.HandLandmarker.create_from_options(options)
# Preprocessing transform
preprocess = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# Class mapping
idx_to_class = {i: chr(65+i) for i in range(26)} # A-Z
# Capture from webcam
cap = cv2.VideoCapture(0)
timestamp_ms = 0
while True:
ret, frame = cap.read()
if not ret:
break
# Detect hand
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame_rgb)
results = hands.detect_for_video(mp_image, timestamp_ms)
timestamp_ms += 33 # ~30 FPS
if results.hand_landmarks:
landmarks = results.hand_landmarks[0]
# Extract hand region
h, w = frame.shape[:2]
x_coords = [lm.x * w for lm in landmarks]
y_coords = [lm.y * h for lm in landmarks]
x_min = max(0, int(min(x_coords)) - 40)
y_min = max(0, int(min(y_coords)) - 40)
x_max = min(w, int(max(x_coords)) + 40)
y_max = min(h, int(max(y_coords)) + 40)
hand_crop = frame[y_min:y_max, x_min:x_max]
# Preprocess and predict
pil_image = Image.fromarray(cv2.cvtColor(hand_crop, cv2.COLOR_BGR2RGB))
tensor = preprocess(pil_image).unsqueeze(0).to(device)
with torch.no_grad():
outputs = model(tensor)
probabilities = torch.softmax(outputs, dim=1)
top_prob, top_idx = torch.max(probabilities, dim=1)
predicted_class = idx_to_class[top_idx.item()]
confidence = top_prob.item() * 100
# Display prediction
cv2.rectangle(frame, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)
cv2.putText(frame, f"{predicted_class}: {confidence:.1f}%",
(x_min, y_min-10), cv2.FONT_HERSHEY_SIMPLEX,
0.9, (0, 255, 0), 2)
cv2.imshow('ASL Recognition', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Dataset
ASL Alphabet Dataset (Kaggle)
- Source: grassknoted/asl-alphabet
- Total Images: 87,000
- Classes: 26 letters (A-Z)
- Format: RGB, 200Γ200 pixels
- Background: Plain (controlled environment)
Data Split
Total: 87,000 images
βββ Training: 60,900 (70%) β 2,342 per class
βββ Validation: 13,050 (15%) β 502 per class
βββ Test: 13,050 (15%) β 502 per class
Data Augmentation
Training Augmentation:
- Random Rotation (Β±15Β°)
- Random Affine Translation (10%)
- Random Horizontal Flip (30%)
- Color Jitter (Β±20% brightness, contrast, saturation)
- ImageNet Normalization
Impact: +2.52% validation accuracy improvement
Model Files
best_model.pth- PyTorch checkpoint (44 MB) - Epoch 8, 99.72% val acchand_landmarker.task- MediaPipe hand detection model (7.8 MB)config.json- Training configurationclass_mapping.json- A-Z class mappingstraining_history.png- Training curves visualizationconfusion_matrix.png- Test set confusion matrixsample_predictions.png- Sample predictions with confidenceclassification_report.txt- Detailed per-class metricsrequirements.txt- Python dependencies
Limitations
- Static Gestures Only: Recognizes only static alphabet letters (A-Z), no dynamic gestures
- Single Hand: Processes only one hand at a time, no two-handed signs
- Lighting Sensitivity: Performance degrades in very low light conditions
- Hand Orientation: Expects specific orientations matching training data
- Dataset Bias: Trained on single signer's hands (may not generalize to all hand sizes/skin tones)
- No Word Formation: Letter-by-letter recognition only, no automatic word construction
Citation
If you use this model in your research, please cite:
@software{nasir2025signlanguage,
author = {Nasir, Huzaifa},
title = {Real-Time Sign Language Translator: ResNet18 + MediaPipe for ASL Recognition},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/huzaifanasirrr/realtime-sign-language-translator},
note = {GitHub: https://github.com/Huzaifanasir95/RealTime-Sign-Language-Translator}
}
Academic Report
A comprehensive academic report in Springer LNCS format is available in the GitHub repository.
Report Highlights:
- Title: Real-Time Sign Language Recognition using Deep Convolutional Neural Networks and MediaPipe Hand Detection
- Institution: National University of Computer and Emerging Sciences
- Format: Springer Lecture Notes in Computer Science (LNCS)
- Length: 776 lines, comprehensive analysis with 20+ citations
Author
Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
π§ nasirhuzaifa95@gmail.com
π Portfolio
π GitHub Repository
License
MIT License - See LICENSE file for details.
Acknowledgments
- Akash (grassknoted) for the ASL Alphabet dataset on Kaggle
- Google Research for MediaPipe hand tracking framework
- PyTorch Team for the deep learning framework
- ResNet18 pretrained on ImageNet (He et al., 2016)
- FAST-NUCES for computational resources
Research conducted at FAST-NUCES Islamabad. Inspired by the need for accessible communication tools for the deaf and hard-of-hearing community.
Made with β€οΈ for accessibility and inclusion
Breaking communication barriers, one sign at a time
- Downloads last month
- 30