Concepts¶
Key concepts and theory behind PrimateFace.
Facial Landmarks¶
What are Facial Landmarks?¶
Facial landmarks are specific points on the face that correspond to anatomical features. They enable: - Precise face alignment - Expression analysis - Individual identification - Behavioral studies
68-Point vs 48-Point Systems¶
68-Point System (Human-Oriented)¶
Originally designed for human faces with detailed coverage: - Jaw line: 17 points - Eyebrows: 10 points (5 each) - Eyes: 12 points (6 each) - Nose: 9 points - Mouth: 20 points
48-Point System (Primate-Optimized)¶
Adapted for non-human primates: - Reduced jaw points: Better fits primate facial structure - Simplified eyebrows: Accounts for fur coverage - Core features preserved: Eyes, nose, mouth remain detailed - Species-agnostic: Works across primate genera
Why Convert Between Systems?¶
- Dataset Compatibility: Leverage human face datasets
- Cross-Species Transfer: Apply human-trained models to primates
- Tool Interoperability: Use various annotation tools
DINOv2 Features¶
Self-Supervised Vision Transformers¶
DINOv2 is a vision transformer trained without labels that learns: - Universal visual features: Not task-specific - Semantic understanding: Groups similar content - Fine-grained details: Captures subtle differences
Why DINOv2 for Primates?¶
- No Species Bias: Not trained specifically on humans
- Robust Features: Works across lighting, pose, species
- Zero-Shot Transfer: No primate-specific training needed
Feature Applications¶
- Smart Selection: Choose diverse training samples
- Quality Assessment: Identify good/bad images
- Clustering: Group by species or individuals
- Anomaly Detection: Find unusual cases
Cascade R-CNN Architecture¶
Multi-Stage Detection¶
Cascade R-CNN improves detection through: 1. Stage 1: Rough face localization 2. Stage 2: Refined bounding boxes 3. Stage 3: Final precise detection
Why Cascade for Primates?¶
- Handles Occlusion: Partial faces in natural settings
- Multi-Scale: Detects faces at various distances
- High Precision: Reduces false positives
HRNet Architecture¶
High-Resolution Networks¶
HRNet maintains high-resolution representations throughout: - Parallel branches: Multiple resolution streams - Information exchange: Cross-resolution fusion - Detail preservation: Fine landmark localization
Advantages for Landmarks¶
- Precise localization: Sub-pixel accuracy
- Robust to blur: Handles motion and focus issues
- Efficient: Good accuracy/speed trade-off
Semi-Supervised Learning¶
Pseudo-Labeling Strategy¶
Combine model predictions with human verification: 1. Initial Prediction: Model generates candidates 2. Human Review: Quick verification/correction 3. Iterative Improvement: Retrain with new data
Benefits¶
- Faster Annotation: 5-10x speedup
- Consistency: Reduces human annotator variance
- Scalability: Handle large datasets
Cross-Framework Training¶
Why Multiple Frameworks?¶
Different frameworks excel at different tasks: - MMPose: Best accuracy, production-ready - DeepLabCut: Behavioral analysis, tracking - SLEAP: Multi-animal, social interactions - YOLO: Real-time, edge deployment
Unified Data Format¶
COCO format as universal exchange:
{
"images": [...],
"annotations": [{
"keypoints": [x1,y1,v1, x2,y2,v2, ...],
"bbox": [x, y, width, height]
}],
"categories": [...]
}
Evaluation Metrics¶
NME (Normalized Mean Error)¶
Distance error normalized by face size:
- Good: < 5% - Acceptable: 5-10% - Poor: > 10%PCK (Percentage of Correct Keypoints)¶
Percentage within threshold distance:
mAP (Mean Average Precision)¶
Detection quality across IoU thresholds: - mAP@50: 50% overlap required - mAP@75: 75% overlap required - mAP@50-95: Average across range
Species-Specific Considerations¶
Anatomical Variations¶
Different primates require adjustments: - Great Apes: Pronounced brow ridges - Prosimians: Large eyes, small faces - New World Monkeys: Varied nose structures
Behavioral Context¶
Consider natural behaviors: - Arboreal species: Often partially occluded - Social species: Multiple faces in frame - Nocturnal species: Low-light conditions