Evaluation API¶

The evals module provides comprehensive tools for evaluating pose estimation models across multiple frameworks, calculating standard metrics, and generating comparison reports.

Quick Reference¶

from evals import (
    NMECalculator,
    PCKCalculator,
    OKSCalculator,
    ModelManager,
    ModelConfig,
    CrossFrameworkEvaluator,
    EvalVisualizer
)

Main Classes¶

NMECalculator¶

Calculate Normalized Mean Error with various normalization strategies.

calculator = NMECalculator(
    normalize_by='bbox',  # 'bbox', 'interocular', or 'head_size'
    use_visible_only=True
)

# Calculate NME
result = calculator.calculate(
    predictions,  # Shape: (N, K, 2) or (N, K, 3)
    ground_truth,  # Shape: (N, K, 2) or (N, K, 3)
    metadata={'bbox': [x, y, w, h]}
)

print(f"NME: {result['nme']:.3f}")
print(f"Per-keypoint NME: {result['per_keypoint_nme']}")

PCKCalculator¶

Calculate Percentage of Correct Keypoints at multiple thresholds.

calculator = PCKCalculator(
    thresholds=[0.05, 0.1, 0.2],  # Multiple thresholds
    normalize_by='bbox'
)

result = calculator.calculate(predictions, ground_truth, metadata)

# Access PCK at different thresholds
print(f"PCK@0.1: {result['pck'][0.1]:.2%}")
print(f"AUC: {result['auc']:.3f}")

OKSCalculator¶

COCO-style Object Keypoint Similarity metric.

calculator = OKSCalculator(
    sigmas=None,  # Use default or provide custom
    use_area=True
)

result = calculator.calculate(
    predictions,
    ground_truth,
    areas  # Object areas for normalization
)

print(f"Mean OKS: {result['mean_oks']:.3f}")

ModelManager¶

Unified interface for loading models from different frameworks.

manager = ModelManager()

# Load model from config
config = ModelConfig(
    name='pf_hrnet',
    framework='mmpose',
    config_path='configs/hrnet.py',
    checkpoint_path='models/hrnet.pth',
    device='cuda:0'
)

model = manager.load_model(config)

# Run inference
results = manager.run_inference(
    model,
    test_images,
    framework='mmpose'
)

CrossFrameworkEvaluator¶

Compare models across different frameworks.

evaluator = CrossFrameworkEvaluator(
    test_data='test_annotations.json',
    output_dir='evaluation_results/'
)

# Add models for comparison
evaluator.add_model('configs/mmpose_model.yaml')
evaluator.add_model('configs/dlc_model.yaml')
evaluator.add_model('configs/sleap_model.yaml')

# Evaluate all models
evaluator.evaluate_all(metrics=['nme', 'pck', 'oks'])

# Generate comparison report
df = evaluator.generate_comparison_report()
evaluator.plot_comparison(metric='nme')

Common Usage Patterns¶

Basic Model Evaluation¶

from evals import NMECalculator, PCKCalculator

# Load predictions and ground truth
predictions = load_predictions('predictions.json')
ground_truth = load_annotations('annotations.json')

# Calculate multiple metrics
nme_calc = NMECalculator(normalize_by='bbox')
nme = nme_calc.calculate(predictions, ground_truth, metadata)

pck_calc = PCKCalculator(thresholds=[0.1, 0.2])
pck = pck_calc.calculate(predictions, ground_truth, metadata)

print(f"NME: {nme['nme']:.3f}")
print(f"PCK@0.2: {pck['pck'][0.2]:.2%}")

Per-Genus Evaluation¶

from evals import GenusEvaluator

evaluator = GenusEvaluator(
    predictions_path='predictions.json',
    annotations_path='annotations.json'
)

# Evaluate per genus
evaluator.evaluate_per_genus(metrics=['nme', 'pck'])

# Generate report
df = evaluator.generate_report()
print(df.to_string())

# Plot comparison
fig = evaluator.plot_genus_comparison(metric='nme')

Visualization¶

from evals import EvalVisualizer

viz = EvalVisualizer()

# Plot training curves
history = {
    'loss': [0.5, 0.4, 0.3],
    'val_loss': [0.6, 0.5, 0.4],
    'nme': [0.1, 0.08, 0.06]
}
viz.plot_training_curves(history, metrics=['loss', 'nme'])

# Visualize predictions
viz.plot_predictions(
    images=test_images,
    predictions=model_predictions,
    ground_truth=annotations,
    max_images=9
)

# Error distribution
viz.plot_error_distribution(
    errors,
    keypoint_names=KEYPOINT_NAMES
)

CLI Scripts¶

Detection Model Comparison¶

python compare_det_models.py \
    --coco test_annotations.json \
    --model-config cascade_rcnn.py \
    --model-checkpoint cascade_rcnn.pth \
    --output detection_eval.json

Pose Model Comparison¶

python compare_pose_models.py \
    --coco test_annotations.json \
    --pose-config hrnet.py \
    --pose-checkpoint hrnet.pth \
    --output pose_eval.json

Genus-Specific Evaluation¶

python eval_genera.py \
    --predictions predictions.json \
    --annotations annotations.json \
    --output genus_metrics.json

Metrics Overview¶

Metric	Description	Range	Lower is Better
NME	Normalized Mean Error	0-∞	✓
PCK	Percentage Correct Keypoints	0-1	✗
OKS	Object Keypoint Similarity	0-1	✗
mAP	Mean Average Precision	0-1	✗

Framework Support¶

Framework	Detection	Pose	Training	Evaluation
MMPose	✓	✓	✓	✓
DeepLabCut	✗	✓	✓	✓
SLEAP	✗	✓	✓	✓
YOLO	✓	✗	✓	✓

Configuration¶

Model Configuration¶

# model_config.yaml
name: pf_hrnet_68kpt
framework: mmpose
config_path: configs/hrnet_w32_primateface.py
checkpoint_path: checkpoints/pf_hrnet_best.pth
device: cuda:0
additional_params:
  bbox_thr: 0.3
  kpt_thr: 0.3

Load and Use¶

from evals import ModelConfig

# Load from file
config = ModelConfig.from_file('model_config.yaml')

# Or create programmatically
config = ModelConfig(
    name='my_model',
    framework='mmpose',
    config_path='config.py',
    checkpoint_path='model.pth'
)

Advanced Features¶

Temporal Consistency¶

from evals import TemporalConsistencyEvaluator

evaluator = TemporalConsistencyEvaluator()
result = evaluator.evaluate_video(
    predictions,  # Shape: (T, K, 2)
    fps=30.0
)

print(f"Jitter: {result['mean_jitter']:.2f}")
print(f"Smoothness: {result['smooth_score']:.2f}")
print(f"Problematic frames: {result['problematic_frames']}")

Multi-Scale Evaluation¶

from evals import MultiScaleEvaluator

evaluator = MultiScaleEvaluator(scales=[0.5, 1.0, 1.5])
results = evaluator.evaluate(
    model,
    test_images,
    ground_truth
)

for scale, metrics in results.items():
    print(f"Scale {scale}: NME={metrics['nme']:.3f}")

Detailed Documentation¶

For comprehensive technical documentation including: - Full metric implementations - Cross-framework comparison details - Advanced evaluation pipelines - Performance optimization - Testing strategies

See: evals/eval_docs.md

Demos - Run inference with models
GUI - Visualize evaluation results
Converter - Convert between formats

Best Practices¶

Use configuration files instead of hardcoded paths
Store results in standardized formats (COCO JSON)
Batch processing for efficient evaluation
Multiple metrics for comprehensive assessment
Cross-validation for robust results

Troubleshooting¶

Common Issues¶

Framework conflicts: Use separate environments
Memory errors: Reduce batch size
Metric discrepancies: Check normalization methods
Path errors: Use absolute paths in configs

Getting Help¶

Check Technical Documentation
See Framework Training Guide
Review examples in test_evals.py

Next Steps¶

Compare models using CrossFrameworkEvaluator
Visualize results with EvalVisualizer
Train new models following framework guides
Analyze per-genus performance with GenusEvaluator