Early Access · Research Partners

The egocentric
data layer for robotics.

Vista delivers annotated first-person video datasets — structured, diverse, and ready for your training pipeline.

LeRobot-compatible1080p minimumConfidence scores included

The problem

Real-world data is the bottleneck.

Lab-captured data lacks diversity

Always the same environments, the same operators. Real-world generalization requires exposure to thousands of unique scenarios — not just your lab floor.

Synthetic data has a realism gap

Models trained on simulated environments struggle in the real world. The physics, lighting, and occlusion patterns of real daily life cannot be fully replicated.

Building your own fleet is slow and expensive

5 to 20 data collectors to capture a few hours of useful footage. Months of logistics before your first training run.

The Vista Standard

Built for the data your models actually need.

01

Body & hand pose

38 keypoints via MediaPipe

Every dataset will include full skeletal tracking — 38 keypoints per frame, confidence score included, exported as normalized coordinates compatible with LeRobot schema.

02

Object segmentation

SAM2-powered instance masks

Pixel-level segmentation masks will be generated on every manipulated object, designed to track occlusions and hand-object interactions across the full clip.

03

Action labeling

Timestamped + natural language

Every atomic action will be tagged with start/end timestamps and a structured natural language label, built to align with standard robotics action taxonomies.

04

LeRobot-ready

HuggingFace format native

Datasets will ship as LeRobot-compatible HDF5 archives — designed to load directly into your training loop with no conversion scripts and no format wrangling.

Coverage

Diversity at every dimension.

Designed for diversity
Kitchen, office, workshop, outdoor — contributors film real life, not lab setups
Global by design
Diverse cultural and spatial contexts, not a single controlled environment
1080p
Minimum resolution
30fps minimum · MP4 & MOV accepted
>90%
Confidence threshold
Auto-validated · AI quality score per dataset

Pipeline

From raw footage to training-ready data.

01

Upload

Contributor submits raw footage via secure portal

02

Validation

Technical checks: resolution, fps, duration, format

03

Frame extraction

FFmpeg splits video into annotatable frames

04

Pose estimation

MediaPipe extracts 38 body & hand keypoints

05

Segmentation

SAM2 generates instance masks on all objects

06

LeRobot packaging

HDF5 export, indexed and ready to train

Early access

Get early access to Vista datasets.

We're onboarding our first research partners. Tell us about your use case.

Target environments *

We'll respond within 48 hours.

For contributors

Film your daily life. Get paid.

Join the Vista contributor network. Upload egocentric videos from your daily life and earn per validated clip.