Home / Use cases / Imitation learning

Human demonstration datasets for imitation learning.

First-person video annotated with the layers imitation learning baselines actually consume, plus the format and the audit trail your research workflow needs.

First-person egocentric view of folding t-shirts on a workbench, a bimanual task captured in a home workshop — A bimanual folding task in a home workshop — first-person demonstrations like this are the raw input for an imitation learning dataset.

1 of 7 sections

Why egocentric demonstrations for imitation learning.

Imitation learning has been quietly shifting away from teleoperation as the dominant data source. Teleop is precise, repeatable, and slow. It scales linearly with the cost of a robot, an operator, and a lab. Learning from human video promises a different scaling curve: cheap to capture, abundant in the wild, and naturally diverse in environment, contributor and task. The trade-off, of course, is the embodiment gap. A human hand is not a robot end-effector, and a first-person camera is not a wrist-mounted gripper view.

Recent work like EgoMimic, EgoZero and the Ego4D-based pretraining studies have shown that the gap is surmountable when the human demonstrations come with the right annotation depth and the right alignment to robot policies. Hand pose, contact timing, depth, action language. Each layer narrows the gap. EgoVista is built around that lesson: produce the annotation stack that closes the embodiment gap for the policy class your research targets.

2 of 7 sections

Cross-embodiment learning from human video.

Several approaches now exist to translate human demonstrations into robot policies. They differ in where they place the embodiment translation, and in which signals they rely on.

Direct human-to-robot retargeting

The most explicit approach: extract 3D hand pose from the human video, then retarget it onto a specific robot kinematic. The robot policy is trained directly on the retargeted trajectories. EgoVista supplies the 3D hand pose; the retargeting layer is the responsibility of your team since it is robot-specific.

Visual-only learning

A more conservative approach: train a visual policy that maps egocentric image observations to actions, without an explicit hand pose channel. Useful for VLA-style training and for policies that should rely on visual context rather than precise hand tracking. EgoVista frames at 30 FPS plus the action label channel are sufficient inputs for this setup.

Hand pose as supervision signal

A middle ground: use hand pose as an auxiliary supervision signal during pretraining, then drop it at fine-tune time on the robot. The hand pose annotation is consumed by the loss function rather than by the policy directly, which lets the encoder learn a manipulation-aware representation without locking the architecture to a specific input shape.

Contact timing as task structure signal

For long horizon tasks, contact start and end transitions are a strong segmentation signal. Imitation learning baselines that use contact phases as subgoal markers consume the per-frame contact transitions delivered by EgoVista, with no further preprocessing.

3 of 7 sections

What is in an imitation learning dataset from EgoVista.

Every dataset is shipped as a self-contained directory in either LeRobot or RLDS format. The annotation stack is the same regardless of format: the format only determines how the data is serialised. The contents:

Egocentric RGB at 30 FPS, in MP4 with frame-aligned timestamps.
2D hand pose: 21 keypoints per hand, with per-keypoint confidence.
3D hand pose: 21 keypoints per hand, lifted from 2D and per-frame depth.
Depth maps from a monocular depth model, metric where intrinsics are available.
Hand-object segmentation with five classes per frame.
Contact phase annotations: per-frame contact start and end transitions.
Action language descriptions from a vision language model, with confidence scores.
Camera intrinsics from EXIF when available, estimated otherwise.
Episode metadata with task identifier, environment label, contributor pseudonym, and QA score.

4 of 7 sections

Research applications.

Several research workflows benefit from the EgoVista format and annotation depth. The same dataset can usually serve more than one of these in parallel.

Pretraining vision encoders on egocentric data. The 30 FPS frames with hand pose supervision are well suited to encoders that will later be fine tuned on robot data.
Self-supervised representation learning. Contact transitions and action labels can serve as weakly supervised signals for contrastive or masked learning, with the QA layer providing a confidence channel.
Few-shot imitation from a single human demo. Recent work on conditional policies uses a single demonstration plus a target instruction. The action label and the structured episode metadata are the conditioning signal.
Cross-embodiment transfer. Combined with the public Open X-Embodiment corpus through the RLDS export, EgoVista datasets add a human-demonstration dimension to multi-embodiment training.

5 of 7 sections

Applied use cases.

On the applied side, three families of projects fit this format well.

Skilled-trade and professional tasks. Tool handling, fitting, repairs, finishing. Egocentric video from independent professionals captures the dexterous, real-world manipulation and the specific failure modes that a policy must learn to handle.
Household robotics. Cooking, cleaning, tidying. Real homes vary in layout, lighting and clutter. Human demonstrations in real homes are the only realistic source of data that captures this variation.
Light assembly. Tool use, fastening, simple electronics. For light assembly tasks, hand pose and contact timing are the determining signals, and the human demonstration captures dexterity that is hard to replicate in teleop.

6 of 7 sections

Imitation learning dataset FAQ.

Can I train a policy on human video without a robot?

Yes, that is the core promise of learning from human demonstration. The standard approach is to extract hand pose and contact transitions from the egocentric video, then learn either a visual policy that maps observations to actions, or a representation that a downstream robot policy can fine tune. EgoVista provides the annotations that make this approach work: hand pose in 2D and 3D, contact phase, language descriptions, depth. The training side and the robot retargeting side remain your responsibility.

How do you ensure consistency across different contributors?

Consistency comes from three places. First, every contributor receives a mission brief with explicit guidance on framing, camera position, and the target action. Second, the annotation pipeline is the same for every contributor, so the layers are produced under identical conditions. Third, the QA pass flags clips that deviate from the brief (off-screen hands, wrong objects, occluded camera) so they can be replaced before delivery. The QA report quantifies inter-contributor consistency for the audit trail.

What is the difference between Ego4D and EgoVista datasets?

Ego4D is a large public corpus of egocentric video, collected for general activity recognition, with broad coverage but limited annotation depth for robotics. EgoVista is purpose built for robotics manipulation training. Every clip is annotated with the layers a policy actually consumes, the data flow is GDPR-compliant by construction, and the dataset is delivered in LeRobot or RLDS rather than a custom format. Ego4D is great for pretraining and broad context, EgoVista is targeted to your specific tasks and policies.

Can you replicate environments from existing datasets?

Up to a point. We can match common task families (kitchen, assembly, light office work) and align object categories, but we cannot reproduce a specific physical environment from a public dataset because we work with our contributor network in their own spaces. If the goal is to combine an EgoVista dataset with Ego4D or with Open X-Embodiment, we ship in formats and conventions that align naturally with both, and we document the cross-corpus integration in the dataset card.

Do you provide synchronized multi-view?

Yes when the input has multiple synchronized cameras. The annotation pipeline runs per view, so each camera gets its own pose, depth, segmentation and action labels. Synchronization metadata is preserved in the dataset metadata, with frame timestamps and clock offsets. If your project needs multi-view but the contributor network does not natively cover the setup, we can stage controlled multi-view recording sessions with the relevant hardware.

How do you handle hand occlusions?

Hand occlusions are part of natural egocentric video. The hand pose detector flags low-confidence keypoints, and the QA report quantifies how many frames per episode fall under the occlusion threshold. Severely occluded episodes are flagged for replacement. The dataset card documents the occlusion distribution so a training pipeline can either include occluded frames as is, mask them with the confidence signal, or filter them out using the QA tags.

7 of 7 sections

Discuss your imitation learning project.

A short call is enough to scope the task, the volume, and the format. From there, a pilot batch of 10 to 20 episodes typically lands in two to three weeks. For related material, see the manipulation policy use case, the LeRobot format details, the RLDS format details, or the full product page.

Discuss your imitation learning project Request a sample