Home / Formats / LeRobot v3.0

LeRobot v3.0 dataset format support.

Every EgoVista dataset ships in the LeRobot v3.0 format, ready to load with the Hugging Face LeRobot stack. No conversion script, no schema gymnastics, no edge case bugs in the data path.

1 of 8 sections

What LeRobot brings to robotics teams.

LeRobot became the de facto reference dataset format for imitation learning and behavior cloning in 2025 because it solved a very practical problem: every robotics team was reinventing the same data path. Frames, hand pose, actions, episode boundaries, train and validation splits, dataset cards, statistics. With LeRobot, all of that lives in a single versioned directory structure that the Hugging Face stack understands natively.

For an ML team, receiving a dataset already in LeRobot v3.0 means skipping two to four weeks of conversion work. The training scripts shipped with the LeRobot repository (ACT, Diffusion Policy, VQ-BeT) load it directly. The standard visualization tools render it without modification. Upload to the Hugging Face Hub is a single command. For an ML lead trying to ship a policy in a quarter, that translates into more time iterating on architecture and less time fighting IO bugs.

2 of 8 sections

LeRobot v3.0 dataset structure.

A LeRobot v3.0 dataset is a versioned directory with a small set of well-defined files. The structure was designed to scale from a few episodes for a quick prototype to hundreds of thousands of episodes for a production policy. EgoVista produces every dataset in this exact layout, so you can point the LeRobot loader at the root and start training.

Standard LeRobot v3.0 dataset directory structure.

The directory layout in plain terms:

data/chunk-000/episode_000000.parquet holds the per-frame time series for one episode. Observations, actions, timestamps and any auxiliary signals are columns in this parquet. Episodes are grouped into chunks so the dataset can be streamed and sharded across workers.
videos/chunk-000/<camera>/episode_000000.mp4 holds the raw video frames per camera, aligned with the parquet timestamps. For egocentric datasets there is typically a single camera, but multi-view setups are supported.
meta/info.json is the dataset header: schema version, feature dictionary, frame rate, list of cameras, total episode count and build provenance.
meta/episodes.jsonl lists every episode with its start and end frame indices, the chunk it belongs to, and any per-episode tags (task identifier, contributor pseudonym, environment label).
meta/stats.json contains per-feature normalization statistics (mean, std, min, max) computed over the full dataset. The standard data loaders use this for input normalization at train time.
meta/tasks.jsonl describes the tasks present in the dataset with stable identifiers, natural language descriptions, and language instruction templates for VLA training.

Schema version is part of the dataset, so a re-export under a newer LeRobot release can be requested without re-collecting any data. Schema changes are documented in the dataset card so audits stay reproducible.

3 of 8 sections

Features and observation schema.

The parquet rows follow the canonical LeRobot v3.0 feature dictionary, extended with the EgoVista annotation layers. Every feature has a stable name so your training script can address it by string key, and every feature is documented in meta/info.json with its dtype, shape, and origin.

Default features in an EgoVista LeRobot dataset:

observation.image.<camera>: reference to the per-frame video sample, decoded lazily by the LeRobot loader.
observation.state.hand_pose_2d: 21 keypoints per hand, normalized image coordinates, with per-keypoint confidence.
observation.state.hand_pose_3d: 21 keypoints per hand in 3D, lifted from 2D pose and the per-frame depth map.
observation.depth_map: per-frame depth map from Depth Anything V2. Metric where camera intrinsics are available, relative otherwise.
observation.segmentation.hand_object: 5-class segmentation (left hand, right hand, object in left hand, object in right hand, object in both hands).
action.contact_event: per-frame contact start and end transitions inferred from hand and object segmentation overlap.
action.label: natural language action description produced by a vision language model, with a confidence score and the model identifier.
timestamp, episode_index, frame_index: standard LeRobot bookkeeping.

Loading a dataset is a one-liner:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset("egovista/sample-manipulation-v1")
episode = dataset[0]
print(episode["observation.state.hand_pose_2d"].shape)
print(episode["action.label"])

4 of 8 sections

From raw video to LeRobot dataset.

The conversion from a raw egocentric video to a LeRobot v3.0 dataset is a deterministic pipeline. Every step is logged with timestamps and model versions, so any frame in the final dataset can be traced back to the exact pipeline run that produced it.

First-person egocentric view of a kitchen cooking task, an example of the raw contributor video that enters the LeRobot conversion pipeline — Example of a raw egocentric contributor video, before annotation and LeRobot v3.0 packaging.

The pipeline stages, in order:

Ingest the contributor video into EU-region object storage. Original video stays in the contributor zone, only an anonymized derived version reaches the annotation stack.
Apply face anonymization at source. No frame with an identifiable third party face crosses an external boundary.
Run the 9 annotation layers in parallel on EU compute: 2D and 3D hand pose with our hand-tracking model, depth with a monocular depth model, segmentation with a hand-object segmentation model on EU GPU, contact timing derived from segmentation, action labels with a vision language model in the EU region, plus camera intrinsics and metadata enrichment.
Map raw annotations to the LeRobot v3.0 feature dictionary, validate schema, and generate the parquet chunks.
Build meta/info.json, meta/episodes.jsonl, meta/stats.json, and meta/tasks.jsonl, then produce a per-dataset QA report.
Deliver via signed URL on EU-region object storage. Retention follows the agreed engagement model, with raw footage purged after the contractual window.

Every step in this pipeline runs on EU compute. See the GDPR compliance page for the full data flow and legal basis per processing operation.

5 of 8 sections

Compatibility with the LeRobot ecosystem.

A LeRobot-native dataset only adds value if it plugs into the tools the community already uses. EgoVista datasets are tested against the standard LeRobot stack at every export, so the integration is predictable.

Hugging Face Hub: direct upload via huggingface-cli upload or the Python API. The dataset card is generated at build time and includes schema version, provenance, and QA scores.
LeRobot training scripts: ACT, Diffusion Policy and VQ-BeT load the dataset without modification. Configuration files are provided as a starting point.
Standard data loaders: the Hugging Face datasets library and the LeRobot loader both work natively with the parquet layout.
Visualization tools: lerobot.scripts.visualizerenders episodes with the annotation layers overlaid, which is useful for QA review before training.

6 of 8 sections

When LeRobot is the right format for your team.

LeRobot is the natural pick if you fall into one of three buckets. First, teams that already train on the Hugging Face stack: LeRobot is the path of least resistance, and the existing tooling immediately accelerates iteration. Second, research groups that want their results to be comparable with the LeRobot benchmark suite: producing datasets in the standard format makes cross-paper reproduction much easier. Third, startups that want to focus engineering capacity on model architecture and policy evaluation, not on maintaining a custom dataset format. If you need TFRecord-based interop instead, see the RLDS format page.

7 of 8 sections

LeRobot format frequently asked questions.

Do you support older LeRobot versions?

Yes, we can export to LeRobot v2 on request, with the older format conventions (HDF5 episode files, single-file metadata). However, the schema differs from v3.0 in several places (no chunked parquet, different observation naming), so we recommend v3.0 for any new project unless your training stack is pinned to an older version. We document the v2-to-v3 mapping in the dataset card so your team can migrate later without re-collecting data.

Can the dataset be uploaded directly to Hugging Face Hub?

Yes. Datasets are produced in the exact directory layout expected by the Hugging Face datasets library and the LeRobot loader, so `huggingface-cli upload` works without modification. Clients can choose between a private repository under their organization or a fully self-hosted delivery via signed URL. The dataset card is generated as part of the build and includes provenance, schema version, and the per-episode QA scores.

How do you handle multi-camera setups in LeRobot v3.0?

Each camera stream gets its own observation key under `observation.image.<camera_name>`, with the video files stored under `videos/chunk-<n>/<camera_name>/episode_<id>.mp4`. Synchronization metadata (frame timestamps, clock offsets) is preserved in `meta/episodes.jsonl`. Hand pose, depth, and segmentation are produced per-view so each camera has its own consistent annotation stack. Cross-view alignment data is added when EXIF intrinsics are available.

What is the maximum dataset size you can deliver?

There is no hard cap on dataset size. The format supports shards of several hundred gigabytes, split across multiple chunks for parallel loading. For very large deliveries the dataset is sharded across multiple parquet chunks so a training loop can stream rather than load everything in memory. We also publish a `meta/stats.json` so your data loader can normalize without scanning the full dataset.

Can we request a custom schema variant?

Yes. The defaults follow the canonical LeRobot v3.0 spec, but every dataset can be re-shaped to a custom feature set on request: drop unused layers, rename observation keys, change the action space encoding, or split a feature into separate columns. Custom variants are versioned alongside the canonical one so your team can switch between them without re-collection. Custom mappings are documented in the dataset card.

Are videos included or referenced externally?

Videos are included inside the dataset, under the `videos/` directory, encoded as H.264 MP4 with timestamps aligned to the per-frame parquet rows. We do not reference external buckets, so the dataset is self-contained and reproducible after delivery. If a client prefers external video references for storage reasons, we can produce a hybrid layout where parquet holds the annotations and videos sit on a separate signed URL.

How do you handle action label uncertainty in the schema?

Each action label produced by our vision language model ships with a confidence score under `action.label_confidence`, plus a model version under `action.label_model_id`. Low-confidence labels are flagged in the per-episode QA report so your training pipeline can mask them or weight them down. We never claim that action labels are ground truth, they are predictions from a vision language model, and we document the methodology so audits are reproducible.

8 of 8 sections

Request a LeRobot sample.

Tell us the task, the environment, and the format variant. We can ship a 10 to 20 episode LeRobot v3.0 sample so your team can validate the schema and the data path on your existing training loop before any scale up. For related reading, see RLDS TFRecord and Open X-Embodiment, the full product overview, or the GDPR pipeline details.

Request a LeRobot sample Talk to an engineer