Chapter 6: Policy Architectures: ACT, Diffusion Policy, RT-X, OpenVLA, and pi0

Overview

This is not a generic VLA chapter; it is an architecture map of action representation, latency, sensor channels, and release gates for factory manipulation. This chapter rewrites the problem as a factory-cell data contract. The promise of large robot datasets in ^[1] and ^[2] matters, but in manufacturing an episode becomes valuable only when it is attached to inspection and process outcomes.

ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. The data is therefore not just a camera stream. It includes hand choice, force-torque, tactile patches, controller mode, operator intervention, and inspection result. ^[3] and ^[4] show why human-to-robot transfer loses critical signals unless the capture interface is designed around embodiment and contact.

After reading this chapter... - Explain factory manipulation as a data contract rather than a model score. - Separate two-finger, suction, custom, and five-finger choices by collection cost and failure observability. - Read papers, company releases, and deployment claims through evidence tiers. - Design the replay set and QA trace required for a first manufacturing PoC.

Core Map

Decision axis	Data that must remain observable	Factory-cell decision
Task distribution	SKU, lot, fixture, contact event, inspection result	Decide whether one policy family is enough or the cell needs task-specific policies ^[1]
Hardware	Gripper, hand, tactile/force channel, calibration state	Compare suction, two fingers, custom tools, and five-finger hands by data cost ^[2]
Operations log	Override, stop, rework, scrap, cycle time	Set retraining triggers and rollback criteria ^[3]

Visual Argument

Figure 6.1. Manipulation policy architecture map. Source: reused local survey asset or author-created illustration.

Figure 6.2. ACT action-chunking behavior. Source: reused local survey asset or author-created illustration.

Figure 6.3. Open X-Embodiment cross-robot data. Source: reused local survey asset or author-created illustration.

Action Chunking and Latency

The practical meaning of action chunking and latency is that process variables must be left in a learnable form. ^[1] shows how scale can broaden embodiment and task families, yet a factory cell is narrower and stricter. A pick that looks identical to a person can become a different distribution when fixture tolerance, surface contamination, cycle pressure, or reject code changes.

In the concrete scenario, ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. Human demonstrations alone are not enough. Even when an interface such as ^[2] makes collection easier, a policy will repeat the same release failure if contact force and failure cause are absent. The episode schema must bind observation, action, contact state, QA outcome, and operator note under one key.

The evidence tier matters. A source such as ^[3] gives a method and benchmark that can be inspected. A company source can reveal deployment direction, but usually exposes less about data rights, recovery handling, and operating metrics. Manufacturers should place both in the same comparison table but never read them with the same confidence.

From a control perspective, the learned policy should not be the whole system. Force limits, guarded motion, fixture state, collision zones, and rollback conditions have to sit around the model. Without that boundary, collecting more data can create more safety stops and more rework instead of a more deployable robot.

From an operating perspective, the reproducibility of failure matters more than a headline success rate. A failure has to enter a replay set, the next update has to pass that replay set, and the deployed cell has to show a lower rate for the same failure code. The question is not only how to describe action chunking and latency, but which missing log would make the claim collapse on a line.

Diffusion Actions and Continuous Control

The practical meaning of diffusion actions and continuous control is that process variables must be left in a learnable form. ^[4] shows how scale can broaden embodiment and task families, yet a factory cell is narrower and stricter. A pick that looks identical to a person can become a different distribution when fixture tolerance, surface contamination, cycle pressure, or reject code changes.

In the concrete scenario, ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. Human demonstrations alone are not enough. Even when an interface such as ^[5] makes collection easier, a policy will repeat the same release failure if contact force and failure cause are absent. The episode schema must bind observation, action, contact state, QA outcome, and operator note under one key.

The evidence tier matters. A source such as ^[6] gives a method and benchmark that can be inspected. A company source can reveal deployment direction, but usually exposes less about data rights, recovery handling, and operating metrics. Manufacturers should place both in the same comparison table but never read them with the same confidence.

From a control perspective, the learned policy should not be the whole system. Force limits, guarded motion, fixture state, collision zones, and rollback conditions have to sit around the model. Without that boundary, collecting more data can create more safety stops and more rework instead of a more deployable robot.

From an operating perspective, the reproducibility of failure matters more than a headline success rate. A failure has to enter a replay set, the next update has to pass that replay set, and the deployed cell has to show a lower rate for the same failure code. The question is not only how to describe diffusion actions and continuous control, but which missing log would make the claim collapse on a line.

RT-X and Cross-Embodiment

The practical meaning of rt-x and cross-embodiment is that process variables must be left in a learnable form. ^[7] shows how scale can broaden embodiment and task families, yet a factory cell is narrower and stricter. A pick that looks identical to a person can become a different distribution when fixture tolerance, surface contamination, cycle pressure, or reject code changes.

In the concrete scenario, ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. Human demonstrations alone are not enough. Even when an interface such as ^[8] makes collection easier, a policy will repeat the same release failure if contact force and failure cause are absent. The episode schema must bind observation, action, contact state, QA outcome, and operator note under one key.

The evidence tier matters. A source such as ^[9] gives a method and benchmark that can be inspected. A company source can reveal deployment direction, but usually exposes less about data rights, recovery handling, and operating metrics. Manufacturers should place both in the same comparison table but never read them with the same confidence.

From a control perspective, the learned policy should not be the whole system. Force limits, guarded motion, fixture state, collision zones, and rollback conditions have to sit around the model. Without that boundary, collecting more data can create more safety stops and more rework instead of a more deployable robot.

From an operating perspective, the reproducibility of failure matters more than a headline success rate. A failure has to enter a replay set, the next update has to pass that replay set, and the deployed cell has to show a lower rate for the same failure code. The question is not only how to describe rt-x and cross-embodiment, but which missing log would make the claim collapse on a line.

OpenVLA, Octo, and pi0

The practical meaning of openvla, octo, and pi0 is that process variables must be left in a learnable form. ^[10] shows how scale can broaden embodiment and task families, yet a factory cell is narrower and stricter. A pick that looks identical to a person can become a different distribution when fixture tolerance, surface contamination, cycle pressure, or reject code changes.

In the concrete scenario, ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. Human demonstrations alone are not enough. Even when an interface such as ^[11] makes collection easier, a policy will repeat the same release failure if contact force and failure cause are absent. The episode schema must bind observation, action, contact state, QA outcome, and operator note under one key.

The evidence tier matters. A source such as ^[12] gives a method and benchmark that can be inspected. A company source can reveal deployment direction, but usually exposes less about data rights, recovery handling, and operating metrics. Manufacturers should place both in the same comparison table but never read them with the same confidence.

From a control perspective, the learned policy should not be the whole system. Force limits, guarded motion, fixture state, collision zones, and rollback conditions have to sit around the model. Without that boundary, collecting more data can create more safety stops and more rework instead of a more deployable robot.

From an operating perspective, the reproducibility of failure matters more than a headline success rate. A failure has to enter a replay set, the next update has to pass that replay set, and the deployed cell has to show a lower rate for the same failure code. The question is not only how to describe openvla, octo, and pi0, but which missing log would make the claim collapse on a line.

How Force and Touch Enter the Policy Stack

The practical meaning of how force and touch enter the policy stack is that process variables must be left in a learnable form. ^[13] shows how scale can broaden embodiment and task families, yet a factory cell is narrower and stricter. A pick that looks identical to a person can become a different distribution when fixture tolerance, surface contamination, cycle pressure, or reject code changes.

In the concrete scenario, ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. Human demonstrations alone are not enough. Even when an interface such as ^[14] makes collection easier, a policy will repeat the same release failure if contact force and failure cause are absent. The episode schema must bind observation, action, contact state, QA outcome, and operator note under one key.

The evidence tier matters. A source such as ^[15] gives a method and benchmark that can be inspected. A company source can reveal deployment direction, but usually exposes less about data rights, recovery handling, and operating metrics. Manufacturers should place both in the same comparison table but never read them with the same confidence.

From a control perspective, the learned policy should not be the whole system. Force limits, guarded motion, fixture state, collision zones, and rollback conditions have to sit around the model. Without that boundary, collecting more data can create more safety stops and more rework instead of a more deployable robot.

From an operating perspective, the reproducibility of failure matters more than a headline success rate. A failure has to enter a replay set, the next update has to pass that replay set, and the deployed cell has to show a lower rate for the same failure code. The question is not only how to describe how force and touch enter the policy stack, but which missing log would make the claim collapse on a line.

Manufacturing Cell Checkpoint

Start a PoC by writing the data contract before selecting the model. ACT is strong for short-horizon bimanual imitation, Diffusion Policy handles continuous action distributions smoothly, and RT-X/OpenVLA/pi0 broaden embodiment while still needing contact, force, and inspection gates. The contract should include episode ID, operator or teleop ID, hand or tool ID, contact channel, controller mode, quality decision, defect code, override reason, and rollback condition. If a vendor hides or cannot export these fields, the manufacturer cannot explain why model improvement happens.

The second checkpoint is evidence tiering. Papers with arXiv or DOI links support method claims; official company pages support product-direction claims; press and watchlist items should not become load-bearing claims. PI, Generalist, Skild, Figure, Covariant, Dexterity, Chef Robotics, Sunday, Config, and CarbonSix all fit under data-driven manipulation, but they expose very different evidence about manufacturing readiness.

Open Questions and Failure Modes

First, large data without contact state leaves insertion, wiping, deformable handling, and tool use under-observed. Second, when data rights stay entirely with the vendor, the manufacturer does not own the cause of process improvement. Third, worker video and process IP are learning assets and governance risks at the same time. Fourth, sim-to-real data does not guarantee factory performance unless it is tied to real QA labels.

Data Contract Addendum

The important unit is not the model name but the observable attempt. To a human, two factory cycles can look like the same task; to a robot, they may differ in surface friction, initial pose, contact order, force limit, and inspection rule. Data strategy is therefore an operating system for connecting failures to process variables, not a campaign for collecting success clips.

Factory data is slower and messier than benchmark data, but it carries more decisive labels. Defect codes, rework, operator intervention, line stops, and safety stops are not noise to be hidden from learning. They are the labels that decide whether a policy can be deployed. If those labels are detached from episodes, model curves can improve while process KPIs stay flat.

Large-data driven manipulation is therefore not a choice of one VLA. It is a joint design of task schema, hand choice, controller boundary, replay set, QA trace, and update governance. Without that design, a foundation model may produce impressive demonstrations without becoming repeatable replacement labor in manufacturing.

Operating Loop Addendum

The important unit is not the model name but the observable attempt. To a human, two factory cycles can look like the same task; to a robot, they may differ in surface friction, initial pose, contact order, force limit, and inspection rule. Data strategy is therefore an operating system for connecting failures to process variables, not a campaign for collecting success clips.

Factory data is slower and messier than benchmark data, but it carries more decisive labels. Defect codes, rework, operator intervention, line stops, and safety stops are not noise to be hidden from learning. They are the labels that decide whether a policy can be deployed. If those labels are detached from episodes, model curves can improve while process KPIs stay flat.

Large-data driven manipulation is therefore not a choice of one VLA. It is a joint design of task schema, hand choice, controller boundary, replay set, QA trace, and update governance. Without that design, a foundation model may produce impressive demonstrations without becoming repeatable replacement labor in manufacturing.

Deployment Governance Addendum

The important unit is not the model name but the observable attempt. To a human, two factory cycles can look like the same task; to a robot, they may differ in surface friction, initial pose, contact order, force limit, and inspection rule. Data strategy is therefore an operating system for connecting failures to process variables, not a campaign for collecting success clips.

Factory data is slower and messier than benchmark data, but it carries more decisive labels. Defect codes, rework, operator intervention, line stops, and safety stops are not noise to be hidden from learning. They are the labels that decide whether a policy can be deployed. If those labels are detached from episodes, model curves can improve while process KPIs stay flat.

Large-data driven manipulation is therefore not a choice of one VLA. It is a joint design of task schema, hand choice, controller boundary, replay set, QA trace, and update governance. Without that design, a foundation model may produce impressive demonstrations without becoming repeatable replacement labor in manufacturing.

Field Failure Addendum

The important unit is not the model name but the observable attempt. To a human, two factory cycles can look like the same task; to a robot, they may differ in surface friction, initial pose, contact order, force limit, and inspection rule. Data strategy is therefore an operating system for connecting failures to process variables, not a campaign for collecting success clips.

Factory data is slower and messier than benchmark data, but it carries more decisive labels. Defect codes, rework, operator intervention, line stops, and safety stops are not noise to be hidden from learning. They are the labels that decide whether a policy can be deployed. If those labels are detached from episodes, model curves can improve while process KPIs stay flat.

Large-data driven manipulation is therefore not a choice of one VLA. It is a joint design of task schema, hand choice, controller boundary, replay set, QA trace, and update governance. Without that design, a foundation model may produce impressive demonstrations without becoming repeatable replacement labor in manufacturing.

What to Learn Next

The next chapter applies the same principle to hands and end-effectors. Some cells are best served by suction or two fingers; others cannot close the data loop without five-finger hands and tactile or force-rich data.

References

Zhao, Tony Z. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv.
Chi, Cheng (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv.
O'Neill, Abby (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv.
Brohan, Anthony (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv.
Brohan, Anthony (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv.
Kim, Moo Jin (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv.
Octo Model Team (2024). Octo: An Open-Source Generalist Robot Policy. arXiv.
Black, Kevin (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. arXiv.
Physical Intelligence (2024). pi0: A Generalist Robot Policy. Company research post.
Physical Intelligence (2025). OpenPI: Open Source Robot Policy Stack. GitHub.
Pertsch, Karl (2025). FAST: Efficient Action Tokenization for Vision-Language-Action Models. arXiv.
Shukla, Shivin (2025). SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. arXiv.
DeepMind Robotics Team (2025). Gemini Robotics: Bringing AI into the Physical World. arXiv.
Bjorck, Johan (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv.
NVIDIA (2025). Isaac GR00T N1 Open Foundation Model for Humanoid Robots. NVIDIA Developer.
Yu, Wenhao (2025). ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation. arXiv.
Huang, Yuhang (2025). Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization. arXiv.
Hao, Yaru (2025). Tactile-Language-Action Model for Contact-Rich Manipulation. arXiv.
Khazatsky, Alexander (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv.
AgiBot World Team (2025). AgiBot World Colosseo: A Large-Scale Manipulation Platform. arXiv.
Li, Xingyu (2024). Evaluating Real-World Robot Manipulation Policies in Simulation. arXiv.
Mandlekar, Ajay (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. arXiv.
Ha, Huy (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv.
Toyota Research Institute (2024). Large Behavior Models for Robot Manipulation. Company technical post.