2889 - Vision-Language-Action for humanoid robots
Vision-Language-Action (VLA) models are an emerging class of multi-modal large language models that enable robots to understand and execute complex verbal instructions [1,2,3]. These models integrate Vision-Language Models (VLMs), Large Language Models (LLMs), and imitation learning methods (often based on diffusion models) to combine language, vision, and action in a unified framework. The long-term goal is to develop generalist robots capable of performing virtually any task in any environment—ranging from factories to homes—based solely on natural language commands. For example, a user could instruct a robot to “make breakfast with some eggs and coffee and bring it to me,” or alternatively, “put all these items in a box, close the package, and send it to a client,” using the same robot and the same underlying policy.
The first objective of this internship is to evaluate open-source VLA systems (e.g., OpenVLA, Pi0) on real robotic platforms, starting with our Unitree G1 robot, integrating them with our whole-body stack, and potentially extending to our bimanual Tiago++ robot. These models will be fine-tuned using in-house data collected in the lab to improve performance in our specific scenarios.
The second objective is to enrich these models with force sensing data to improve the interactions with the environment.
Finally, the internship will explore any opportunity to push the state of the art in VLA-based robotics.
[1] Kim, Moo Jin, et al. 'OpenVLA: An open-source vision-language-action model.' arXiv preprint arXiv:2406.09246 (2024) - https://arxiv.org/pdf/2406.09246
[2] Black, Kevin, et al. '$\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control.' arXiv preprint arXiv:2410.24164 (2024). - https://arxiv.org/pdf/2410.24164 - https://www.physicalintelligence.company/blog/pi0
[3] Pertsch, Karl, et al. 'Fast: Efficient action tokenization for vision-language-action models.' arXiv preprint arXiv:2501.09747 (2025) - https://arxiv.org/pdf/2501.09747
Required skills:
Languages: English (French is not required -- English is the main language of the team).