ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning - 황지훈 발표 > Seminar

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning - 황지훈 발표

페이지 정보

작성자 최고관리자 댓글 조회 작성일 26-06-23 13:54

본문

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

첨부파일

황지훈_세미나_260331.pptx (6.5M) 0회 다운로드 | DATE : 2026-06-23 13:54:41

이전글LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding - 박건 발표 26.06.23
다음글Learned Perceptive Forward Dynamic Model for Safe and Platform - aware Robotic Navigation - 손이인 발표 26.06.23

댓글목록

등록된 댓글이 없습니다.

Boards

Seminar

페이지 정보

본문

첨부파일

댓글목록