MoLaM

Abstract

Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additionally, a unified and versatile model is needed to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles. To address these challenges, we introduce MoLaM, the Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies that primarily focus on uni-directional tasks such as text-to-motion or motion-to-text, MoLaM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities. Given the absence of an appropriate dataset to support this task, we introduce Inter-MT², a large-scale instruction-tuning dataset containing 82.7K multi-turn interactive motion instructions, covering 153K interactive motion samples. Inter-MT² spans diverse instructional scenarios, including motion editing, question answering, and story generation, leveraging off-the-shelf large language models and motion diffusion models to construct a broad set of interactive motion instructions. We extensively evaluate the versatility of MoLaM across multiple interactive motion-related tasks, including motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. Notably, MoLaM is the first model capable of effectively addressing all these tasks within a single unified framework, achieving competitive performance compared to task-specific methods.

Inter-MT²: Interactive Multi-Turn Motion-Text Dataset

Current datasets (Inter-X, InterHuman) for modeling interactive motions lack sufficient diversity in instructions and do not include multi-turn conversations. To address this gap, we introduce INTER-MT²: INTERactive MUTI-Turn Motion-Text dataset. This dataset covers a variety of interactive motion scenarios with multi-turn conversations, diverse instructions, and spatiotemporally aligned motions between two individuals.

Generate Motion Captions & Instructions. We employ GPT-4o to generate motion captions and conversational instructions for a variety of tasks, such as motion editing, reasoning, and story generation, enhancing the model’s versatility.
Generate Corresponding Motion. We utilize the state-of-the-art text-to-motion diffusion model, InterGEN, to generate corresponding motions that align with the generated caption from LLMs.

Inter-MT2 Pipeline

Our pipeline creates samples in two ways. First, starting with a dataset motion, we generate a caption and instruction and then use InterGEN to synthesize a matching motion, yielding both the original and synthesized motions with the instruction. Alternatively, we generate two captions and instructions to synthesize two motions, producing samples entirely from synthesized motions. This method blends data-sourced and generative motions for reliable interactive motion modeling.

Overall, we collected 82K multi-turn conversations, including 96K synthesized and 56K real motions. Figure 2 shows statistics and samples from our Inter-MT², where motion scenarios are classified using a large language model with motion captions. We aim to showcase spatiotemporally aligned motions between two individuals, summarizing everything on the project page.

The distribution of instruction types, motion sources, and motion scenario types, highlighting the dataset’s diversity. The type of motion scenario is classified using a large language model with motion captions.

Data samples

MoLaM: Interactive Motion Language Model

We pursue the versatility of MoLaM through a unified architecture that can simultaneously input and output both motion and text modalities. Based on the pre-trained LLMs, our training process can be divided into three stages:

Stage 1: Motion tokenizer. Involves training a motion tokenizer that encodes and decodes interactive motion data
Stage 2: Pre-training for Cross-modal Motion-Text Alignment.. We pre-train the model by integrating motion and text data, allowing it to learn the alignment between text and interactive motion. In paricular, we train the large language model (LLM) using a low-rank adaptor (LoRA), including the embedding layer and the decoder head. We utlized the motion-text pair dataset as follows:

InterHuman: 7K interactive motion-text paired dataset
Inter-X: 11K interactive motion-text paired dataset
Motion-X: Single motion-text paired dataset

Stage 3: Instruction Fine-tuning with Inter-MT² Data. This stage focuses on Instruction Tuning, fine-tuning the model to follow instructions and improve its responsiveness to conversational cues. We first merge LoRA weights to the model, then fully fine-tune all the weights using Inter-MT² and also interactive motion-text paired dataset, i.e., Inter-X and InterHuman.

Examples on Motion Editing Task

User

Two friends meet after a long time and greet each other like [source motion]. What if one of the friends becomes overly touched upon meeting the other friend?

MoLaM

User

Let's have a scene where two people are interacting, like the motion you see in [source motion]. What if the person who is walking up is a bit more respectful?

MoLaM

User

Two individuals are performing a self-defense technique, where one person strikes with the left hand, and the other intercepts the attack. It's similar to [source motion]. The first person seems too aggressive. Can you make the second person counter in a more forceful and confident way?

MoLaM

User

Two friends are meeting up like in [source motion]. The first person seems way too aggressive. Can you make this person more gentle?

MoLaM

Examples on Motion Reasoning Task

Examples on Tradiaional Motion Related Tasks

Motion-to-Text

One person pats the other on the back, and the other person turns to look.

The first person holds onto the second's right forearm, then stumbles and drags them down.

One person approaches and massages the other's shoulders using both hands.

One person steps forward and steps on the left foot of the other person.

Text-to-Motion

Two people sit facing each other, taking turns to play rock-paper-scissors by waving their right arms to the right three times each.

Two people face each other and raise both hands in front of their heads. Then, they move forward and clap.

The initial individual is seated on the chair. The subsequent individual approaches from the left of the first person, grasps his/her left arm with both hands, and assists him/her in rising.

Two people walk towards each other, and when they meet, their arms collide.

Reaction Generation

Source Motion

Generated Motion (Blue)

Source Motion

Generated Motion (Blue)

Source Motion

Generated Motion (Blue)

Source Motion

Generated Motion (Blue)

BibTeX


        @article{park2024versatile,
          title={A Unified Framework for Motion Reasoning and Generation in Human Interaction},
          author={Park, Jeongeun and Choi, Sungjoon and Yun, Sangdoo},
          journal={arXiv preprint arXiv:2410.05628},
          year={2024}
        }

A Unified Framework for Motion Reasoning and Generation in Human Interaction

MoLaM: Interactive Motion-Language Models

ICCV 2025 (Accepted)

Abstract

Inter-MT²: Interactive Multi-Turn Motion-Text Dataset

Inter-MT2 Pipeline

Data samples

MoLaM: Interactive Motion Language Model

Examples on Motion Editing Task

Examples on Motion Reasoning Task

Examples on Tradiaional Motion Related Tasks

Motion-to-Text

Text-to-Motion

Reaction Generation

BibTeX

A Unified Framework for Motion Reasoning and Generation in Human Interaction

MoLaM: Interactive Motion-Language Models

ICCV 2025 (Accepted)

Abstract

Inter-MT2: Interactive Multi-Turn Motion-Text Dataset

Inter-MT2 Pipeline

Data samples

MoLaM: Interactive Motion Language Model

Examples on Motion Editing Task

Examples on Motion Reasoning Task

Examples on Tradiaional Motion Related Tasks

Motion-to-Text

Text-to-Motion

Reaction Generation

BibTeX

Inter-MT²: Interactive Multi-Turn Motion-Text Dataset