VIM: Versatile Interactive Motion Language Model

Versatile Motion-Language Models for Multi-Turn Interactive Agents

1 Korea University 2 NAVER AI Lab
*This research was conducted as part of an internship at NAVER AI Lab, 2024,
Corresponding authors

Abstract

Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset called INTER-MT2; where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pre-training stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using INTER-MT2. We evaluate the versatility of our method across motion-related tasks—motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight VIM’s versatility and effectiveness in handling complex interactive motion synthesis.

Inter-MT2: Interactive Multi-Turn Motion-Text Dataset

Current datasets (Inter-X, InterHuman) for modeling interactive motions lack sufficient diversity in instructions and do not include multi-turn conversations. To address this gap, we introduce INTER-MT2: INTERactive MUTI-Turn Motion-Text dataset. This dataset covers a variety of interactive motion scenarios with multi-turn conversations, diverse instructions, and spatiotemporally aligned motions between two individuals.

  • Generate Motion Captions & Instructions. We employ GPT-4o to generate motion captions and conversational instructions for a variety of tasks, such as motion editing, reasoning, and story generation, enhancing the model’s versatility.
  • Generate Corresponding Motion. We utilize the state-of-the-art text-to-motion diffusion model, InterGEN, to generate corresponding motions that align with the generated caption from LLMs.

Data samples

VIM: Versatile Interactive Motion Language Model

We pursue the versatility of VIM through a unified architecture that can simultaneously input and output both motion and text modalities. Based on the pre-trained LLMs, our training process can be divided into three stages:

  • Stage 1: Motion tokenizer. Involves training a motion tokenizer that encodes and decodes interactive motion data
  • Stage 2: Pre-training for Feature Alignment. We pre-train the model by integrating motion and text data, allowing it to learn the alignment between text and motion. In paricular, we train the large language model (LLM) using a low-rank adaptor (LoRA), including the embedding layer and the decoder head. We utlized the motion-text pair dataset as follows:
    • InterHuman: 7K interactive motion-text paired dataset
    • Inter-X: 11K interactive motion-text paired dataset
    • Motion-X: Single motion-text paired dataset
  • Stage 3: Instruction Fine-tuning. This stage focuses on Instruction Tuning, fine-tuning the model to follow instructions and improve its responsiveness to conversational cues. We first merge LoRA weights to the model, then fully fine-tune all the weights using Inter-MT2 and also interactive motion-text paired dataset, i.e., Inter-X and InterHuman.

Examples on Motion Editing Task

User

Two friends meet after a long time and greet each other like [source motion]. What if one of the friends becomes overly touched upon meeting the other friend?

VIM
User

Let's have a scene where two people are interacting, like the motion you see in [source motion]. What if the person who is walking up is a bit more respectful?

VIM
User

Two individuals are performing a self-defense technique, where one person strikes with the left hand, and the other intercepts the attack. It's similar to [source motion]. The first person seems too aggressive. Can you make the second person counter in a more forceful and confident way?

VIM
User

Two friends are meeting up like in [source motion]. The first person seems way too aggressive. Can you make this person more gentle?

VIM

Examples on Motion Reasoning Task

Examples on Tradiaional Motion Related Tasks

Motion-to-Text

One person pats the other on the back, and the other person turns to look.

The first person holds onto the second's right forearm, then stumbles and drags them down.

One person approaches and massages the other's shoulders using both hands.

One person steps forward and steps on the left foot of the other person.

Text-to-Motion

Two people sit facing each other, taking turns to play rock-paper-scissors by waving their right arms to the right three times each.

Two people face each other and raise both hands in front of their heads. Then, they move forward and clap.

The initial individual is seated on the chair. The subsequent individual approaches from the left of the first person, grasps his/her left arm with both hands, and assists him/her in rising.

Two people walk towards each other, and when they meet, their arms collide.

Reaction Generation

Source Motion
Generated Motion (Blue)
Source Motion
Generated Motion (Blue)
Source Motion
Generated Motion (Blue)
Source Motion
Generated Motion (Blue)

BibTeX


        bibtex here