Two friends meet after a long time and greet each other like [source motion]. What if one of the friends becomes overly touched upon meeting the other friend?
Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset called INTER-MT2; where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pre-training stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using INTER-MT2. We evaluate the versatility of our method across motion-related tasks—motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight VIM’s versatility and effectiveness in handling complex interactive motion synthesis.
Current datasets (Inter-X, InterHuman) for modeling interactive motions lack sufficient diversity in instructions and do not include multi-turn conversations. To address this gap, we introduce INTER-MT2: INTERactive MUTI-Turn Motion-Text dataset. This dataset covers a variety of interactive motion scenarios with multi-turn conversations, diverse instructions, and spatiotemporally aligned motions between two individuals.
We pursue the versatility of VIM through a unified architecture that can simultaneously input and output both motion and text modalities. Based on the pre-trained LLMs, our training process can be divided into three stages:
Two friends meet after a long time and greet each other like [source motion]. What if one of the friends becomes overly touched upon meeting the other friend?
Let's have a scene where two people are interacting, like the motion you see in [source motion]. What if the person who is walking up is a bit more respectful?
Two individuals are performing a self-defense technique, where one person strikes with the left hand, and the other intercepts the attack. It's similar to [source motion]. The first person seems too aggressive. Can you make the second person counter in a more forceful and confident way?
Two friends are meeting up like in [source motion]. The first person seems way too aggressive. Can you make this person more gentle?
One person pats the other on the back, and the other person turns to look.
The first person holds onto the second's right forearm, then stumbles and drags them down.
One person approaches and massages the other's shoulders using both hands.
One person steps forward and steps on the left foot of the other person.
Two people sit facing each other, taking turns to play rock-paper-scissors by waving their right arms to the right three times each.
Two people face each other and raise both hands in front of their heads. Then, they move forward and clap.
The initial individual is seated on the chair. The subsequent individual approaches from the left of the first person, grasps his/her left arm with both hands, and assists him/her in rising.
Two people walk towards each other, and when they meet, their arms collide.
bibtex here