📍 CHI 2026 · Barcelona
🏅 Honorable Mention Award

DancingBox: A Lightweight MoCap System for
Character Animation from Physical Proxies

1 University of Edinburgh  ·  2 Inria, Université Côte d'Azur  ·  3 Tsinghua University
DancingBox teaser: a user manipulates a banana as a proxy puppet; the system generates a realistic 3D character animation.
TL;DR We bridge vision foundation models with generative AI to propose a new technical path toward accessible motion capture. No more heavy systems — just a smartphone camera and an everyday object as your actor.

Abstract

Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy–animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies — from plush toys to bananas — lowering the barrier to entry for novice animators.

Method

DancingBox consists of two core modules linked by a compact bounding-box representation: a vision-based Motion Capture (MoCap) module that tracks proxy objects in a webcam video, and a diffusion-based Motion Generation (MoGen) module that lifts those coarse box trajectories into realistic full-body skeletal animation.

Step 1

Vision-Based Motion Capture

DancingBox MoCap pipeline: user clicks → SAM2 segmentation → π³ point cloud → bounding box tracking
Fig. 3 — MoCap pipeline. From left to right: user-provided seed clicks, SAM2 part segmentation propagated across frames, monocular 3D reconstruction, and oriented bounding-box tracking.

Given a short webcam recording of a user manipulating any physical object, DancingBox estimates a sequence of 3D oriented bounding boxes — one per articulated part — without any specialised hardware. The user marks the desired parts with a few clicks on the first frame. SAM2 propagates these segments across all frames, robustly handling occlusions. π³ (Pi3) reconstructs a dense 3D point cloud from each monocular frame. An oriented bounding box is fitted to the first-frame point cloud via PCA, then tracked across time through SVD-based rigid alignment (Kabsch-Umeyama algorithm), yielding a compact box-motion sequence ready for generation.

▸ Bounding-box motion sequence passed to MoGen
Step 2

Box-Guided Motion Generation

DancingBox MoGen architecture: permutation-invariant box encoder + ControlNet conditioning on MDM
Fig. 4 — MoGen architecture. The box-motion encoder conditions a ControlNet adapter built on top of the pre-trained MDM denoiser. Spatial guidance refines trajectory alignment at inference time.

The box-motion sequence is consumed by a ControlNet-style adapter built on top of a pre-trained human motion diffusion model (MDM). A permutation-invariant encoder aggregates temporal box features, handling the variable number of parts produced by different proxies (e.g., 4 boxes for a banana, 6 for a humanoid puppet). The encoder output is injected into the frozen denoiser via zero-initialised linear blocks for stable training. At inference, a lightweight spatial guidance mechanism further improves trajectory alignment without any additional training.

User Study Materials

Questionnaires used in the user study, provided for reproducibility.

📋 Pre-study Questionnaire 📋 Post-study Questionnaire

Citation

ToBeDone