Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data preparation for more keypoints #25

Open
AmitMY opened this issue Dec 23, 2024 · 1 comment
Open

Data preparation for more keypoints #25

AmitMY opened this issue Dec 23, 2024 · 1 comment

Comments

@AmitMY
Copy link

AmitMY commented Dec 23, 2024

This looks really interesting, and I would love to try this for sign language motion diffusion.

My data is MediaPipe poses, with 73 joints (body and two hands) as well as 468 face mesh keypoints (reduced to 105 for contour), for a total of 178 keypoints or 534 floats.

I downloaded the tiny humanml3d dataset, and tried to understand the data structure:

  • Mean.npy and Std.npy include (263, ) floats - why 263? I thought it is X,Y,Z positions, but 263 does not divide by 3
  • texts/{id}.txt includes multiple prompts (1 per row) and is formatted as {text}#{part of speech}#0.0#0.0 Is part of speech necessary?
  • new_joint_vecs/{id}.npy include ({num frames}, 263) - Still not clear to me why 263.

Further questions on training:

  • My data includes text that is not available in Glove. Is there a way for me to use a byte based tokenizer?
  • My data also includes an image for every text-pose pair. Is there a way to add that for the control, through CLIP?
@Dai-Wenxun
Copy link
Owner

263 is the feature dimension in this motion dataset. The feature consists of root velocity, root height, joint positions, etc., not simply the X, Y, and Z positions. Mean.npy and Std.npy are for normalizing the motion features. As for the text, it is the format defined in the HumanML3D dataset. of course, you can change the dataset loading logic to accommodate your needs.

Glove is used for evaluation not training the diffusion model.

I suggest you run the training script to see the architecture of the diffusion models since it's hard to tell you how to build the sign
language motion diffusion from scratch. I'm sure this framework can do this.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@AmitMY @Dai-Wenxun and others