You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This looks really interesting, and I would love to try this for sign language motion diffusion.
My data is MediaPipe poses, with 73 joints (body and two hands) as well as 468 face mesh keypoints (reduced to 105 for contour), for a total of 178 keypoints or 534 floats.
I downloaded the tiny humanml3d dataset, and tried to understand the data structure:
Mean.npy and Std.npy include (263, ) floats - why 263? I thought it is X,Y,Z positions, but 263 does not divide by 3
texts/{id}.txt includes multiple prompts (1 per row) and is formatted as {text}#{part of speech}#0.0#0.0Is part of speech necessary?
new_joint_vecs/{id}.npy include ({num frames}, 263) - Still not clear to me why 263.
Further questions on training:
My data includes text that is not available in Glove. Is there a way for me to use a byte based tokenizer?
My data also includes an image for every text-pose pair. Is there a way to add that for the control, through CLIP?
The text was updated successfully, but these errors were encountered:
263 is the feature dimension in this motion dataset. The feature consists of root velocity, root height, joint positions, etc., not simply the X, Y, and Z positions. Mean.npy and Std.npy are for normalizing the motion features. As for the text, it is the format defined in the HumanML3D dataset. of course, you can change the dataset loading logic to accommodate your needs.
Glove is used for evaluation not training the diffusion model.
I suggest you run the training script to see the architecture of the diffusion models since it's hard to tell you how to build the sign
language motion diffusion from scratch. I'm sure this framework can do this.
This looks really interesting, and I would love to try this for sign language motion diffusion.
My data is MediaPipe poses, with 73 joints (body and two hands) as well as 468 face mesh keypoints (reduced to 105 for contour), for a total of 178 keypoints or 534 floats.
I downloaded the tiny humanml3d dataset, and tried to understand the data structure:
Mean.npy
andStd.npy
include(263, )
floats - why 263? I thought it is X,Y,Z positions, but 263 does not divide by 3texts/{id}.txt
includes multiple prompts (1 per row) and is formatted as{text}#{part of speech}#0.0#0.0
Is part of speech necessary?new_joint_vecs/{id}.npy
include({num frames}, 263)
- Still not clear to me why 263.Further questions on training:
The text was updated successfully, but these errors were encountered: