Wip: ltx-video support #491

stduhpf · 2024-12-01T15:19:58Z

For now, the diffusion model seems to load in memory. the 128-D VAE is still completely unmplemented. Forward logic might be off.

TODO:

rebase on master properly
figure out how to work around tensor 'model.diffusion_model.proj_out.weight' has wrong shape in model file: got [1, 1, 2048, 128], expected [2048, 128, 1, 1] (diffusion model will hopefully load properly after that)
VAE support
Figure out modulation order
Make it generate Video
Use it to implement Pixart Alpha too

Amin456789 · 2024-12-03T09:21:07Z

this is great, thank u so much, cant wait to test it when its done

please support for quantized ltx models too and conversation too, there are fp8 in huggingface and fp16 gguf are in there,

please add for cpu users too

it will be great if u make it work to have im2vid

also in ltx video playground in huggingface there is advanced options that we can make videos up to 11 seconds like 512x320 resolution with 257 frames, it will be great ifwe can make long videos here too

stduhpf · 2024-12-03T10:21:19Z

Convertion/quantization should be working already.

stduhpf · 2024-12-08T00:57:55Z

The 5D tensors in the VAE are a pain to deal with. I'm losing motivation...

Amin456789 · 2024-12-08T13:38:11Z

it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work
let me summon the staff of ltx themselves:
@yoavhacohen
could u guys give us a hand here please, we seems have hard times to bring a video to cpp here

ggerganov · 2024-12-08T18:08:35Z

The 5D tensors in the VAE are a pain to deal with.

@stduhpf Which operators require more than 4 dimensional tensors? Can these tensors be transformed to less dimensions? Maybe with appropriate combination of ggml_view and ggml_reshape it can be done. We had similar issues when implementing SAM and I remember we managed to avoid the need for 5D tensors which were used in the original implementation. Hopefully there is a workaround, since adding support for 5D and more dimensions to ggml would be very difficult.

stduhpf · 2024-12-08T18:31:00Z

@ggerganov Basically the whole VAE is made of 3D convolutions, so this means a 3x3x3 kernel for each input/output channel pair. Maybe there is a way to flatten it to use Conv2d instead, but I couldn't figure it out.

ggerganov · 2024-12-08T19:05:27Z

Hm, indeed it's not obvious. I guess we will need to increase the GGML_MAX_DIMS at some point.

yoavhacohen · 2024-12-08T19:50:44Z

it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work let me summon the staff of ltx themselves: @yoavhacohen could u guys give us a hand here please, we seems have hard times to bring a video to cpp here

It’s understandable that this task is challenging, and I appreciate everyone’s efforts so far. Based on the comments, the issue seems to stem from the lack of conv3d implementation in the GGML library.

Although I’m not familiar with GGML, I noticed that conv2d is implemented using im2col and matmul:
https://github.com/ggerganov/ggml/blob/a5960e80d3e65ce6ff18f90315ab96f63cf9c4cc/src/ggml.c#L3884

The same principle can be extended to conv3d using a 3D version of im2col. Here’s a high-level approach:

Implementing conv3d:

You can create an im2col_3d tensor and perform matrix multiplication for convolution, similar to the conv2d implementation. Below is sample (untested) code:

// a: [OC, IC, KD, KH, KW]
// b: [N, IC, ID, IH, IW]
// result: [N, OC, OD, OH, OW]
struct ggml_tensor * ggml_conv_3d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   s0, // stride depth
        int                   s1, // stride height
        int                   s2, // stride width
        int                   p0, // padding depth
        int                   p1, // padding height
        int                   p2, // padding width
        int                   d0, // dilation depth
        int                   d1, // dilation height
        int                   d2  // dilation width) {
    // Create im2col tensor for 3D input
    struct ggml_tensor * im2col = ggml_im2col_3d(ctx, a, b, s0, s1, s2, p0, p1, p2, d0, d1, d2, true, a->type); // [N, OD, OH, OW, IC * KD * KH * KW]

    // Perform matrix multiplication for the convolution
    struct ggml_tensor * result =
        ggml_mul_mat(ctx,
                ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[4] * im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OD, OH, OW, IC * KD * KH * KW] => [N*OD*OH*OW, IC * KD * KH * KW]
                ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2] * a->ne[3]), a->ne[4]));                          // [OC, IC, KD, KH, KW] => [OC, IC * KD * KH * KW]

    // Reshape the result back to a 5D tensor
    result = ggml_reshape_5d(ctx, result, im2col->ne[1], im2col->ne[2], im2col->ne[3], im2col->ne[4], a->ne[4]); // [OC, N, OD, OH, OW]
    result = ggml_cont(ctx, ggml_permute(ctx, result, 0, 1, 4, 3, 2)); // [N, OC, OD, OH, OW]

    return result;
}

Implementing im2col_3d:

Since GGML lacks im2col_3d, you can emulate it using a composition of two im2col operations:

Apply 1D im2col along the depth dimension.
Step 2: Apply 2D im2col for the height and width dimensions.

struct ggml_tensor * ggml_im2col_3d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   s0, // stride depth
        int                   s1, // stride height
        int                   s2, // stride width
        int                   p0, // padding depth
        int                   p1, // padding height
        int                   p2, // padding width
        int                   d0, // dilation depth
        int                   d1, // dilation height
        int                   d2, // dilation width
        enum ggml_type        dst_type) {

    // Step 1: Perform 1D im2col along the depth dimension
    const int64_t OD = ggml_calc_conv_output_size(b->ne[2], a->ne[2], s0, p0, d0); // Depth
    const int64_t IH = b->ne[3];
    const int64_t IW = b->ne[4];
    const int64_t IC_KD = b->ne[1] * a->ne[2]; // IC * KD

    const int64_t ne1[5] = { IC_KD, IW, IH, OD, b->ne[0] }; // Intermediate tensor shape: [N, OD, IH, IW, IC * KD]

    struct ggml_tensor * intermediate = ggml_new_tensor(ctx, dst_type, 5, ne1);

    int32_t params_1d[] = { s0, 1, p0, 0, d0, 1 }; // Stride and padding for depth dimension
    ggml_set_op_params(intermediate, params_1d, sizeof(params_1d));
    intermediate->op     = GGML_OP_IM2COL; // Use existing im2col operation
    intermediate->src[0] = a;
    intermediate->src[1] = b;

    // Step 2: Perform 2D im2col on the intermediate tensor for height and width
    const int64_t OH = ggml_calc_conv_output_size(IH, a->ne[3], s1, p1, d1); // Height
    const int64_t OW = ggml_calc_conv_output_size(IW, a->ne[4], s2, p2, d2); // Width

    const int64_t ne2[5] = { IC_KD * a->ne[3] * a->ne[4], OW, OH, OD, b->ne[0] }; // Final output shape: [N, OD, OH, OW, IC * KD * KH * KW]

    struct ggml_tensor * result = ggml_new_tensor(ctx, dst_type, 5, ne2);

    int32_t params_2d[] = { s2, s1, p2, p1, d2, d1 }; // Stride and padding for height and width dimensions
    ggml_set_op_params(result, params_2d, sizeof(params_2d));
    result->op     = GGML_OP_IM2COL; // Use existing im2col operation
    result->src[0] = a;              // Use filter tensor
    result->src[1] = intermediate;   // Intermediate tensor from step 1

    return result;
}

As already stated GGML_MAX_DIMS should be increased to 5 to support 5D tensors.

These are just starting points and will need testing and optimization.
Does this align with your understanding? Are there additional constraints or goals that we should consider?

LTX: first commit

c5e01af

stduhpf force-pushed the wip-ltx-support branch from e6c6000 to c5e01af Compare December 1, 2024 21:08

LTX: Fix out_proj

b4335bc

ggerganov mentioned this pull request Dec 8, 2024

ggml : increase GGML_MAX_DIMS ggerganov/ggml#1042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wip: ltx-video support #491

Wip: ltx-video support #491

stduhpf commented Dec 1, 2024 •

edited

Loading

Amin456789 commented Dec 3, 2024 •

edited

Loading

stduhpf commented Dec 3, 2024

stduhpf commented Dec 8, 2024

Amin456789 commented Dec 8, 2024 •

edited

Loading

ggerganov commented Dec 8, 2024

stduhpf commented Dec 8, 2024

ggerganov commented Dec 8, 2024

yoavhacohen commented Dec 8, 2024

Wip: ltx-video support #491

Are you sure you want to change the base?

Wip: ltx-video support #491

Conversation

stduhpf commented Dec 1, 2024 • edited Loading

Amin456789 commented Dec 3, 2024 • edited Loading

stduhpf commented Dec 3, 2024

stduhpf commented Dec 8, 2024

Amin456789 commented Dec 8, 2024 • edited Loading

ggerganov commented Dec 8, 2024

stduhpf commented Dec 8, 2024

ggerganov commented Dec 8, 2024

yoavhacohen commented Dec 8, 2024

Implementing conv3d:

Implementing im2col_3d:

stduhpf commented Dec 1, 2024 •

edited

Loading

Amin456789 commented Dec 3, 2024 •

edited

Loading

Amin456789 commented Dec 8, 2024 •

edited

Loading