Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What space does the transformation matrix given by Face Landmarker transform to? #5883

Open
flamingotech opened this issue Mar 5, 2025 · 0 comments
Assignees
Labels
os:windows MediaPipe issues on Windows platform::android Android Solutions task:face landmarker Issues related to Face Landmarker: Identify facial features for visual effects and avatars. type:support General questions

Comments

@flamingotech
Copy link

flamingotech commented Mar 5, 2025

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

None

OS Platform and Distribution

Windows

MediaPipe Tasks SDK version

0.10.14

Task name (e.g. Image classification, Gesture recognition etc.)

Face Landmarker

Programming Language and version (e.g. C++, Python, Java)

Java

Describe the actual behavior

When a face is detected, a quad, which has the size same as bounding box of detected face, is always at the center of the screen. Its width was unchanged with the detected face width but its height has changed a bit with detected face height.

Describe the expected behaviour

When a face is detected, a quad, which has the size same as bounding box of detected face, should have size matching the detected face and cover the detected face.

Standalone code/steps you may have used to try to get what you need

I have loaded the canonical face from the face_with_iris.obj file and get the vertices in unit of cm and converted the unit into meter. Then applied the given transformation matrix to the canonical face to get the transformed vertices in a unknown space. Then used the transformed vertices to get the minimum and maximum x and y values and the center of the transformed vertices and applying these data to draw a quad using OpenGL.

Other info / Complete Logs

I am working on a Android camera app which detect face mesh in real-time on camera fee using Mediapipe Face Landmarker and render 3D mesh overlay on detected face mesh using OpenGL.
On detection of face mesh in a frame, Mediapipe provides column-major transformation matrix which transform the face landmarks from a canonical face model to the vertices in positions that are the same as the detected face mesh in unknown space. Since Mediapipe document did not explicitly say which space, it could be world or camera space.
https://ai.google.dev/edge/api/mediapipe/java/com/google/mediapipe/tasks/vision/facelandmarker/FaceLandmarkerResult

I experimented it a bit by drawing a quad based on the transformed vertices using OpenGL ES.

I have loaded the canonical face from the provided face_with_iris.obj file on Github and get the vertices in unit of cm and converted the unit into meter. 
Then applied the given transformation matrix to the canonical face to get the transformed vertices. 

Tried the following test cases:
1. Supposed the transformed vertices are in clip space, used transformed vertices directly to get min and max x and y and center vertex which are then passed to QuadRenderer for rendering quad.
2. Supposed the transformed vertices are in camera/view space, applied derived projection matrix to transformed vertices. Then use it to get min and max x and y and center vertex which are then passed to QuadRenderer for rendering quad.
3. Supposed the transformed vertices are in world space, applied derived view matrix and projection matrix to transformed vertices. Then use it to get min and max x and y and center vertex which are then passed to QuadRenderer for rendering quad.

All of the above cases has the following result:

When a face is detected, a quad, which is supposed to have the same size as the bounding box of detected face, is always at the center of the screen. Its width and height were changed slightly according to the detected face width and height. The overall size of the quad is more or less the same even if the detected face moved back and forth in front of the camera. 
However, in case 3 the quad has larger height and smaller width than the other 2 cases. The proportion of the quad drawn in case 1 and 2 is closer to that of the detected face.

What space does the given transformation matrix transform to?
The matrix is supposed to transform the canonical face mesh in model space in default position in unit of cm to face mesh in detected face position in a specific frame in a space other than model space, right?
What's wrong with my approach to draw quad on detected face?

Here are how I did it in details:

Since the input image to Face Landmarker was 4X smaller than the preview, the vertices are scaled back to 4 times bigger by multiplying the given transformation matrix elements like this before application of the matrix.

int imgScaleFactor = 4;
transformMtx[0] *= imgScaleFactor;
transformMtx[5] *= imgScaleFactor;
transformMtx[10] *= imgScaleFactor;

// native code to apply transformation matrix
float* transformVerticesColumnMajor(const float* vertices, int numVertices, const float* transformationMtx) {
    // Allocate memory for the transformed vertices (x, y, z only)
    float* transformedVertices = new float[numVertices * 3];

    // Apply the transformation matrix to each vertex
    for (int i = 0; i < numVertices; ++i) {
        // Get the current vertex (x, y, z)
        float x = vertices[i * 3]; // Scale x
        float y = vertices[i * 3 + 1]; // Scale y
        float z = vertices[i * 3 + 2]; // Scale z

        // Treat w as 1.0 for homogeneous coordinates
        float w = 1.0f;

        // Transform the vertex using the matrix (column-major)
        float tx = transformationMtx[0] * x + transformationMtx[1] * y + transformationMtx[2] * z + transformationMtx[3] * w;
        float ty = transformationMtx[4] * x + transformationMtx[5] * y + transformationMtx[6] * z + transformationMtx[7] * w;
        float tz = transformationMtx[8] * x + transformationMtx[9] * y + transformationMtx[10] * z + transformationMtx[11] * w;

        // Store the transformed vertex (x, y, z only)
        transformedVertices[i * 3] = tx;
        transformedVertices[i * 3 + 1] = ty;
        transformedVertices[i * 3 + 2] = tz;
    }

    return transformedVertices;
}

// native code to get the data for drawing quad where the input argument vertices is the float array of transformed vertices
void getMeshData(const float* vertices, int vertexCount, MeshXYZData& meshData) {
    // Initialize center to zero
    meshData.center[0] = 0.0f;
    meshData.center[1] = 0.0f;
    meshData.center[2] = 0.0f;

    // Initialize min and max values
    meshData.minX = vertices[0];
    meshData.maxX = vertices[0];
    meshData.minY = vertices[1];
    meshData.maxY = vertices[1];

    // Sum all vertex positions and find min/max x and y
    for (int i = 0; i < vertexCount; ++i) {
        float x = vertices[3 * i];
        float y = vertices[3 * i + 1];
        float z = vertices[3 * i + 2];

        // Update center
        meshData.center[0] += x;
        meshData.center[1] += y;
        meshData.center[2] += z;

        // Update min and max x
        meshData.minX = std::min(meshData.minX, x);
        meshData.maxX = std::max(meshData.maxX, x);

        // Update min and max y
        meshData.minY = std::min(meshData.minY, y);
        meshData.maxY = std::max(meshData.maxY, y);
    }

    // Divide by the number of vertices to get the average (center)
    meshData.center[0] /= vertexCount;
    meshData.center[1] /= vertexCount;
    meshData.center[2] /= vertexCount;
}


// renderer to draw quad
public QuadRenderer() {
    }

    public void createOnGlThread(Context context) throws IOException {

        // Load and compile shaders
        int vertexShader = loadShader(GLES20.GL_VERTEX_SHADER, vertexShaderCode);
        int fragmentShader = loadShader(GLES20.GL_FRAGMENT_SHADER, fragmentShaderCode);

        // Create the shader program
        shaderProgram = GLES20.glCreateProgram();
        GLES20.glAttachShader(shaderProgram, vertexShader);
        GLES20.glAttachShader(shaderProgram, fragmentShader);
        GLES20.glLinkProgram(shaderProgram);

        // Get handles to shader variables
        positionHandle = GLES20.glGetAttribLocation(shaderProgram, "vPosition");
        colorHandle = GLES20.glGetUniformLocation(shaderProgram, "vColor");
    }


    public void draw(float minX, float maxX, float minY, float maxY, float[] ctr, float[] quadColor) {
        // Use the shader program
        GLES20.glUseProgram(shaderProgram);

        translateVertices(minX, maxX, minY, maxY, ctr);

        Log.d("Quad renderer", "MP: quad vertices = " + Arrays.toString(quadVertices));

        // Initialize the vertex buffer
        ByteBuffer bb = ByteBuffer.allocateDirect(quadVertices.length * 4);
        bb.order(ByteOrder.nativeOrder());
        vertexBuffer = bb.asFloatBuffer();
        vertexBuffer.put(quadVertices);
        vertexBuffer.position(0);

        // Pass the vertex data
        GLES20.glEnableVertexAttribArray(positionHandle);
        GLES20.glVertexAttribPointer(positionHandle, 3, GLES20.GL_FLOAT, false, 0, vertexBuffer);

        // Pass the color data
        GLES20.glUniform4fv(colorHandle, 1, quadColor, 0);

        // Draw the quad
        GLES20.glDrawArrays(GLES20.GL_TRIANGLE_FAN, 0, 4);

        // Disable the vertex attribute array
        GLES20.glDisableVertexAttribArray(positionHandle);
    }


    // Helper method to load and compile a shader
    private int loadShader(int type, String shaderCode) {
        int shader = GLES20.glCreateShader(type);
        GLES20.glShaderSource(shader, shaderCode);
        GLES20.glCompileShader(shader);
        return shader;
    }

    private void translateVertices(float minX, float maxX, float minY, float maxY, float[] center) {
        // Set the quad vertices using minX, minY, maxX, maxY, and the z value of the center
        quadVertices[0] = minX;  // Top-left x
        quadVertices[1] = maxY;  // Top-left y
        quadVertices[2] = center[2];  // Top-left z

        quadVertices[3] = minX;  // Bottom-left x
        quadVertices[4] = minY;  // Bottom-left y
        quadVertices[5] = center[2];  // Bottom-left z

        quadVertices[6] = maxX;  // Bottom-right x
        quadVertices[7] = minY;  // Bottom-right y
        quadVertices[8] = center[2];  // Bottom-right z

        quadVertices[9] = maxX;  // Top-right x
        quadVertices[10] = maxY; // Top-right y
        quadVertices[11] = center[2]; // Top-right z
    }
}

Here are my approaches to calculate the projection and view matrices. Not sure if they are correct.

Projection matrix
// Approach 1
// using intrinsic camera parameters like focal lengths in unit of mm, optical axis position in pixel
// actual projection matrix = [4.092262, 0.0, 1.0, 0.0, 0.0, 5.456349, 1.0, 0.0, 0.0, 0.0, -1.002002, -0.2002002, 0.0, 0.0, -1.0, 0.0]
// this resulted in a quad covering the whole screen

void calculateProjectionMatrix(float* matrix, float fx, float fy, float cx, float cy, int width, int height, float near, float far) {
    float w = static_cast<float>(width);
    float h = static_cast<float>(height);

    matrix[0] = 2.0f * fx / w;
    matrix[1] = 0.0f;
    matrix[2] = 1.0f - 2.0f * cx / w;
    matrix[3] = 0.0f;

    matrix[4] = 0.0f;
    matrix[5] = 2.0f * fy / h;
    matrix[6] = 1.0f - 2.0f * cy / h;
    matrix[7] = 0.0f;

    matrix[8] = 0.0f;
    matrix[9] = 0.0f;
    matrix[10] = -(far + near) / (far - near);
    matrix[11] = -2.0f * far * near / (far - near);

    matrix[12] = 0.0f;
    matrix[13] = 0.0f;
    matrix[14] = -1.0f;
    matrix[15] = 0.0f;
}

// Approach 2
// using current camera field of view, preview size
// actual projection matrix = [1.1049107, 0.0, 0.0, 0.0, 0.0, 1.4732143, 0.0, 0.0, 0.0, 0.0, -1.002002, -0.2002002, 0.0, 0.0, -1.0, 0.0]
// this approach is used in the above test cases

glm::mat4 calculateProjectionMatrix(float fov, int width, int height, float near, float far) {
    // Calculate aspect ratio
    float aspectRatio = static_cast<float>(width) / static_cast<float>(height);

    // Create perspective projection matrix using GLM
    glm::mat4 proj = glm::perspective(glm::radians(fov), aspectRatio, near, far);

    return proj;
}

View matrix
float[] viewMatrix = new float[16];
Matrix.setLookAtM(viewMatrix, 0, 0, 0, 3, 0f, 0f, 0f, 0f, 1.0f, 0.0f);

Other references:
https://github.com/google-ai-edge/mediapipe/wiki/MediaPipe-Face-Mesh
https://github.com/google-ai-edge/mediapipe/tree/master/mediapipe/modules/face_geometry/data
@kuaashish kuaashish added os:windows MediaPipe issues on Windows task:face landmarker Issues related to Face Landmarker: Identify facial features for visual effects and avatars. platform::android Android Solutions type:support General questions labels Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
os:windows MediaPipe issues on Windows platform::android Android Solutions task:face landmarker Issues related to Face Landmarker: Identify facial features for visual effects and avatars. type:support General questions
Projects
None yet
Development

No branches or pull requests

2 participants