Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Patents Document Processing with Gemini Notebook #1549

Merged
merged 7 commits into from
Dec 18, 2024

Conversation

holtskinner
Copy link
Collaborator

No description provided.

@holtskinner holtskinner requested a review from a team as a code owner December 17, 2024 19:51
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @holtskinner, code-review-assist is currently reviewing this pull request and will provide feedback shortly.

In the meantime, here's a summary of the changes for you and other reviewers to quickly get up to speed:

This pull request introduces a Gemini notebook for processing patent documents. The notebook uses the Gemini 2.0 Flash model to perform several tasks: classifying the patent granter (US or EU), classifying the invention type (Medical Tech, Computer Vision, Cryptography, or Other), extracting key entities (publication date, application number, etc.), and detecting bounding boxes for figures within the document. The notebook fetches patent PDF URIs from BigQuery, constructs a detailed prompt for Gemini, processes each PDF, and saves the structured results to a new BigQuery table. A comparison with ground truth data is also included, though a direct comparison isn't straightforward due to data format differences. The overall intent is to demonstrate a simplified document understanding pipeline using Gemini, highlighting improvements in comprehensive extraction, workflow simplification, and the elimination of custom model training.

The main changes are within gemini/use-cases/document-processing/patents_understanding.ipynb. The entire file is new, containing code to perform the tasks described above. The code includes functions for fetching data from BigQuery, defining a detailed Gemini prompt with JSON schema for structured output, processing patents using Gemini's controlled generation capabilities, and saving the results back to BigQuery. The notebook also includes sections for installation, authentication (Colab-specific), library imports, project setup, and result comparison.

If there's any missing context, please let me know. I did my best to summarize based on the available information.


And now, a little haiku to brighten your day:

Code flows like a stream,
Gemini's wisdom lights the way,
Patents find their form.

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook demonstrates how to use Gemini to extract structured information from patent documents and store the results in BigQuery. The use of a detailed prompt and JSON schema ensures consistent output and simplifies the document understanding pipeline. The notebook is well-structured and easy to follow. However, the PR description is empty, which makes it difficult to understand the context of the changes without examining the code in detail. A more descriptive PR description would improve the review process. Also, there are no tests included, which is a significant gap for ensuring the correctness and reliability of the code. Adding tests, even basic ones, would greatly improve the quality of the code. Finally, there are some minor improvements that could be made to the code, as detailed in the reviews below. I also recommend adding a section on limitations and error handling to the notebook. While the current implementation handles some basic errors, it doesn't address all potential issues, such as invalid PDF URIs or incorrect JSON parsing. Adding a section on limitations and error handling would make the notebook more robust and user-friendly.

@holtskinner holtskinner merged commit 3e99071 into main Dec 18, 2024
9 checks passed
@holtskinner holtskinner deleted the patents-gemini branch December 18, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants