Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding first pass at Dockerfile #12

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amazingvince
Copy link
Contributor

No description provided.

Dockerfile Outdated
@@ -0,0 +1,18 @@
FROM ubuntu:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use an ubuntu image rather than python:3-alpine so you don't have to install python and the image can be smaller?

https://hub.docker.com/_/python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function pthread_attr_setaffinity_np is not on alpine linux and needed for pytorch. I am not sure if there is a slimmer distro than ubuntu. Open to try something if you have any suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we use ubuntu and maybe even package the model weights in the dockerfile (you're going to have to download it anyway, and we pin the revision in docquery).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my local when using it to develop I mount my machines huggingface cache. But I think as a general easy to use docker container that would simplify things and be respectful of huggingfaces resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to build from source than? Or should I just pip install docquery?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on how we want to publish it. We should probably get a better release cadence going and pin the Dockerfile to whatever the latest release is. I think the way to do that would be to somehow provide the version as a parameter to the Dockerfile and populate the parameter by looking at https://github.com/impira/docquery/blob/main/src/docquery/version.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker run IMAGE_NAME docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png

Seems like a really nice way to use the tool.

Dockerfile Outdated
FROM ubuntu:latest

RUN apt-get update \
&& apt-get install -y python3-pip python3-dev \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be specific about which python3 version to use? I think some of the dependencies here require > 3.6.

RUN apt-get update && apt-get install -y python3.9 python3.9-dev python3-pip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to python:3.10-slim-bullseye. So we will be using 3.10.

@ankrgyl
Copy link
Contributor

ankrgyl commented Sep 10, 2022 via email

@ankrgyl
Copy link
Contributor

ankrgyl commented Sep 11, 2022

@amazingvince how does the scan command work from the docker container? Specifically, if you run something like docker ... scan "What is the invoice number?" /path/to/fs how is it able to access files outside the container?

@ankrgyl
Copy link
Contributor

ankrgyl commented Sep 11, 2022

Yeah but what happens if you point to a file on your local filesystem?

@amazingvince
Copy link
Contributor Author

@ankrgyl It would be something like $ docker run -v $(pwd) ... docquery scan "What is the invoice number?" .
Something is missing there and also might change based on system env like windows

Copy link
Contributor

@ankrgyl ankrgyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, that looks good. I think as long as we document it that works. This is very exciting, and I think we are in the homestretch.

To land this we need to sort out a few more things:

  • Add a command to make, like make docker, that builds the Dockerfile (with the appropriate container/tag names)
  • Add documentation to the README that shows how to run the scan command and how to add local files
  • Add something to the tests (maybe a sanity test that the container can be built, or a mode that builds the container and then runs the tests inside of it)

I've also filed some follow ups: #27 and #28. We can address these in follow ups.

COPY ["src/", "./src"]

RUN pip install .
CMD ["python3"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the default entrypoint python3 -m docquery.cmd (or just docquery)?

COPY ["README.md", "pyproject.toml", "setup.py", "./"]
COPY ["src/", "./src"]

RUN pip install .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either pip install .[all] or a select list of extensions, e.g. [donut] (I'm currently working on adding [web] which will contain extras for web scraping).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also manually install transformers (the same version that's suggested in the README)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants