atet / llm

Introduction to Large Language Models

This introduction covers what is absolutely necessary to get you up and running to host large language models like Meta's latest Llama 3 (released April 18, 2024) and other models on your own computer.

0. Requirements

The general rules of thumb so that you spend less time troubleshooting and more time using these models:

Consider how big the models are and use small variants if necessary
Have more system RAM and GPU VRAM than you think you need
Load entire model on RAM or VRAM, do not swap!
If using a GPU, use an Nvidia GPU

The following tutorials will require:

Software: This tutorial was developed on Microsoft Windows 10 with Windows Subsystem for Linux 2 (WSL2) and Docker
Hardware: The size of the large language model you can run is dependent on the amount of...
- (CPU-only) ...system RAM you have, 8 GB RAM is the minimum for CPU-only processing, but more is welcome
- (GPU Acceleration) ...Nvidia GPU VRAM and system RAM you have, 4 GB VRAM and 8 GB system RAM is the minimum for GPU processing but more would be better

Minimum Required GPU VRAM / System RAM (GB)	Example Models & Variants	Example Desktop GPU
6 / 12	• EleutherAI GPT-Neo (125 Million → 1.3 Billion Parameter) • Google FLAN-T5 (Medium → Large) • Meta AI (Facebook) Fairseq Dense (125 Million → 1.3 Billion Parameter) • OpenAI GPT-2 (Small → Extra Large)	Nvidia GeForce GTX 1660
8 / 16	• EleutherAI GPT-Neo (2.7 Billion Parameter) • Meta AI (Facebook) Fairseq Dense (2.7 Billion Parameter)	Nvidia GeForce RTX 3050
24 / 48	• EleutherAI GPT-J (6 Billion Parameter) • Google FLAN-T5 (Extra Large → Extra Extra Large) • Meta AI (Facebook) Fairseq Dense (6.7 Billion Parameter)	Nvidia GeForce RTX 3090

NOTE: If you do not have a sufficient GPU but have at least 8 GB of system RAM, you could still try out the smaller GPT-2 models: Instructions for GPT-2 with system RAM.

Back to Top

1. Examples

The following are actual, real-life responses when leveraging these models for anything from removing writer's block in creative work to summarizing dense text manuscripts to get work done more efficiently.

NOTE: Training, fine-tuning, and prompt engineering are beyond the scope of this tutorial.

1.1. Creative Completion of Text

From GPT-2 (Small):

input  = "GPT-2 don't be so creepy, you're making me"

output = "GPT-2 don't be so creepy, you're making me cry!"

1.2. Explanation and Rationale

From FLAN-T5 (XL):

input  = "Give the rationale before answering. Could a single human eat an entire elephant in a year?"

output = "An elephant can weigh up to 20,000 pounds. A single human can eat about 2,000 pounds of food in a year. So the final answer is no."

1.3. Text Summarization

From FLAN-T5 (Large):

input  = "Summarize the following for a teenager: HQ USSOCOM is seeking information from qualified sources to understand the capabilities available relevant to Enterprise Data Management, Intelligence, Mission Planning, Command and Control system as part of market research for software-based capabilities. USSOCOM currently has multiple disparate software systems that are utilized to execute a myriad of functions related to mission data, insights, intelligence, planning, and execution. USSOCOM seeks best of breed capabilities for end-to-end solutions to execute these functions while also permitting for the future integration into the greater USSOCOM ecosystem of Software systems for data analytics, intelligence, mission planning, and mission execution. The scope of this effort is for both a readily available end to end system that can be deployed to meet existing in the quickest manner possible while also identifying the best candidate to provide space for modernization with the systems of systems approach USSOCOM is taking. For the purposes of this RFI, the period of performance is assumed to be a 12-month base year starting 1 May 2023 with four 12-month option years."

output = "Read the RFI."

1.4. Chat Bot

From GPT-J (6B):

input  = "Tell me a joke about turtles."

output = "Turtles don't like peanut butter."

input  = "That's not a joke but it did make me laugh."

output = "What do you think about death?"

Back to Top

2. Installation

The following instructions will use two Docker images to quickly deploy a browser-based interface that can leverage multiple models appropriate to your available RAM (8 GB system RAM minimum) or VRAM (Nvidia GPU with 4 GB VRAM minimum).

NOTE: To take advantage of GPU acceleration in this tutorial, you must have an Nvidia GPU as only Nvidia instructions are provided below

2.1. Installing Ollama LLM Server

Pick one of the following methods to install Ollama, CPU-only or GPU accelerated:

2.1.1. CPU-only Instructions (click here to expand)

2.1.1.1. Start Ollama service from a Docker container:
- NOTE: The command below will not leverage your GPU if you have one that is supported

$ docker run -d -p 11434:11434 -v ollama:/root/.ollama --name ollama --restart always ollama/ollama

OR

2.1.2. GPU Accelerated Instructions (click here to expand)

2.1.2.1. Ensure that `nvidia-smi` displays GPU information in commandline:

$ nvidia-smi
Thu May 30 21:12:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650        On  |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8              1W /   50W |       1MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

If you do not see the above, in your WSL2 environment, install the most recent Nvidia drivers for your GPU in Windows (not through WSL2): https://www.nvidia.com/download/index.aspx

NOTE: "Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2...therefore users must not install any NVIDIA GPU Linux driver within WSL 2." -Nvidia CUDA on WSL2 User Guide

After installation of the GPU drivers, the nvidia-smi command should now work within your WSL2 terminal

2.1.2.2. Install the Nvidia Container Toolkit:

$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt update && sudo apt -y upgrade
$ sudo apt install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker

2.1.2.3. You must now restart Docker in WSL2 for the Nvidia configuration to take effect, the easiest way is to restart your computer

2.1.2.4. Start Ollama service from a Docker container with Nvidia GPU support:
- NOTE: The command below will will only work if your Nvidia GPU is correctly configured for Docker above

$ docker run -d -p 11434:11434 --gpus=all -v ollama:/root/.ollama --name ollama --restart always ollama/ollama

2.1.3. Confirm Ollama webserver is running by visiting localhost:11434 and seeing that "Ollama is running":

NOTE: Do not download any models for Ollama through CLI, they will not register with Open WebUI; we will download models through Open WebUI instead

2.2. Installing Open WebUI Browser-Based Interface

2.2.1. Start Open WebUI from a Docker container (this will take a few minutes):

$ docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

2.2.2. Access Open WebUI website by visiting localhost:3000:

2.2.3. Click on "Sign up" at the bottom and register with any username/email (can be fake)/password, and click "Create Account":

2.2.4. Click on bottom right icon and select "Settings":

2.2.5. Click on "Models" and in the text box for "Pull a model from Ollama.com," type in an appropriate model from the table below, and click on the download icon to the right:

Model	Memory Usage Minimum (System RAM or GPU VRAM)
`llama3:8b-instruct-q3_K_S`	At least 4 GB
`dolphin-mistral:7b`	>4 GB but less than 8 GB
`llama3:8b`	>4 GB but less than 8 GB
`dolphin-mistral:7b-v2.8-fp16`	At least 16 GB
`llama3:8b-instruct-fp16`	>16 GB but less than 20 GB
More models at: ollama.com/library

2.2.6. Wait for the model to download, have its integrity checked, and properly registered into your Open WebUI environment:

2.2.7. Exit from Settings and by default you will be in a "New Chat," click on the top "Select a model," choose a model, and input a prompt at the bottom like, "Write a haiku about artificial intelligence.":

2.2.8. Depending on CPU (slow) or GPU (fast) processing, it may take a minute to respond:

Silent thoughts weave,

Artificial Intelligence blooms;

Awakening life.

– dolphin-mistral

Awesome! Now you've got a powerful large language model installed locally on your computer. If you disconnected from the internet, the Open WebUI and dolphin-mistral model would still work!

Back to Top

3. Next Steps

3.1. Up-to-Date Models

Keep up to date with the leaderboard of models on Hugging Face at https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

NOTE: For Open WebUI with Ollama, available models for Ollama are listed at https://ollama.com/library

3.2. Stopping and Uninstalling Open WebUI and Ollama

Good news is that Open WebUI and Ollama were stood up as containers, so nothing was technically "installed," you can check the status of the containers, stop, and (re)start them:

$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND               CREATED          STATUS                    PORTS                                           NAMES
f517b5c25628   ghcr.io/open-webui/open-webui:main   "bash start.sh"       47 minutes ago   Up 47 minutes (healthy)   0.0.0.0:3000->8080/tcp, :::3000->8080/tcp       open-webui
d5f7cd566cce   ollama/ollama                        "/bin/ollama serve"   53 minutes ago   Up 53 minutes             0.0.0.0:11434->11434/tcp, :::11434->11434/tcp   ollama

$ docker stop open-webui
open-webui
$ docker stop ollama
ollama

$ docker start open-webui
open-webui
$ docker start ollama
ollama

Back to Top

Other Resources

Description	Link
What is Windows Subsystem for Linux (WSL)?	https://github.com/atet/wsl
What are large language models (LLMs)?	https://en.wikipedia.org/wiki/Wikipedia:Large_language_models
Nvidia CUDA on WSL2 User Guide	https://docs.nvidia.com/cuda/wsl-user-guide
Open WebUI	https://docs.openwebui.com
Ollama	https://ollama.com
Hugging Face LLM Leaderboard	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Expanded table of model hardware requirements:

Model	Size Variant	Required System RAM	Required VRAM	Example Desktop GPU
EleutherAI GPT-Neo	125 Million Parameter (125M)	4 GB	2 GB	Nvidia GeForce GTX 1650
EleutherAI GPT-Neo	1.3 Billion Parameter (1.3B)	12 GB	6 GB	Nvidia GeForce GTX 1660
EleutherAI GPT-Neo	2.7 Billion Parameter (2.7B)	16 GB	8 GB	Nvidia GeForce RTX 3050
EleutherAI GPT-J	6 Billion Parameter (6B)	32 GB	16 GB	Nvidia GeForce RTX 3090
Google FLAN-T5	Large	12 GB	6 GB	Nvidia GeForce GTX 1660
Google FLAN-T5	Extra Large (XL)	32 GB	16 GB	Nvidia GeForce RTX 3090
Google FLAN-T5	Extra Extra Large (XXL)	48 GB	24 GB	Nvidia GeForce RTX 3090
Meta AI (Facebook) Fairseq Dense	125 Million → 1.3 Billion Parameter (125M, 355M, 1.3B)	8 GB	4 GB	Nvidia GeForce GTX 1650
Meta AI (Facebook) Fairseq Dense	2.7 Billion Parameter (2.7B)	16 GB	8 GB	Nvidia GeForce RTX 3050
Meta AI (Facebook) Fairseq Dense	6.7 Billion Parameter (6.7B)	32 GB	16 GB	Nvidia GeForce RTX 3090
OpenAI GPT-2	Small → Large	8 GB	4 GB	Nvidia GeForce GTX 1650
OpenAI GPT-2	Extra Large (XL)	12 GB	6 GB	Nvidia GeForce GTX 1660
OpenAI GPT-3†	175 Billion Parameter (175B)	700 GB	350 GB	×11 Nvidia Tesla V100 (32GB VRAM)
OpenAI ChatGPT (a.k.a. GPT-3.5)‡	20 Billion Parameter (20B)	320 GB	160 GB	×5 Nvidia Tesla V100 (32GB VRAM)

Back to Top

Troubleshooting

Issue	Solution
I want to access GPT-3 or ChatGPT/GPT-3.5.	You can create an account with OpenAI to access those models online; this tutorial is focused on running the models locally on your computer.
Can I use an AMD or Intel GPU?	No, not for this tutorial.
I don't have enough GPU VRAM.	If you have at least 8 GB of system RAM, you could try out the smaller GPT-2 models in system RAM.
Model did not seem to install from Ollama in Open WebUI after download completed	Exit out of Settings, go back into Settings/Models and redownload the same model; should start at 100%, reverify, and register correctly
I cannot access `nvidia-smi` within my docker container.	You must install `nvidia-docker2` in addition to docker.
Some LLM responses are very unsettling.	Yes, they are.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.img		.img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

atet / llm

Introduction to Large Language Models

Table of Contents

Introduction

Supplemental

0. Requirements

1. Examples

1.1. Creative Completion of Text

1.2. Explanation and Rationale

1.3. Text Summarization

1.4. Chat Bot

2. Installation

2.1. Installing Ollama LLM Server

2.2. Installing Open WebUI Browser-Based Interface

3. Next Steps

3.1. Up-to-Date Models

3.2. Stopping and Uninstalling Open WebUI and Ollama

Other Resources

Troubleshooting

Citations

About

Uh oh!

Releases

Packages

atet/llm

Folders and files

Latest commit

History

Repository files navigation

atet / llm

Introduction to Large Language Models

Table of Contents

Introduction

Supplemental

0. Requirements

1. Examples

1.1. Creative Completion of Text

1.2. Explanation and Rationale

1.3. Text Summarization

1.4. Chat Bot

2. Installation

2.1. Installing Ollama LLM Server

2.2. Installing Open WebUI Browser-Based Interface

3. Next Steps

3.1. Up-to-Date Models

3.2. Stopping and Uninstalling Open WebUI and Ollama

Other Resources

Troubleshooting

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages