Unlocking the Power of GGUF Models Locally with Ollama

G’day mates! Today, I want to share my experience diving into the world of large language models (LLMs) and how you can run GGUF models locally using Ollama. If you’re like me and have been keen to harness AI capabilities without relying on cloud services, this guide is for you.

Why GGUF and Ollama?

First off, let’s talk about GGUF (GPT-Generated Unified Format). It’s quickly become the go-to standard for running LLMs on personal machines. With an ever-growing collection of GGUF models available on Hugging Face—thanks to legends like TheBloke and many other contributors—it’s never been easier to get started.

Then there’s Ollama, a fantastic tool based on llama.cpp that simplifies running these models locally. Combining GGUF models with Ollama allows you to leverage powerful AI models without the need for hefty cloud-based solutions.

The Game Changer: Running GGUF Models Directly from Hugging Face via Ollama

In the past, running GGUF models with Ollama involved downloading the models manually and setting up Modelfiles. But now, Ollama has introduced a brilliant feature that lets you run any GGUF model available on Hugging Face directly, without the need for manual downloads or Modelfiles. This has been a game-changer in simplifying the workflow.

How It Works

At the time of writing, there are over 45,000 public GGUF checkpoints on the Hugging Face Hub that you can run with a single ollama run command. Here’s how you can do it.

Step 1: Install Ollama

If you haven’t installed Ollama yet, you can do so by following the instructions on their GitHub repository.

For macOS users with Homebrew:

brew install ollama/tap/ollama

For other platforms, please RTFM README.

Step 2: Running a GGUF Model Directly from Hugging Face

To run a GGUF model directly from Hugging Face, use the following command:

ollama run hf.co/{username}/{repository}

For example, to run the Llama-3.2-1B-Instruct-GGUF model:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

This command tells Ollama to fetch the model directly from Hugging Face, download it if necessary, and then run it. No need to manually download files or set up Modelfiles!

Examples of Models You Can Try

Here are some models you can try running directly:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF

Custom Quantization

By default, Ollama uses the Q4_K_M quantization scheme when it’s present inside the model repository. If not, it picks a reasonable default. However, you can specify a different quantization scheme directly in the command:

ollama run hf.co/{username}/{repository}:{quantization}

For example:

ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

# Quantization names are case-insensitive
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m

# You can also use the full filename as a tag
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf

Custom Chat Template and Parameters

If you want to customize the chat template or sampling parameters, you can create files named template, system, or params in the model repository on Hugging Face. However, for most users, the default settings work just fine.

References

The Traditional Method: Downloading GGUF Models Manually

While running models directly from Hugging Face is incredibly convenient, you might still want to download models manually in certain scenarios—like if you need to modify them or use custom Modelfiles for advanced configurations.

Step 1: Install the Hugging Face CLI

First, ensure you have the Hugging Face CLI installed:

pip install huggingface_hub

Step 2: Download a GGUF Model

Not all models are available in GGUF format, so you need to ensure you’re downloading the correct files.

Checking the Model Files on Hugging Face

Visit the Model’s Page: Go to the Hugging Face page of the model you’re interested in, such as huihui-ai/Llama-3.2-3B-Instruct or TheBloke/MistralLite-7B-GGUF.
Navigate to “Files and Versions”:
- Look for files ending with .gguf. These are the ones compatible with Ollama.
- Avoid files like .safetensors or .bin, as they require conversion.

Downloading the GGUF Model

Once you’ve identified the correct GGUF file, download it using the Hugging Face CLI:

huggingface-cli download \
  TheBloke/MistralLite-7B-GGUF \
  mistrallite.Q4_K_M.gguf \
  --local-dir downloads

Note: Be sure to specify the exact GGUF file to avoid downloading incompatible formats.

Step 3: Creating a Modelfile

After downloading, create a Modelfile to point Ollama to your local model:

# Modelfile
FROM ./downloads/mistrallite.Q4_K_M.gguf

Step 4: Building the Model with Ollama

Build the model using:

ollama create mistrallite -f Modelfile

Step 5: Running the Model

Now, you can run the model:

ollama run mistrallite "What is Kubernetes?"

You should receive an informative response, like:

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It helps organizations manage complex workloads and services.

Handling Models in Other Formats

If the model you want isn’t available in GGUF format and only provides .safetensors or .bin files, you have a couple of options:

Search for a GGUF Version: Often, someone else has converted the model and shared it under a different repository.
Convert the Model Yourself: This can be a bit of a process and may require additional tools like transformers and gguf-converter.

Personally, I prefer to find a GGUF version to save time.

Cleaning Up: Managing Your Models

Over time, you might accumulate several models and want to tidy up.

Listing Installed Models

ollama list

Removing Models

To remove all models:

ollama list | awk 'NR>1 {print $1}' | xargs ollama rm

Or, to remove specific models (e.g., those containing ‘llama’):

ollama list | awk 'NR>1 {print $1}' | grep 'llama' | xargs ollama rm

Building Custom Models for Code Reviews

One of the coolest things I’ve done is create a custom model for code reviews. Here’s how you can do it too.

Setting Up a Custom Model

Create a file named custom-coder.modelfile:

# custom-coder.modelfile
FROM codellama  # Use CodeLlama as the base model

PARAMETER temperature 0.7  # Adjusts creativity
PARAMETER top_p 0.9        # Controls diversity
PARAMETER top_k 40         # Limits the number of tokens to consider

SYSTEM """
You're a senior developer tasked with code reviews:
1. Explain why code works or might break.
2. Identify potential performance issues.
3. Suggest improvements with example code.
"""

Building and Running Your Custom Model

Build the model:

ollama create mycoder -f custom-coder.modelfile

Now, you can use it to review code:

git diff HEAD~1 | ollama run mycoder "Review these changes:"

This pipes your recent Git changes into the model for analysis. It’s like having a senior dev looking over your shoulder!

Enhancing Your Workflow with Shell Integration

To make things even smoother, you can set up shell aliases and functions.

Creating Aliases

Add these to your .bashrc or .zshrc:

alias oc='ollama run codellama'
alias om='ollama run mistral'

Dynamic Model Loading Function

Here’s a nifty function to load models on the fly:

model() {
    local model_name="$1"
    shift
    if ! ollama list | grep -q "$model_name"; then
        echo "Model $model_name not found. Downloading..."
        ollama pull "$model_name"
    fi
    ollama run "$model_name" "$@"
}

Now, you can use it like so:

model codellama "Explain this Python code:"

Advanced Techniques: Model Chaining

You can get creative by chaining models together for more complex tasks.

Example: Code Analysis and Simplification

cat complex.py | \
    ollama run codellama "Analyze this code:" | \
    ollama run mistral "Explain this in layman's terms."

This takes a complex piece of code, analyses it, and then simplifies the explanation.

Example: Code Generation and Review

ollama run codellama "Write a Redis cache wrapper in Python." | \
    ollama run mycoder "Review this code."

Generate code and immediately have it reviewed by your custom model.

Keeping an Eye on Resources

Running these models can be resource-intensive. Here’s how to manage that.

Monitoring GPU Usage

If you’ve got a GPU, monitor its memory usage:

watch -n 1 'nvidia-smi | grep MiB'

Checking Disk Space

See how much space your models are taking:

du -sh ~/.ollama

Limiting GPU Memory Usage

You can limit the GPU memory if needed:

ollama run model --gpu memory_limit=4096

Adjust the memory_limit based on your GPU’s capacity.

Choosing the Right Model for Your Hardware

Not all of us have monster rigs, so pick models that suit your setup:

8GB VRAM GPU: Models like mistral or codellama work well.
12GB VRAM GPU: You can handle larger models like wizard-math.
24GB+ VRAM GPU: Go wild with models like llama2:70b.

It’s all about balancing performance with capability.

Troubleshooting: Quick Fixes

Sometimes things go pear-shaped. Here’s how to get back on track:

# Kill any stuck processes
pkill ollama

# Restart the Ollama server
ollama serve

# Full reset (you'll need to re-download models after this)
rm -rf ~/.ollama

Use that last command cautiously; it’s a nuclear option.

Wrapping Up

Running GGUF models locally with Ollama has been a game-changer for me. With the new ability to run models directly from Hugging Face, it’s never been easier to experiment and develop with powerful AI models right on your own machine.

I hope this guide helps you on your journey. Feel free to share your experiences or ask questions—I’m always keen to chat about this stuff!

Cheers!