Export to Ollama (GGUF) and Import Steps

Unsloth’ s save_pretrained_gguf usually:

merges LoRA,
converts to 16-bit weights,
writes a .gguf file (and metadata) and optionally quantizes to q4_k_m (like your previous script did).

After training, confirm the .gguf file exists:
- gguf_dir will contain a file like model_name_q4_k_m.gguf or similar. Note the absolute path.
Create a Modelfile text file for Ollama (in same folder or anywhere). Example Modelfile content:

FROM /absolute/path/to/gguf_model/model_file_name.gguf
# optional: add SYSTEM or additional metadata

# create model in Ollama
ollama create my-gemma-resume -f Modelfile

# run it
ollama run my-gemma-resume

If ollama create complains about permissions/paths on Windows, copy the .gguf into your Ollama models folder:

%USERPROFILE%\.ollama\models\ and then create a Modelfile referencing that local path or run ollama create pointing to the path.

GGUF stands for “GPT-Generated Unified Format”.

It is a model file format introduced by the llama.cpp project, designed to make large language models easier to run on local machines.

Unified Format
- GGUF is a standardized format for storing LLM weights, tokenizer, and metadata in a single package.
- It replaces older formats like GGML and GGJT that llama.cpp used previously.
Optimized for Local Inference
- GGUF models are pre-quantized (e.g., Q4, Q5, Q8) to make them lightweight and fast.
- They can run efficiently on consumer hardware (CPU or GPU) without huge VRAM requirements.
Cross-Compatibility
- Widely supported by inference tools like Ollama, llama.cpp, LM Studio, GPT4All, and many more.
- If you see a .gguf file, it usually means you can drop it into these tools and run the model immediately.
Quantization Support
- GGUF files often come in variants like q4_k_m, q5_1, q8_0, etc.
- These indicate different quantization levels (trade-offs between size and accuracy).
- Example: q4_k_m is smaller and faster, while q8_0 is larger but closer to full precision.