Apply Now Apply Now Apply Now
header_logo
Post thumbnail
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Setup and Fine-Tune Qwen 3 with Ollama: Complete Guide (2026)

By Jebasta

What if you could run one of the most capable open-source AI models in the world entirely on your own laptop, with no cloud fees, no API keys, and no internet connection after the initial download? That is exactly what you get when you set up Qwen 3 with Ollama, making it one of the most powerful local LLM setups available in 2026. The combination of Alibaba’s powerful Qwen 3 model family and Ollama’s dead-simple local LLM inference platform has made running state-of-the-art AI locally more accessible than ever.

This blog walks you through every step: what Qwen 3 is, what Ollama does, how to install and run different Qwen 3 models, how to control thinking mode, how to connect Qwen 3 to Python, and how to fine-tune Qwen 3 with a custom Modelfile. By the end, you will know how to fine-tune a model for a specific task using nothing but a text file. Everything in this guide is beginner-friendly and verified against current documentation.

Quick Answer 

To set up Qwen 3 with Ollama, install Ollama from Ollama.com, run Ollama pull qwen3:8b to download the model, and run Ollama run qwen3:8b to start chatting. It runs entirely on your local machine with full privacy and zero cloud cost. You can customize the model’s behaviour by creating a Modelfile and using Ollama create to build a fine-tuned version.

Table of contents


  1. What is Qwen 3?
    • Qwen 3 Model Variants and Hardware Requirements
  2. What is Ollama?
  3. How to Install Ollama
    • Installing Ollama on macOS
    • Installing Ollama on Linux
    • Installing Ollama on Windows
    • Verifying the Installation
  4. Downloading and Running Qwen 3 Models
    • Step 1: Pull a Qwen 3 Model
    • Step 2: List Your Downloaded Models
    • Step 3: Run Qwen 3 in Chat Mode
  5. Understanding Qwen 3 Thinking Mode
    • Switching Thinking Mode in the CLI
    • Controlling Thinking Mode via the API
  6. Using Qwen 3 with Python
    • Basic Chat with the Python Client
    • Using the requests Module
    • Streaming Responses
  7. Fine-Tuning Qwen 3 with a Modelfile
    • What is a Modelfile?
    • Step 1: Create a Project Folder and Modelfile
    • Step 2: Write the Modelfile
    • Step 3: Create the Custom Model
    • Step 4: Run Your Custom Model
    • Practical Modelfile Examples
  8. Tips for Getting the Best Results with Qwen 3 and Ollama
    • 💡 Did You Know?
  9. Conclusion
  10. FAQs
    • What hardware do I need to run Qwen 3 with Ollama?
    • Is Qwen 3 free to use in commercial projects?
    • What is the difference between thinking mode and non-thinking mode in Qwen 3?
    • Can I truly fine-tune Qwen 3 with new training data using Ollama?
    • How do I use Qwen 3 in a Python application?

What is Qwen 3?

Qwen 3 is the third generation of large language models developed by Alibaba. It was released on April 28-29, 2025, and represents a significant leap forward from Qwen 2.5. The model family includes eight main variants and six specialised models for retrieval and ranking tasks.

The model was trained on 36 trillion tokens across 119 languages, nearly double the training data used for Qwen 2.5. This makes it one of the most multilingual open-source model families available. All models in the family are released under the Apache 2.0 licence, which means you can download, use, and build on them for both personal and commercial projects without any restrictions.

Qwen 3 Model Variants and Hardware Requirements

Qwen 3 comes in two architectural types: dense models and Mixture-of-Experts (MoE) models. Dense models are straightforward and easier to understand. MoE models are more complex but activate only a fraction of their parameters during each inference step, which means a model with 30 billion total parameters might only use 3 billion parameters per response, keeping it fast and memory-efficient.

Here is a plain-language breakdown of the available Qwen 3 models and what hardware you need to run them comfortably with Ollama.

ModelParametersDisk SizeMinimum VRAMBest For
qwen3:0.6b0.6B dense~400MB4GBTesting, very fast responses
qwen3:1.7b1.7B dense~1.1GB4GBBasic tasks on limited hardware
qwen3:4b4B dense~2.6GB6GBGood starter model
qwen3:8b8B dense~5.2GB8GBBest default choice
qwen3:14b14B dense~9GB10-12GBStrong reasoning quality
qwen3:32b32B dense~20GB20-24GBBest quality single GPU
qwen3:30b-a3b30B MoE (3B active)~18GB19-24GBEfficient for 24GB GPU
qwen3:235b-a22b235B MoE (22B active)~140GB140GB+Maximum quality, multi-GPU

The recommended starting point for most developers new to local LLM inference is qwen3:8b. It runs on any machine with 8GB of VRAM or 16GB of system RAM, delivers strong performance across reasoning, coding, and writing tasks, and downloads in a few minutes on a decent internet connection.

Riddle: The qwen3:30b-a3b model has 30 billion total parameters but only activates 3 billion during each inference step. A developer with an 18GB GPU is choosing between this MoE model and the qwen3:14b dense model. Which one should they choose, and why?

Answer: The qwen3:30b-a3b MoE model is generally the better choice. Because it only activates 3 billion parameters per step, it fits comfortably in 18-24GB of VRAM despite its large total parameter count. It offers more overall model capacity and stronger performance than the 14B dense model, while using similar or less memory during actual inference. The “30B” in the name refers to stored knowledge, not runtime cost.

MDN

What is Ollama?

Ollama is a free, open-source tool that makes running a local LLM on your own machine as simple as running a command in your terminal. It handles model downloading, version management, GPU acceleration, and serving a local LLM API, all automatically.

When you run Ollama pull qwen3:8b, Ollama downloads the model in GGUF format, which is the standard format for local CPU and GPU inference. GGUF bundles the model weights, tokeniser, and metadata into a single file, so there are no Python environments to configure and no separate config files to manage. When you run Ollama run qwen3:8b, your local LLM starts immediately in chat mode.

Ollama also runs a local REST API server on port 11434. This means you can integrate any local LLM into Python, JavaScript, or curl-based applications using standard HTTP requests, without sending your data to any cloud service.

  • Private by default: Your conversations and data never leave your machine. Every local LLM runs entirely on your hardware after the initial download.
  • GPU acceleration: It automatically detects and uses your NVIDIA or AMD GPU, falling back to CPU inference if no GPU is available.
  • Free and open-source: Ollama is completely free with no usage limits or rate limits of any kind.
  • Works on all platforms: Ollama runs on Windows, macOS (including Apple Silicon), and Linux with the same commands on every platform.

Do check out HCL GUVI’s Artificial Intelligence and Machine Learning course, a comprehensive, industry-aligned program designed to help you become job-ready in AI/ML through live mentor-led sessions, hands-on projects, and real-world case studies covering topics like machine learning, deep learning, NLP, and model deployment. With expert guidance, placement support, and certification, it’s ideal for beginners and professionals looking to build a strong AI career in just a few months.

How to Install Ollama

Installing Ollama takes under two minutes on any supported platform. Here is how to do it on each operating system.

1. Installing Ollama on macOS

Open your terminal and run the following command to install Ollama using Homebrew.

brew install Ollama

If you do not have Homebrew installed, you can alternatively download the macOS installer directly from Ollama.com/download and run it like any other Mac application. Ollama will appear in your menu bar after installation.

2. Installing Ollama on Linux

Open your terminal and run the official install script with the following command.

curl -fsSL https://Ollama.com/install.sh | sh

This single command downloads and installs Ollama, sets it up as a system service, and configures it to start automatically. After the script completes, Ollama is running and ready to use.

3. Installing Ollama on Windows

Go to Ollama.com/download in your browser, download the Windows installer, and run it. The installer handles everything automatically. Once installed, Ollama runs in the background as a system service.

4. Verifying the Installation

After installing on any platform, open a new terminal window and run the following command to confirm everything is working.

Ollama –version

You should see the installed version number printed in the terminal. If you see it, Ollama is installed and running correctly.

Downloading and Running Qwen 3 Models

Now for the part that makes this whole setup worthwhile: downloading and running Qwen 3 with Ollama.

Step 1: Pull a Qwen 3 Model

Use the Ollama pull command to download a Qwen 3 model. For most developers, the 8B model is the best starting point.

Ollama pull qwen3:8b

A download progress bar appears. The 8B model is approximately 5.2GB, so download time depends on your internet speed. Once the download completes, the model is stored locally and you never need to download it again.

To download a smaller model for faster responses on limited hardware, run the following.

Ollama pull qwen3:4b

To download the most powerful single-GPU model, run the following.

Ollama pull qwen3:32b

Step 2: List Your Downloaded Models

After downloading, you can see all locally installed models with the following command.

Ollama list

This shows every model installed on your machine, along with its size and the date it was last used.

Step 3: Run Qwen 3 in Chat Mode

To start an interactive chat session with Qwen 3, use the Ollama run command.

Ollama run qwen3:8b

A chat prompt opens where you can type messages and receive responses. To exit the session, type /bye and press Enter.

You can also run a model that you have not downloaded yet. When you run a model that is not installed locally, Ollama automatically downloads it first and then starts the session.

Ollama run qwen3:14b

Understanding Qwen 3 Thinking Mode

One of the most distinctive features of Qwen 3 is its dual-mode design. Every Qwen 3 model can operate in two modes: thinking mode and non-thinking mode.

In thinking mode, Qwen 3 works through the problem step by step before giving you its final answer. You can see the reasoning process displayed in the output, wrapped in think tags. This mode is more accurate for complex problems like maths, logic, coding, and multi-step analysis, but it adds response time because the model is doing more work.

In non-thinking mode, Qwen 3 responds directly and quickly without showing any reasoning. This is better for simple questions, conversational interactions, summarisation, and translation where speed matters more than deep reasoning.

Brain teaser: You are building a customer support chatbot using Qwen 3 with Ollama. For simple questions like “What are your business hours?” you want fast responses. For complex technical questions like “How do I configure two-factor authentication?”, you want careful step-by-step reasoning. Can you design a system that uses both modes depending on the question?

Answer: Yes. You can build a simple classifier that first categorises the incoming question as simple or complex. For simple questions, you send the request to Qwen 3 with think=False in the API call. For complex questions, you send it with think=True. Both routes use the same local Qwen 3 model running through Ollama’s REST API. You can even let Qwen 3 itself do the classification first in non-thinking mode, and then route complex questions back through thinking mode for the detailed answer.

1. Switching Thinking Mode in the CLI

By default, Qwen 3 models run in thinking mode when started with Ollama run. You can switch modes during a chat session using the following commands.

To enable thinking mode during a session, type the following at the chat prompt.

/set think

To disable thinking mode and switch to fast direct responses, type the following.

/set nothink

You can also set the mode when starting the session. To start in thinking mode explicitly, run the following command.

Ollama run qwen3:8b –think

To start in non-thinking mode from the beginning, run the following.

Ollama run qwen3:8b –think=false

2. Controlling Thinking Mode via the API

When using Qwen 3 through the Ollama REST API or the Python client, you control thinking mode with a parameter in your request.

To enable thinking in a Python API call, use the following structure. Note that code commands are shown inline here. You call Ollama.chat and pass think=True along with your model name and messages list. The response object will contain both response.message.thinking with the reasoning trace and response.message.content with the final answer.

To disable thinking and get fast direct responses, pass think=False in the same call instead.

Using Qwen 3 with Python

A REST API runs on http://localhost:11434 that any language can call. For Python developers, the official Ollama Python client makes this even simpler. Install it with the following command.

pip install Ollama

1. Basic Chat with the Python Client

The following code sends a message to Qwen 3 using the Ollama Python client and prints the response. Import chat from the Ollama module. Call chat with model=’qwen3:8b’ and a messages list containing a dictionary with role=’user’ and your message as the content. Access the response with response.message.content and print it.

This is all the code you need for a basic integration. The Ollama Python client handles the HTTP request, connection management, and response parsing for you.

2. Using the requests Module

If you prefer to use the standard requests module without installing the Ollama client, you can call the API directly. Send a POST request to http://localhost:11434/api/chat with a JSON body containing model, messages, and optionally think and stream fields. Set stream to False if you want the complete response as a single JSON object rather than a streaming series of chunks.

3. Streaming Responses

For a more responsive user experience in an application, you can stream the response token by token. In the Ollama Python client, pass stream=True to the chat call. This returns a generator that you iterate over, printing each chunk as it arrives. This is how you get the typewriter effect in chat interfaces.

Fine-Tuning Qwen 3 with a Modelfile

Here is something important to understand before this section: Ollama does not support true fine-tune jobs with new training data. What Ollama does support, through the Modelfile system, is creating a custom model that starts from a base Qwen 3 model but has a fixed system prompt, custom parameters, and a persistent persona baked in. This is sometimes called prompt-based fine-tuning or model customisation. For local LLM applications, this type of fine-tuning covers the vast majority of real-world needs.

Also read – How to Fine-Tune Large Language Models (LLMs)? And Fine-Tuning LLMs with Unsloth and Ollama: A Step-by-Step Guide

For many practical use cases, this is exactly what you need. If you want the model to always behave as a customer support agent for your product, always respond in a specific format, or always have access to certain context without you having to provide it every time, a Modelfile handles all of that elegantly, making it the simplest fine-tune approach available in the local LLM ecosystem.

What is a Modelfile?

A Modelfile is a plain text configuration file that tells Ollama how to build a custom model. It uses a simple syntax similar to a Dockerfile. The most important fields are shown below.

FROM specifies the base model you are building on. For example, FROM qwen3:8b means your custom model starts from the downloaded 8B Qwen 3 model.

SYSTEM defines the system prompt that gets injected at the start of every conversation. This is how you give your model a role, a persona, or task-specific instructions that apply every single time without you having to type them manually.

PARAMETER sets inference parameters. Common parameters include temperature (controls creativity, 0.0 to 1.0), top_p (controls the diversity of word selection), and num_ctx (sets the context window size in tokens).

Step 1: Create a Project Folder and Modelfile

Create a new folder for your custom model. Inside it, create a plain text file named Modelfile with no extension. This is the standard naming convention Ollama expects.

Step 2: Write the Modelfile

Here is a working example of a Modelfile that creates a Qwen 3 model customised for sentiment analysis. The file starts with FROM qwen3:0.6b to use the small, fast 0.6B model as the base. Then it sets a SYSTEM prompt that instructs the model to analyse the sentiment of any text it receives and respond only with one of three labels: Positive, Negative, or Neutral. It includes a few-shot example showing the expected input and output format. The PARAMETER temperature 0.0 line sets temperature to zero, which makes the model deterministic and consistent, ideal for classification tasks. The PARAMETER num_ctx 4096 line sets the context window.

This Modelfile approach means every single conversation automatically starts with those instructions. You do not need to repeat the system prompt in your code every time you call the model.

Step 3: Create the Custom Model

Navigate to your project folder in the terminal and run the following command to build your custom model from the Modelfile.

Ollama create qwen3-sentiment-analyzer -f Modelfile

It reads the Modelfile, applies the configuration to the base Qwen 3 model, and registers the new model under the name qwen3-sentiment-analyzer. The process takes only a few seconds since it is not retraining the model, just configuring it.

Step 4: Run Your Custom Model

Run your new custom Qwen 3 model exactly like any other Ollama model.

Ollama run qwen3-sentiment-analyzer

Every conversation now starts with your system prompt automatically. When you type a sentence, the model responds with only a sentiment label, exactly as configured.

You can verify your custom model appears in your installed models list with the following command.

Ollama list

Practical Modelfile Examples

You can create a wide variety of specialised Qwen 3 models using different Modelfile configurations. Here are three useful examples.

1. Code review model: Set the base to qwen3:14b for better code understanding. Write a SYSTEM prompt that instructs the model to review any code it receives, identify bugs, suggest improvements, and explain its reasoning clearly. Set temperature 0.2 for consistent, precise technical output.

2. Customer support model: Use qwen3:8b as the base. Write a SYSTEM prompt that gives the model a specific company name, product name, and list of common issues it should handle. Include instructions to be polite, stay on topic, and escalate to a human for anything it cannot resolve confidently. Set num_ctx 8192 to handle longer support conversations.

3. Non-thinking fast responder: Use any Qwen 3 base model. Add /no_think at the end of your SYSTEM prompt to permanently disable thinking mode for all conversations with this model. This is the recommended approach when you need fast API responses and do not want thinking mode turning on by default.

Tips for Getting the Best Results with Qwen 3 and Ollama

  • Match the model size to your hardware: Running a model that fits entirely in VRAM gives you 10x or more speed compared to one that spills into system RAM. The qwen3:8b model running at 80+ tokens per second on 8GB VRAM is far more useful than qwen3:32b crawling at 7 tokens per second on the same card.
  • Increase the context window if you see looping: If Qwen 3 seems to repeat itself or loop unexpectedly, the tool may have defaulted to a very small context window. Set it higher by running /set parameter num_ctx 32768 during your session or adding PARAMETER num_ctx 32768 to your Modelfile.
  • Use non-thinking mode for API integrations: When you are calling Qwen 3 from an application rather than chatting interactively, non-thinking mode is usually the better default. It is faster and produces cleaner output for structured tasks like summarisation, classification, and extraction.
  • Keep Ollama updated: Run Ollama –version to check your version and visit Ollama.com for updates. Newer Ollama versions often include speed improvements and support for new model features.
  • Use LangSmith or LangChain for complex pipelines: If you want to chain Qwen 3 with other tools, retrieval systems, or agent workflows, the local API is fully compatible with LangChain. You can swap any cloud model for a local LLM like Qwen 3 with one line of code.
  • Test your Modelfile with a small base model first: When building a custom Modelfile configuration, test your fine-tune system prompt with the 0.6B or 4B model first. It is faster and cheaper during iteration. Once you are happy with the prompt, switch the FROM line to a larger model for your production fine-tune deployment.

💡 Did You Know?

  • Qwen3-32B delivers performance equivalent to Qwen2.5-72B, meaning you get 72B-class capability from a model that fits on a single RTX 4090 GPU, which typically has 24GB of VRAM.
  • Qwen 3 was trained on 119 languages and dialects, making it one of the most multilingual open-source model families ever released, including support for Indian languages such as Hindi, Bengali, Tamil, and Telugu.
  • All Qwen 3 models are released under the Apache 2.0 licence, allowing you to use them in commercial products, modify them, and distribute them without paying licensing fees or requesting permission.

Conclusion

Setting up Qwen 3 with Ollama is one of the most practical things a developer or AI enthusiast can do in 2026, giving you a complete local LLM stack with a built-in fine-tune workflow. Three commands get you from nothing to a fully running local AI model: install Ollama, pull Qwen 3, and run it. Everything after that, controlling thinking mode, integrating with Python, building custom models with Modelfiles, is a natural extension of those basics.

The ability to fine-tune Qwen 3 with a Modelfile and create task-specific local LLM models for sentiment analysis, code review, customer support, or any other workflow you can describe in a system prompt gives you a genuinely powerful local AI stack. And because it all runs on your machine through Ollama, there are no API bills, no data leaving your network, and no rate limits to worry about.

FAQs

1. What hardware do I need to run Qwen 3 with Ollama?

The minimum practical setup for a local LLM is a machine with 8GB of RAM and at least a modern mid-range CPU. For a good experience, 8GB of VRAM lets you run the qwen3:8b model at full speed. Apple Silicon Macs (M1 and above) are excellent for local inference because they share RAM between CPU and GPU. A basic MacBook Air M2 with 16GB of unified memory runs qwen3:8b comfortably as a local LLM.

2. Is Qwen 3 free to use in commercial projects?

Yes. All Qwen 3 models are released under the Apache 2.0 licence. You can use them in commercial applications, fine-tune them, and deploy them without any licensing fees or restrictions. Ollama is also free and open-source.

3. What is the difference between thinking mode and non-thinking mode in Qwen 3?

Thinking mode activates chain-of-thought reasoning before the final answer. The model shows its work step by step, which improves accuracy on hard problems but adds response latency. Non-thinking mode responds directly and is faster, making it better for simple questions, structured data extraction, and API integrations where speed matters.

4. Can I truly fine-tune Qwen 3 with new training data using Ollama?

No. Ollama does not support retraining models on new datasets. The Modelfile system in Ollama lets you fine-tune a model’s behaviour with a system prompt, fixed parameters, and a persona, which handles most practical customisation needs. For true fine-tuning with new training data, you would use a framework like Unsloth or Hugging Face Transformers. These tools let you run a real fine-tune job that updates the model weights on your own dataset.

MDN

5. How do I use Qwen 3 in a Python application?

Install the Ollama Python client with pip install Ollama. Import chat from Ollama and call it with your model name, messages, and optional parameters like think=True or think=False. The model must be downloaded locally with Ollama pull before you can call it from Python. The Ollama server runs automatically in the background once Ollama is installed.

Success Stories

Did you enjoy this article?

Schedule 1:1 free counselling

Similar Articles

Loading...
Get in Touch
Chat on Whatsapp
Request Callback
Share logo Copy link
Table of contents Table of contents
Table of contents Articles
Close button

  1. What is Qwen 3?
    • Qwen 3 Model Variants and Hardware Requirements
  2. What is Ollama?
  3. How to Install Ollama
    • Installing Ollama on macOS
    • Installing Ollama on Linux
    • Installing Ollama on Windows
    • Verifying the Installation
  4. Downloading and Running Qwen 3 Models
    • Step 1: Pull a Qwen 3 Model
    • Step 2: List Your Downloaded Models
    • Step 3: Run Qwen 3 in Chat Mode
  5. Understanding Qwen 3 Thinking Mode
    • Switching Thinking Mode in the CLI
    • Controlling Thinking Mode via the API
  6. Using Qwen 3 with Python
    • Basic Chat with the Python Client
    • Using the requests Module
    • Streaming Responses
  7. Fine-Tuning Qwen 3 with a Modelfile
    • What is a Modelfile?
    • Step 1: Create a Project Folder and Modelfile
    • Step 2: Write the Modelfile
    • Step 3: Create the Custom Model
    • Step 4: Run Your Custom Model
    • Practical Modelfile Examples
  8. Tips for Getting the Best Results with Qwen 3 and Ollama
    • 💡 Did You Know?
  9. Conclusion
  10. FAQs
    • What hardware do I need to run Qwen 3 with Ollama?
    • Is Qwen 3 free to use in commercial projects?
    • What is the difference between thinking mode and non-thinking mode in Qwen 3?
    • Can I truly fine-tune Qwen 3 with new training data using Ollama?
    • How do I use Qwen 3 in a Python application?