Running Gemma 4 Locally

May 4, 2026 · 5 min read

I believe every developer can relate to the urge to DIY. When there is a paid subscription for a tool, the immediate instinct is: If I can build it myself, why bother paying for it? This mindset has been the driving force behind countless open-source projects that eventually eclipsed the very proprietary tools that inspired them.

I’ve been chasing that feeling with local LLMs over the past month. It all started when Google released Gemma 4, with press releases highlighting smaller variants designed to run on Google Pixels. My logic was simple: if it can run on a phone, it should fly on my Mac. Why pay for a monthly subscription when I can run the models on my own silicon?

Spoiler Alert: It did not fly on my Mac.

Harsh Reality

Before we dive in, we need to have a conversation about hardware. Let’s be realistic: most of us aren’t dropping thousands of dollars on a maxed-out Mac Studio, and for many, even a MacBook Pro is out of reach. While those high-end machines make running large models feel seamless, they aren’t the norm.

I’m testing these models on a MacBook Pro—which I recognize is still a premium device—but the reality is that until there is a massive leap in how models handle memory, hardware will always be the primary bottleneck. You likely won’t be running the world’s largest frontier models locally, and that’s okay. The secret to enjoying local AI is not having the most expensive gear; it’s setting your expectations based on the hardware you actually have.

The Setup

So, how do you actually get these models running locally? My go-to is Ollama, which simplifies the process of downloading and managing models on macOS. Once you have Ollama installed, you can pull a model—such as gemma4:e4b—directly from your terminal using:

$ ollama pull gemma4:e4b

While Ollama handles the heavy lifting in the background, you’ll eventually want a more robust way to interact with your models than a terminal window. That’s where Open WebUI comes in. It’s a browser-based interface that transforms the experience, allowing you to modify system prompts and integrate tools like skills and MCP servers to expand your model’s capabilities.

However, if you’re looking for a more integrated, agentic experience—one that can actually interact with your filesystem and execute commands—you want a tool that lives inside your code.

Naturally, I tried this first with Claude Code. Since it’s designed for this exact workflow, I thought it would be the perfect bridge. I pointed it toward my local model, but this is where I hit my first real roadblock: Gemma 4 can think, but it doesn’t know it can act.

[Screenshot 2026-05-04 at 10.39.58 AM.png]

As you can see in the screenshot, the model entered a “thinking” state, but it never actually executed a command. It just sat there, pondering for over a minute, acting like a standard chatbot rather than a coding agent. It had the logic to solve the problem, but it didn’t seem to realize it had the permission to call tools. I spent some time digging through the settings, but I couldn’t find a way to force the behavior in Claude Code.

That led me to explore OpenCode. When I first fired it up, I ran into the exact same issue—Gemma 4 was still just chatting, not acting. But unlike my experience with Claude Code, I was able to find a way to “flip the switch.”

By editing the OpenCode config file (usually located at ~/.config/opencode/opencode.json), I found that I could manually toggle the tool_call property to true.

{
  "$schema": "https://opencode.ai/config.json",
  "model": "ollama/minimax-m2.7:cloud",
  "provider": {
    "ollama": {
      "models": {
        "gemma4:e2b": {
          "_launch": true,
          "name": "gemma4:e2b",
          "tool_call": true
        }
      },
      "name": "Ollama",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:11434/v1"
      }
    }
  }
}

With that one line changed, the model stopped just “thinking” and started doing. Now, we were cooking.

The Mindset Shift

Now that the tools are working, there is a second hurdle you have to clear: your own habits.

If you’ve spent any time with Claude Code, Codex or Antigravity, you’re probably used to “one-shotting” your development. You give the model a massive prompt, describe three different feature changes, and expect it to output a perfectly refactored app in one go.

With local models, that approach will fail. Local models generally have smaller context windows and a lower “reasoning ceiling.” If you try to dump your entire codebase into the prompt and ask for a complex overhaul, the model will either lose the plot or start hallucinating.

To get the most out of local AI, you have to move from delegation to collaboration. Instead of one-shotting, you need to guide the model through incremental steps.

The Frontier Way

With Claude Code, you might be used to prompting: “Refactor the authentication logic to use JWT instead of sessions and update all the affected routes.” and it works most of the time.

The Local Way

“Look at auth.ts. How would we change this to use JWT?” $\rightarrow$ (Verify).
“Now apply that change to auth.ts.” $\rightarrow$ (Verify).
“Now look at routes.ts and update the middleware to match.” $\rightarrow$ (Verify).

You can achieve the same results, but you have to hold the model’s hand.

Finding the balance

Why would you want to run local models when frontier models are accesible? Maybe you’re like me, chasing the DIY itch. Maybe, you keep hitting usage limits within your subscriptions. In that case, using local models can help offset your usage limits. You can use local agents for simple tasks, like refactoring a file, writing a model file, or, with the right harness (coming soon), identifying bugs and potential security issues.

You can then offload the heavy work to the frontier models, saving those precious tokens for complex tasks.

In the end, I didn’t replace my Claude Pro subscription. But I did change how I use it. The repetitive stuff — refactoring a file, scaffolding a model, the tasks I used to burn tokens on without thinking — that’s local now. Claude gets the problems that actually need Claude.

That’s not the outcome I was chasing when I pulled gemma4:e4b, but it might be the more useful one.