Gemma 4: Run AI Free Locally
By Lukas Uhl ·
Every API Call Is a Subscription You Forgot to Cancel
Most companies discover their AI costs the same way they discover a SaaS leak: by accident, when someone looks at the credit card statement.
The first OpenAI API key gets created for a quick experiment. A developer hooks it up to a small workflow. Then three more workflows. Then a customer support bot. Then an internal FAQ tool. Suddenly you’re paying $400/month to run what is, at the core, a few text processing scripts - and every piece of customer data you own is traveling to a US server you’ve never audited.
Google released Gemma 4 on April 2, 2026. Apache 2.0 license. Four model sizes from 2B to 27B parameters. Runs on a laptop, a workstation, or a company server. Zero per-call cost. Data stays on your hardware.
This is not a research project. This is production-ready infrastructure that changes the economics of AI for every business willing to spend one afternoon setting it up.
The Real Cost of Renting Intelligence
Here’s a pattern we see in almost every consulting engagement: a company has been running AI through APIs for 6 to 18 months. The initial bill looked manageable. Now it doesn’t.
A logistics company in the Ruhr area came to us with exactly this situation. They’d built five internal automation workflows over 18 months - document processing, supplier email classification, route summary generation, internal FAQ responses, and a basic customer inquiry router. Each workflow seemed cheap in isolation.
Combined, they were processing roughly 12,000 API calls per month through GPT-4o. Cost: €340/month. That’s €4,080/year. For document processing.
Their legal team had also flagged a GDPR concern: supplier contracts, customer delivery addresses, and internal routing data were all being sent to OpenAI’s servers in the US. Not illegal under current interpretation, but not clean either - and the EU AI Act enforcement timeline was making the legal team nervous.
We ran a three-hour audit. The conclusion: every single one of their five workflows could run on Gemma 4 9B locally. The switch took one developer, one afternoon, and a €1,200 server upgrade they’d been postponing for other reasons anyway.
Month 4 after the migration: their AI infrastructure cost €18/month in electricity. Net saving in year one: over €3,000. GDPR issue: closed. No more external data transfer. Compliance officer satisfied.
That’s not an exceptional case. That’s what local AI does to the unit economics.
What Gemma 4 Actually Is - And What It Isn’t
Before we get to deployment, it’s worth being precise. Gemma 4 is a family of language models. Not an API service. Not a product you subscribe to. A model - a set of weights - that you download and run on your own hardware.
The four sizes:
-
Gemma 4 2B - Runs on a modern laptop with 8GB RAM. Fast response times. Best for classification tasks, short-form text extraction, simple summarization. If you need AI on a device with limited compute, this is your starting point.
-
Gemma 4 9B - The sweet spot for most SMB deployments. Runs comfortably on a server or workstation with 16GB RAM. Handles multi-step reasoning, document analysis, email drafting, and German/English bilingual content well. Think of this as your workhorse model.
-
Gemma 4 27B - Needs serious hardware (32GB+ RAM), but delivers quality comparable to GPT-4 Turbo on most business tasks. If your use case requires nuanced reasoning or complex content generation, this is the right tier.
-
Gemma 4 31B - Enterprise hardware territory. Unless you’re building something at significant scale or with very demanding quality requirements, 27B covers you.
The Apache 2.0 license is the detail that changes everything. Previous open-source models often came with non-commercial restrictions or required attribution in ways that made business use complicated. Apache 2.0 is clean: you can use it, modify it, embed it in commercial products, and sell those products. No royalties, no usage caps, no dependency on a third party’s pricing decisions.
Agentic workflow support is built in. Gemma 4 was designed with multi-step autonomous task execution in mind. It can chain actions, use tools, and operate in pipelines - which matters enormously when you’re building systems that do real work rather than just answering one-off questions.
The Economics of Owning vs. Renting AI
Let’s run the numbers for a 20-person professional services firm that wants to automate internal knowledge work.
Rented AI (current reality for most):
- 15,000 API calls/month through OpenAI GPT-4o: ~€270-380/month
- Annual cost: €3,240-4,560
- Data leaves the building: yes
- Cost per additional 5,000 calls: ~€90
- Risk of pricing change: high (OpenAI changed API pricing four times in 24 months)
Local Gemma 4 9B:
- Hardware: one server upgrade or dedicated workstation: €1,000-1,800 one-time
- Electricity and maintenance: €15-25/month
- Annual cost (year 1 including hardware): €1,300-2,100
- Annual cost (year 2+): €180-300
- Data leaves the building: no
- Cost per additional 5,000 calls: €0
- Risk of pricing change: zero
Breakeven at month 3 to 5 depending on hardware cost. From year 2 onward, you’re running AI for less than €300/year.
A second case: an e-commerce company with 45 employees running roughly 8,000 monthly AI operations for product description optimization, customer support draft generation, and returns classification. They were on a €220/month API plan. After switching to local Gemma 4 9B, the monthly operational cost dropped to €12 in electricity. The one-time setup cost was €800 (hardware upgrade already in the budget). They reached breakeven in month 4.
The product description quality? Slightly different from GPT-4, but within acceptable range after two sessions of prompt refinement. Their conversion rate on optimized descriptions: unchanged.
GDPR Is Not Optional - Local AI Makes It Simple
For any company operating in the EU, the data question is not theoretical. Sending customer data, internal documents, or business-sensitive information through a US API means that data is being processed outside the EU. Under GDPR, this requires either a Data Processing Agreement (DPA) with the API provider, Standard Contractual Clauses (SCCs), or other appropriate safeguards.
OpenAI, Anthropic, and Google all offer DPAs. The paperwork is manageable. But it doesn’t make the data transfer go away - it just makes it documented. If there’s ever an incident, a data request, or an audit, you’ll need to account for everything that left your servers.
With a local model, the answer to “where did this data go?” is: nowhere. It was processed on your server, by your model, in your jurisdiction. That’s not a minor detail for law firms, financial services companies, healthcare-adjacent businesses, or anyone handling sensitive B2B contract data.
The NIS2, AI Act, and CSRD compliance wave we covered last week makes this even more relevant. Three regulatory frameworks are converging simultaneously. Every one of them traces back to data control. Local AI removes an entire category of compliance risk - not by adding controls, but by eliminating the exposure in the first place.
The System: Deploy Gemma 4 in an Afternoon
You don’t need a data science team. You need one developer, a few hours, and clear requirements for one specific use case.
Step 1: Define the task first
The biggest mistake in AI deployment is starting with the technology. Start with the task. What is a process that happens more than 30 times per month, follows a predictable pattern, and currently requires manual attention?
Good candidates:
- Classifying incoming support emails by type and urgency
- Summarizing internal reports or meeting notes
- Drafting standard responses from a template
- Extracting structured data from invoices, forms, or contracts
- Translating or localizing content between German and English
Pick one. Not five. One.
Step 2: Choose the right model size
For most SMB tasks, Gemma 4 9B is the answer. Start there. You can upgrade later if quality doesn’t meet your threshold.
Step 3: Install Ollama
Ollama is the fastest path to running open-source models locally. It’s free, available for Mac, Linux, and Windows, and takes under 10 minutes to set up.
# On Mac or Linux:
curl -fsSL https://ollama.ai/install.sh | sh
# Pull Gemma 4 9B:
ollama pull gemma4:9b
# Start the local API:
ollama serve
Your model is now running at localhost:11434. The API is OpenAI-compatible - any tool that works with OpenAI (n8n, Make.com, LangChain, custom Python scripts) can be pointed to this endpoint with a one-line change.
Step 4: Build one workflow and measure it
Connect your first workflow. Run it in parallel with your current setup for two weeks. Compare output quality. Track time saved. Calculate cost difference. Only then expand.
Step 5: Prompt engineering is a one-time investment
Open-source models respond differently to prompts than commercial models. What works for GPT-4 may need adjustment for Gemma 4. Expect 2 to 4 hours of prompt refinement for your first use case. After that, the patterns carry forward to new workflows.
The Competitive Angle Most Businesses Miss
Here’s a positioning argument that’s working well in our B2B consulting engagements: “We run AI without your data leaving your premises.”
For enterprise clients, this is a procurement differentiator. Their legal, IT, and compliance teams have been burned by vendor data handling. A service provider who can demonstrate that client data never touches a third-party API stands out in a competitive conversation.
If you’re building internal tools or client-facing products, local AI is a feature - not just a cost optimization. “Powered by local AI - your data never leaves” is a line that lands in regulated industries, with privacy-conscious buyers, and in any public sector or healthcare-adjacent context.
The agentic workflow shift we’re tracking shows that 40% of enterprise workflows are going agentic this year. Local models are the foundation of agentic systems that can run at scale without a per-call price tag. And the 5-minute lead response benchmark shows that response speed is a direct revenue driver - local models eliminate API latency and rate limits as bottlenecks in customer-facing workflows.
Renting AI was a reasonable starting point when local options didn’t exist. Today, it’s a choice. And for most SMBs running steady AI workloads, it’s the wrong one.
One Action
This week: identify the one AI task in your business with the highest monthly call volume. Open a spreadsheet. Calculate what you paid for it in API costs over the last three months.
Then download Ollama, pull Gemma 4 9B, and run your five most recent examples through it locally.
If the quality is acceptable - and for most document processing, classification, and drafting tasks it will be - you have a clear migration path. The model is free. The infrastructure is one afternoon. The savings start from day one.
If you want help mapping which of your processes qualify for local AI first, that’s the kind of focused analysis we run in our Strategy Call. Forty-five minutes, concrete recommendations, no pitch deck.
Your AI budget should go to scale, not overhead.


