What is the difference between Gemma and Gemini?

Gemma is Google's open-weight model family built on the same research as Gemini, but designed for the open-source ecosystem. Gemini is Google's proprietary API model. Gemma 4 is built on Gemini 3 research and uses knowledge distillation from a larger teacher model.

Is Gemma 4 really free to use commercially?

Yes. Gemma 4 is released under the Apache 2.0 license — the most permissive standard open-source license. No user count limits (unlike Llama's 700M MAU restriction), no commercial restrictions, no custom terms of use.

Which model should I choose?

31B for maximum quality and fine-tuning. 27B MoE for production serving (near-31B quality at ~4B compute cost). E4B for on-device with image+audio. E2B for the most constrained environments like phones and IoT.

What hardware do I need?

E2B runs on 8GB RAM (even Raspberry Pi). E4B needs 6GB VRAM. 27B MoE fits in ~15GB at Q4 (RTX 4090). 31B needs ~20GB at Q4. Full BF16 31B fits on a single 80GB H100.

Does it support my language?

Gemma 4 supports 140+ languages including English, Chinese, Japanese, Korean, Arabic, Hindi, Spanish, French, German, and many more. It uses a 262K vocabulary SentencePiece tokenizer.

How does the thinking mode work?

Gemma 4 has built-in reasoning capabilities. When enabled, the model shows its step-by-step thought process before generating a response. This dramatically improves math (20.8% → 89.2% AIME) and complex reasoning tasks.

Gemma 4 Free

Chat with Google's most capable open AI model for free. Multimodal, 256K context, 140+ languages. No login required.

TRY GEMMA 4 FREE

Online · Gemma 4 31B

Ask Gemma 4 anything...

Gemma 4 Free Models

Mobile / Embedded

Gemma 4 E2B

2.3B effective / 5.1B total

Ultra-lightweight with Per-Layer Embeddings (PLE). Runs on Raspberry Pi, smartphones, and IoT devices. Native audio input for voice applications.

Context128K

ModalitiesText, Image, Audio

Hardware8GB RAM

Best forVoice assistants, mobile apps, IoT

Laptop / Desktop

Gemma 4 E4B

4.5B effective / 8B total

Best balance of quality and efficiency for on-device AI. Full multimodal with native audio. Runs on consumer GPUs and Apple Silicon Macs.

Context128K

ModalitiesText, Image, Audio

Hardware6GB VRAM

Best forLocal chatbots, code assistants, offline AI

Best Value

Gemma 4 27B MoE

3.8B active / 26B total

Mixture-of-Experts: 128 experts, only 8 active per token. Achieves near-flagship performance at a fraction of compute cost. #6 on LMArena with just 3.8B active parameters.

#6 (ELO 1441)

Context256K

ModalitiesText, Image

Hardware~15GB VRAM (Q4)

Best forProduction chatbots, API serving, cost-sensitive inference

Flagship

Gemma 4 31B

31B dense

Maximum quality dense model. #3 on LMArena globally. 89.2% AIME 2026 math, 80% LiveCodeBench coding, 94.1% HumanEval. Best foundation for fine-tuning and enterprise deployment.

#3 (ELO 1452)

Context256K

ModalitiesText, Image

Hardware~20GB VRAM (Q4)

Best forEnterprise AI, fine-tuning, research, complex reasoning

What Gemma 4 Can Do

◇

Text

256K tokens context

Process entire codebases and long documents in a single prompt
140+ languages with strong multilingual performance
Built-in reasoning (thinking mode) for complex math and logic
Native function calling and structured JSON output for agentic workflows
Codeforces ELO 2150 — above 98% of human competitive programmers

◎

Vision

Image & video understanding

Configurable image token budgets (70–1120 tokens) for speed-quality tradeoff
OCR, chart interpretation, diagram understanding, visual reasoning
Variable aspect ratio with ViT encoder (16×16 patches, 2D RoPE)
76.9% MMMU Pro, 85.6% MATH-Vision on 31B model
Process video as multi-frame sequences

◆

Audio

E2B & E4B models

Native speech recognition via USM-style conformer encoder (~300M params)
No separate ASR pipeline needed — direct audio-to-text understanding
Ideal for on-device voice assistants and real-time transcription
Available on edge models (E2B, E4B) for mobile deployment

Gemma 4 Benchmarks

MMLU Pro

Gemma 4 31B

85.2

Qwen 3.5 27B

86.1

Llama 4 Scout

GPT-4o

83.5

AIME 2026

Gemma 4 31B

89.2

Qwen 3.5 27B

Llama 4 Scout

GPT-4o

LiveCodeBench v6

Gemma 4 31B

Qwen 3.5 27B

Llama 4 Scout

GPT-4o

GPQA Diamond

Gemma 4 31B

84.3

Qwen 3.5 27B

85.5

Llama 4 Scout

GPT-4o

Gemma 4 vs Qwen 3.5 Comparison

	Gemma 4 31B	Qwen 3.5 27B
LMArena Rank	#3 (1452)	#4 (1448)
AIME 2026 (Math)	89.2%	~85%
LiveCodeBench (Code)	80.0%	72.0%
MMLU Pro (Knowledge)	85.2%	86.1%
GPQA Diamond (Science)	84.3%	85.5%
Multimodal	Text + Image + Video	Text + Image
Audio Input	Yes (E2B/E4B)	No
Context Window	256K	128K
Languages	140+	29
License	Apache 2.0	Apache 2.0
Edge Models	E2B (2B), E4B (4B)	Qwen3 0.6B/1.7B/4B
MoE Variant	27B (3.8B active)	No

Why Gemma 4

◎

Multimodal Native

Every model understands text and images. Edge models add native audio. No separate pipelines needed.

◈

256K Context Window

Process entire codebases, research papers, or hours of conversation history in a single prompt.

◆

140+ Languages

Broad multilingual support including CJK, Arabic, Hindi, and 130+ more languages.

◇

Apache 2.0 License

Fully permissive. No user count limits, no commercial restrictions. Unlike Llama's 700M MAU cap.

▣

On-Device Ready

E2B runs on 8GB RAM devices. Quantized 31B fits on a single RTX 4090. NVIDIA, AMD, and Apple Silicon supported.

◉

Built-in Reasoning

Thinking mode for step-by-step reasoning. AIME jumped from 20.8% (Gemma 3) to 89.2% (Gemma 4).

⚙

Agentic Workflows

Native function calling, structured JSON output, and multi-step planning for tool-use agents.

⟐

Fine-tuning Ready

LoRA, QLoRA, full SFT supported. Works with HuggingFace PEFT, Keras, Unsloth, and NVIDIA NeMo.

Gemma 4 Use Cases

Code Generation

94.1% HumanEval, 2150 Codeforces ELO. Generates, reviews, and debugs code. Outperforms GPT-4o on coding benchmarks.

RAG & Document Q&A

256K context + function calling. Ingest entire PDFs, codebases, or knowledge bases. Structured JSON output for pipelines.

Multimodal Chat

Understand images, charts, diagrams, and video frames. Describe photos, extract data from screenshots, analyze documents.

On-Device AI

E2B and E4B with native audio run on phones, Raspberry Pi, and Jetson. Offline-capable with <1.5GB RAM footprint.

Enterprise Deployment

Vertex AI managed serving, sovereign cloud ready. Apache 2.0 means no license headaches. 15+ frameworks supported day one.

Research & Fine-tuning

Base models for custom training. Used by Yale (cancer research), INSAIT (BgGPT for Bulgarian), and 100K+ community variants.

Gemma 4 Architecture

Per-Layer Embeddings (PLE)Each decoder layer gets its own small embedding table. E2B has 5.1B total params but only 2.3B effective compute — more capacity without more inference cost.

Mixture of Experts (MoE)27B model has 128 experts with 8 active per token. Only 3.8B parameters activate per forward pass, delivering near-31B quality at ~4B compute.

Hybrid AttentionInterleaves local sliding-window (512/1024 tokens) with global full-context attention. Fast processing + deep long-range understanding.

Shared KV CacheKeys=Values constraint in global attention layers halves KV cache memory. Enables longer context on less hardware.

Dual RoPECombines standard RoPE with proportional extension (p=0.25). Better position encoding for long sequences up to 256K tokens.

Vision EncoderViT with 16×16 patches and 2D RoPE. Configurable token budgets (70–1120) let you trade image detail for speed.

Get Started with Gemma 4 Free

# Install and run the flagship model
ollama run gemma4:31b

# Or the efficient MoE variant
ollama run gemma4:27b

# Edge model for lightweight devices
ollama run gemma4:4b

Gemma 4 FAQ

TRY GEMMA 4 FREE