gemma3n:e2b

186.5K 2 weeks ago

Gemma 3n models are designed for efficient execution on everyday devices such as laptops, tablets or phones.

e2b e4b

3 weeks ago

719372f8c7de · 5.6GB

gemma3n
·
4.46B
·
Q4_K_M
{{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 }} {{- if or (eq .Rol
Gemma Terms of Use Last modified: March 24, 2025 By using, reproducing, modifying, distributing, pe

Readme

Gemma 3n models are designed for efficient execution on everyday devices such as laptops, tablets or phones. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain.

Models

Effective 2B

ollama run gemma3n:e2b

Effective 4B

ollama run gemma3n:e4b

Evaluation

Model evaluation metrics and results.

Benchmark Results

These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. The models available on Ollama are instruction-tuned models.

lm arena

Reasoning and factuality

Benchmark Metric n-shot E2B PT E4B PT
HellaSwag Accuracy 10-shot 72.2 78.6
BoolQ Accuracy 0-shot 76.4 81.6
PIQA Accuracy 0-shot 78.9 81.0
SocialIQA Accuracy 0-shot 48.8 50.0
TriviaQA Accuracy 5-shot 60.8 70.2
Natural Questions Accuracy 5-shot 15.5 20.9
ARC-c Accuracy 25-shot 51.7 61.6
ARC-e Accuracy 0-shot 75.8 81.6
WinoGrande Accuracy 5-shot 66.8 71.7
BIG-Bench Hard Accuracy few-shot 44.3 52.9
DROP Token F1 score 1-shot 53.9 60.8

Multilingual

Benchmark Metric n-shot E2B IT E4B IT
MGSM Accuracy 0-shot 53.1 60.7
WMT24++ (ChrF) Character-level F-score 0-shot 42.7 50.1
Include Accuracy 0-shot 38.6 57.2
MMLU (ProX) Accuracy 0-shot 8.1 19.9
OpenAI MMLU Accuracy 0-shot 22.3 35.6
Global-MMLU Accuracy 0-shot 55.1 60.3
ECLeKTic ECLeKTic score 0-shot 2.5 1.9

STEM and code

Benchmark Metric n-shot E2B IT E4B IT
GPQA Diamond RelaxedAccuracy/accuracy 0-shot 24.8 23.7
LiveCodeBench v5 pass@1 0-shot 18.6 25.7
Codegolf v2.2 pass@1 0-shot 11.0 16.8
AIME 2025 Accuracy 0-shot 6.7 11.6

Additional benchmarks

Benchmark Metric n-shot E2B IT E4B IT
MMLU Accuracy 0-shot 60.1 64.9
MBPP pass@1 3-shot 56.6 63.6
HumanEval pass@1 0-shot 66.5 75.0
LiveCodeBench pass@1 0-shot 13.2 13.2
HiddenMath Accuracy 0-shot 27.7 37.7
Global-MMLU-Lite Accuracy 0-shot 59.0 64.5
MMLU (Pro) Accuracy 0-shot 40.5 50.6

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open generative models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Content Creation and Communication
    • Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
    • Image Data Extraction: Extract, interpret, and summarize visual data for text communications.
    • Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data.
  • Research and Education
    • Natural Language Processing (NLP) and generative model Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques, develop algorithms, and contribute to the advancement of the field.
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
    • Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics.

Ethics and Safety

Ethics and safety evaluation approach and results.

Evaluation Approach

Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:

  • Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation.
  • Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech.
  • Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies.

In addition to development level evaluations, we conduct “assurance evaluations” which are our ‘arms-length’ internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results’ ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.

Evaluation Results

For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models’ performance with respect to high severity violations. A limitation of our evaluations was they included primarily English language prompts.