• | 3:00 pm

Breaking down Gemma, Google’s new open-source AI model

Gemma marks a return to the practice of releasing new research into the open-source ecosystem.

Breaking down Gemma, Google’s new open-source AI model
[Source photo: Vertigo3d/Getty Images]

Welcome to AI DecodedFast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.


Google announced today a set of new large language models, collectively called “Gemma,” and a return to the practice of releasing new research into the open-source ecosystem. The new models were developed by Google DeepMind and other teams within the company that already brought us the state-of-the-art Gemini models.

The Gemma models come in two sizes: one that is comprised of a neural network with 2 billion adjustable variables (called parameters) and one with a neural network with 7 billion parameters. Both sizes are significantly smaller than the largest Gemini model, “Ultra,” which is said to be well beyond a trillion parameters, and more in line with the 1.8B- and 3.25B-parameter Gemini Nano models. While the Gemini Ultra is capable of handling large or nuanced requests, it requires data centers full of expensive servers.

The Gemma models, meanwhile, are small enough to run on a laptop or desktop workstation. Or they can run in the Google cloud, for a price. (Google says its researchers optimized the Gemma models to run on Nvidia GPUs and Google Cloud TPUs.)

The Gemma models will be released to developers on Hugging Face, accompanied by the model weights that resulted from pretraining. Google will also include the inference code and the code for fine-tuning the models. It is not supplying the data or code used during pretraining. Both Gemma sizes are released in two variants—one that’s been pretrained and the other that’s already been fine-tuned with pairs of questions and corresponding answers.

But why is Google releasing open models in a climate where state-of-the-art LLMs are hidden away as proprietary? In short, it means that Google is acknowledging that a great many developers, large and small, don’t just build their apps atop a third-party LLM (such as Google’s Gemini or OpenAI’s GPT-4), but that they access via a paid API, but also use free and open-source models at certain times and for certain tasks.

The company may rather see non-API developers build with a Google model than move their app to Meta’s Llama or some other open-source model. That developer would remain in Google’s ecosystem and might be more likely to host their models in Google Cloud, for example. For the same reasons, Google built Gemma to work on a variety of common development platforms.

There’s of course a risk that bad actors will use open-source generative AI models to do harm. Google DeepMind director Tris Warkentin said during a call with media on Tuesday that Google researchers tried to simulate all the nasty ways that bad actors might try to use Gemma, then used extensive fine-tuning and reinforcement-learning to keep the model from doing those things.


Remember that scene in The Fly when the scientist Seth (played by Jeff Goldblum) tries to teleport a piece of steak from one pod to another but fails? “It tastes synthetic,” says science journalist Ronnie (Geena Davis). “The computer is rethinking it rather than reproducing it, and something’s getting lost in the translation,” Seth concludes. I was reminded of that scene, and that problem, last week when I was getting over my initial open-mouthed reaction to videos created by OpenAI’s new Sora tool.

Sora uses a hybrid architecture that leverages the accuracy of diffusion models with the scalability of transformer models (meaning that the more computing power you give the model, the better the results). The resultant videos seem more realistic and visually pleasing than those created by the text-to-video generator from Runway, which has been the leader in that space.

But as I looked a bit closer at some of the Sora videos, the cracks began to show. The shapes and movements of things are no longer ridiculously, nightmarishly, wrong, but they’re still not quite right—enough so to break the spell. Objects in videos often move in unnatural ways. The generation of human hands remains a challenge in some cases. For all its flash appeal, Sora still has one foot in the Uncanny Valley.

The model still seems to lack a real understanding of the laws of physics that govern the play of light over objects and surfaces, the fineries of facial expressions, the textures of things. That’s why text-to-video AI still isn’t ready to start putting thousands of actors out of work. However, it’s hard to argue that Sora couldn’t be useful for producing “just in time” or “just good enough” videos, such as for short-run ads for social media.

OpenAI has been able to rapidly improve the capabilities of its large language models by increasing their size, the amount of data they train on, and the amount of compute power they use. A unique quality of the transformer architecture that underpins GPT-4 is that it scales up in predictable and (surprisingly) productive ways. Sora is built on the same transformer architecture. We may see the same rapid improvements in Sora that we’ve seen in the GPT language models in just a few years.


Google announced last week that a new version of its Gemini LLM called Gemini 1.5 Pro offers a one-million-token (words or word parts) context window. This is far larger than the previous industry leader, Anthropic’s Claude 2, which offered a 200,000-token window. You can tell Gemini 1.5 Pro to digest an hour of video, or 11 hours of audio, or 30,000 lines of computer code, or 700,000 words.

In the past, the “context window size” metric has been somewhat overplayed because, regardless of the prompt’s capacity for data, there’s no guarantee the LLM will be able to make sense of it all. As one developer told me, LLMs can become overwhelmed by large amounts of prompt data and start spitting out gibberish. This doesn’t seem to be the case with Gemini 1.5 Pro, however. Here are some of the things developers have been doing with the model and its context window:

  • A developer uploaded an hour-long video and asked Gemini 1.5 Pro to answer detailed questions about the content of the video. They then asked the model to write a detailed outline of all slides shown in the video.
  • A developer instructed the LLM to read through every department in a company’s year-end reports and analyze overlapping goals or identify ways for departments to work together.
  • A developer input half a million lines of computer code and asked the model to answer specific questions about code that were discussed in only one place (i.e., the “needle in the haystack” problem).
  • A developer fed the model the entire text of The Great Gatsby, inserted a mention of a laser-lawnmower and an “iPhone in a box,” then asked the model if it “saw anything weird.” Gemini found both additions and explained why they sounded out of place. It even seized on the (real) mention in the book of a business called “Swastika Holding Company,” calling it “historically inaccurate” and “jarring.”

  Be in the Know. Subscribe to our Newsletters.


Mark Sullivan is a senior writer at Fast Company, covering emerging tech, AI, and tech policy. Before coming to Fast Company in January 2016, Sullivan wrote for VentureBeat, Light Reading, CNET, Wired, and PCWorld More

More Top Stories: