Welcome to AI Decoded, Fast Company’s weekly LinkedIn newsletter that breaks down the most important news in the world of AI. If a friend or colleague shared this newsletter with you, you can sign up to receive it every week here.
GOOGLE ANNOUNCES GEMINI MODEL, WHICH CAN SEE AND HEAR
Google was caught flat-footed when OpenAI suddenly released ChatGPT to the public a year ago, and the search giant has been furiously playing catch-up ever since. On Wednesday, Google announced its powerful new Gemini large language models (LLMs), which it says are the first built to process not just words but also sounds and images. Gemini was developed in part by the formidable brains at Google DeepMind, with involvement across the organization. Based on what I saw in a press briefing earlier this week, Gemini could put Google back at the front of the current AI arms race.
Google is releasing a family of Gemini models: a large Ultra model for complex AI tasks, a midsize Pro model for more general work, and a smaller Nano model that’s designed to run on mobile phones and the like. (In fact, Google plans to build Gemini into the Android OS for one of its phones next year.) The Ultra model “exceeds current state-of-the-art results” on 30 of the 32 benchmarks commonly used to test LLMs, Google says. It also scored a 90% on a harder test called the Massive Multitask Language Understanding, which assesses a model’s comprehension ability in 57 subject areas including math, physics, history, and medicine. Google says it’s the first LLM to score better than most humans on the test.
The models were pretrained (allowed to process large amounts of training data on their own) using images, audio, and code. A Google spokesperson tells me the new models were pretrained using “data” from YouTube but didn’t say if they were pretrained by actually “watching” videos, which would be a major breakthrough. (OpenAI’s GPT-4 model is multimodal and can accept image and voice prompts.)
Models that can see and hear are a big step forward, in terms of functionality. When running on an Android phone, Gemini (the Nano version) could use the device’s camera and microphones to process images and sounds from the real world. Or, if Nano performs something like the larger models, it might be used to identify and reason about real-world objects it “sees” through the lenses of a future augmented reality headset (developed by Google or one of its hardware partners). That’s something Apple’s iPhone and Vision Pro VR headset probably won’t be able to deliver next year, though Meta is hard at work on XR headsets that perform this sort of visual computing.
During the press briefing Tuesday, Google screened a video showing Gemini reasoning over a set of images. In the video, a person placed an orange and a fidget toy on the table in front of a lens connected to Gemini. Gemini immediately identified both objects and responded with a clever commonality between the two items: “Citrus can be calming and so can the spin of a fidget toy,” it said aloud. In another video, Gemini is shown a math test where a user has handwritten their calculations to a problem. Gemini then identifies and explains the errors in the student’s calculations.
In the near term, Gemini’s powers can be experienced through Google’s Bard chatbot. Google says Bard will now be powered by the mid-tier Gemini Pro model, which it expects will give the chatbot better learning and reasoning skills. Bard will upgrade to the more powerful Gemini Ultra model next year, says Sissie Hsiao, Google’s VP/GM of Assistant and Bard. Developers and enterprise customers will be able to access and build on Gemini Pro via an API served from the Google Cloud starting December 13, a spokesperson said.
IBM AND META JOIN 50 OTHER ORGANIZATIONS IN AI ALLIANCE
IBM, Meta, and about 50 other tech companies and universities have set up a new organization to promote open-source AI. The organization, called the AI Alliance, aims to build and support open technology for AI, as well as the communities that will enable AI to benefit business and society.
As the tech industry, regulators, researchers, and others struggle to understand the real risks of advanced AI models, a debate rages about whether it’s better to develop AI models out in the open so that anyone can understand and build on them; or to keep AI models secret in order to prevent bad actors from using the tech for nefarious purposes (say, flooding the web with misinformation). While tech industry heavyweight OpenAI has become more secretive about its research as competition among AI developers has increased, other players like Meta and Hugging Face have been very vocal about the advantages of open source.
The AI Alliance’s formation comes as lawmakers in Europe and the U.S. are working toward passing safety and responsibility regulations for AI developers. Many in the AI community hope the government’s eventual regulations won’t stifle quickly moving development. “A key part of this objective will be to advocate for smart policymaking that is focused on regulating specific applications of AI, not underlying algorithms,” an IBM spokesperson tells me.
But the term “open source” means different things to different people. Meta often talks about open sourcing its models (like Llama), but even it has been criticized for not opening certain details of its models to the developer community. Indeed, one of the Alliance’s main goals will be to define open source. “For example, creating a framework for releasing models with varying levels of openness and defining what those levels are and the standard requirements—these types of nuances are what AI Alliance members intend to tackle,” the IBM spokesperson said.
YANN LECUN: TODAY’S LLMS ARE DUMBER THAN YOUR CAT
Meta’s director of AI research, Yann LeCun, said last week at a Meta FAIR press event that LLMs have a long way to go to reach artificial general intelligence (meaning AIs that are smarter than humans in a wide array of tasks). “Those systems are stupider than your house cat,” LeCun told an assembled crowd.
LeCun, who helped invent the neural networks underpinning today’s LLMs, says today’s best AI systems still lack a basic understanding of the world needed to carry out useful tasks. They can’t yet execute higher-order actions like perceive, remember, reason, and plan. There isn’t yet a robot, he said, that can be trained to figure out how to clear the table after dinner and put the dishes in the dishwasher—a task a seven-year-old could perform. “AGI is not just around the corner,” LeCun added. “It’s going to take a lot of work.”
The difficulty of AGI may lie within LLMs’ training technique. Most LLMs train by processing text content from the internet. They can learn a certain amount about the world by doing that, LeCun says, about as much as a baby knows. But babies can learn from their sense of sight and touch, a much richer flow of training data, if you will. Training transformer models using motion video would be a big improvement but researchers haven’t discovered how to do that yet, LeCun explained. A more robust way of training models might be the path to creating systems with enough common sense (including an understanding of the laws of physics) to handle more complex tasks on behalf of humans.
Loading the player...
What is the future of green mobility in the Middle East?