Key takeaways:
- Google unveiled an enhanced Gemini 1.5 Pro, its flagship large multimodal model with a context window of 1 million tokens that will eventually increase to 2 million, the longest in the industry.
- Google debuted Project Astra, a universal AI assistant that can ‘see’ and analyze the world.
- Google also took the wraps off Veo, its video generator and budding rival to OpenAI’s Sora, plus Imagen 3, an improved text-to-image generator, as well as enhanced Search with AI overviews.
Google came out sprinting in the AI race at its annual developer conference, Google I/O. The search giant unveiled a dizzying array of AI-enhanced products, led by an improved Gemini 1.5 Pro with an upcoming 2-million token context window – the longest in the industry by far.
“At Google, we are fully in the Gemini era,” said Alphabet and Google CEO Sundar Pichai at the conference. Gemini is the company’s flagship large, natively multimodal model.
Google said Gemini 1.5 Pro is now in private preview. The company unveiled Gemini 1.5 Pro earlier this year but launched an enhanced version at the developer conference. What sets this model part from competitors’ models is its long context window of one million tokens, which will soon double.
Tokens can be entire or parts of words, images, videos, audio or code; 100 tokens is roughly 75 words. That means the model can process over 700,000 words, an hour of video, 11 hours of audio or 30,000 lines of code – in one go.
To join the Gemini 1.5 Pro waitlist, go to the Google AI Studio or Vertex AI for cloud customers. Google also unveiled a smaller, faster version: Gemini 1.5 Flash.
Gemini 1.5 represents a “step change” in Google’s approach, with a Transformer and Mixture-of-Experts (MoE) architecture and is more efficient to train and serve, wrote Google DeepMind CEO Demis Hassabis in a blog post. (Transformer works as one large neural network while MoE models are divided into smaller ‘expert’ neural networks.) See pricing here.
Google also unveiled Project Astra, an AI assistant that can ‘see’ the world and provide information about it. For example, ask it to find objects that make noise – using a smartphone camera panning a room – and it picks out a speaker on a desk. It can do this by continuously encoding video frames and combining the video and speech input into a timeline of events and caching them for efficient recall.
Imagen 3, the latest version of Google’s text-to-image generator, was launched as well, showing the ability to generate more photorealistic images. Google said it can draw text accurately – a difficulty seen in most text-to-image generators.
Google also unveiled Veo, the closest competitor to OpenAI’s Sora, which went viral on social media for its capabilities. Veo records in 1080p high definition, accepts text, video and image prompts and can create videos of more than a minute long. Veo also boasts cinematic techniques such as aerial views and time lapses. It has in-video editing, too. Experience Veo in VideoFX and sign up for the waitlist.
Souped up Google Search, chips
Gemini will power many of the AI-enhanced products unveiled at Google I/O, including a souped up version of its search engine. The AI model not only will help people search for information, it can also help them accomplish tasks.
For example, searching for the best gym in Los Angeles that also has a smoothie bar and located close to the beach will yield a full page of results (called AI Overviews). Gemini will use multi-step reasoning to break down the task and prioritize step by step. Soon, users can upload or take a video and ask Gemini questions about it – such as how to fix gushing water from a pipe at home.
Gemini also will enhance Gmail on mobile, giving it the ability to summarize long chains of emails and search for specific items. Chat will also get AI enhancements, among other Workspace advances.
Pichai showed off Trillium, its sixth generation TPUs to power AI compute. He said it can bring a 4.7x improvement in compute performance compared to the 5th generation and also is 67% more energy efficient. Trillium is coming to cloud clients by late 2024. Google also has CPUs and GPUs, with Nvidia’s Blackwell GPU coming in early 2025.
Open source model, Edge AI
Google’s open source AI model, Gemma, is also getting refreshed with Gemma 2. Available in June, it adds a 27 billion-parameter model that runs on a single TPU in Google Cloud’s Vertex AI platform. The Gemma 2 model can perform at par with Meta’s Llama 3 (70 billion parameter model) “at less than half the size,” according to a Google blog post.
Google also introduced PaliGemma, an open source vision language model (VLM). In the same family is Gemma Nano, a smaller, on-device model that can do things like detect scam calls on smartphones.
Google is releasing its LLM Comparator in open source as well, as an expansion of its Responsible Generative AI Toolkit. The LLM Comparator compares the responses of AI models.
Pichai said there’s more to come: “We are in the very early days of the AI platform shift.”