Google Unveils Multimodal AI Model Gemini

In a groundbreaking announcement on Wednesday, Google introduced Gemini, a formidable multimodal AI model family poised to challenge the supremacy of OpenAI’s GPT-4—the powerhouse behind the ChatGPT paid version. Google claims that the largest iteration of Gemini surpasses “current state-of-the-art results on 30 of the 32 widely used academic benchmarks in large language model (LLM) research and development.” This unveiling follows Google’s earlier attempt with PaLM 2, aiming to compete with GPT-4’s capabilities.

Gemini stands out as a game-changer due to its multimodal capabilities, adeptly handling various types of input such as text, code, images, and audio. This unparalleled versatility positions Gemini to tackle a wide array of challenges, from everyday queries to complex scientific problem-solving. Google envisions this technology as a catalyst for a new era in computing, with plans to seamlessly integrate Gemini into its diverse array of products.

The model comes in three distinct sizes: Gemini Ultra, designed for highly complex tasks; Gemini Pro, suitable for scaling across a broad range of tasks; and Gemini Nano, tailored for on-device tasks like those on Google’s Pixel 8 Pro smartphone. Each size represents a nuanced trade-off between complexity and computational efficiency, with Nano catering to local consumer devices and Ultra demanding data centre hardware.

In the eyes of Sundar Pichai, the CEO of Google and its parent company Alphabet, Gemini represents a significant step forward. Pichai stated, “The model is innately more capable. It’s a platform. AI is a profound platform shift, bigger than web or mobile. And so it represents a big step for us.” Indeed, it is a big step for Google, but not necessarily a giant leap for the field as a whole.

Google DeepMind claims that Gemini outmatches GPT-4 on 30 out of 32 standard measures of performance. However, the margins between them are thin. What Google DeepMind has achieved is the consolidation of AI’s best current capabilities into one powerful package. To judge from demos, Gemini does many things very well—but few things that we haven’t seen before. Despite the buzz surrounding the next big thing, Gemini might be signalling that we’ve reached peak AI hype, at least for now.

Chirag Shah, a professor at the University of Washington specialising in online search, compares the launch of Gemini to Apple’s annual iPhone introductions. “Maybe we have risen to a different threshold now, where this doesn’t impress us as much because we’ve just seen so much,” he suggests.

Like its GPT-4 counterpart, Gemini is multimodal, meaning it is trained to handle multiple kinds of input: text, images, audio, and more. This capability enables Gemini to combine different formats to answer questions about everything from household chores to college-level maths to economics.

In a compelling demonstration for journalists, Google showcased Gemini’s ability to take an existing screenshot of a chart, analyse hundreds of pages of research with new data, and then update the chart with that new information. In another example, Gemini was shown pictures of an omelette cooking in a pan and asked (using speech, not text) if the omelette was cooked yet. “It’s not ready because the eggs are still runny,” it replied.

Most users will have to wait for the full Gemini experience, though. The version launched today is integrated into Bard, Google’s text-based search chatbot, which the company says will provide more advanced reasoning, planning, and understanding capabilities. The full release of Gemini will be staggered over the coming months, with the English version available in more than 170 countries, excluding the EU and the UK initially to engage with local regulators.

Google’s claim that Gemini Ultra outperforms GPT-4 on paper is met with scepticism. MIT Technology Review notes that while Gemini claims superiority on 30 out of 32 performance measures, the margins are thin. A benchmark comparison chart provided by Google indicates slight leads in most metrics, raising questions about the substantial leap Gemini makes over GPT-4.

Particularly noteworthy is Gemini Ultra’s unprecedented 90% score on the Massive Multitask Language Understanding (MMLU) benchmark, surpassing human expert performance. However, critics caution that interpreting these benchmarks requires caution, emphasising the ongoing debate within the AI community regarding the efficacy and relevance of such metrics.

The promise of Gemini lies in its ability to process complex written and visual information, offering breakthroughs across various fields from science to finance. Google envisions Gemini becoming integral to its products, with plans for integration into Search, Ads, Chrome, and Duet AI in the coming months. Additionally, developers and enterprise customers can access Gemini Pro through the Gemini API in Google AI Studio or Google Cloud Vertex AI from December 13.

However, the real-world impact of Gemini remains uncertain. Despite its benchmark performance, the question of whether it translates into more accurate and useful answers for end-users remains unanswered. The challenges of evaluating large language models persist, with ongoing debates about the reliability of benchmarks and concerns about transparency in training data.

Gemini’s arrival follows a trend of companies attempting to catch up with OpenAI’s evolving GPT models. Google’s previous attempt, PaLM 2, aimed to achieve this goal but fell short. Now, with Gemini, Google is making bold claims, setting the stage for a new chapter in the battle for AI supremacy alongside competitors like Anthropic, Meta, Microsoft, and OpenAI.

The blogosphere buzzes with opinions about Gemini’s incremental improvements over its predecessors and competitors. Some experts express scepticism, suggesting that Gemini might not represent a substantial leap forward. The AI community remains divided on the significance of benchmarks and the true capabilities of models like Gemini, raising questions about the trajectory of AI development.

Gemini’s immediate impact is felt through Bard, Google’s chatbot, which now integrates a specially tuned version of Gemini Pro. The mid-level model enhances Bard’s reasoning, planning, and understanding capabilities, marking a considerable improvement over the previous version based on PaLM 2. Early testing indicates enhanced performance, making Bard, powered by Gemini Pro, the preferred free chatbot in blind evaluations compared to leading alternatives.

Google’s collaboration with YouTuber and educator Mark Rober showcases Bard’s capabilities by crafting the most accurate paper aeroplane using Gemini Pro. This practical demonstration highlights the potential of Gemini in real-world applications, emphasising its ability to assist users in creative endeavours.

As Gemini rolls out in phases, with Gemini Ultra set to debut in Bard Advanced early next year, Google envisions this as the beginning of a new era in AI development. Despite scepticism and debates over benchmarks, Sundar Pichai remains optimistic about the potential headroom for AI advancements. Multimodality is identified as a key focus, with the expectation of more significant breakthroughs as models like Gemini evolve to reason more deeply.

In conclusion, Google’s Gemini marks a significant milestone in the ongoing AI race, but the real impact will unfold as users experience its capabilities firsthand. As AI continues to advance, the narrative surrounding models like Gemini will likely shape the future of technology, making it a compelling journey to watch unfold. Whether Gemini truly lives up to the hype or not, its existence undeniably propels us further into the realm of AI possibilities.

for all my daily news and tips on AI, Emerging technologies at the intersection of humans, just sign up for my FREE newsletter at www.robotpigeon.beehiiv.com