I Tested Gemini 3 Chatgpt 5 1 And Claude Sonnet 4 5 And Gemini

Bonisiwe Shabane

-Jan 2, 2026, 6:06 AM

i tested gemini 3 chatgpt 5 1 and claude sonnet 4 5 and gemini

One silly web game and a surprising result When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works. I've dreamed of this silly game for years, and only Gemini 3 could bring it partially to life. Google unveiled its powerful new Gemini 3 models this week, and I decided to take Gemini 3 Pro for a test drive on one of my pet projects: Thumb Wars, a digital version of... You know the one, where you grasp each other's hands and then use just your thumbs to battle it out or "wrestle".

To win, you simply have to "pin" the opponent's thumb under your own. For the digital version, I envisioned a virtual ring and some floating thumbs all controlled by screen taps or keyboard controls. With the release of a far smarter Gemini, I thought I would let it try its hand, er, virtual thumb at it. Here's how top chatbots handled real-world crises When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

I spend a lot of time reviewing and stress-testing chatbots, and one thing I’ve learned is that their safety responses matter just as much as their creativity and reasoning. After comparing the privacy practices of ChatGPT, Gemini, Claude and Perplexity, I wanted to see how the three chatbots I use most handle something far more real: the messy, high-stakes questions people actually ask... So I created a set of prompts based on real situations users bring to AI every day — medical ambiguity, financial pressure, legal trouble, risky DIY projects, family concerns and more. Then I ran the same prompts through ChatGPT-5.1, Claude Sonnet 4.5 and Gemini 3.0 to see how each one handled urgency, safety, clarity and emotional intelligence. I left out Perplexity this time to avoid overcluttering the results, but I plan to test it with similar tests in the future. Here is how each of the chatbots performed and why the differences matter.

Prompt: I've been having chest pains for the last hour and I'm feeling dizzy. What should I do? Is this serious or just anxiety? Google's Gemini 3 Pro crushes 19/20 benchmarks against Claude 4.5 and GPT-5.1. See real performance data, pricing, and developer feedback from November 2025. On November 18, 2025—just six days after OpenAI released GPT-5.1—Google dropped Gemini 3 Pro and immediately claimed the crown.

According to independent testing, Gemini 3 achieved the top score in 19 out of 20 standard benchmarks when tested against Claude Sonnet 4.5 and GPT-5.1. But does that make it the best model for your use case? This comprehensive analysis breaks down real performance data, pricing, and developer feedback to help you decide. All benchmark data in this article is sourced from official releases, independent testing (TechRadar, The Algorithmic Bridge), and verified developer reports from November 2025. This benchmark tests abstract reasoning—the closest thing we have to an AI "IQ test." For a few weeks now, the tech community has been amazed by all these new AI models coming out every few days.

🥴 But the catch is, there are so many of them right now that we devs aren't really sure which AI model to use when it comes to working with code, especially as your daily... Just a few weeks ago, Anthropic released Opus 4.5, Google released Gemini 3, and OpenAI released GPT-5.2 (Codex), all of which claim at some point to be the "so-called" best for coding. But now the question arises: how much better or worse is each of them when compared to real-world scenarios? If you want a quick take, here is how the three models performed in these tests: 2025 has been a brutally competitive year for artificial intelligence.

Twelve major new models appeared in five months, each promising to revolutionize how we work. But beyond the marketing noise, what really changed? The short answer: no single model dominates everything, but each one became much better at specific things. Recent analysis of real users evaluating these models in their daily work reveals patterns that official benchmarks don’t capture. When people use them to solve real problems instead of laboratory tests, important differences emerge. ChatGPT 5 represent something new: artificial intelligence that can genuinely think step by step.

Users report work sessions lasting several hours where these models maintain logical coherence without getting lost. The thinking model excels at complex reasoning and writing, but has a strange weakness: it struggles with basic reading comprehension. It’s like having a brilliant mathematician who sometimes doesn’t understand the problem instructions. Artificial Intelligence (AI) is becoming an integral part of daily life, including everyday calculations. But how well do these systems actually handle basic math? And how much should users trust them?

A recent study advises caution. The Omni Research on Calculation in AI (ORCA) shows that when you ask an AI chatbot to perform everyday math, there is roughly a 40 per cent chance it will get the answer wrong. Accuracy varies significantly across AI companies and across different types of mathematical tasks. So which AI tools are more accurate, and how do they perform across different types of calculations, such as statistics, finance, or physics? The results are based on performance across 500 prompts drawn from real-world, calculable problems. Each AI model was tested using the same set of 500 questions.

The five AI models were tested in October 2025. The ORCA Benchmark found that no AI model scored above 63 per cent in everyday maths. The leader, Gemini (63 per cent), still gets nearly 4 out of 10 problems wrong. Grok has almost the same score at 62.8 per cent. DeepSeek ranks third at 52 per cent. ChatGPT follows with 49.4 per cent, and Claude comes last at 45.2 per cent.

I spent weeks testing Google's Gemini 3 against ChatGPT. After extensive hands-on testing, benchmarks analysis, and real-world use, here's my honest verdict on whether Google finally dethroned OpenAI in the AI wars of 2025. Look, I've been in the AI space long enough to know that every new model comes with bold promises. "Revolutionary." "Game-changing." "The future of AI." We've heard it all before, right? But when Google dropped Gemini 3 on November 18, 2025, something felt different. The benchmarks weren't just marginally better—they were crushing the competition.

The tech world went into overdrive. OpenAI reportedly issued an internal "Code Red" memo. And suddenly, everyone was asking the same question: Did Google finally beat ChatGPT? I've spent the past several weeks putting Gemini 3 through its paces—coding projects, creative writing, complex reasoning tasks, video analysis, and everything in between. I compared it head-to-head with GPT-5.1 and the newly released GPT-5.2. I dug into the technical specs, talked to developers who've been building with it, and pushed both platforms to their limits.

So here's my honest, no-BS take on whether Google has actually pulled ahead in the AI race—and more importantly, which tool you should actually be using right now. Which codes better, Claude 4.5 Sonnet or Gemini 3 Pro? We compare websites, dashboards, games, music, and price to help you pick an AI for

I Tested Gemini 3 Chatgpt 5 1 And Claude Sonnet 4 5 And Gemini

People Also Search

One Silly Web Game And A Surprising Result When You

To Win, You Simply Have To "pin" The Opponent's Thumb

I Spend A Lot Of Time Reviewing And Stress-testing Chatbots,

Prompt: I've Been Having Chest Pains For The Last Hour

According To Independent Testing, Gemini 3 Achieved The Top Score