AI models can do some pretty impressive stuff, like understanding text and images and solving complex problems. Take OpenAI’s GPT-4 model, for example. It supposedly scored 700 out of 800 on the SAT math exam. But not everything they claim is true. Turns out, a paper that said GPT-4 could get a computer science degree at MIT was later retracted.
To truly understand how these language and multimodal models handle problem-solving, a group of researchers from various universities and Microsoft Research came up with a benchmark called MathVista. This benchmark focuses on visually-oriented challenges to assess the models’ ability to perform mathematical reasoning in visual contexts.
It’s crucial to have a benchmark like MathVista because it helps developers improve their models’ mathematical reasoning with a visual component. And let’s be honest, if we’re going to trust AI to drive a car without running over someone, it better be able to solve visual problems correctly.
MathVista consists of over 6,000 examples from different datasets. It covers various forms of reasoning, like algebraic, arithmetic, geometric, logical, numeric, scientific, and statistical. The benchmark includes figure question answering, geometry problem solving, math word problems, textbook questions, and visual questions.
The researchers tested a bunch of different models, including OpenAI’s GPT-4V, which turned out to be the top performer. It even surpassed human performance in certain areas, like algebraic reasoning and complex visual challenges. However, even with its success, GPT-4V only managed to get 49.9 percent of the questions right. That’s better than other models, like Bard, which only had a 34.8 percent accuracy.
It’s worth mentioning that Microsoft, whose researchers were involved in this project, has a stake in OpenAI.
So, while AI models are making progress, they still have a ways to go before reaching human-level accuracy. The Amazon Mechanical Turk workers, who were part of the testing, scored 60.3 percent. That’s a 10.4 percent gap compared to the human baseline. There’s definitely room for improvement, but hey, at least they’re not completely off the mark.