OpenAI: This is How GPT-3 Responds When Asked to Mimic Human Reasoning
According to UCLA psychologists, GPT-3 can reason approximately as effectively as a college student. But does the technology mirror the way people think, or does it use a completely new way of thinking?
People are good at solving new problems without any special training or practice. They do this by comparing them to problems they have already solved and applying the answer to the new problem.
But now there might be a new kid on the block, so people might have to make room.
The study challenges the belief that analogical reasoning, the process by which new problems are solved by comparing them to known problems, is a uniquely human skill.
The UCLA researchers, however, raised a new questions: Is GPT-3 merely imitating human thought patterns due to the extensive language training dataset it was trained on, or is it demonstrating an entirely novel cognitive process?
Access to GPT-3’s internal structure is restricted by OpenAI, its creator, making it difficult for the researchers to definitively pinpoint how its reasoning abilities function. They noted that despite its remarkable performance in certain reasoning tasks, the AI tool failed significantly in others.
“No matter how impressive our results, it’s important to emphasize that this system has major limitations,” remarked Taylor Webb, a UCLA postdoctoral researcher in psychology and the study’s first author. “It can do analogical reasoning, but it can’t do things that are very easy for people, such as using tools to solve a physical task. When we gave it those sorts of problems — some of which children can solve quickly — the things it suggested were nonsensical.”
How did GPT-3 perform on Raven’s Progressive Matrices?
To compare GPT-3’s reasoning capabilities to humans, Webb, and his team designed a series of tests inspired by Raven’s Progressive Matrices, which require the subject to predict the next image in a complex pattern of shapes.
Webb turned the pictures into texts that GPT-3 could understand so that it could “see” the shapes. This method also made sure that the AI had never seen the questions before.
Interestingly, GPT-3’s solutions mirrored human responses, including making similar errors.
The researchers gave the same questions to 40 first-year college students at UCLA.
“Surprisingly, not only did GPT-3 do about as well as humans but it made similar mistakes as well,” commented UCLA psychology professor Hongjing Lu, the study’s senior author.
The AI managed to solve 80% of the problems, exceeding the human average of just under 60%, but within the range of the highest human scores.
Can GPT-3 solve SAT analogy questions?
Additionally, GPT-3 outperformed the average human score on a set of unique SAT analogy questions, suggesting a level of reasoning previously unobserved in AI models.
The researchers also asked GPT-3 to answer a set of SAT comparison questions that they think had never been posted on the internet. This means that the questions were unlikely to have been part of GPT-3’s training data.
The questions ask people to choose two words that are related in the same way.
(For example, in the question “‘Love’ is to ‘Hate’ as ‘Rich’ is to which word?,” the answer is “Poor.”)
They matched GPT-3’s scores to the released results of college applicants’ SAT scores and found that the AI did better than the average score for the people.
After that, the researchers gave GPT-3 and student participants analogy problems based on short tales, asking them to read one paragraph and then choose another that had the same meaning. On such issues, the technology fared worse than students, despite GPT-4, the most recent version of OpenAI’s technology, outperforming GPT-3.
On these questions, the technology did worse than the students, but the most recent version of OpenAI’s technology, GPT-4, did better than GPT-3.
The researchers at UCLA have made their own computer model, which is based on how humans think, and have been comparing it to business AI.
“AI was getting better, but our psychological AI model was still the best at doing analogy problems until last December when Taylor got the latest upgrade of GPT-3, and it was as good or better,” added UCLA psychology professor Keith Holyoak, a co-author of the study.
According to the researchers, GPT-3 has so far been unable to resolve issues that call for a grasp of spatial relations. For instance, GPT-3 presented strange answers when given descriptions of a set of equipment that it might use to transport gumballs from one bowl to another, such as a cardboard tube, scissors, and tape.
“Language learning models are just trying to do word prediction so we’re surprised they can do reasoning,” Lu remarked. “Over the past two years, the technology has taken a big jump from its previous incarnations.”
The UCLA researchers want to find out whether language learning models are genuinely starting to “think” like people or if they are just mimicking what people think.
“GPT-3 might be kind of thinking like a human,” Holyoak added. “But on the other hand, people did not learn by ingesting the entire internet, so the training method is completely different. We’d like to know if it’s really doing it the way people do, or if it’s something brand new — a real artificial intelligence — which would be amazing in its own right.”
They would need access to the software and the data used to train it, as well as the ability to offer tests that they are certain the software hasn’t previously taken, in order to identify the fundamental cognitive processes AI models are utilizing. They said it would be the next stage in determining what AI should turn into.
“It would be very useful for AI and cognitive researchers to have the backend to GPT models,” Webb added. “We’re just doing inputs and getting outputs and it’s not as decisive as we’d like it to be.”
Image Credit: Omar Marques/SOPA Images/LightRocket via Getty Images