Joe makes a call from a phone booth. It costs him 60 cents for each minute of his call. After 10 minutes, the price drops to 50 cents per minute. How much would a 30 minute call cost him?
Questions like these are a part of a suite of arithmetic tests for US grade schools, typically targeting 10 to 11 year olds. Mathematical reasoning is key to problem solving. It therefore can be used to measure the capabilities of an artificial intelligence (AI) system.
The grade school math 8k (GSM8K) suite has become a popular benchmark for various AI large language models (LLMs), such as ChatGPT. The suite contains 8,500 problems like the one above, divided into problems to train a LLM and then the real problems to be solved. The latest LLM from ChatGPT’s OpenAI, the GPT-4o model, has scored 92.5 per cent on the GSM8K suite while Google’s LLM Gemini 1.5 Pro scored 91.7 per cent. A smaller LLM having fewer tuning parameters, Microsoft’s Phi-3-small, has nevertheless achieved an impressive 88.5 per cent.
However, a recent paper by six researchers at Apple has uncovered significant weaknesses in the reasoning ability of 22 different state-of-the-art LLMs, including those named above. Just a simple name change – such as from “Joe” to “Dave” in the problem above – and then leaving the remainder of the test question completely unchanged, may lead to a different answer from a LLM. This is clearly surprising, and would not be expected from a student who has genuine mathematical understanding.
The great Guinness shortage has lessons for Diageo
Ireland has won the corporation tax game for now, but will that last?
Corkman leading €11bn development of Battersea Power Station in London: ‘We’ve created a place to live, work and play’
Elf doors, carriage rides and boat cruises: Christmas in Ireland’s five-star hotels
The fragility of the various LLMs examined by the researchers was more significant when numbers in the test problems were changed, rather than just names alone.
For example, changing the base rate of the phone call in the test above from 60 cents a minute to 70 cents a minute, and similar numerical changes in the remainder of the test problems, led to a wider variety of accuracy in responses. The researchers concluded that the LLMs are not performing formal reasoning and hypothesised that instead they are making best efforts to match patterns within the set of supplied training problems.
Even more intriguingly, dropping or adding additional clauses had a significant impact on the performance of the LLMs. For example, removing the clause specifying a call price reduction after 10 minutes in the test problem above, or adding a new clause giving a 5 per cent discount for calls costing more than $10, frequently caused a variation in the accuracy of the results.
The researchers noted that as the difficulty of the test problems increased by adding more clauses, then the performance of the LLMs deteriorated rapidly with the increase in problem complexity. They postulated that searching and pattern matching becomes significantly harder for the LLMs as the problem difficulties increase, reinforcing their suggestion that authentic mathematical reasoning is not actually occurring.
In addition to changing the specified values and complexity of problems, the researchers then tried adding apparently relevant, but in practice completely inconsequential, clauses. For example, the phone call problem above might add an unimportant clause observing that phone call prices were actually 10 per cent cheaper last year, but the problem is nevertheless about how much Joe’s phone call costs him now today. However, frequently the LLMs would nevertheless apply the discount rate. In these scenarios, the researchers observed catastrophic performance declines across all of the LLMs tested, which they attributed potentially to an over-reliance by the LLMs on the particular set of training problems.
The researchers concluded: “Ultimately, our work underscores significant limitations in the ability of LLMs to perform genuine mathematical reasoning. The high variance in LLM performance on different versions of the same question, their substantial drop in performance with a minor increase in difficulty, and their sensitivity to inconsequential information indicate that their reasoning is fragile. It may resemble sophisticated pattern matching more than true logical reasoning.”
The text responses from ChatGPT and other LLMs attracted the attention of both the public and investors when they gave an impression that they genuinely understand the world. In practice, it seems they have reached such a size where they absorb more information from their training data than individual humans might typically know or recall, and conjugate this data in various combinations. With sufficient input data and training, requiring considerable investment and energy, a LLM can give an illusion of intelligence but in fact is inherently limited in high-level reasoning, and has no sapient conceptual model.
One of todays most influential giants in computing is Linus Torvalds, the creator of the very widely used Linux operating system. He recently stated that while he found AI really interesting he nevertheless was going to basically ignore it for now. He observed that the whole tech industry around AI is 90 per cent marketing and 10 per cent reality, and “in five years time things will change and at that point we’ll see what AI is getting used every day for real workloads”.
I agree with him. The current generation of LLMs have some utility in text analysis and searching, and can also produce great images and videos, but their true business impact has yet to be proven.
- Sign up for Business push alerts and have the best news, analysis and comment delivered directly to your phone
- Find The Irish Times on WhatsApp and stay up to date
- Our Inside Business podcast is published weekly – Find the latest episode here