It seems that every day brings a new headline about the burgeoning capabilities of large language models (LLMs) like ChatGPT and Googleās Geminiāheadlines that are either exciting or increasingly apocalyptic, depending on oneās point of view.
One particularly striking story arrived earlier this year: a paper that described how an LLM had passed the Turing Test, an experiment devised in the 1950s by computer science pioneer Alan Turing to determine whether machine intelligence could be distinguished from that of a human. The LLM in question was ChatGPT 4.5, and the paper found that it had been strikingly successful in fooling people into thinking it was human: In an experiment where participants were asked to choose whether the chatbot or an actual human was the real person, nearly three of the four chose the former.
This soundsā¦significant. But how, exactly? What does it all mean?
What the Turing Test isāand what it isnāt
To answer that question, we first need to look at what the Turing Test is, and what it means for an LLM to pass or fail it.
Cameron Jones, a postdoctoral student at UC San Diego and one of the co-authors of the new paper, explains that Turing introduced the idea of the test in his seminal 1950 paper āComputing Machinery and Intelligence.ā The paper set out to address a big, fundamental question that occupied the minds of Turingās contemporaries: āCan machines think?ā
In his paper, Turing quickly rejects the question as being ambiguous and non-rigorous, because it is not clear on what either a āmachineā is in this context, nor on what āthinkingā is. He argues that a more nuanced and easily addressable question was required, and proposes, āCan a machine act in such a manner that its actions are indistinguishable from those of a human?ā To answer this question, he proposes what he calls āThe Imitation Game,ā and itās this exercise that has since come to be referred to as simply āThe Turing Test.ā
The test involves one personāthe āinterrogatorāācommunicating simultaneously with two hidden interlocutors, referred to as the āwitnesses.ā All communication is written. The twist is that while one of these witnesses is a real person, the other is a machine. The point of the game, Jones explains, āis for the interrogator to figure out which of the two witnesses is the real human.ā
Jones and his team performed this experiment with four LLMs. ChatGPT 4.5 was by far the most successful: 73% of participants identified it as the real human. Another model that goes by the unwieldy name LLaMa-3.1-405B was identified as human 56% of the time. (The other two modelsāELIZA and GPT-4oāachieved 23% and 21% success rates, respectively, and will not be spoken of again.)
What does ChatGPT passing the Turing Test mean?
The results for ChatGPT 4.5 and LLaMa are striking enough, but the really interesting question is what their success signifies.
Itās important to note from the outset that the test isnāt designed to detect machine intelligence. In rejecting the question āCan machines think?ā Turing also neatly sidesteps the thorny question of exactly who is doing the thinking if the answer is āyes.ā Consider Rene Descartesā famous assertion cogito, ergo sum, āI think, therefore I am,ā which essentially demands that the presence of thought requires consciousness.
However, Turingās paper does argue that success in the Imitation Game means that we canāt deny the possibility that genuine machine intelligence is at work. As Jones explains, Turing ābasically [argued] that if we could build a machine that was so good at this game that we couldnāt reliably tell the difference between the witnesses, then essentially weād have to say that that machine was intelligent.ā
Modern readers might well recoil from such an assessment, so itās worth looking at Turingās line of reasoning, which went as follows:
- We canāt know that our fellow humans are intelligent. We canāt inhabit their minds or see through their eyes.
- Nevertheless, we accept them as intelligent.
- How do we make this judgment? Turing argues that we do so on the basis of our fellow humansā behavior.
- If we attribute intelligence based on behavior, and we encounter a situation where we canāt distinguish between a machineās behavior and that of a humanās, we should be prepared to conclude that the machineās behavior also indicates intelligence.
Again, readers might argue that this feels wrong. And indeed, the key question is with Turingās premise that we attribute intelligence on the basis of behavior alone. Weāll address counter-arguments in due course, but first, itās worth thinking about what sort of behavior we might feel conveys intelligence.
Why Turing selected language as a test for machines
It feels like it was no accident that Turing chose language as the basis by which his āImitation Gameā would be conducted. After all, there are many obvious ways in which a machine could never imitate a human, and equally, there are many ways in which a person could never imitate a machine. Printed language, however, is simply a set of letters on a page. It says nothing about whether it was produced by a human with a typewriter or a computer with a printer.
Nevertheless, the simple presence of language comes with a whole set of assumptions. Ever since our ancestors first started putting together sentences, language hasāas far as we can tell, at leastābeen the exclusive domain of humanity (though some apes are getting close).
This has also been the case for the type of intelligence that we possessāother animals are clever, but none of them seem to think the way we do, or possess the degree of self-consciousness that humans demonstrate. On that basis, itās almost impossible not to conflate language and intelligence. This, in turn, makes it very difficult not to instinctively attribute some degree of intelligence to anything that appears to be talking to you.
This point was made eloquently in a recent essay by Rusty Foster, author of the long-running newsletter Today in Tabs. Foster argues that we tend to conflate language with intelligence because until now, the presence of the former has always indicated the presence of the latter. āThe essential problem is this: generative language software is very good at producing long and contextually informed strings of language, and humanity has never before experienced coherent language without any cognition driving it,ā writes Foster. āIn regular life, we have never been required to distinguish between ālanguageā and āthoughtā because only thought was capable of producing language.ā
Foster makes an exception for ātrivialā examples, but even these are surprisingly compelling to us. Consider, for example, a parrot. Itās certainly disconcerting to hear a bird suddenly speaking in our languageābut, crucially, itās also almost impossible not to talk back. (Viewers with a tolerance for profanity might enjoy this example, which features a very Australian woman arguing with a very Australian parrot over the intellectual merits of the family dog.) Even though we know that parrots donāt really know what theyāre āsaying,ā the presence of language demands language in response. So what about LLMs? Are they essentially energy-hungry parrots?Ā
āI think [this has] been one of the major lines of criticismā of the Turing Test, says Jones. āItās a super behaviorist perspective on what intelligence isāthat to be intelligent is to display intelligent behavior. And so you might want to have other conditions: You might require that a machine produce the behavior in the right kind of way, or have the right kind of history of interaction with the world.ā

The Chinese Room thought experiment
There are also thought experiments that challenge the Turing Testās assumptions about the indistinguishability of the appearance of intelligence and the presence of genuine intelligence.Ā Jones cites John Searleās Chinese Room thought experiment, presented in a paper published in 1980, as perhaps the best known of these. In the paper, Searle imagines himself placed in a room where someone is passing him pieces on paper under the door. These pieces of paper have Chinese characters. Searle speaks no Chinese, but he has been provided with a book of detailed instructions about how to draw Chinese characters and a set of instructions about which characters to provide in response to those he receives under the door.
To a person outside, it might appear that Searle speaks perfect Chinese when in reality, he is simply following instructionsāa programāthat tells him which characters to draw and how to draw them. As Searle explains in his paper, āIt seems to me quite obvious in the example that I do not understand a word of the Chinese stories. I have inputs and outputs that are indistinguishable from those of the native Chinese speaker, and I can have any formal program you like, but I still understand nothing.ā
This argument is an explicit rejection of the Turing Testās premise. With it, Searle proposes a crucial distinction between understanding and appearing to understand, between thinking and appearing to think.
Tweaking ChatGPT to fool people
It also demonstrates another potential issue with the Turing Test: The Chinese Room is clearly designed with the express purpose of fooling the person on the other side of the doorāor, to put it another way, itās a program designed specifically to pass the Turing Test. With this in mind, itās worth noting that in Jonesās experiment, the LLMs that passed the test required a degree of tweaking and tuning to be convincing. Jones says that his team tested a large number of prompts for the chatbot, and one of the key challenges was āgetting [the model] to not do stuff that ChatGPT does.ā
Some of the ways that Jones and team got ChatGPT to not sound like ChatGPT are certainly fascinating, and again they revolve around the nuances of language. āYou want it to not always speak in complete sentences,ā says Jones. āThereās a kind of casual way that people speak when theyāre textingāitās just like sentence fragments. You need to get that sort of thing in.āĀ
Additionally the team experimented with ChatGPT making spelling errors to sound more human. Typos are āactually quite hard to get right. If you just tell an LLM to try really hard to make spelling errors, they do it in every word, and the errors are really unconvincing. I donāt think they have a good model of what a keyboard substitution looks like, where you hit the wrong key in a word.ā
Why ChatGPT is better than other LLMs
LLMs are difficult subjects for researchāby their very nature, their internal operations are fundamentally inscrutable. Even the aspects of their construction that can be studied are often hidden behind NDAs and layers of corporate secrecy. Nevertheless, Jones says, the experiment did reveal some things about what sort of LLM is best equipped to perform a credible imitation of a human: āChatGPT 4.5 is rumored to be one of the biggest models, and I think that being a large model is really helpful.ā
What does ābigā mean in this sense? A large codebase? A large dataset? No, says Jones. He explains that a big model has a relatively large number of internal variables whose values can be tuned as the model hoovers up training data. āOne of the things you see the smaller distilled models often can mimic good performance in math, and even in quite simple reasoning. But I think itās the really big models that tend to have good social, interpersonal behavioral skills.ā
Even the computer programmers that created artificial intelligence donāt know how it works. Credit: TED-Ed
Did Turing predict ChatGPT?
So did Turing ever conceive of his test as something that would actually be carried out? Or was it more of a thought experiment? Jones says that the answer to that question continues to be the subject of debate amongst Turing scholars. For his part, Jones says that he is ājust drawing on the paper itself. I think you can read the paper quite literally, as a suggestion that people could run this experiment at some point in the future.ā
Having said that, Jones also points out, āI think itās clear that Turing is not laying out a methodology. I mean, I think he doesnāt imagine this experiment would be worth running for decades. So heās not telling you how long it should be or, you know, if thereās any rules and what they can talk about.ā
If Turing did envisage the test might be passable, he certainly knew that it wouldnāt happen in the 1950s. Nevertheless, his paper makes it clear that he did at least imagine the possibility that one day we might build machines that would succeed: āWe are not asking whether all digital computers would do well in the game nor whether the computers at present available would do well, but whether there are imaginable computers which would do well,ā he writes.
Turing has often been describedārightlyāas a visionary, but thereās one passage in the 1950 paper thatās genuinely startling in its prescience. āI believe that in about 50 yearsā time it will be possible to programme computersā¦to make them play the imitation game so well that an average interrogator will not have more than [a] 70 per cent chance of making the right identification after five minutes of questioning.ā
It took 75 years, not 50, but here we are, confronted by a computerāor, at least, a computer-driven modelāthat does indeed fool people 70% of the time.
What makes human intelligence unique, anyway?
This all brings us back to the original question: what does it all mean? āThatās a question Iām still struggling with,ā Jones laughs.
āOne line of thinking that I think is useful is that the Turing Test is neither necessary nor sufficient evidence for intelligenceāyou can imagine something being intelligent that doesnāt pass, because it didnāt use the right kind of slang, and you can also imagine something that does pass that isnāt intelligent.ā
Ultimately, he says, the key finding is exactly what it says on the tin: āItās evidence that these models are becoming able to imitate human-like behavior well enough that people canāt tell the difference.ā This, clearly, has all sorts of social implications, many of which appear to interest the public and the scientific community far more than they interest the companies making LLMs.
There are also other philosophical questions raised here. Turing addresses several of these in his paper, most notably what he calls the āArgument from Consciousness.ā Even if a machine is intelligent, is it conscious? Turing uses the example of a hypothetical conversation between a person and a sonnet-writing machineāone that sounds strikingly like the sort of conversation one can have with ChatGPT today. The conversation provides an example of something that could be examined āto discover whether [its author] really understands [a subject] or has ālearnt it parrot-like.āā
Related AI Stories
Of course, there are many more philosophical questions at play here. Perhaps the most disquieting is this: if we reject the Turing Test as a reliable method of detecting genuine artificial intelligence, do we have an alternative? Or, in other words, do we have any reliable method of knowing when (or if) a machine could possess genuine intelligence?
āI think most people would say that our criteria for consciousness [should] go beyond behavior,ā says Jones. āWe can imagine something producing the same behavior as a conscious entity, but without having the conscious experience. And so maybe we want to have additional criteria.ā
What those criteria should beāor even whether reliable criteria exist for a definitive āIs this entity intelligent or not?ā testāremains to be determined. After all, itās not even clear that we have such criteria for a similar test for animals. As humans, we possess an unshakeable certainty that we are somehow unique, but over the years, characteristic after characteristic that we once considered exclusively human have turned out to be no such thing. Examples include the use of tools, the construction of societies, and the experience of empathy.
And yet, itās hard to give up the idea that we are different. Itās just surprisingly difficult to identify precisely how. Similarly, it proves extremely difficult to determine where this difference begins. Where do we stop being sacks of electrolytes and start being conscious beings? It turns out that this question is no easier to answer than that of where consciousness might arise from the bewildering mess of electrical signals in our computersā CPUs.
Turing, being Turing, had an answer for this, too. āI do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected with any attempt to localise it.ā However, he argued that understanding the source of human consciousness wasnāt necessary to answer the question posed by the test.
In the narrowest sense, he was correctāin and of itself, the question of whether a machine can reliably imitate a human says nothing about consciousness. But the sheer amount of publicity around ChatGPT passing the Turing Test says a lot about the age weāre in: an age in which it may well be very important to know whether genuine artificial intelligence is possible.
To understand if a machine can be intelligent, perhaps we first need to understand how, and from where, intelligence emerges in living creatures. That may provide some insight into whether such emergence is possible in computersāor whether the best we can do is construct programs that do a very, very convincing job of parroting the internet, along with all its biases and prejudices, back at us.
