It seems that every day brings a new headline about the burgeoning capabilities of large language models (LLMs) like ChatGPT and Googleâs Geminiâheadlines that are either exciting or increasingly apocalyptic, depending on oneâs point of view.
One particularly striking story arrived earlier this year: a paper that described how an LLM had passed the Turing Test, an experiment devised in the 1950s by computer science pioneer Alan Turing to determine whether machine intelligence could be distinguished from that of a human. The LLM in question was ChatGPT 4.5, and the paper found that it had been strikingly successful in fooling people into thinking it was human: In an experiment where participants were asked to choose whether the chatbot or an actual human was the real person, nearly three of the four chose the former.
This soundsâŚsignificant. But how, exactly? What does it all mean?
What the Turing Test isâand what it isnât
To answer that question, we first need to look at what the Turing Test is, and what it means for an LLM to pass or fail it.
Cameron Jones, a postdoctoral student at UC San Diego and one of the co-authors of the new paper, explains that Turing introduced the idea of the test in his seminal 1950 paper âComputing Machinery and Intelligence.â The paper set out to address a big, fundamental question that occupied the minds of Turingâs contemporaries: âCan machines think?â
In his paper, Turing quickly rejects the question as being ambiguous and non-rigorous, because it is not clear on what either a âmachineâ is in this context, nor on what âthinkingâ is. He argues that a more nuanced and easily addressable question was required, and proposes, âCan a machine act in such a manner that its actions are indistinguishable from those of a human?â To answer this question, he proposes what he calls âThe Imitation Game,â and itâs this exercise that has since come to be referred to as simply âThe Turing Test.â
The test involves one personâthe âinterrogatorââcommunicating simultaneously with two hidden interlocutors, referred to as the âwitnesses.â All communication is written. The twist is that while one of these witnesses is a real person, the other is a machine. The point of the game, Jones explains, âis for the interrogator to figure out which of the two witnesses is the real human.â
Jones and his team performed this experiment with four LLMs. ChatGPT 4.5 was by far the most successful: 73% of participants identified it as the real human. Another model that goes by the unwieldy name LLaMa-3.1-405B was identified as human 56% of the time. (The other two modelsâELIZA and GPT-4oâachieved 23% and 21% success rates, respectively, and will not be spoken of again.)
What does ChatGPT passing the Turing Test mean?
The results for ChatGPT 4.5 and LLaMa are striking enough, but the really interesting question is what their success signifies.
Itâs important to note from the outset that the test isnât designed to detect machine intelligence. In rejecting the question âCan machines think?â Turing also neatly sidesteps the thorny question of exactly who is doing the thinking if the answer is âyes.â Consider Rene Descartesâ famous assertion cogito, ergo sum, âI think, therefore I am,â which essentially demands that the presence of thought requires consciousness.
However, Turingâs paper does argue that success in the Imitation Game means that we canât deny the possibility that genuine machine intelligence is at work. As Jones explains, Turing âbasically [argued] that if we could build a machine that was so good at this game that we couldnât reliably tell the difference between the witnesses, then essentially weâd have to say that that machine was intelligent.â
Modern readers might well recoil from such an assessment, so itâs worth looking at Turingâs line of reasoning, which went as follows:
- We canât know that our fellow humans are intelligent. We canât inhabit their minds or see through their eyes.
- Nevertheless, we accept them as intelligent.
- How do we make this judgment? Turing argues that we do so on the basis of our fellow humansâ behavior.
- If we attribute intelligence based on behavior, and we encounter a situation where we canât distinguish between a machineâs behavior and that of a humanâs, we should be prepared to conclude that the machineâs behavior also indicates intelligence.
Again, readers might argue that this feels wrong. And indeed, the key question is with Turingâs premise that we attribute intelligence on the basis of behavior alone. Weâll address counter-arguments in due course, but first, itâs worth thinking about what sort of behavior we might feel conveys intelligence.
Why Turing selected language as a test for machines
It feels like it was no accident that Turing chose language as the basis by which his âImitation Gameâ would be conducted. After all, there are many obvious ways in which a machine could never imitate a human, and equally, there are many ways in which a person could never imitate a machine. Printed language, however, is simply a set of letters on a page. It says nothing about whether it was produced by a human with a typewriter or a computer with a printer.
Nevertheless, the simple presence of language comes with a whole set of assumptions. Ever since our ancestors first started putting together sentences, language hasâas far as we can tell, at leastâbeen the exclusive domain of humanity (though some apes are getting close).
This has also been the case for the type of intelligence that we possessâother animals are clever, but none of them seem to think the way we do, or possess the degree of self-consciousness that humans demonstrate. On that basis, itâs almost impossible not to conflate language and intelligence. This, in turn, makes it very difficult not to instinctively attribute some degree of intelligence to anything that appears to be talking to you.
This point was made eloquently in a recent essay by Rusty Foster, author of the long-running newsletter Today in Tabs. Foster argues that we tend to conflate language with intelligence because until now, the presence of the former has always indicated the presence of the latter. âThe essential problem is this: generative language software is very good at producing long and contextually informed strings of language, and humanity has never before experienced coherent language without any cognition driving it,â writes Foster. âIn regular life, we have never been required to distinguish between âlanguageâ and âthoughtâ because only thought was capable of producing language.â
Foster makes an exception for âtrivialâ examples, but even these are surprisingly compelling to us. Consider, for example, a parrot. Itâs certainly disconcerting to hear a bird suddenly speaking in our languageâbut, crucially, itâs also almost impossible not to talk back. (Viewers with a tolerance for profanity might enjoy this example, which features a very Australian woman arguing with a very Australian parrot over the intellectual merits of the family dog.) Even though we know that parrots donât really know what theyâre âsaying,â the presence of language demands language in response. So what about LLMs? Are they essentially energy-hungry parrots?Â
âI think [this has] been one of the major lines of criticismâ of the Turing Test, says Jones. âItâs a super behaviorist perspective on what intelligence isâthat to be intelligent is to display intelligent behavior. And so you might want to have other conditions: You might require that a machine produce the behavior in the right kind of way, or have the right kind of history of interaction with the world.â

The Chinese Room thought experiment
There are also thought experiments that challenge the Turing Testâs assumptions about the indistinguishability of the appearance of intelligence and the presence of genuine intelligence. Jones cites John Searleâs Chinese Room thought experiment, presented in a paper published in 1980, as perhaps the best known of these. In the paper, Searle imagines himself placed in a room where someone is passing him pieces on paper under the door. These pieces of paper have Chinese characters. Searle speaks no Chinese, but he has been provided with a book of detailed instructions about how to draw Chinese characters and a set of instructions about which characters to provide in response to those he receives under the door.
To a person outside, it might appear that Searle speaks perfect Chinese when in reality, he is simply following instructionsâa programâthat tells him which characters to draw and how to draw them. As Searle explains in his paper, âIt seems to me quite obvious in the example that I do not understand a word of the Chinese stories. I have inputs and outputs that are indistinguishable from those of the native Chinese speaker, and I can have any formal program you like, but I still understand nothing.â
This argument is an explicit rejection of the Turing Testâs premise. With it, Searle proposes a crucial distinction between understanding and appearing to understand, between thinking and appearing to think.
Tweaking ChatGPT to fool people
It also demonstrates another potential issue with the Turing Test: The Chinese Room is clearly designed with the express purpose of fooling the person on the other side of the doorâor, to put it another way, itâs a program designed specifically to pass the Turing Test. With this in mind, itâs worth noting that in Jonesâs experiment, the LLMs that passed the test required a degree of tweaking and tuning to be convincing. Jones says that his team tested a large number of prompts for the chatbot, and one of the key challenges was âgetting [the model] to not do stuff that ChatGPT does.â
Some of the ways that Jones and team got ChatGPT to not sound like ChatGPT are certainly fascinating, and again they revolve around the nuances of language. âYou want it to not always speak in complete sentences,â says Jones. âThereâs a kind of casual way that people speak when theyâre textingâitâs just like sentence fragments. You need to get that sort of thing in.âÂ
Additionally the team experimented with ChatGPT making spelling errors to sound more human. Typos are âactually quite hard to get right. If you just tell an LLM to try really hard to make spelling errors, they do it in every word, and the errors are really unconvincing. I donât think they have a good model of what a keyboard substitution looks like, where you hit the wrong key in a word.â
Why ChatGPT is better than other LLMs
LLMs are difficult subjects for researchâby their very nature, their internal operations are fundamentally inscrutable. Even the aspects of their construction that can be studied are often hidden behind NDAs and layers of corporate secrecy. Nevertheless, Jones says, the experiment did reveal some things about what sort of LLM is best equipped to perform a credible imitation of a human: âChatGPT 4.5 is rumored to be one of the biggest models, and I think that being a large model is really helpful.â
What does âbigâ mean in this sense? A large codebase? A large dataset? No, says Jones. He explains that a big model has a relatively large number of internal variables whose values can be tuned as the model hoovers up training data. âOne of the things you see the smaller distilled models often can mimic good performance in math, and even in quite simple reasoning. But I think itâs the really big models that tend to have good social, interpersonal behavioral skills.â

Even the computer programmers that created artificial intelligence donât know how it works. Credit: TED-Ed
Did Turing predict ChatGPT?
So did Turing ever conceive of his test as something that would actually be carried out? Or was it more of a thought experiment? Jones says that the answer to that question continues to be the subject of debate amongst Turing scholars. For his part, Jones says that he is âjust drawing on the paper itself. I think you can read the paper quite literally, as a suggestion that people could run this experiment at some point in the future.â
Having said that, Jones also points out, âI think itâs clear that Turing is not laying out a methodology. I mean, I think he doesnât imagine this experiment would be worth running for decades. So heâs not telling you how long it should be or, you know, if thereâs any rules and what they can talk about.â
If Turing did envisage the test might be passable, he certainly knew that it wouldnât happen in the 1950s. Nevertheless, his paper makes it clear that he did at least imagine the possibility that one day we might build machines that would succeed: âWe are not asking whether all digital computers would do well in the game nor whether the computers at present available would do well, but whether there are imaginable computers which would do well,â he writes.
Turing has often been describedârightlyâas a visionary, but thereâs one passage in the 1950 paper thatâs genuinely startling in its prescience. âI believe that in about 50 yearsâ time it will be possible to programme computersâŚto make them play the imitation game so well that an average interrogator will not have more than [a] 70 per cent chance of making the right identification after five minutes of questioning.â
It took 75 years, not 50, but here we are, confronted by a computerâor, at least, a computer-driven modelâthat does indeed fool people 70% of the time.
What makes human intelligence unique, anyway?
This all brings us back to the original question: what does it all mean? âThatâs a question Iâm still struggling with,â Jones laughs.
âOne line of thinking that I think is useful is that the Turing Test is neither necessary nor sufficient evidence for intelligenceâyou can imagine something being intelligent that doesnât pass, because it didnât use the right kind of slang, and you can also imagine something that does pass that isnât intelligent.â
Ultimately, he says, the key finding is exactly what it says on the tin: âItâs evidence that these models are becoming able to imitate human-like behavior well enough that people canât tell the difference.â This, clearly, has all sorts of social implications, many of which appear to interest the public and the scientific community far more than they interest the companies making LLMs.
There are also other philosophical questions raised here. Turing addresses several of these in his paper, most notably what he calls the âArgument from Consciousness.â Even if a machine is intelligent, is it conscious? Turing uses the example of a hypothetical conversation between a person and a sonnet-writing machineâone that sounds strikingly like the sort of conversation one can have with ChatGPT today. The conversation provides an example of something that could be examined âto discover whether [its author] really understands [a subject] or has âlearnt it parrot-like.ââ
Related AI Stories
Of course, there are many more philosophical questions at play here. Perhaps the most disquieting is this: if we reject the Turing Test as a reliable method of detecting genuine artificial intelligence, do we have an alternative? Or, in other words, do we have any reliable method of knowing when (or if) a machine could possess genuine intelligence?
âI think most people would say that our criteria for consciousness [should] go beyond behavior,â says Jones. âWe can imagine something producing the same behavior as a conscious entity, but without having the conscious experience. And so maybe we want to have additional criteria.â
What those criteria should beâor even whether reliable criteria exist for a definitive âIs this entity intelligent or not?â testâremains to be determined. After all, itâs not even clear that we have such criteria for a similar test for animals. As humans, we possess an unshakeable certainty that we are somehow unique, but over the years, characteristic after characteristic that we once considered exclusively human have turned out to be no such thing. Examples include the use of tools, the construction of societies, and the experience of empathy.
And yet, itâs hard to give up the idea that we are different. Itâs just surprisingly difficult to identify precisely how. Similarly, it proves extremely difficult to determine where this difference begins. Where do we stop being sacks of electrolytes and start being conscious beings? It turns out that this question is no easier to answer than that of where consciousness might arise from the bewildering mess of electrical signals in our computersâ CPUs.
Turing, being Turing, had an answer for this, too. âI do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected with any attempt to localise it.â However, he argued that understanding the source of human consciousness wasnât necessary to answer the question posed by the test.
In the narrowest sense, he was correctâin and of itself, the question of whether a machine can reliably imitate a human says nothing about consciousness. But the sheer amount of publicity around ChatGPT passing the Turing Test says a lot about the age weâre in: an age in which it may well be very important to know whether genuine artificial intelligence is possible.
To understand if a machine can be intelligent, perhaps we first need to understand how, and from where, intelligence emerges in living creatures. That may provide some insight into whether such emergence is possible in computersâor whether the best we can do is construct programs that do a very, very convincing job of parroting the internet, along with all its biases and prejudices, back at us.