Why Generative AI Might Work Better With Visual Text Tokens

Thinking outside the box by using images of text, rather than pure text, during tokenization of LLMs.

Thank you for reading this post, don't forget to subscribe!

getty

In today’s column, I examine a rather innovative idea that cleverly turns the conventional design of generative AI and large language models (LLMs) on its head. Simply stated, consider the brash notion that instead of generative AI receiving pure text, the text was first captured as images, and the images were then fed into the AI.

Say what?

For anyone versed in the technical underpinnings of LLMs, this seems entirely oddball and counterintuitive. You might already be yelling aloud that this makes no sense. Here’s why. An LLM is designed to deal with natural languages such as English and, therefore, makes abundant use of text. Text is the way that we normally input prompts and enter our questions into LLMs. Opting to use images of text, in place of actual text, has got to be a screwball concept. Blasphemous.

Hold onto your hat because some earnest researchers tried the approach, and there is enough merit that we ought to give the flight of fancy a modicum of seriously devoted diligent attention.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

Tokenization Is Crucial

The heart of the matter entails the tokenization aspects of modern-era generative AI and LLMs. I’ve covered the details of tokenization at the link here. I will provide a quick overview to get you up to speed.

When you enter text into AI, the text gets converted into various numbers. Those numbers are then dealt with throughout the rest of the processing of your prompt. Once the AI has arrived at an answer, the answer is actually in a numeric format and needs to be converted back into text, so it is readable by the user. The AI proceeds to convert the numbers into text and displays the response accordingly.

That whole process is known as tokenization. The text that you enter is encoded into a set of numbers. The numbers are referred to as tokens. The numbers, or shall we say tokens, flow through the AI and are used to figure out answers to your questions. The response is initially in the numeric format of tokens and needs to be decoded back into text.

Fortunately, an everyday user is blissfully unaware of the tokenization process. There is no need for them to know about it. The topic is of keen interest to AI developers, but of little interest to the general public. All sorts of numeric trickery are often employed to try and make the tokenization process as fast as possible so that the AI isn’t being held up during the encoding and decoding that needs to occur.

Tokens Are An Issue

I mentioned that the general public usually doesn’t know about the tokenization aspects of LLMs. That’s not always the case. Anyone who has pushed AI to its limits is probably vaguely aware of tokens and tokenization.

The deal is this.

Most of the contemporary LLMs, such as OpenAI’s ChatGPT and GPT-5, Anthropic Claude, Meta Llama, Google Gemini, xAI Grok, and others, are somewhat limited due to the number of tokens they can adequately handle at one time. When ChatGPT first burst onto the scene, the number of allowed tokens in a single conversation was quite limited.

You would rudely discover this fact by ChatGPT suddenly no longer being able to recall the earlier portions of your conversation. This was due to the AI hitting the wall on how many active tokens could exist at one time. The tokens from earlier in your conversation were summarily being tossed away.

If you were doing any lengthy and complex conversations, these limitations were exasperating and pretty much knocked out of contention any big-time use of generative AI. You were limited to relatively short conversations. The same issue arose when you imported text via a method such as RAG (see my discussion at the link here). The text had to be tokenized and once again counted against the threshold of how many active tokens the AI could handle.

It was maddening to those who had dreams of using generative AI for larger-scale problem-solving.

Limits Are Higher But Still Exist

The early versions of ChatGPT had a limitation of fewer than 10,000 tokens that could be active at any point in time. If you think of a token as representing a small word, such as “the” or “dog”, this means you hit the wall once your conversation had consumed roughly ten thousand simple words. This was insufferable at the time for any lengthy or complex usage.

Nowadays, the conventional version of GPT-5 has a token context window of about 400,000 tokens. That is considered the total capacity associated with both the input tokens and the output tokens as a combined total. Context window sizes can vary. For example, Claude has a limit of about 200,000 tokens on some of its models, while others extend further to around 500,000 tokens.

A visionary view of the future is that there won’t be any limitations associated with the allowed number of tokens. There is state-of-the-art work on so-called infinite or endless memory in AI that would pretty much enable any number of tokens. Of course, in a practical sense, there is only so much server memory that can exist; thus, it isn’t truly infinite, but the claim is catchy and reasonably fair. For my explanation of how AI infinite memory works, see the link here.

Coping With The Token Problem

Because tokenization is at the crux of how most LLMs are designed and utilized, a lot of effort has been stridently undertaken to try and optimize the tokenization aspects. The aim is to somehow make tokens smaller, if possible, allowing more tokens to exist within whatever memory constraints the system has.

AI designers have repeatedly sought to compress tokens. Doing so could be a big help. Whereas a token window might be customarily limited to 200,000 tokens, if you could drop each token down into half its usual size, you could double the limit to 400,000 tokens. Nice.

There is a nagging catch associated with the compression of tokens. Often, yes, you can squeeze them down in size, but the precision gets undercut when you do so. That’s bad. It might not be overly bad in the sense that they are still workable and usable. It all depends upon how much precision gets sacrificed.

Ideally, you would want the maximum possible compression and do so at a 100% retention of precision. It’s a lofty goal. The odds are that you will need to weigh compression levels against precision accuracy. Like most things in life, there is never a free lunch.

Knock Your Socks Off

Suppose we allowed ourselves to think outside the box.

The usual approach with LLMs is to accept pure text, encode the text into tokens, and proceed in our merry way. We would almost always begin our thought processes about tokenization by logically and naturally assuming that the input from the user will be pure text. They enter text via their keyboard, and text is what gets converted into tokens. It’s a straightforward approach.

Ponder what else we might do.

Seemingly out of left field, suppose we treated text as images.

You already know that you can take a picture of text and have that then optically scanned and either kept as an image or later converted into text. The process is a longstanding practice known as OCR (optical character recognition). OCR has been around since the early days of computers.

The usual OCR process consists of converting images into text and is referred to as image-to-text. Sometimes you might want to do the reverse, namely, you have text and want to transform the text into images, which is text-to-image processing. There are lots and lots of existing software applications that will gladly do image-to-text and do text-to-image. It is old hat.

Here’s the crazy idea about LLMs and tokenization.

We still have people enter text, but we take that text and convert it to an image (i.e., text-to-image). Next, the image of the text is used by the token encoder. Thus, rather than encoding pure text, the encoder is encoding based on images of text. When the AI is ready to provide a response to the user, the tokens will be converted from tokens into text, making use of image-to-text conversions.
Boom, drop the mic.

Making Sense Of The Surprise

Whoa, you might be saying, what good does this playing around with images achieve?

If the images-to-tokens conversions can get us toward smaller tokens, we might be able to compress tokens. This, in turn, means we can potentially have more tokens within the bounds of limited memory. Remember that the compression of tokens is solemnly on our mind.

In a recently posted study entitled “DeepSeek-OCR: Contexts Optical Compression” by Haoran Wei, Yaofeng Sun, Yukun Li, arXiv, October 21, 2025, the research paper made these claims (excerpts):

“A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios.”
“This insight motivates us to reexamine vision-language models (VLMs) from an LLM-centric perspective, focusing on how vision encoders can enhance LLMs’ efficiency in processing textual information rather than basic VQA, which humans excel at.”
“OCR tasks, as an intermediate modality bridging vision and language, provide an ideal testbed for this vision-text compression paradigm, as they establish a natural compression-decompression mapping between visual and textual representations while offering quantitative evaluation metrics.”
“Our method achieves 96%+ OCR decoding precision at 9-10x text compression, ∼90% at 10-12x compression, and ∼60% at 20x compression on Fox benchmarks featuring diverse document layouts (with actual accuracy being even higher when accounting for formatting differences between output and ground truth).”

As noted above, the experimental work seemed to suggest that a compression ratio of 10x smaller could at times be achieved with a 96% precision. If that could be done across the board, it would imply that, whereas a token window limit today might be 400,000 tokens, the limit could be raised to 4,000,000 tokens, albeit at a 96% precision rate.

The precision at 96% might be tolerable or intolerable, depending on what the AI is being used for. You can’t get a free lunch, at least so far. A compression rate of 20x would be even better, though the precision at 60% would seem quite unattractive. Still, there might be circumstances in which one could begrudgingly accept the 60% for the 20x increase.

Famous AI luminary Andrej Karpathy posted his initial thoughts online about this approach overall: “I quite like the new DeepSeek-OCR paper. It’s a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn’t matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images.” (source: Twitter/X, October 20, 2025).

Brainstorming Is Useful

The research study also tried using a multitude of natural languages. This is yet another value of using images rather than pure text. As you know, there are natural languages that make use of pictorial characters and words. Those languages would seem especially well-suited to an image-based method of tokenization.

Yet another intriguing facet is that we already have VLMs, consisting of AI that deals with visual images rather than text per se (i.e., visual language models). We don’t have to reinvent the wheel when it comes to doing likewise with LLMs. Just borrow what has worked with VLMs and readjust to usage in LLMs. That’s using the whole noggin and leveraging reuse when feasible.

The idea is worthy of acknowledgment and additional digging in. I wouldn’t suggest going around and right away declaring that all LLMs need to switch to this kind of method. The jury is still out. We need more research to see how far this goes, along with understanding both the upsides and the downsides.

Meanwhile, I guess we can at least make this bold pronouncement: “Sometimes, a picture really is worth a thousand words.”

What's Hot

Mets manager Carlos Mendoza refutes reports of clubhouse issues

Why Paramount And Netflix Are Fighting Over Warner Bros.

A better understanding of mental ill health is crucial | Mental health

The Surprising Idea That Generative AI Might Be Better Off Using Visual Images Of Text Rather Than Pure Text As Tokens

Why Paramount And Netflix Are Fighting Over Warner Bros.

Off-grid living ‘not a dream, it’s a nightmare’

Today’s Wordle #1634 Hints & Answer For Tuesday, December 9

Palestinians flee Gaza City districts as Israel says first stages of assault have begun

Client Challenge

The Best Weekend Resorts Near NYC

Mets manager Carlos Mendoza refutes reports of clubhouse issues

Why Paramount And Netflix Are Fighting Over Warner Bros.

A better understanding of mental ill health is crucial | Mental health

News

Categories

Useful links

What's Hot

The Surprising Idea That Generative AI Might Be Better Off Using Visual Images Of Text Rather Than Pure Text As Tokens

Tokenization Is Crucial

Tokens Are An Issue

Limits Are Higher But Still Exist

Coping With The Token Problem

Knock Your Socks Off

Making Sense Of The Surprise

Brainstorming Is Useful

Related Posts

News

Categories

Useful links

Subscribe to Updates