Two recent papers on the question of nonhuman, and AI, consciousness:
have led me to reflect further on an often-rehashed limitation of large language models (LLMs): they inhabit worlds that entirely textual1. But is this really as limiting as it first seems?
Text isn’t all you need?
It is often taken as axiomatic that learning from text is not sufficient to create a self-aware entity. On one hand, this seems natural: the gold standard existence proof for conscious self-experience is our own and so it feels rational to suppose that the more “human” an entity is, the more likely it might be to have self-awareness. At the sensory level, this means vision, audition, taste, … and at the motor level this means the ability to physically act upon our world. Indeed, embodied theories of consciousness argue that these things are necessary to create self awareness.
For simplicity and concreteness, let’s restrict our discourse here to sensory inputs and a caricatured position on consciousness: “One cannot generate a conscious entity by training only on text. Other sensory modalities are required.”
But are they?
It from bit
Physicist John Archibald Wheeler proposed the radical “it from bit” hypothesis, suggesting that the fundamental substrate of the universe is information:
It from bit symbolises the idea that every item of the physical world has at bottom — at a very deep bottom, in most instances — an immaterial source and explanation; that what we call reality arises in the last analysis from the posing of yes-no questions and the registering of equipment-evoked responses; in short, that all things physical are information-theoretic in origin and this is a participatory universe.
Suppose that Wheeler is correct. Information — and information alone — is the true substrate underlying everything in the universe. Vision? Converting the bits of certain wavelengths of light in certain configurations to the bits of activity in the visual hierarchy of your brain. Sound? Bits. Taste? Bits. Language? Bits. Thought? Bits. Everything is information and computation.
Viewing everything as information, it is easy to see morphisms between sensory modalities (and everything else, for that matter); all it takes is some computation. And if it really is just bits all the way down, and a bit is a bit, what is the difference between tasting an apple via sensory receptors on your tongue, and reading a very good natural language description of the same? Is not the same (or at least *-morphic) information being encoded?
This is, admittedly, an extreme position, but it establishes an intellectual lower bound for a framework that demands critical examination of our intuitions regarding the ineffable qualia of our sensory experiences. If Wheeler is correct, there is no fundamental obstacle to tasting an apple via text so long as the same information is presented and processed in the same way2.
Quining qualia
Returning to the more practical world of LLMs, philosopher Dan Dennett aligns with Wheeler when he writes:
… are we really so sure that what it is like to see red or blue can't be conveyed to one who has never seen colors in a few million or billion words?3
I answer Dennett’s query with humility: I’m not so sure. Maybe all of the text that serves as training grist for our LLMs is providing not just language but also something resembling experience. Maybe training on a few trillion tokens worth of text related to colour, perception, physics, psychology, … can impart “what it is like” to see red. Maybe an LLM experiences reality with the same fidelity that I experience the author’s constructed world in an exceptional novel? If this is true then what it is like to be a thinking LLM seems somewhat close to what it is like to be a thinking human.
It from bat
There are many who reject Dennett (and Wheeler) and propose the existence of qualia — that which makes up our subjective experience and transcends easy material explanation — succinctly summarized in Thomas Nagel’s famous question: “What is it like to be a bat?” In such theories what it is like to see, and taste, a red apple is, by definition, distinct from a mere description of the same. If these theories are correct, what it is like to be an LLM is likely very different than what it is like to be a human; and, indeed, perhaps there is nothing at at all it is like to be an LLM.
What do I think?
In the short term, I believe Dennett: The experience of what it is like to see red can be transmitted in written language, and one can profitably tapdance about architecture. LLMs build world models and can reason about them. They can probably have experiences of a sort. No embodiment required4.
In the long term, I believe Wheeler. It’s information and computation all the way down and deep neural nets are a powerful window into the class of computation we call cognition.
Multi-modal models exist, and are exciting for many reasons, but without loss of generality I’ll argue here from the simple case of a text-only LLM.
Though this may be exceptionally difficult to do. I argue only for the existence of a mapping, not the complexity of implementing it.
Which is not to say that models better suited to some tasks cannot be built (better) by an embodied agent.