The Unrealized Promise of LLM-games

When ChatGPT burst onto the scene in late 2022, game executives and AI enthusiasts imagined that LLMs would reshape games almost overnight. Two and a half years later, no breakout LLM-driven hit has appeared.

Here is my take after 2 years working on AuraVale, an agentic Life Sim game.

Defining the problem

To get started, we need to define games and compare them to LLMs.

Most great games achieve this kind of surprise, and is a major reason why the medium of gaming has exploded in popularity over the last 50 years. If emergence is a critical byproduct of a game, can LLMs amplify it?

At first glance, LLMs seem perfect for amplifying emergence. They produce fluent text, improvise character dialogue, and can remix lore on demand.

But an LLM is still a next token predictor, not an all-purpose simulator with critical constraints.

State management: games track game state in precise data structures. An LLM returns tokens in natural language and can’t hold state themselves.
Difficult for designers to control: without guardrails, LLM can contradict earlier facts or say things that are not possible within the system.
Limited mechanical surface area: most games are only lightly conversational and aren’t made any better with “intelligent” NPC.

This last issue is critical because even if you solve the tech, LLMs won’t add meaningful player value in most genres, for example:

There was a mod for Fable where players could sit around a fire with an LLM-powered NPC and listen to it tell a story. While on-demand stories sounds revolutionary, players don’t play Fable to sit around a campfire listening to endless stories.
There have been multiple mods where you can talk to the blacksmith and negotiate a better price. Again, this sounds cool, but players play action RPG games because they want to battle epic monsters, not have long conversations with their blacksmith.

Unlocking new game design surface area

Constraints aside, LLMs have unlocked new game design surface area, specifically:

NPC that can role play and generate in-character dialog realtime based on game state.
Systems that can understand player dialog and react to it in a human-like way.
Nuanced and emotional AI agents with a common sense understanding of the world.

These strengths matter most in genres built on dialogue, relationships, and systemic simulation. Obvious genres that fit these criteria are:

Dating sims
Life simulation
RP games (popular on Roblox)
Social deduction games
Choose-your-own-adventure narrative games

Evaluating LLM-Native Game Concepts

For anyone exploring how to use LLM in games today, I would recommend focusing your research on the above genres of games, diving deep into audience expectations, researching systems design, and brainstorming novel mechanics unlocked by LLM.

Because the genre is so new, I would generate multiple ideas and then evaluate them on the below rubric.

#1: How deeply are LLM integrated into the game?
Is it part of the moment-to-moment gameplay, a step on the core loop, a narrative mechanic that fires periodically, or a side feature of the game?

#2: How aligned is this feature with audience expectations?
Do players of this game actually want LLM generated content with their game experience? When you talk to them or read Reddit posts in their communities, how open are they to LLM-based features?

You need to deeply understand the player expectations for this genre.

Now that we have established the new game design surface area that LLMs unlock (dialog generation, role playing, understanding human dialog & emotions) and have settled on a few genres that would be improved with these new abilities, let’s dive into how I would build a mobile-first live service game using LLM.

Step 1: Broad Research and LLM features brainstorm
I would start by playing the top titles on mobile/Steam and write mini-reports that include gameplay screenshots, the core loop, and LLM-native feature ideas.

Step 2: Genre Deep-dive & LLM features evaluation
As the team hones in on a genre, we’d want to double down and research all of the games in that genre (the good and the bad) to get a deep understanding of player expectations and motivations.

From there we’d evaluate potential LLM-enabled feature on:

Depth of impact to the core loop.
Alignment with audience expectations.

Before writing any code, we’d want to have a handful of design ideas where LLM are tightly coupled with the core game loop and meaningfully aligned with what players want from this genre of game.

Step 3: Prototyping
The team would then build out multiple prototypes to compare and contrast internally. I’d prioritize the following questions:

Is this game actually fun?
What new design possibilities have we unlocked?
We we believe LLMs will make this game more fun, engaging, deep, or monetizable than the current best-in-class titles?

Step 4: Demos & Playtesting
Pre-production ends when we have a playable demo that feels new and exceeds player expectations for the genre.

The prototyping and demo phase would take ~50% longer than for traditional games, as there are more design questions and less tooling available.

Where gaming is going

We are currently in the early days of the AI age of gaming, where designers are still naively exploring the design space unlocked by LLM.

Going forwards, designers need to understand what LLM are good at, identify genres where players value that, and run a process to iterate and unlock new ways to play using LLM.

The teams that understand the tech, their audience, and get the most iterations in are going to build the next hit in gaming.

That said, I am much less confident than I was back in 2023, as we are three years in and have yet to see a single big game in the market yet.