From Infinite Scene Images to Infinite Comic Books (ComfyUI first comic book generator from a simple story with consistancy using no references, LORAs etc).

A few days ago I published an article about a workflow capable of generating an effectively infinite number of consistent scene images using nothing more than a text description. The central idea was surprisingly simple: instead of maintaining consistency by carrying visual information from one generation to the next through LoRAs, ControlNet, reference images, edit models or image-to-image workflows, I continuously regenerated the description of the world itself. The previous image stopped being the source of truth. The description became the source of truth.

If you haven’t read it yet, you can find it here:

KREA 2 (and maybe others) infinite scene images with consistancy using description (Comfy UI workflow)

At the time, I thought I had simply found an interesting way of generating endless consistent scene variations. What I completely failed to realize was that I had accidentally stumbled onto something much more fundamental.

If a textual description can become the persistent representation of a scene, then why shouldn’t it become the persistent representation of an entire fictional universe?

That single question completely changed the direction of the project.

Instead of generating isolated images, I started building a workflow capable of transforming an ordinary story into a complete comic book, where every page is generated independently, yet still belongs to the same world.

No previous image.

No LoRA.

No ControlNet.

No reference images.

No character sheets.

No edit models.

No IP Adapter.

No image-to-image.

Nothing.

Just a story.

That sounds almost absurd when you think about how we’ve approached consistency in AI image generation over the last few years. The accepted wisdom has always been that visual consistency requires visual memory. If you want the same character tomorrow, you somehow have to carry yesterday’s image into tomorrow’s generation. Almost every existing workflow is built around this assumption.

While working on this project I realized something embarrassingly obvious.

The image model never actually sees the previous image.

It only sees the prompt.

So why was I treating the image as the source of truth?

Instead, I made the description itself the canonical representation of the world.

Every comic page begins by redefining every recurring visual element that appears on it. Characters are described again. Locations are described again. Clothing, lighting, atmosphere, architecture, recurring objects, materials and colors are all described again. Every page assumes the model remembers absolutely nothing, because in reality it doesn’t. Any page can be generated first, last, or completely on its own, and it should still belong to exactly the same universe.

Initially I thought I was simply giving the model more information.

After spending far too many hours experimenting with it, I don’t think that’s what’s happening anymore.

I think the repeated descriptions act as semantic anchors.

The model isn’t remembering the previous page. It is repeatedly being pushed into almost exactly the same semantic region before every generation begins. Instead of preserving images, the workflow preserves meaning. Every page reconstructs the same fictional universe from scratch before asking the image model to render a single panel.

Ironically, I almost destroyed the workflow while trying to optimize it.

Like most people, my instinct was to make the prompts shorter, cleaner and more elegant. The results became noticeably worse. Characters drifted. Environments slowly changed. Clothing evolved. The comic stopped feeling like a single world.

That’s when I realized something that now seems obvious.

The redundancy wasn’t the cost.

The redundancy was the mechanism.

Every repeated sentence wasn’t wasting tokens. It was removing degrees of freedom from the image model. Every repeated description narrowed the space of possible interpretations until the model naturally converged toward the same result over and over again.

There’s another reason I believe this approach has only become practical recently.

The models themselves have changed.

Ironically, one of the most common complaints about modern image models is that they’re becoming “too consistent.” People often describe this as a limitation because identical prompts tend to produce similar results. Personally, I think we’ve been looking at this backwards.

Consistency is an asset.

If you want variation, that’s easy. Randomize the prompt. In fact, I wrote an entire article about exactly that. A prompt enhancer or prompt randomizer can deliberately inject variation whenever you want it.

But if the underlying model is fundamentally inconsistent, there is almost nothing you can do to force it to become consistent.

Inconsistency is extremely difficult to fix.

Consistency is trivial to break.

I’ll take consistency every single time.

The second reason this has become viable is context length. Modern language models can comfortably process thousands of tokens, while image models have become much better at following long, detailed descriptions. That allows the prompt to evolve from a simple caption into something much closer to a specification document. Those thousands of tokens are no longer wasted—they become the persistent memory of the fictional world.

Krea 2 was simply the first open-source model I experimented with that demonstrated this idea convincingly enough to make me realize there was something fundamentally different happening. But the technique itself doesn’t appear to be tied to Krea. I’ve had similarly encouraging results using FLUX Dev and Z-Image as well. Klein, on the other hand, is still rather poor at following long textual descriptions, so the approach doesn’t work nearly as well there. To me, that’s actually good news. It suggests the workflow isn’t exploiting some hidden quirk inside Krea. It’s taking advantage of a broader shift in how modern image models understand language.

Perhaps the funniest part of the whole project is how technically unimpressive the workflow actually looks.

If someone opened the graph expecting dozens of custom nodes, complex diffusion tricks or some elaborate latent manipulation pipeline, they’d probably be disappointed.

There isn’t one.

The workflow is almost entirely composed of ordinary ComfyUI nodes. A language model transforms the story into production-ready comic pages. A tiny Python snippet separates the pages. The image model renders them. That’s essentially the whole thing.

The intelligence isn’t hidden in the node graph.

It’s hidden in the prompt.

And I don’t mean prompt engineering in the old sense of finding magical adjectives or secret keywords. For a while prompt engineering became something of a meme, almost a joke. But I think we’ve quietly crossed a threshold where that attitude no longer matches reality.

As language models become an increasingly important part of image generation itself, prompt engineering stops being about writing prettier prompts and starts becoming software architecture.

The prompt defines the persistent state of the entire system.

It decides what information is immutable.

What information evolves.

What gets repeated.

What never changes.

The workflow isn’t really a collection of nodes anymore.

The prompt is the workflow.

That observation reminds me of prompt enhancers. They started as experiments. Today they’re becoming standard. Prompt randomizers followed a similar path. They sounded like a novelty until people realized they were genuinely useful. I believe canonical semantic reconstruction—the idea of repeatedly rebuilding the same fictional world through language instead of preserving it visually—is following exactly the same trajectory.

Maybe I’m wrong.

Maybe someone has already built something similar and I simply haven’t found it.

I’ve spent quite a while looking, and so far I haven’t seen a workflow whose entire consistency strategy is based on repeatedly reconstructing the world semantically while generating every image independently. If one exists, I’d genuinely love to see it.

But if it doesn’t, I have a feeling this idea will become one of those techniques that, a few years from now, simply feels obvious.

Not because it’s particularly complicated.

Quite the opposite.

Because it’s almost embarrassingly simple.

And those are usually the ideas that last.

For years we’ve been asking, “How do we make the model remember the previous image?”

I’m starting to think we’ve been asking the wrong question.

The better question may be:

“How do we make the previous image completely irrelevant?”

If the answer turns out to be language, then I think we’re witnessing the beginning of a much larger shift—one where prompt engineering stops being a trick for improving prompts and becomes the foundation upon which entire generative workflows are built.

Download the workflow from here. It’as actuualy a Promptflow and it is straightforward to use 🙂

Aurel Manea

Leave a Reply Cancel reply