WAN 2.2 + external actors > LTX-2 upscaler/refiner/actor reinforcement in ComfyUI

In my previous posts I talked about how you can use LTX-2 as an WAN upscaler/refiner and how to add external actors and elements references without img2vid (you need an empty scene without them and need them to come into the scene).
But why not both ? LTX-2 sux in action sequences and human interactions so the alternative at this point is wan 2.2 . But wan is lowres and has the same issue as ltx, no way for now to add actors in latent space.
So I used the same technique as for LTX2 to add actors to wan and then reinforce them in LTX-2 using the same method. Here are some results:

Idea :
Generate a very low res wan 2.2 video as reference for LTX but still pre-appending the actors and elements images at the beginning of the video,, then have the first image from the actual shot and referencing the characters from the beginning in the video. This step at 480P is very fast and good enough for characters interaction/movement coherence etc to be used as vid2vid in ltx-2. We save it at 12 fps so we can upscale with temporal upscaler in ltx.
Then in the LTX step we bring the same intro images but at highest resolution possible so ltx knows how the characters actually look like in maximum detail and paints them over the lowres wan video at at a 4x resolution. So the 480p video becomes 1440p in this case (but you can go lower if you don’t have the resources, I have an 3090 and 64GB system ram).
Both qwen image edit and flux klein were used for generating the actors, scene, zoom ins on the scene, removing characters etc.

Here is the workflow for it.

Usage: This is a 2 groups 2 steps workflow. The 2 groups can be merged together in a single step but I run into OOM’s no matter what I do.
Input 2 images in the left, both 1920×1088, very important. Set the prompts similar to the one at the end of this post to pick up the actors from the first image and put them into the scene that starts with the second image.
The first image contains the actors on gray background and the second one is the actual first frame you need for the video.
The results are far from perfect and there are huge consistancy issues but I will work on it and see what can be improved.
WAN is pretty bad at referencing properly the initial images (adds eyes to the xenomorph etc) so the wan refiner needs a high enough denoising (0.6 works great but because of the huge resolution bigger than 1920p it starts adding hands to the legs of the characters, repeating on the horizontal). So i needed to lower that. But if your final output is just 1080p or smaller you can increase denoising and it will pick up details from the actors reference better.

and here are images used and some not used yet, but should be clear what king of resurces are needed as “actors”, either the first frame of the actual shot or the actors to be cut out later.

wan22 ltx upscaler refiner external reference actors — the actors reference image. I put multiple angles for the second one for better consistency.

wan22 ltx upscaler refiner external reference actors 2 — One of the first images for an action sequence.

wan22 ltx upscaler refiner external reference actors 7 — another one, empty for the actors to come in the scene and kiss.

wan22 ltx upscaler refiner external reference actors 4 — wide angle for another scene where I want to zoom in on the characters. Qwen image edit was used to get wide angle of the arena,

wan22 ltx upscaler refiner external reference actors 6 — The original arena image

The generated kissing scene prompt was this (chatgpt after a lot of trials with indsructions):

Hard cinematic cut.

Tight close-up shot at chest and head level inside a massive ancient stone coliseum under warm golden midday sunlight. Background spectators and layered stone arches are softly visible, with fine dust suspended in the heated air.

From the left edge of the close frame, the tall dark biomechanical alien enters first. Its body is a fusion of organic and mechanical forms: a glossy black elongated domed skull with no visible eyes, the surface smooth and reflective with subtle wear and micro-scratches. Beneath the dome, a partially exposed metallic jaw reveals sharp biomechanical teeth. Thick ribbed tubing and cable-like tendons run from the base of the head down into the chest cavity. The torso is composed of layered exoskeletal plates over visible internal tubing, with deep recesses between segments. The shoulders are rounded and armored, connecting to elongated arms ending in articulated, multi-jointed clawed fingers with sharp tapered tips. Portions of the long muscular tail base are visible behind the torso, segmented and flexible. Warm sunlight glides across the curved surfaces, creating sharp specular highlights along the black exoskeleton.

From the right edge of the same close frame, the slender beige alien enters simultaneously. Her head is elongated and crowned with an ornate sculpted headpiece integrated seamlessly into her skull, detailed with delicate symmetrical engravings and embedded metallic accents. Her enormous glossy black eyes dominate her face, reflective and spherical. Elongated pointed ears extend outward from beneath the headpiece. Her pale beige skin is covered entirely in intricate ornamental patterns—fine swirling motifs and geometric filigree etched and embossed across her neck, shoulders, and upper chest. Subtle warm golden nodes or embedded details glow faintly along her skin. A long flowing pale cape descends from behind her headpiece, the fabric smooth and heavy, catching the light in soft folds. The surface of her skin appears matte yet subtly luminous under the sun.

They step toward each other into the center of the frame until their faces are inches apart. The biomechanical dome reflects her pale patterned form; her large eyes mirror the dark curvature of his head.

A brief still moment.

They move closer and embrace. The biomechanical alien’s segmented arms wrap around her back, claws carefully curving without tension. The beige alien’s long delicate fingers press against the layered exoskeletal plates and tubing along his upper torso. Her cape drapes softly around both figures.

They kiss.

As they kiss, the camera begins a slow continuous push-in, transitioning from tight close-up to extreme close-up. Simultaneously, the camera performs a smooth rotational orbit at head level. Golden sunlight glides across the glossy black dome, creating shifting reflections, while warm highlights bloom across her engraved skin patterns and metallic headpiece.

Fine dust particles drift through volumetric light beams. The coliseum architecture blurs progressively as the rotation tightens and the zoom intensifies.

Final frame: extreme close-up of their faces in embrace, the biomechanical dome and her patterned skin filling the frame, warm light flaring softly behind them as the camera completes its slow rotation.

Cinematic realism. High surface detail. Warm volumetric lighting. Smooth rotational camera movement. Continuous single shot. Identity consistency of the two established aliens only.

And this is the output from wan2.2 at 480p (extra eyes and stuff), but ltx took care of thos problems but the information was still good enough for ltx to refine and upscale.

Limitations :
-Obvious consistancy limitations visible from the videos. But maybe can be improved and until ltx2.1 or 2.5 comes I have not found any other workaround.
-Small lenght of videos 5-6 seconds due toe the WAN model. Maybe in the future another wan will come and give better results.
-High motion still gives artefacts. While slow motion is very good, when the action gets going everything fails.

what can improve results a lot:
I am using low steps and low resolution wan. You can increase steps, use higher resolution and change the workflow, you can go up to 720p and even use non quantized models to get the best possible initial output out of wan. Also I always use only 3 steps in LTX, maybe increasing that. This workflow is designed for speed/quality at the same time but quality can always be improved.
Play around with the prompts A LOT so as many of the details of the actors are picked up. Use separate prompts for wan and ltx and see what works best for them on picking up details and identity.

That is about it, leave a comment if you have any questions or suggestions or check out the reddit post.
If you enjoy this maybe buy me some coffee ?

Aurel Manea

One Comment

Leave a Reply Cancel reply