LTX-2: Adding outside actors and elements to the scene (not existing in the first image) IMG2VID workflow.

This for me was the biggest problem with LTX-2, the inability to add characters from outside the camera without training a lora. So I finally managed to get something working (workflow).

The idea is like this. The workflow consists of 2 Flux Klein groups, one to generate an image with 3 actors (you can extend and add more) and one for generating the first frame of the video. What the flux workflow does is take the characters image, batch multiply by 8 or so (may need to change this) so you end up with the first 8 frames of the video containing the actors that you can reference later. After that the first normal frame is also multiplied by 9 so you end up with an extra 9 frames appened on the original 8 (lower values don’t work properly and you need to keep it 8+1).
In this video I used the first flux group to put 3 characters from 3 photos together standing up on a dark background.

comfyui ltx outside actors — sorry I know they are lame.

And yes, referncing the last character failed on the shirt, but that was just my first try.
And the second flux group to generate the initial shot, this

Then I made the prompt using chatgpt giving the actors image and the first frame image to make them come into the shot. The prompt it generate was this:

“A dark minimalist studio environment with a deep charcoal seamless backdrop and faint atmospheric haze. The matte floor subtly reflects light beneath the figures. High-contrast cinematic lighting with a soft overhead key and subtle rim lights from both sides.The frame begins in near darkness.A narrow overhead spotlight fades up, revealing three figures standing evenly spaced in a straight line.On the left stands a confident athletic woman wearing a black-and-white striped bikini. Her posture is powerful yet relaxed, shoulders back, gaze fixed forward.In the center stands a second woman wearing the same striped bikini, but with a realistic human skull covering her face. The skull catches sharp highlights across its brow ridge and cheekbones, deep shadows filling the eye sockets.On the right stands a young man in black loungewear with a subtle red floral pattern, barefoot, expression serious and focused.The camera begins in a wide frontal shot at chest height. Slow, deliberate dolly forward begins. The figures remain still for a brief beat, then all three begin walking forward in synchronized, controlled steps.The lighting intensifies gradually as they approach. Floor reflections sharpen. The skull’s hollow eyes catch thin slivers of light. The striped pattern subtly shifts perspective as depth closes.The camera continues its steady push-in. No cuts. No shake. Smooth mechanical dolly motion.As they approach, the composition tightens. Their bodies fill more of the frame. The background falls further into darkness.They stop very close to the lens, settling into a powerful medium shot — framed from mid-torso up. The skull-faced figure remains centered, the striped-bikini woman just to the left of frame, and the man to the right. Their faces and upper bodies dominate the composition.The skull is now large in frame, dramatic and imposing. The woman’s expression is calm and confident. The man’s gaze is intense and unwavering.The camera eases into a subtle micro push-in at the end, heightening tension.Single continuous shot. Smooth dolly forward. Cinematic contrast. High-detail skin and fabric texture. Subtle haze. Controlled dramatic lighting”

I made this workflow so it is as close to Seedance 2.0 as possible but for control and memory management I would generate the first frame and actors/elements images in a separate manner. You can teoretically add more image batches with more actors and even increase the number per batch to 5-6. So in theory you could add 10-20 actors and elements to the scene (I have not tested) if they are differend enough and you reference them properly so the model can pick them up from the latent space it encoded them in.
I think this can be extended to voices also but I preffer to use input audio and besides that would take many seconds for the initial segment and you end up not having enough memory for the video. With only 2 imagesnot even a second is lost.

At the end the extra frames are substracted from the final video. Enjoy, I hope it is useful.
You can make a small donation if you like.

Aurel Manea

Leave a Reply Cancel reply