LTX-2.3 is now out with better coherence both temporal and spatial. So I gave it a spin. With a hard sure-to-fail scenario. A very long continous action scene with actor and environment referencing. I know the characters change faces during the video but this is my fault as I updated the characters during the creation of the video.

The results were indeed much better than LTX2.0 regarding movement and coherence. From my experiments rendering directly in 1080p ruins motion and gives a lot of artifacts so I stuck to half res and upscaled using the LTX spatial upscaler and used .20 denoising for the second pass.
Following my previous concepts in the articles before I segmented this 40 seconds video into smaller 10 seconds passes, each with their own prompt and instead of just feeding an initial image I fed 2-3 images + 2x the previous segment, one at 4 fps to give it visual context of what happened before and another with the last second at normal speed to give it motion context. In the first 2 images or 3 I give the environment with the characters full body shots and another one with a closeup of their faces.
This happens after the first video where it is just a normal image2video.

Workflow HERE .

Usage:
In the left you have 2 batch chained 2 images, each one batched 8 times (you can experiment with more). You can also link more images like the third one that is now decoupled.
In the left, bottom, you drag the previous segment in both the nodes there, one for the visual context and one for the temporal so the motion is fluid between the shots. Since you now have one common second between shots you can blend them seemlesly in whatever editing software.
You need extremly complex prompts that you make with Chatgpt (gemini, grok, claude failed for me).
You can use this same workflow to just add actors that come into the screen later. Used wisely it can be quite powerful in a lot of situations.