I always feel like it helps to have a bit of a base understanding on how the models work on these things.
Initially, someone created a large dataset of images and descriptions. The descriptions were tokenized, and the images cut up into squares. Then, random noise was generated based on a seed. It took one square, generated random noise based on a seed, and attempted to denoise the noise into the image on the square. Once it got something close, it discarded the square and grabbed another one. At the end, all of this was saved in a model.
Now, what happens when you are generating an image is that your prompt is reduced to tokens by a text encoder (XL based models use CLIP-L and CLIP-G), random noise is generated by the specified seed, and then the sampler and noise schedule is how it denoises, with as many steps as you specify.
Some schedulers introduce a bit of noise at every steps, namely the ancestral ones (with an a at the end), and sde, but there may be others. With those ones, the image is going to change more between steps and they’ll be more chaotic. Also, some will take less steps then others to get to a good image, and how long each step takes will vary a bit. I believe some are just better at dealing with certain things in the image, too, so it’ll take some playing around.
Now, the clip text encoder actually can’t cope with anything more than 77 tokens at once, and that includes a start and end token, so effectively 75. So if your prompt is more than 75 tokens, it gets broken up into chunks of 75.
The idea behind “BREAK” is that you are telling it to end the current chunk right there and just pad it out with null tokens at the end. The point is just that you’re making sure that particular part of the prompt is all in the same chunk. I’ve had mixed results on it, so I try doing it that way occasionally, but also don’t a lot of the time. It is going to have trouble with getting confused anyways. This is just an attempt to minimize it a bit.
(Text encoding is one of the differences between model architectures, too. 1.* & 2.* had one clip, XL has two, then when you start getting into things like flux and 3, you start dealing with things like two clips and a t5 encoder, and the t5 encoder accepts more like 154 tokens. I also didn’t get into the vae, which is actually what turns the results into an image…)