You guessed it, it's another Rules update. Check here for more info

Text-to-image prompting

tyto4tme4l

Something of an artist
@MareStare
Great idea! Img2img and inpainting are invaluable tools and it would be useful to describe them in detail. Things like denoising strength and the difference between “Whole picture” vs “Only masked” for the inpaint area are extremely important here.
Scarlet Ribbon

@Thoryn
I have the same GPU. I can generate a 1024x1024 image in Comfy UI in less than 15 seconds. I don’t know what was up with automatic1111, but I was getting similarly glacial performance on it.
Strongly recommend you just get rid of it and learn a different front end.
Zerowinger

3-3/4" Army Man Fan
So, I tried out Pony Diffusion on Civitai to some success, and part of the prompt was copy-pasting the score_x score_up prompts that I had seen elsewhere. However, I’m a little confused as to exactly how those prompts work, the whole text to image format is very different to the style I’m familiar with.
Could I get some insider info on just exactly how this format in Pony Diffusion and similar checkpoints works?
Posted Report
MareStare

Mare is very curious👀
@Zerowinger
The score_* tags are specific to Pony Diffusion. Their original idea was that you’d be able to write score_7_up tag only (just a single tag), and you’d get an image based on the dataset of images of quality 7 or higher.
However, the way this was implemented during training was wrong, and completely broken. The developers discovered this bug only in the middle of training, at which point fixing that bug would be too expensive (they’d need to restart training from scratch again, which would cost them potentially several tens or even hundreds of thousands of dollars). So, they kept the bug, and made a guideline to include that lengthy score_9, score_8_up, ... etc. string at the start of the prompt to work around it.
There is more detail on this training fiasco in this article: https://civitai.com/articles/4248/what-is-score9-and-how-to-use-it-in-pony-diffusion
Posted Report
Zerowinger

3-3/4" Army Man Fan
@MareStare
So basically, including that string is necessary for higher quality images then? What about the rest of the prompting? On Imagen, I’m used to using full sentences and phrases to describe exactly what I want the output to be, with Pony Diffusion it seems the go to format is to list each individual aspect as a prompt separated by a comma.
Posted Report
Scarlet Ribbon

@Zerowinger
Different models are trained in different ways, leading to some models being better for natural language, and others better for tag-based prompting. Pony doesn’t completely fail with natural language prompting, but in my experience it performs much better with tag-based. If you add source_pony to your prompt, you can damn near just use Derpi/Tanta tags to get most of the results you’re looking for.
Posted Report
MareStare

Mare is very curious👀
@Zerowinger
You can use full sentences to describe the prompt with Pony Diffusion as well. Citing from their page on civitai the recommended prompt format:
score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up, just describe what you want, tag1, tag2
where tag1, tag2 are simple words/word combinations similar to derpibooru tags like “unicorn, blushing, trio, duo”, etc
Posted Report
MareStare

Mare is very curious👀
The quantity of steps depends on the sampler, for Euler it’s 25+ sampling steps, but sometimes it can be lower. I guess it depends on composition and it’s never constant. I recommend just trying different settings and checking if increasing the steps substantially improves the image
Posted Report
Thoryn

Latter Liaison
I’ve seen some guides mention to use BREAK in prompts to help guide the model. E.g.
Description of scenery
BREAK
Character 1 wearing denim jeans and red sweater sitting on a bench
BREAK
Character 2 wearing black suit with bowtie walking in the background
But I’m not having much success with it, it still gets confused as to who wears/does what.
Any of you using it successfully?
Posted Report
Lord Waite

I always feel like it helps to have a bit of a base understanding on how the models work on these things.
Initially, someone created a large dataset of images and descriptions. The descriptions were tokenized, and the images cut up into squares. Then, random noise was generated based on a seed. It took one square, generated random noise based on a seed, and attempted to denoise the noise into the image on the square. Once it got something close, it discarded the square and grabbed another one. At the end, all of this was saved in a model.
Now, what happens when you are generating an image is that your prompt is reduced to tokens by a text encoder (XL based models use CLIP-L and CLIP-G), random noise is generated by the specified seed, and then the sampler and noise schedule is how it denoises, with as many steps as you specify.
Some schedulers introduce a bit of noise at every steps, namely the ancestral ones (with an a at the end), and sde, but there may be others. With those ones, the image is going to change more between steps and they’ll be more chaotic. Also, some will take less steps then others to get to a good image, and how long each step takes will vary a bit. I believe some are just better at dealing with certain things in the image, too, so it’ll take some playing around.
Now, the clip text encoder actually can’t cope with anything more than 77 tokens at once, and that includes a start and end token, so effectively 75. So if your prompt is more than 75 tokens, it gets broken up into chunks of 75.
The idea behind “BREAK” is that you are telling it to end the current chunk right there and just pad it out with null tokens at the end. The point is just that you’re making sure that particular part of the prompt is all in the same chunk. I’ve had mixed results on it, so I try doing it that way occasionally, but also don’t a lot of the time. It is going to have trouble with getting confused anyways. This is just an attempt to minimize it a bit.
(Text encoding is one of the differences between model architectures, too. 1.* & 2.* had one clip, XL has two, then when you start getting into things like flux and 3, you start dealing with things like two clips and a t5 encoder, and the t5 encoder accepts more like 154 tokens. I also didn’t get into the vae, which is actually what turns the results into an image…)
Posted Report
Syntax quick reference: **bold** *italic* ||hide text|| `code` __underline__ ~~strike~~ ^sup^ ~sub~

Detailed syntax guide