>>> Separately edit the prompt
Prompts are written by separating words with commas, but there are also many cases where two or more words are entered and separated, such as Prompt best quality, long hair. Therefore, the length of a phrase can vary greatly depending on the model or user's preference.
For example, when you want to output an image such as 'Prompt 1girl. having green hair and blue eyes wearing pink shirt and yellow skirt with red ribbon and brown belt putting white shoes', you can input a long sentence like this: Prompt 1girl. having green hair and blue eyes wearing pink shirt and yellow skirt with red ribbon and brown belt putting white shoes, or you can input it with commas like this: Prompt 1girl, green hair, blue eyes, pink shirt, yellow skirt, red ribbon, brown belt, white shoes.
If you create images using the two prompts and compare them, you can see that they are generally created as instructed, even if you write the prompt as a long sentence.
Long Prompt masterpiece, best quality, ultra detailed, fantasy, vivid color, 1girl, having green hair and blue eyes wearing pink shirt and yellow skirt with red ribbon and brown belt putting white shoes |
Short Prompt 1girl, green hair, blue eyes, pink shirt, yellow skirt, red ribbon, brown belt, white shoes |
>>>Build prompts considering the number of tokens
As mentioned in Chapter 3-1 when explaining CLIP, the prompt is divided into tokens. The current number of tokens is displayed in the upper right corner of the window where you enter the prompt and negative prompt. At first glance, it looks like it shows the number of prompts separated by commas, but in reality, the token is a word that CLIP thinks is a token, along with the commas.
Now, let's see why we need to check the token count. The important thing here is the denominator 75 of the token count, which is marked as '0/75'. The number 75 is not the maximum number of tokens, but rather the boundary of the prompt that CLIP processes. This is explained in detail on AUTOMATIC1111's GitHub.
GitHub-AUTOMATIC1111/stable-diffusion-webui
Unlimited Token Works #2138
https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2138#inssecomment-1272825515
When the number of tokens exceeds 75, it changes to [76/150] and the denominator becomes 150. The denominator increases by 75 in the following order: 150, 225, 300. The 75th token is treated as related to the previous token in CLIP, but tokens from the 76th token onwards are treated as different from the previous tokens. Therefore, tokens are not related to the previous tokens up to the 75th token, and are recognized as related to the 150th token thereafter.
In addition, since the prompt affects itself and the tokens after it, the closer the token is to the beginning, the stronger its influence on the generation result. Therefore, the first token has the strongest influence, and the 75th token has a very weak influence until the end of the processing process. And the 76th token has a very strong influence because it is the leading token of the second processing process.
Therefore, if the prompt becomes long, there are cases where important tokens are placed at the end with less influence, or when they unintentionally become the 76th token and exert more influence than necessary. The solution to this problem is a construct called Prompt BREAK.
>>> Let's use the BREAK statement to fill in the token count
The 'BREAK syntax' is a function that controls the influence of the prompt by forcibly creating a token processing boundary. If you enter Prompt BREAK in the prompt, the corresponding token becomes the 75th or 150th boundary, and the next token is recognized as the 76th token, which is the beginning of the second processing process. When the prompt is not reflected well or when you want to emphasize the prompt at the end of the processing process, you can use this method to reflect the prompt's instructions in the generated results as you intended.
▲This function only applies to BREAK in capital letters.
▲Be careful because if you add a comma after BREAK, it will also be treated as 1 token.
For example, if you are trying to print a dark blue skirt, but the white skirt is printed due to the influence of Prompt white long hair right in front, insert Prompt BREAK in front of Prompt indigo dress to prevent the influence of Prompt white, and separate it with the second CLIP process. This way, you can fix the skirt color to the original purpose by searching.
Prompt absurdres, masterpiece, best quality, ulta detailed, cinematic lighting, pastel color, classic, 1girl, upper body, white long hair, indigo dress, gorgeous classical long dress, castle, night, multiple candle, |
Prompt absurdres, masterpiece, best quality, ulta detailed, cinematic lighting, pastel color, classic, 1girl, upper body, white long hair,BREAK indigo dress, gorgeous classical long dress, castle, night, multiple candle, |
>>>Learn about CFG and sampling
Here, we will look into how prompts are involved in image generation in detail. This is necessary knowledge to handle prompts better. When CLIP interprets the prompt, it gradually fits it to the conditions during the process of removing noise from the diffusion space according to the structure called AttnBlock of U-Net, and based on this, accurate noise removal is performed, as explained in Chapter 3-1.
However, if we proceed like this, we can generate an image by capturing only the features of the random noise given at the beginning. However, it is difficult to prepare a classifier that can detect the correct keyword in advance because the possibilities will be too many. In addition, it is a very difficult problem to generate an image with only a 'one-time' text, not a conversation with various contexts.
To solve this problem, Stable Diffusion compares the inference results for 'no conditions', which are noise-removed when no prompt is input, with the 'conditional' inference results, which are noise-removed when a prompt is input, using a sampler, and then derives the part affected by the prompt. This process of deriving and estimating values from conditional or no conditions does not require prior learning by the classifier, so it is called CFG (Classifier Free Guidance).
Comparing the 'conditional' inference given in the prompt and the 'no condition' inference without entering the prompt, an image that is more consistent with the condition can be generated. In short, the conditions given in CLIP are evaluated repeatedly a specified number of times by various sampling algorithms and schedulers of UNet. To put it simply, it means 'prompt instructions are repeatedly delivered a specified number of times to the network that handles the time progression of the area, attention, and diffusion of UNet.'
This is implemented as a feature called CFG scale, and is an indicator of 'how much the prompt is followed' without a unit of measurement. 'The standard is 7, but if you reduce it to 6, 5, 4 like this, the influence of the prompt becomes weaker, and if you increase it to 8, 9, 10, the difference in the presence or absence of a condition by the prompt becomes clearly stronger. You can do an actual test using the X/Y/Z plot, so I recommend you try it. If expressed as a conceptual formula, it is as follows.
Noise prediction = unconditional + CFG scale * (conditional/unconditional)
Negative prompt is also an application of CFG. Instead of 'no conditions', it uses negative prompt as a comparison target, and uses the result of noise removal by doubling the CFG scale. In this way, the sampler applies the inference result of 'distant noise' according to the condition specified by the prompt as well as the condition specified by the negative prompt. As a result, certain elements can be excluded from the image.
Stable Diffusion, which utilizes this advanced approach, has established itself as a flexible and powerful image generation tool as a 'stable, diffusion model'. By combining prompts, CFG, and negative prompts, users can give detailed and precise instructions to the model and reliably obtain the results they expect. This model of Stable Diffusion has the ability to generate meaningful and purposeful images from initial noise even without teacher information (zero shots).
Negative prompt | GitHub - AUTOMATIC1111/stable-diffusion-webui
Negative prompt · AUTOMATIC1111/stable-diffusion-webui Wiki · GitHub
Finally, let's look at the sampler that performs the noise prediction task. We have already mentioned the sampling process that performs noise removal based on the inference results according to the conditions of the prompt. AUTOMATIC1111 can set the sampler and steps with parameters related to this.
Have you ever wondered what the cryptic sampler name and step number, such as 'DPM++2M Karras', actually mean? This is the name of the algorithm used in sampling, and it usually contains the name of the mathematician who proposed the algorithm.
As explained so far, 'Stable Diffusion' is a large-scale artificial neural network that specializes in removing 'noise components' from noisy images. And it performs 'noise removal' by comparing the weights of the prompt and negative prompt using CLIP.
In Stable Diffusion, the names of all samplers are derived from the types and abbreviations of the algorithms that find solutions to differential equations for noise removal.
Now, let's learn about the loss function of machine learning here. The loss function is a function that calculates the difference between the correct answer and the 'predicted value' output by the model. In Stable Diffusion, it means the distance until the initial noise and the 'perfect state' match, and it applies more 'pressure' using the prompt. And it decides 'how to diffuse the initial noise' and generates the resulting image that minimizes the loss function.
In addition, due to the specifications of the software, this complex differential equation is encoded in more than 1 billion floating points within the model.
Here, the sampler basically takes a specified number of steps, and at each step, samples the latent space, computes the local gradient, and decides how to proceed to the next step. After the specified number of steps and all the samples are completed, the inference result is sent to the VAE and output as an image. The sampler tries to minimize the loss function as low as possible, like a ball rolling down a hill. However, even if it seems to be the fastest path locally, it often does not actually lead to an optimal solution (the value that maximizes or minimizes the value of the objective function among the solutions that satisfy the constraints).
Sometimes you may be lost in the local optimum, and you may have to climb the mountain of functions to find a better path. You should proceed with the latent diffusion model slowly and evenly so that the inference does not collapse, and there is also a method of establishing a schedule that moves greatly at the beginning to avoid the valley of the local optimum and converges as carefully as possible in the end.
Depending on the type of sampler, the noise removal algorithm and schedule are different, and there are also differences in the generation results.
Although the model specifies the sampler, the user basically wants to 'select a sampler that generates images quickly and has good quality'. When creating an illustration to create a vivid image like a photo, let's review it using the X/Y/Z plot from various perspectives such as 'how many steps will it converge' and 'is the level of completion okay'. Fix the CFG scale to 7 and compare each sampler by setting the steps to '2, 4, 8, 10, 20, 40'. Most samplers including Eular can observe the image changing over 20 to 40 steps.
There are speed-priority samplers such as Heun and DPM++2M Karras that converge before step 20 and do not change thereafter, and on the contrary, there are samplers that do not stop searching even when the inference time is long, such as 40 steps. There are also samplers that determine the number of steps themselves, such as DPM adaptive. And if you increase the CFG scale, it becomes easier to check the degree of obedience to the prompt and the relationship.
For example, in the case of the initial work of character design and posing, the same composition is not preferred and various images are created, so Eular is stable and fast. Although the image quality improves depending on the step, the image tends to become blurry as the number of steps increases, so it may not be suitable for final finishing. Let's choose samplers with the letter a in the name, such as Ancestral or Euler a, DPM2 a, DPM++2S a, and DPM++2S a Karras. These have the disadvantage of high randomness and low reproducibility of illustrations, but they can also stimulate creativity. If you have a pattern like character and clothing design or need to work on detailed parts like facial expressions, DDIM or DPM++ 2M Karras would be good.
And it is convenient to hide the samplers that you don't use often from the selection. To set it, select the unnecessary samplers in samplers in user interface of Settings > Sampler parameter. The selected samplers are not completely deleted, so you can add them again later.
>>>Let's change the sampling method
There is a parameter called Sampling method on the top left of the Generation tab. Here you can select the sampler that will perform the denoising process. The sampler refers to the algorithm that performs the denoising process. The results will be completely different depending on the sampler you use, so you should select it according to the model or the image you want to talk about.
Currently, the sampler that is recommended for many illustration-style models and generates high-quality images is DPM++2M Karras. If you are having trouble, I recommend selecting DPM++2M Karass and generating images. Different samplers are recommended depending on the model, so read the description carefully and choose. It is also good to compare them with x/y/z plots.
DPM++ 2M | DPM++ SDE | DPM++ 2M SDE | DPM++ 2M SDE Heum | DPM++2S a | |||||
DPM++3M SDE | Euler a | Euler | LMS | Heun | |||||
DPM2 | DPM2 a | DPM fase | DPM adaptive | Restart | |||||
DDIM | DDIM CFG++ | PLMS | UniPC | LCM | |||||
>>>Let's adjust the sampling step
In addition to the sampling method, there is a parameter called sampling steps. This parameter indicates how many times to perform the denoising process described in the sampling method earlier. The more steps there are, the more times the denoising process is performed, which leads to more changes in the image.
If the number of steps is extremely small, noise cannot be removed properly, which increases the possibility of creating a blurred image. All sampling methods do not show dramatic changes when the number of steps exceeds a certain level. The more times there are, the longer the inference time becomes, so set it according to the sampling method you will use.
For example, in the case of the sampling method: DPM++ 2M Karras, when the number of steps is 1 to 5, the illustration becomes blurred severely, and when it is 10 to 80, a proper image is output. When the number of steps exceeds 20, there is no significant change, but you can see differences in the composition and details. The appropriate number of sampling steps varies depending on the compatibility with the model, so refer to the figures in the model overview and check the generated image to find the number of sampling steps that is right for you.
Steps :1 | Steps : 5 | Steps : 10 | Steps : 20 | Steps : 40 | Steps : 80 |