Sharing experiences of image creation using AI

For those who would like to understand the origin of certain AI hallucinations, I'll share my experience with an AI that was struggling to create images. Those who would like to replicate my experiment can see the exact test case I conducted with ChatGPT Plus 5 Thinking (extended thinking).

(The prompt is the request I send to the AI)

Prompt: Generates a hyper-realistic image of a generic man sitting at a table looking at a closed book in front of him. On the table, he can read the title: "Flower". The man is seen from the front with his eyes lowered to the book.

Image created >

Prompt: The man cannot read backwards. Correct the image.

Image created >

Prompt: replaces the word in the title with the word "Vagabond"»

Image created >

Prompt: Put a suit on the man

Image created >

Prompt: Turn the book over so the man can read the book

(No image created)

Prompt: Why don't you provide the image?

Text response >

Good question 😊

The last time you wrote «"Turn the book over so the man can read the book"», I simply did not restart the image generation: my response was… empty.

Technically, when I generate an image, I have to send the request to the image tool and then respond without text. In this case, I didn't make another call to the tool, so no new image appeared with the book the right way up.

If you still want the scene with the book "Vagabond" facing the man correctly, just say so (for example: «"regenerates the same image with the book the right way up"»), and I'll take care of it.

Prompt: generates the image

Image created >

Prompt: Generates the image of the man who can read the book title

Image created >

Prompt: Why did you take your hands off the table?

Text response >

Because, technically, each time I "regenerate" the image, it's not a retouching of the previous one, but a new complete image created from the text I give to the tool.

As a result :

The tool has retained the idea well. «man in suit / book Vagabon»
But he changed some details of the pose: this time he made the man a little closer to the table, without visible hands, and with the book the wrong way for him, but readable for the viewer.

So I don't have voluntarily Removing the hands is just an automatically generated variation.

If you like, I can explicitly ask: "Man in a suit, both hands flat on either side of the book, title" Vagabond readable for him" and regenerate an image in that style.

Prompt: So you're saying that we always need to provide you with the entire context of image generation. Is that correct?

Text response >

Not obliged… but it helps a lot, yes 😊

Two important points:

Each generation is a new image, not a pixel-by-pixel retouching of the previous one.
Even if we start from the same idea, the tool can change the pose, the hands, the expression, the size of the book, etc.
The more the context is repeated and precise, the more consistent the result remains.
If you absolutely must:
- a man in a suit
- both hands placed on the table
- a beige book in front of him
- the title «"Vagabond"» readable for him (and not for the viewer)

So it's best to give me all of that in your instructions, or tell me clearly:

«Keep the exact same scene as the first image in costume, with the hands on the table, but turn the book around so the man can read “Vagabond”.»

For my part, I can reuse what we've already discussed to rewrite a complete prompt, but if you want a very controlled result, yes: restore all the key elements is the best method.

If you want, in the following message you can just write:

«"Generates the man in a suit, both hands on the table, the book Vagabond "In front of him, the title was legible to him." And I'll go back to that.

Prompt:

I just tested it in a new, blank discussion with no history. I've attached the generated image. The result isn't valid. The book is upside down for the man. How do you explain this?

Text response >

Yes, I can clearly see the problem in the image you attached: the word VAGABON is readable for us (the camera), therefore necessarily upside down for the human.

Why does this happen, even in a new, well-formulated discussion?

The model "thinks" for the camera, not for the character
The image generator was trained primarily on photos where book titles, posters, etc., are facing the camera. So, when it sees “a book with the title VAGABON”, its natural tendency is to make the text legible for the viewer, not for the person in the scene.
Your instruction “the man can read the title” is correct, but his “statistical reflex” remains: text = facing the camera.
The constraints of perspective are difficult for him
Stating that “the title must be legible to humans, therefore upside down for the viewer” requires fairly sophisticated geometric reasoning (who sees what, from what angle, in which direction). Image models are quite good at writing text and placing objects, but they are still very imprecise when it comes to combining elements.
- Text
- Precise orientation
- Viewpoint of a character within the scene.
Each generation is independent and probabilistic
Even with good instructions, the model does not guarantee that it will comply all The constraints. He produces what seems to him visually “most probable”. Now, a cover legible to the camera, with a man looking at the book, is exactly the kind of image he has seen thousands of times during training.

In practical terms, to maximize your chances of getting what you want, you almost have to "force" the model to go against its reflex:

“The title VAGABON is facing the man and therefore upside down for the viewer.”
“The word VAGABON appears upside down from the camera's point of view, because the book is oriented towards the person who can read it.”

And sometimes adding a different view also helps a lot:

“Seen from a slightly elevated angle, we can see the top of the man’s head and, in front of him, the book 'VAGABON' whose title is readable for him and upside down for us.‘

So you haven't "done anything wrong": you're just encountering a very strong bias in the image model, which almost always prefers to make the text readable for the camera rather than for the character.

Finally, here are two prompts that can help generate images:

1. Quick way to generate an image from scratch:

You are an expert image creation assistant.
I want you to generate an image.

Main topic: [Clearly describe the subject: character, object, landscape, scene…]
Frame / composition: [close-up, medium shot, aerial view, low angle shot, etc.]
Visual style: [realistic, oil painting, digital illustration, comic book, watercolor, low-poly, pixel art, etc.]
Time period / atmosphere: [futuristic, medieval, contemporary, dystopian, dreamlike…]
Color palette: [bright/pastel/dark colors, specific dominant colors: midnight blue, warm orange…]
Light : [soft, dramatic, backlit, studio light, sunset, neon…]
Important details to include: [clothing, accessories, decor, facial expression, weather, material texture…]
What to avoid: [elements to exclude, atmosphere to avoid, style not to use…]
Format: [1:1 square, 16:9 horizontal, 9:16 vertical, etc.]
Give me an optimized, compact and descriptive final prompt, ready to be used in an image generator, without any surrounding commentary, only the prompt text.

2. Quick to retouch/modify an existing image (provide the reference image)

You are an expert assistant in image retouching and modification.
I'm going to provide you with an image and I want you to suggest a precise retouching prompt.

Starting image: [briefly describe the image if needed]
Items to keep: [main character, general setting, overall palette, atmosphere…]
Desired changes:
– [change the background / add or remove a character / modify the outfit / adjust the colors / correct a defect, etc.]
– [e.g., replace the grey sky with a dramatic cloudy sky, add light rain, etc.]
Desired overall style after retouching: [more realistic, more stylized, brighter, darker, cartoonish, etc.]
Retouching level: [light (correction) / medium (simple additions) / significant (recomposed)]
What to avoid: [e.g., no text in the image, no glitch effect, no excessive blur…]
Give me a clear and detailed description, ready to be used in an image editing tool with an input image, no comments around it, only the prompt text.

Pine

0 Shares

Leave a ReplyCancel Reply