Primitive image generators don't "understand" semantics of the patterns they assemble. The obvious example we're probably all familiar with is "AI hands" with weird poses, excessive fingers, sometimes even extra arms etc.: the AI knows the pattern we understand as a finger, it knows the correlation with those we see as hands and arms, but it doesn't how what a hand is or why it's important to have a specific number of those patterns combined in specific ways.

Text is the same: The model knows the graphic patterns of letters, maybe knows the patterns of words the letter-patterns often occur in, but the same randomisation that can produce different enough results to look human can also lead to randomly generating patterns that it fundamentally can not know aren't actually valid words.

&nsbp;

There are solutions, such as using a more specialised tool like a combined layout generator and text generator with feedback to make the text fit the layout. Using the right tool for the task at hand, paired with supervision by a human that knows its shortcomings and can check whether and where it trips up, might do a better job.

But that human has to have the required know-how, and if you really want to use a single LLM to feed all prompts into, that model should be capable of detecting and delegating the work to those specialised tools (checking with the prompter to confirm that its detection is accurate).

A simple all-purpose-model and general "prompt engineer" without subject-specific experience and training just won't cut it. The marketing for these tools generally seems terribly intransparent about that problem, and executives generally seem to be oblivious to it (or just indifferent, so long as it helps them cut costs on paper for a few quarters).

(As an aside: it's the same difficulty text generators occasionally have with facts and citations: They can't tell when it is important to have very specific combinations of words that map to very specific occurrences. It might have picked up the correlation of the word pattern "Words (Number), Words (something edition). Words." with human names, year numbers, book titles and publisher respectively, but it can't know why only specific combinations of author, year, existing book title and publisher are permissible.

It may get them right often enough by picking a likely combination from texts related to the prompt, but unless you double check (or provide in advance) that citations refer to existing works and fit the cited content, you run the risk that it randomly generates bullshit. A student "writing" a paper might not be able to catch it, but a professor that knows the major authors and works of their field is probably gonna spot it.)