this post was submitted on 29 May 2026
746 points (98.7% liked)

Technology

84998 readers
3188 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] BlackLaZoR@lemmy.world 41 points 16 hours ago (1 children)

Just to make things clear: API access to most models is charged per input tokens + output tokens. It means that the longer your conversation is, the more you pay for every new answer. Single prompt with no context and 100 tokens of answer is cheap. Single prompt with 100k tokens of context and 100 tokens of answer is NOT cheap.

Extremely long conversations with most expensive top of the line models can absolutely demolish your budget.

[–] perviouslyiner@lemmy.world 10 points 15 hours ago (3 children)

does it give the full history to the LLM each time?

Last time I tried implementing something like this, it suggested to have a rolling window of history so that it takes into account your last X messages but not the entire conversation.

(I guess this is what ollama calls "context length"?)

[–] BlackLaZoR@lemmy.world 2 points 4 hours ago

does it give the full history to the LLM each time?

It's limited to the context size supported by given model. You can give the model 100k tokens of history but if it's configured for less, it will just truncate it before processing (usually by removing oldest tokens first)

[–] percent@infosec.pub 7 points 14 hours ago

Most agent harnesses do something called "compaction." For example, here's how Pi does compaction

[–] Sabata11792@ani.social 7 points 15 hours ago* (last edited 15 hours ago) (1 children)

You send the entire history for that conversation every time and likely more if its getting info from tools. If its not in the context the model dose not see it unless you have a memory system that dose something like feeding in summaries of past conversations that also takes up tokens and context. Rolling drops old messages to not reach context limits but you can lose important info or get odd results. If the history gets bigger than the context things break or slow way down.

[–] perviouslyiner@lemmy.world 9 points 15 hours ago

presumably this is why Claude periodically writes its conclusions so far into a text file that it can read later instead of having to remember everything. Sounds like an interesting approach.