Hii
On my fork of AI Character Chat, I have a function to estimate how many lore entries I can fit within a prompt; it simply gets the token count of the prompt without lore entries, and substracts that number from idealMaxContextTokens to get an idea of how much space can be used.
This works alright, but while testing it out, I was looking at my debug prints in the console and that made me finally realize that the token count given by countTokens is fairly inflated (by over 1200 tokens on average) when compared to what the AI Text Plugin itself reports when printing out "initial token count" and "token count after trimming".
So I looked at the implementation for countTokens:
return Math.ceil(text.length/3.6); // TODO: just an approximation for now - this approx will not work well with non-english characters. will update during next model upgrade.
And this got me wondering what other methods to approximate the token count may look like; for my usecase, I specifically need to handle only plain English text, which led me to think maybe I can have a more precise estimate.
I don't know how to perform this calculation properly though, so I went back and forth with the text model (at my own peril) and it landed on this:
function countTokens_ENG(text) {
// count word-like sequences (including apostrophes/hyphens)
const wordCount = (text.match(/[\w'-]+/g) || []).length;
// count punctuation/symbols (excluding those already part of words)
const punctCount = (text.match(/[^\w\s'-]/g) || []).length;
// estimate 1.3 tokens per word (accounts for splitting common words)
// add punctuation as separate tokens
return Math.round(wordCount * 1.3 + punctCount);
}
I tested the above code for a while, and though it is still inflating the token count, it only does so by around half as much, so not too bad.
So now I'm curious about whether anything else can be done in code to get a more accurate estimate, as I think that could help me improve both summarization and dynamic retrieval a little bit.
Cheers!
Guess that depends on how you are applying it. i.e. natural language im sure it does fine.