Perchance - Create a Random Text Generator

1843 readers

3 users here now

⚄︎ Perchance

This is a Lemmy Community for perchance.org, a platform for sharing and creating random text generators.

Feel free to ask for help, share your generators, and start friendly discussions at your leisure :)

This community is mainly for discussions between those who are building generators. For discussions about using generators, especially the popular AI ones, the community-led Casual Perchance forum is likely a more appropriate venue.

See this post for the Complete Guide to Posting Here on the Community!

Rules

1. Please follow the Lemmy.World instance rules.

The full rules are posted here: (https://legal.lemmy.world/)
User Rules: (https://legal.lemmy.world/fair-use/)

2. Be kind and friendly.

Please be kind to others on this community (and also in general), and remember that for many people Perchance is their first experience with coding. We have members for whom English is not their first language, so please be take that into account too :)

3. Be thankful to those who try to help you.

If you ask a question and someone has made a effort to help you out, please remember to be thankful! Even if they don't manage to help you solve your problem - remember that they're spending time out of their day to try to help a stranger :)

4. Only post about stuff related to perchance.

Please only post about perchance related stuff like generators on it, bugs, and the site.

5. Refrain from requesting Prompts for the AI Tools.

We would like to ask to refrain from posting here needing help specifically with prompting/achieving certain results with the AI plugins (text-to-image-plugin and ai-text-plugin) e.g. "What is the good prompt for X?", "How to achieve X with Y generator?"
See Perchance AI FAQ for FAQ about the AI tools.
You can ask for help with prompting at the 'sister' community Casual Perchance, which is for more casual discussions.
We will still be helping/answering questions about the plugins as long as it is related to building generators with them.

6. Search through the Community Before Posting.

Please Search through the Community Posts here (and on Reddit) before posting to see if what you will post has similar post/already been posted.

founded 2 years ago

MODERATORS

eatham@lemmy.world

eatham@aussie.zone

VioneT@lemmy.world

perchance@lemmy.world

Regarding the accuracy of the AI Text Plugin's countTokens function (lemmy.world)

submitted 2 months ago by Almaumbria@lemmy.world to c/perchance@lemmy.world

7 comments fedilink hide all child comments

Hii

On my fork of AI Character Chat, I have a function to estimate how many lore entries I can fit within a prompt; it simply gets the token count of the prompt without lore entries, and substracts that number from idealMaxContextTokens to get an idea of how much space can be used.

This works alright, but while testing it out, I was looking at my debug prints in the console and that made me finally realize that the token count given by countTokens is fairly inflated (by over 1200 tokens on average) when compared to what the AI Text Plugin itself reports when printing out "initial token count" and "token count after trimming".

So I looked at the implementation for countTokens:

return Math.ceil(text.length/3.6); // TODO: just an approximation for now - this approx will not work well with non-english characters. will update during next model upgrade.

And this got me wondering what other methods to approximate the token count may look like; for my usecase, I specifically need to handle only plain English text, which led me to think maybe I can have a more precise estimate.

I don't know how to perform this calculation properly though, so I went back and forth with the text model (at my own peril) and it landed on this:

function countTokens_ENG(text) {
  // count word-like sequences (including apostrophes/hyphens)
  const wordCount = (text.match(/[\w'-]+/g) || []).length;
  
  // count punctuation/symbols (excluding those already part of words)
  const punctCount = (text.match(/[^\w\s'-]/g) || []).length;
  
  // estimate 1.3 tokens per word (accounts for splitting common words)
  // add punctuation as separate tokens
  return Math.round(wordCount * 1.3 + punctCount);
}

I tested the above code for a while, and though it is still inflating the token count, it only does so by around half as much, so not too bad.

So now I'm curious about whether anything else can be done in code to get a more accurate estimate, as I think that could help me improve both summarization and dynamic retrieval a little bit.

Cheers!

you are viewing a single comment's thread
view the rest of the comments

[–] MinoriMirariRProductions@lemmy.world 1 points 2 months ago

Guess that depends on how you are applying it. i.e. natural language im sure it does fine.