this post was submitted on 23 Apr 2026

532 points (99.4% liked)

196

6097 readers

2114 users here now

Community Rules

You must post before you leave

Be nice. Assume others have good intent (within reason).

Block or ignore posts, comments, and users that irritate you in some way rather than engaging. Report if they are actually breaking community rules.

Use content warnings and/or mark as NSFW when appropriate. Most posts with content warnings likely need to be marked NSFW.

Most 196 posts are memes, shitposts, cute images, or even just recent things that happened, etc. There is no real theme, but try to avoid posts that are very inflammatory, offensive, very low quality, or very "off topic".

Bigotry is not allowed, this includes (but is not limited to): Homophobia, Transphobia, Racism, Sexism, Abelism, Classism, or discrimination based on things like Ethnicity, Nationality, Language, or Religion.

Avoid shilling for corporations, posting advertisements, or promoting exploitation of workers.

Proselytization, support, or defense of authoritarianism is not welcome. This includes but is not limited to: imperialism, nationalism, genocide denial, ethnic or racial supremacy, fascism, Nazism, Marxism-Leninism, Maoism, etc.

Avoid AI generated content.

Avoid misinformation.

Avoid incomprehensible posts.

No threats or personal attacks.

No spam.

Moderator Guidelines

Moderator Guidelines

Don’t be mean to users. Be gentle or neutral.
Most moderator actions which have a modlog message should include your username.
When in doubt about whether or not a user is problematic, send them a DM.
Don’t waste time debating/arguing with problematic users.
Assume the best, but don’t tolerate sealioning/just asking questions/concern trolling.
Ask another mod to take over cases you struggle with, if you get tired, or when things get personal.
Ask the other mods for advice when things get complicated.
Share everything you do in the mod matrix, both so several mods aren't unknowingly handling the same issues, but also so you can receive feedback on what you intend to do.
Don't rush mod actions. If a case doesn't need to be handled right away, consider taking a short break before getting to it. This is to say, cool down and make room for feedback.
Don’t perform too much moderation in the comments, except if you want a verdict to be public or to ask people to dial a convo down/stop. Single comment warnings are okay.
Send users concise DMs about verdicts about them, such as bans etc, except in cases where it is clear we don’t want them at all, such as obvious transphobes. No need to notify someone they haven’t been banned of course.
Explain to a user why their behavior is problematic and how it is distressing others rather than engage with whatever they are saying. Ask them to avoid this in the future and send them packing if they do not comply.
First warn users, then temp ban them, then finally perma ban them when they break the rules or act inappropriately. Skip steps if necessary.
Use neutral statements like “this statement can be considered transphobic” rather than “you are being transphobic”.
No large decisions or actions without community input (polls or meta posts f.ex.).
Large internal decisions (such as ousting a mod) might require a vote, needing more than 50% of the votes to pass. Also consider asking the community for feedback.
Remember you are a voluntary moderator. You don’t get paid. Take a break when you need one. Perhaps ask another moderator to step in if necessary.

founded 1 year ago

MODERATORS

SoleInvictus@lemmy.blahaj.zone

will_steal_your_username@lemmy.blahaj.zone

TheCoolerMia@lemmy.blahaj.zone

kittenzrulz123@lemmy.blahaj.zone

rockSlayer@lemmy.world

JoMiran@lemmy.ml

jawa21@lemmy.sdf.org

TotallynotJessica@lemmy.blahaj.zone

erotador@lemmy.blahaj.zone

Arkhive@lemmy.blahaj.zone

BadJojo@lemmy.blahaj.zone

rockSlayer@lemmy.blahaj.zone

WillStealYourUsername@piefed.blahaj.zone

kittenzrulz123@piefed.blahaj.zone

kittenzrulz123@lemmy.dbzer0.com

532

Mythos rule (lemmy.blahaj.zone)

submitted 16 hours ago by jojo@lemmy.blahaj.zone to c/onehundredninetysix@lemmy.blahaj.zone

25 comments fedilink hide all child comments

top 25 comments

sorted by: hot top controversial new old

[–] TherapyGary@lemmy.dbzer0.com 3 points 5 hours ago

[–] Grandwolf319@sh.itjust.works 14 points 10 hours ago

What is ironic is that there have been consistent reports that it does not improve productivity.

[–] Smorty@lemmy.blahaj.zone 15 points 12 hours ago (1 children)

on a serious note: designing benchmarks is hard.

the consensus has been that creating verifiable benchmarks is surprisingly difficult and the ones that are difficult (like HLE) only get included in these benchmark images when new higher scores are achieved.

its just soooo nice seeing a 99% score on a tool calling benchmark which literally just tests for if the model can generate proper json

people are trying their best designing benchmarks.

[–] TotallynotJessica@lemmy.blahaj.zone 5 points 7 hours ago* (last edited 7 hours ago) (1 children)

The best measure for AI is the productivity and accuracy of the work people do with the models. It doesn't matter if the tech is good at anything if people don't use it properly. Just like any tool, there are right and wrong ways to use them.

AI isn't just about machine learning, but about the role that technology has in our lives. The problem with AI has never been the underlying tech, but how people perceive it and how they use it.

[–] Fiery@lemmy.dbzer0.com 1 points 2 hours ago

The best measure is indeed the final impact of these systems. However that is very hard to actually measure properly, and doesn't completely make benchmarks useless. Benchmarks are still good data points (if they're designed well) to measure advances in the technology. If a model failed to do a realistic task before and the next gen can do it, that often translates to a real improvement to impact. Though having a benchmark improve x2 doesn't mean the model will have x2 impact.

A benchmark can be run automatically and often, while real impact studies take time.

In software development the best measure for quality is the end user having no issues, that doesn't mean automated testing (unit/integration/end-to-end) suddenly is irrelevant though.

[–] ActualGrapesTasteGreen@piefed.zip 57 points 16 hours ago (3 children)

A big detail nobody seems to bring up about Project Glasswing is that they didn't just prompt it "Hey, check out this codebase looking for issues" and out popped zero days. They ran each project through tens of thousands of dollars worth of compute time. Iteration after iteration and after all that they accumulate a report. Now they've reached out to some of the most cash flush companies to say "we can do the same for you."

Put your quarter in the one armed bandit. Maybe you'll get a zero day but more than likely you'll get a "better luck next time." But please, keep paying us. In 10,000 more iterations we'll surely find the bug that would have cost you millions.

[–] Ophrys@lemmy.dbzer0.com 17 points 14 hours ago

Yeah it's cool a computer can write a script but if it takes 5 megawatts to do it then it's not really an improvement

[–] qqq@lemmy.world 7 points 12 hours ago* (last edited 11 hours ago) (1 children)

A competent pentest already costs in the tens of thousands of dollars, and we're also not guaranteed to find anything. Some of the bugs that were discovered by Mythos existed in long standing code bases for a very long time and were not previously known. I would definitely not write off those capabilities.

[–] peeteer@feddit.org 1 points 40 minutes ago

According to the Mythos System Card, Mythos did not find any new bugs. It was provided descriptions of existing bugs and a intentionally weakened and not properly sandboxed environment. Even then it was able to replicate those bugs 85% of the time.

[–] derbolle@lemmy.world 19 points 15 hours ago (1 children)

I read that in Ed Zitron's voice

[–] fullsquare@awful.systems 8 points 12 hours ago

[–] germanatlas@lemmy.blahaj.zone 4 points 10 hours ago

The numbers don’t lie

[–] supersquirrel@sopuli.xyz 37 points 16 hours ago (1 children)

Data without context is irrelevant and meaningless.

[–] ImgurRefugee114@reddthat.com 15 points 15 hours ago (4 children)

[–] Luisp@lemmy.dbzer0.com 3 points 4 hours ago

[–] supersquirrel@sopuli.xyz 15 points 15 hours ago* (last edited 15 hours ago) (1 children)

42*4=196 I think

[–] ImgurRefugee114@reddthat.com 14 points 15 hours ago (1 children)

Luckily this won't be on the exam...

[–] supersquirrel@sopuli.xyz 9 points 15 hours ago* (last edited 15 hours ago) (1 children)

Well as long as the AI I use to cheat on the exam wasn't trained on data inputted from confident bullshit I have said or other idiots like me have said on the internet I will be fine!

[–] Viking_Hippie@lemmy.dbzer0.com 7 points 14 hours ago

Narrator: supersquirrel would not be fine

[–] UnspecificGravity@piefed.social 5 points 13 hours ago

What do you get when you multiply six by nine?

[–] Black616Angel@discuss.tchncs.de 2 points 12 hours ago

1337/420≈π

[–] Smorty@lemmy.blahaj.zone 8 points 12 hours ago

benchmaxxing and are real annoying.

recent local model releases appear to be good, but i dismissed them becuz of the high scores (implying benchmaxxing)

this whole project glasswing thing, oh gosh... most of the exploits found by that model were later proven to be findable with older models too, so this is nothing new.

[–] Quacksalber@sh.itjust.works 8 points 16 hours ago

This can be copied 1:1 to right-wingers and wanna-be fascists. They too love to make up scary big numbers.

[–] CriticalMiss@lemmy.world 6 points 15 hours ago

It’s more like the tests we came up with ourselves show our models improved therefore it means you can safely invest a lot of money in us and uhh yeah we will become profitable one day

[–] Canadian_Cabinet@lemmy.ca 3 points 15 hours ago

The numbers don't lie! And they spell disaster for you!