Data Hoarder

747 readers

95 users here now

Keep it about datahoarding.

Rules

Be respectful.
Engage in constructive discussions.
Please, no harassment, hate speech, spam or trolling.

founded 2 years ago

MODERATORS

danielintempesta@lemmy.world

OCR's Epstein emails (lemmy.world)

submitted 1 day ago by TropicalDingdong@lemmy.world to c/datahoarder@lemmy.world

22 comments fedilink hide all child comments

I've been running OCR on the recent house epstein email dump. Making this available now that its close to finishing (20k/ 23k emails processed).

Processing script available here: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

I also put an analysis script in there if you want to use drive/ colab.

Currently finished files are available here:

https://files.catbox.moe/xrgts0.sqlite

you are viewing a single comment's thread
view the rest of the comments

[–] Grimy@lemmy.world 2 points 18 hours ago* (last edited 18 hours ago) (1 children)

https://old.reddit.com/r/LocalLLaMA/comments/1p0q3z1/offline_epstein_file_ranker_using_gptoss120b/

Here's a similar idea that is using a ranking method. I'd maybe add that and ask it in the prompt to classify how the name was mentioned from a set list of options, so you can quickly filters out names like Obama which are most likely not involved in the actual bad stuff.

Awesome work!

[–] TropicalDingdong@lemmy.world 3 points 18 hours ago (1 children)

Yeah I have a feeling my OCR is better. I had some crashes but its still running. There have been some real issues with how they released this. I've also added a sha256 hash to ensure file integrity if, for whatever reason, shit gets fucked with (although I wish I wouldn't have limited it to 4096 bits).

I'm running it in google colab to be able to mount the doc's directly, but its been behaving very, very fuckily. I've also tried downloading them all, and it almost always fails.

We're at 21840/23124 images processed, but its been a slog.

I think there were also some failures in there I'll have to go fix, because I set this up to process across multiple nodes, if a node failed, it may have checked out a data file but not processed it. Should be less than a few hundred.

[–] Grimy@lemmy.world 1 points 17 hours ago* (last edited 17 hours ago) (1 children)

It's definitely a lot of data. I didn't mean anything by it btw, I just sent the link since it has his prompt in the comments, or part of it at least. We get the most out of it if different people are going through it separately in any case. I'm very excited to see what you find.

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I'm guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don't know how succesful this would be though, llms already struggle with simple ranking sometimes.

Maybe you can run normal algorithms for terms that aren't in dictionaries and comb through the recurring ones manually.

Just trying to brainstorm a bit.

[–] TropicalDingdong@lemmy.world 3 points 17 hours ago (1 children)

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I’m guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don’t know how succesful this would be though, llms already struggle with simple ranking sometimes.

I mean you can down load the data and start digging around.

I promise its not that cryptic.

And I think the issue you are identifying is real, but its also why I'm not even trying to do any kind of heavy-lift analysis with LLM's, they have a role imo, but that ain't it. You'll really struggle if you insist on doing things this way. My editorial opinion is that the best use of LLM's is to do the weakest tasks possible. The more you ask them to do, the less good they are at it. Or maybe "less good" isn't the right word, but greater the propensity for hallucination and more likely to "smooth over" information that would actually be interesting.

You take responsibility for the analysis, and let the LLM help where its capable. Take a "just the facts ma'am" approach to analysis, but use the LLM's to do things (like, regex) that they are good at.

And honestly, as a data scientist, this isn't exactly my specific kind of work. I do work with machine learning, but usually global spatial data and for actual scientific questions. I just thought of this as a fun, relatively easy thing to do over the weekend. And now its next week. And its still going...

We did not do these things because they are easy; we do these things because we thought they were easy.

[–] mojofrododojo@lemmy.world 1 points 10 hours ago

props. huzzah.