this post was submitted on 19 Nov 2025
63 points (95.7% liked)

Data Hoarder

746 readers
87 users here now

Keep it about datahoarding.

Rules

founded 2 years ago
MODERATORS
 

I've been running OCR on the recent house epstein email dump. Making this available now that its close to finishing (20k/ 23k emails processed).

Processing script available here: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

I also put an analysis script in there if you want to use drive/ colab.

Currently finished files are available here:

https://files.catbox.moe/xrgts0.sqlite

all 19 comments
sorted by: hot top controversial new old
[–] Typotyper@sh.itjust.works 2 points 4 hours ago

Why are all the emails from epstien and very few to him. Is this what Congress is holding back

[–] TootSweet@lemmy.world 14 points 17 hours ago (2 children)

We have to see all the places where "Bubba" is mentioned.

[–] TropicalDingdong@lemmy.world 13 points 16 hours ago* (last edited 16 hours ago) (2 children)

Search Results: "Bubba" References in Epstein Documents

Found 9 matches across the documents:


Match 1 (Document 2549)

...ulation as a souvenir? Well, Jonathan Brandmeier on KLSX invited listeners to call in and suggest euphemisms for presidential semen. My favorite was "Bubba butter." Apparently, my role is to serve as a vehicle for the destruction of taboos. I have also become an automatic comedy reference. So, to Jay...


Match 2 (Document 10035)

...ng that if anyone really investigates the crop of liberal Hypocrats in power today, they would find who were the real Dirty Old Men of the party. Bubba Clinton is only the first to be exposed.


Match 3 (Document 10053)

...arch 29, 2014 at 8:43 am — Log in to Reply. Good thing it was on here for a moment or two, because once the powers that be get their paws on this, Bubba, The Pervert info, they will bend over backward to make it disappear from view.


Match 4 (Document 10053 - continuation)

...bba, The Pervert info, they will bend over backward to make it disappear from view. By the time 2016 rolls around, Hillary and the co-president (Bubba) will be scrubbed squeaky clean and she'll be so sanitized, the Pope will bow to her when she deigns to meet him.


Match 5 (Document 10079)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...


Match 6 (Document 10090)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...


Match 7 (Document 10101)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...


Match 8 (Document 18120)

21/2018 8:32:08 AM To: 'jeffrey E.' Subject: RE: hey. Ask him if Putin has the photos of Trump blowing Bubba? From: jeffrey E. Sent: Monday, March 19, 2018 2:15 PM


Match 9 (Document 18121)

and i thought- I had tsuris. On Wed, Mar 21, 2018 at 4:32 AM, Mark L. Epstein wrote: Ask him if Putin has the photos of Trump blowing Bubba? From: jeffrey E. Sent: Monday, March 19, 2018 2:15 PM

[–] TootSweet@lemmy.world 5 points 12 hours ago (1 children)

You are an absolute god.

And it's clear that they used the nickname "Bubba" to refer to Bill Clinton more than once.

[–] TropicalDingdong@lemmy.world 10 points 12 hours ago

I would caution over-interpretation of this. The only direct mention of bubba getting a BJ is that one email thread.

And a few more things. One, the email dump has a huge number of "digest" type emails. Like summaries of forum conversations. 10035, and 10053, are from a forum they seemed to be tracking, so Epstein didn't say those things, but some commenter in the forum did.

Also, they were regularly being forwarded articles. Like a lot of articles. From a lot of people. And it often had to do with them. So in some ways, this contaminates the e-mails, because it creates a set of names, dates, and locations, which, was just someone sending an article about Epstein to Epstein.

[–] resipsaloquitur@lemmy.world 1 points 11 hours ago (1 children)

Hey now. “Bubba Clinton” could be anyone. Definitely isn’t Bill Clinton.

[–] mojofrododojo@lemmy.world 2 points 5 hours ago

PFFFT it's obviously George Clinton from Parliament-Funkadelic.

[–] TropicalDingdong@lemmy.world 10 points 17 hours ago (1 children)

Just a bit of a note...

Its like, all the email Epstein got, including newsletters/ advertisements/ digests they were signed up to.

Gonna be a lot of crap to filter through.

[–] Nomad@infosec.pub 5 points 17 hours ago (1 children)

I know LLMs are a bad word here, but why not feed that stuff into an llm and have it give you all the significant references?

[–] TropicalDingdong@lemmy.world 8 points 17 hours ago* (last edited 17 hours ago) (2 children)

I mean its not an LLM, but its a form of transformer model that is doing the OCR (if you look at the code I put out). Just the OCR component has taken almost 10 days. The model I'm using is incredibly small compared with a full LLM, and it takes almost ~30s a document (see the DB for the actual timings).

Before I kicked this off I did the math on using Claude and open AI to do this. The price would have been between $1.5-2.5k in tokens to do so. So the process I'm using is a lot slower (I could have done this in hours leveraging Anthropics or OpenAI's GPU's instead of rolling my own), but its also OCR specific and frankly, other than the failures I've mentioned previously (very small text), I'm yet to find any substantial mistakes or hallucinations. Its at least 99% accurate.

But beyond that, I'm absolutely planning on using an LLM to go further than I've done already.

Specifically, I'm going to run this prompt:

prompt = Extract ALL contact information from this document. ONLY extract what exists EXACTLY as written - DO NOT make anything up. CRITICAL: Extract EVERY person mentioned (full names, nicknames, first names only, Mr/Ms + lastname). Look in headers AND body text.

Document: {content}

Return ONLY JSON: {{"emails": [{{"email": "exact", "context": "where", "associated_name": null}}], "names": [{{"name": "exact", "associated_email": null, "associated_phone": null, "context": "where"}}], "phone_numbers": [{{"phone": "exact", "associated_name": null, "context": "where"}}], "organizations": [{{"name": "exact", "context": "where"}}], "locations": [{{"location": "exact", "context": "where"}}], "addresses": [{{"address": "exact", "associated_name": null, "context": "where"}}]}}

And its worked well so far. That being said, I bought a processing machine (the AMD StrixHalo 395+ AI Max), so I have a machine which can hold up to a 60gb model in memory. Once its done processing the OC, then I'm planning to start the next step in the process.

That being said, you are more than welcome to hop in on the repo I published and do whatever you please. Its all available to you.

[–] Nomad@infosec.pub 6 points 15 hours ago

I like. Thanks for your work so far.

[–] Grimy@lemmy.world 2 points 13 hours ago* (last edited 13 hours ago) (1 children)

https://old.reddit.com/r/LocalLLaMA/comments/1p0q3z1/offline_epstein_file_ranker_using_gptoss120b/

Here's a similar idea that is using a ranking method. I'd maybe add that and ask it in the prompt to classify how the name was mentioned from a set list of options, so you can quickly filters out names like Obama which are most likely not involved in the actual bad stuff.

Awesome work!

[–] TropicalDingdong@lemmy.world 3 points 12 hours ago (1 children)

Yeah I have a feeling my OCR is better. I had some crashes but its still running. There have been some real issues with how they released this. I've also added a sha256 hash to ensure file integrity if, for whatever reason, shit gets fucked with (although I wish I wouldn't have limited it to 4096 bits).

I'm running it in google colab to be able to mount the doc's directly, but its been behaving very, very fuckily. I've also tried downloading them all, and it almost always fails.

We're at 21840/23124 images processed, but its been a slog.

I think there were also some failures in there I'll have to go fix, because I set this up to process across multiple nodes, if a node failed, it may have checked out a data file but not processed it. Should be less than a few hundred.

[–] Grimy@lemmy.world 1 points 12 hours ago* (last edited 12 hours ago) (1 children)

It's definitely a lot of data. I didn't mean anything by it btw, I just sent the link since it has his prompt in the comments, or part of it at least. We get the most out of it if different people are going through it separately in any case. I'm very excited to see what you find.

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I'm guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don't know how succesful this would be though, llms already struggle with simple ranking sometimes.

Maybe you can run normal algorithms for terms that aren't in dictionaries and comb through the recurring ones manually.

Just trying to brainstorm a bit.

[–] TropicalDingdong@lemmy.world 3 points 12 hours ago (1 children)

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I’m guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don’t know how succesful this would be though, llms already struggle with simple ranking sometimes.

I mean you can down load the data and start digging around.

I promise its not that cryptic.

And I think the issue you are identifying is real, but its also why I'm not even trying to do any kind of heavy-lift analysis with LLM's, they have a role imo, but that ain't it. You'll really struggle if you insist on doing things this way. My editorial opinion is that the best use of LLM's is to do the weakest tasks possible. The more you ask them to do, the less good they are at it. Or maybe "less good" isn't the right word, but greater the propensity for hallucination and more likely to "smooth over" information that would actually be interesting.

You take responsibility for the analysis, and let the LLM help where its capable. Take a "just the facts ma'am" approach to analysis, but use the LLM's to do things (like, regex) that they are good at.

And honestly, as a data scientist, this isn't exactly my specific kind of work. I do work with machine learning, but usually global spatial data and for actual scientific questions. I just thought of this as a fun, relatively easy thing to do over the weekend. And now its next week. And its still going...

We did not do these things because they are easy; we do these things because we thought they were easy.

[–] mojofrododojo@lemmy.world 1 points 4 hours ago

props. huzzah.