this post was submitted on 19 Nov 2025

74 points (96.2% liked)

Data Hoarder

838 readers

1 users here now

Keep it about datahoarding.

Rules

Be respectful.
Engage in constructive discussions.
Please, no harassment, hate speech, spam or trolling.

founded 2 years ago

MODERATORS

danielintempesta@lemmy.world

OCR's Epstein emails (lemmy.world)

submitted 3 months ago by TropicalDingdong@lemmy.world to c/datahoarder@lemmy.world

22 comments fedilink hide all child comments

I've been running OCR on the recent house epstein email dump. Making this available now that its close to finishing (20k/ 23k emails processed).

Processing script available here: https://codeberg.org/sillyhonu/Image_OCR_Processing_Epstein

I also put an analysis script in there if you want to use drive/ colab.

Currently finished files are available here:

https://files.catbox.moe/xrgts0.sqlite

all 24 comments

sorted by: hot top controversial new old

[–] TootSweet@lemmy.world 14 points 3 months ago (2 children)

We have to see all the places where "Bubba" is mentioned.

[–] TropicalDingdong@lemmy.world 18 points 3 months ago* (last edited 3 months ago) (2 children)

Search Results: "Bubba" References in Epstein Documents

Found 9 matches across the documents:

Match 1 (Document 2549)

...ulation as a souvenir? Well, Jonathan Brandmeier on KLSX invited listeners to call in and suggest euphemisms for presidential semen. My favorite was "Bubba butter." Apparently, my role is to serve as a vehicle for the destruction of taboos. I have also become an automatic comedy reference. So, to Jay...

Match 2 (Document 10035)

...ng that if anyone really investigates the crop of liberal Hypocrats in power today, they would find who were the real Dirty Old Men of the party. Bubba Clinton is only the first to be exposed.

Match 3 (Document 10053)

...arch 29, 2014 at 8:43 am — Log in to Reply. Good thing it was on here for a moment or two, because once the powers that be get their paws on this, Bubba, The Pervert info, they will bend over backward to make it disappear from view.

Match 4 (Document 10053 - continuation)

...bba, The Pervert info, they will bend over backward to make it disappear from view. By the time 2016 rolls around, Hillary and the co-president (Bubba) will be scrubbed squeaky clean and she'll be so sanitized, the Pope will bow to her when she deigns to meet him.

Match 5 (Document 10079)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...

Match 6 (Document 10090)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...

Match 7 (Document 10101)

...other powerful figure in the Western world); dissections of handwritten flight logs for the financier's private 727 aircraft that frequently capture Bubba at 30,000 feet; and extensive allegations from a woman, Virginia Roberts, who claims she was Epstein's sex slave as a teenager and whose recollection...

Match 8 (Document 18120)

21/2018 8:32:08 AM To: 'jeffrey E.' Subject: RE: hey. Ask him if Putin has the photos of Trump blowing Bubba? From: jeffrey E. Sent: Monday, March 19, 2018 2:15 PM

Match 9 (Document 18121)

and i thought- I had tsuris. On Wed, Mar 21, 2018 at 4:32 AM, Mark L. Epstein wrote: Ask him if Putin has the photos of Trump blowing Bubba? From: jeffrey E. Sent: Monday, March 19, 2018 2:15 PM

[–] TootSweet@lemmy.world 7 points 3 months ago (1 children)

You are an absolute god.

And it's clear that they used the nickname "Bubba" to refer to Bill Clinton more than once.

[–] TropicalDingdong@lemmy.world 11 points 3 months ago

I would caution over-interpretation of this. The only direct mention of bubba getting a BJ is that one email thread.

And a few more things. One, the email dump has a huge number of "digest" type emails. Like summaries of forum conversations. 10035, and 10053, are from a forum they seemed to be tracking, so Epstein didn't say those things, but some commenter in the forum did.

Also, they were regularly being forwarded articles. Like a lot of articles. From a lot of people. And it often had to do with them. So in some ways, this contaminates the e-mails, because it creates a set of names, dates, and locations, which, was just someone sending an article about Epstein to Epstein.

[–] resipsaloquitur@lemmy.world 1 points 3 months ago (1 children)

Hey now. “Bubba Clinton” could be anyone. Definitely isn’t Bill Clinton.

[–] mojofrododojo@lemmy.world 2 points 3 months ago

PFFFT it's obviously George Clinton from Parliament-Funkadelic.

[–] TropicalDingdong@lemmy.world 11 points 3 months ago (1 children)

Just a bit of a note...

Its like, all the email Epstein got, including newsletters/ advertisements/ digests they were signed up to.

Gonna be a lot of crap to filter through.

[–] Nomad@infosec.pub 5 points 3 months ago (1 children)

I know LLMs are a bad word here, but why not feed that stuff into an llm and have it give you all the significant references?

[–] TropicalDingdong@lemmy.world 10 points 3 months ago* (last edited 3 months ago) (2 children)

I mean its not an LLM, but its a form of transformer model that is doing the OCR (if you look at the code I put out). Just the OCR component has taken almost 10 days. The model I'm using is incredibly small compared with a full LLM, and it takes almost ~30s a document (see the DB for the actual timings).

Before I kicked this off I did the math on using Claude and open AI to do this. The price would have been between $1.5-2.5k in tokens to do so. So the process I'm using is a lot slower (I could have done this in hours leveraging Anthropics or OpenAI's GPU's instead of rolling my own), but its also OCR specific and frankly, other than the failures I've mentioned previously (very small text), I'm yet to find any substantial mistakes or hallucinations. Its at least 99% accurate.

But beyond that, I'm absolutely planning on using an LLM to go further than I've done already.

Specifically, I'm going to run this prompt:

prompt = Extract ALL contact information from this document. ONLY extract what exists EXACTLY as written - DO NOT make anything up. CRITICAL: Extract EVERY person mentioned (full names, nicknames, first names only, Mr/Ms + lastname). Look in headers AND body text.

Document: {content}

Return ONLY JSON: {{"emails": [{{"email": "exact", "context": "where", "associated_name": null}}], "names": [{{"name": "exact", "associated_email": null, "associated_phone": null, "context": "where"}}], "phone_numbers": [{{"phone": "exact", "associated_name": null, "context": "where"}}], "organizations": [{{"name": "exact", "context": "where"}}], "locations": [{{"location": "exact", "context": "where"}}], "addresses": [{{"address": "exact", "associated_name": null, "context": "where"}}]}}

And its worked well so far. That being said, I bought a processing machine (the AMD StrixHalo 395+ AI Max), so I have a machine which can hold up to a 60gb model in memory. Once its done processing the OC, then I'm planning to start the next step in the process.

That being said, you are more than welcome to hop in on the repo I published and do whatever you please. Its all available to you.

[–] Nomad@infosec.pub 6 points 3 months ago

I like. Thanks for your work so far.

[–] Grimy@lemmy.world 2 points 3 months ago* (last edited 3 months ago) (1 children)

https://old.reddit.com/r/LocalLLaMA/comments/1p0q3z1/offline_epstein_file_ranker_using_gptoss120b/

Here's a similar idea that is using a ranking method. I'd maybe add that and ask it in the prompt to classify how the name was mentioned from a set list of options, so you can quickly filters out names like Obama which are most likely not involved in the actual bad stuff.

Awesome work!

[–] TropicalDingdong@lemmy.world 3 points 3 months ago (1 children)

Yeah I have a feeling my OCR is better. I had some crashes but its still running. There have been some real issues with how they released this. I've also added a sha256 hash to ensure file integrity if, for whatever reason, shit gets fucked with (although I wish I wouldn't have limited it to 4096 bits).

I'm running it in google colab to be able to mount the doc's directly, but its been behaving very, very fuckily. I've also tried downloading them all, and it almost always fails.

We're at 21840/23124 images processed, but its been a slog.

I think there were also some failures in there I'll have to go fix, because I set this up to process across multiple nodes, if a node failed, it may have checked out a data file but not processed it. Should be less than a few hundred.

[–] Grimy@lemmy.world 1 points 3 months ago* (last edited 3 months ago) (1 children)

It's definitely a lot of data. I didn't mean anything by it btw, I just sent the link since it has his prompt in the comments, or part of it at least. We get the most out of it if different people are going through it separately in any case. I'm very excited to see what you find.

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I'm guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don't know how succesful this would be though, llms already struggle with simple ranking sometimes.

Maybe you can run normal algorithms for terms that aren't in dictionaries and comb through the recurring ones manually.

Just trying to brainstorm a bit.

[–] TropicalDingdong@lemmy.world 3 points 3 months ago (1 children)

Just to tack on, I just thought about ranking them but it terms of how cryptic the exchange is. I’m guessing they use a lot of inside terms when talking about the heavy stuff that the llm will struggle to understand. I don’t know how succesful this would be though, llms already struggle with simple ranking sometimes.

I mean you can down load the data and start digging around.

I promise its not that cryptic.

And I think the issue you are identifying is real, but its also why I'm not even trying to do any kind of heavy-lift analysis with LLM's, they have a role imo, but that ain't it. You'll really struggle if you insist on doing things this way. My editorial opinion is that the best use of LLM's is to do the weakest tasks possible. The more you ask them to do, the less good they are at it. Or maybe "less good" isn't the right word, but greater the propensity for hallucination and more likely to "smooth over" information that would actually be interesting.

You take responsibility for the analysis, and let the LLM help where its capable. Take a "just the facts ma'am" approach to analysis, but use the LLM's to do things (like, regex) that they are good at.

And honestly, as a data scientist, this isn't exactly my specific kind of work. I do work with machine learning, but usually global spatial data and for actual scientific questions. I just thought of this as a fun, relatively easy thing to do over the weekend. And now its next week. And its still going...

We did not do these things because they are easy; we do these things because we thought they were easy.

[–] mojofrododojo@lemmy.world 1 points 3 months ago

props. huzzah.

[–] Typotyper@sh.itjust.works 2 points 3 months ago (1 children)

Why are all the emails from epstien and very few to him. Is this what Congress is holding back

[–] TropicalDingdong@lemmy.world 1 points 3 months ago (1 children)

That's all combined. To from cc etc..

It's his emails account. So yeah.. he's in all of them.

[–] Typotyper@sh.itjust.works 1 points 3 months ago (1 children)

Maybe I'm missing something. In the bottom left graph there are 2200+ emails "from" him and only ~150 "to" him.

He should have an a lot more "to" him

[–] TropicalDingdong@lemmy.world 1 points 3 months ago* (last edited 3 months ago) (1 children)

Oh.

Yeah that might be a formatting artifact. Or it might speak to the fact that we just receive far more than we send.

Many of the emails are digests or new articles. Like the NYT might send out a headlines email. And you just receive it, and aren't going to respond, so it only gets a "from"; no "to".

There is a lot of just... Crap in there. At least two partial books. Random stuff from forums and threads.

[–] Typotyper@sh.itjust.works 1 points 3 months ago (1 children)

For emails I respond to there are roughly equal numbers. For emails I send people I deal with there are roughly equal numbers. Some businesses ignore me, so those would skew things.

Maybe things will balance out once they release them all

I seriously expect the Epstein files to JFK assassination level conspiracies which linger for decades

[–] TropicalDingdong@lemmy.world 1 points 3 months ago

I doubt you actually do, and I doubt most people do. The vast vast majority of email is sent and never even read.

You are only thinking about email used as direct correspondence,.but how many random mass mailer emails have landed in your in boxes today? 10s? Hundreds?

I have another figure I can send you, but let me get some coffee in me. It's a frequency analysis in time.