this post was submitted on 17 Apr 2026
59 points (91.5% liked)
Selfhosted
60409 readers
223 users here now
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil.
-
No spam.
-
Posts are to be related to self-hosting.
-
Don't duplicate the full text of your blog or readme if you're providing a link.
-
Submission headline should match the article title.
-
No trolling.
-
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
A governmental-ish site I'm required to use doesn't push notifications as mails, so you have to login daily to check for updates. Updates may happen multiple times daily or once a month. I automated my server to access the site once a day with my credentials, screenshot the notifications, parse them with ocr, and send myself a mail.
What do you use for OCR parsing?
The data is non critical and doesn't contain indentifying info so I use ocr.space API. You could probably find ways to use the tesseract libraries locally.
Why screenshot and parse? Can't you just parse the html directly?
Since the dawn of LLMs it's virtually impossible to scrape web content. Headless browsers have become basically useless. I actually have to automate keyboard inputs to simulate the navigation. I could maybe try to write the javascript cache to file but honestly it's just faster that way.
What why, I'm scraping html just fine