datahoarder

7641 readers
3 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 5 years ago
MODERATORS
1
 
 

@[email protected] Got it done, I'm first of the mods here and will be learning a little Lemmy over the next few weeks.

While everything is up in the air with the reddit changes I'll be very busy working on replacing the historical pushshift API without reddits bastardizations should a PS version come back.

In the mean time you should all mirror this data ensuring its survival, do what you do best and HOARD!!

https://the-eye.eu/redarcs/

2
7
submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]
 
 

My partner's grandmother has passed and has left a collection of hundreds possibly thousands of DVDs. These range from official releases to pirated and bootleg copies.

What would be the best way to digitize and archive this collection? Is there an external device out there that will let me burn and convert the DVDs? I'd want to possibly upload on archive.org if the copyright expired, store on backblaze or maybe another digital archiving site besides a regular torrent, would appreciate any recs on sites and advice in general. I haven't gone through these yet but figure the project would be a fun learning experience.

3
 
 

Hi there, I've been meaning to go get more serious about my data. I have minimal backups, and some stuff is not backed up at all. I'm begging for disaster.

Here's what I've got: 2 8tb drives almost full in universal external enclosures A small formfactor PC as a server, with one 8tb drive connected. An unused raspberry pi. No knowledge of how to properly use zfs.

Here's what I want: I've decided I don't need raid. I don't want the extra cost of drives or electricity, and I don't need uptime. I just need backups. I want to use what drives I have, and an additional 16tb drive I'll buy.

My thought was that I would replace the 8tb drive with a 16tb one, format it with zfs (primarily to avoid bit rot. I'll need to learn how to check for this), then back it up across the two 8tb drives as a cold backup. Either as two separate drives somehow? Btrfs volume extension? Or a jbod connected to the raspberry pi, that I leave unplugged except for when it's time to sync the new data?

Or do you have a similarly cheap solution that's less janky?

I just want to back up my data, with an amount of rot protection, cheaply.

I understand that it might make sense to invest in something a bit more robust right now, and fill it with drives as needed.

But the thing I keep coming to is the cold backup. How can you keep cold backups over several hard drives, without an entire second server to do the work?

Thanks for listening to my rambling.

4
4
SS Blog [New Archival Project] (tracker.archiveteam.org)
submitted 1 week ago* (last edited 1 week ago) by [email protected] to c/[email protected]
 
 

cross-posted from: https://lemm.ee/post/60023388

Archive Team has just begun the distributed archiving of the Japanese SS Blog, a blog hosting service, which is set to be discontinued on March 31, 2025.

And you can help! There isn't much time left, so as many people running the warrior as possible is needed.

Resources:

  • The wiki page of the project (not much info)
  • The tracker (at the top of the page) has the simplest info on how you can help out
  • The github page offers a docker-based alternative for advanced users, and more info on best practices for this sort of archiving

Why help out?

The web is disappearing all the time, and often a lot of previously easily accessible information is lost to time. These japanese blogs may not be very important to you, but they certainly are to a lot of people, and nobody knows what sort of information is found only here, until they need it.

5
 
 

Faster downloads from Gofile, in case Internet Archive is slow or not available: https://gofile.io/d/EFyn1q

Internet Archive for preservation: https://archive.org/details/snes_mods_and_romhacks_collection_20250326_patched


This is the first time I am uploading patched Roms, unlike previously where I uploaded only the patch files. My personal collection of Super Nintendo Romhacks in ready to play patched Roms in .sfc and .smc formats, complete with a descriptive text document. Most, if not all, files are patched by myself, but I did not test every game yet. Some old Romhacks do not work in accurate emulators.

Please share this everywhere where Rom files are allowed to be shared. I am only sharing here at the moment.

This collection comes in two variants: flat structure, and sub structure. "flat" just means all Roms and documents are saved in one single directory. "sub" means, every game got its own dedicated directory, where only related Romhacks and Mods are saved.

snes_mods_and_romhacks_collection_20250326_patched_flat.7z: (View Contents)

     snes_mods_and_romhacks_collection_20250326/
        Super Metroid_Nature v1.03.smc
        Super Metroid_Nature v1.03.txt

snes_mods_and_romhacks_collection_20250326_patched_sub.7z: (View Contents)

        Super Nintendo Mods and Romhacks Collection 2025-03-26/
            Documents/
                Super Metroid/
                    Nature v1.03.txt
            Games/
                Super Metroid/
                    Nature v1.03.smc
6
 
 

For years I've on and off looked for web archiving software that can capture most sites, including ones that are "complex" with lots of AJAX and require logins like Reddit. Which ones have worked best for you?

Ideally I want one that can be started up programatically or via command line, an opens a chromium instance (or any browser), and captures everything shown on the page. I could also open the instance myself and log into sites and install addons like UBlock Origin. (btw, archiveweb.page must be started manually).

7
 
 

Hi all,

I've been thinking about picking up an N150 or 5825U MiniITX board for a NAS, but I'm wondering if there are better options given my requirements.

  • At least 2x 2.5Gb LAN
  • A 10Gb LAN, or 2.5Gb if not
  • 2x NVME
  • 8x SATA for spinning disks
  • 2x SATA for SSDs
  • MiniITX is required for the 10" rack
  • 64+ Gigs of RAM (ZFS cache) (This is not possible on an N150)

The problem I'm running into with the boards I've looked at is PCIe lanes, and not having ways to expand the sata or network ports without stealing from NVME.

I've started to look at boards with PCIe 4.0x16 slots and risers/splitters for expansion, but then I can't find low power CPUs for them.

Thoughts?

8
29
submitted 3 weeks ago* (last edited 3 weeks ago) by [email protected] to c/[email protected]
9
10
 
 

cross-posted from: https://lemmy.world/post/26375626

A team of volunteer archivists has recreated the Centers for Disease Control website exactly as it was the day Donald Trump was inaugurated. The site, called RestoredCDC.org, went live Tuesday and is currently being hosted in Europe.

As we have been following since the beginning of Trump’s second term, websites across the entire federal government have been altered and taken offline under this administration’s war on science, health, and diversity, equity, and inclusion. Critical information promoting vaccines, HIV care, reproductive health options including abortion, and trans and gender confirmation healthcare have been purged from the CDC’s live website under Trump. Disease surveillance data about bird flu and other concerns have either been delayed or have stopped being updated entirely. Some deleted pages across the government have at least temporarily been restored thanks to a court order, but the Trump administration has added a note rejecting “gender ideology” to some of them.

“Our goal is to provide a resource that includes the information and data previously available,” the team wrote. “We are committed to providing the previously available webpages and data, from before the potential tampering occurred. Our approach is to be as transparent as possible about our process. We plan to gather archival data and then remove CDC logos and branding, using GitHub to host our code to create the site.”

11
 
 

they made cool workout posters, and still do, but I think they got dmca'd in 2016. the superheroes are all gone.

navigating archive. org is slow and often leads to "no hotlinking" and unavailable Google drive PDFs.

anyone got these stocked somewhere?

12
 
 

Lexipol, also known as PoliceOne, is a private company based in Frisco, Texas that provides policy manuals, training bulletins, and consulting services to approximately 8,500 law enforcement agencies, fire departments, and other public safety departments across the United States. This leak contains the policy manuals produced by Lexipol, and some subscriber information.

Founded by two former cops that became lawyers, Lexipol retains copyright over all manuals which it creates despite the public nature of its work. There is little transparency on how decisions are made to draft their policies, which have an oversized influence on policing in the United States. The company localizes their materials to address differences in legal frameworks, depending on the city or state where the client is based.

Lexipol's manuals become public policy in thousands of jurisdictions. Lexipol's policies have been challenged in court for their role in racial profiling, harassment of immigrants, and unlawful detention. For example, Lexipol policies were used to justify body cameras being turned off when a police officer shot and killed Eric Logan in South Bend, Indiana in June 2019.

13
 
 

cross-posted from: https://lemmy.dbzer0.com/post/37424352

I have been lurking on this community for a while now and have really enjoyed the informational and instructional posts but a topic I don't see come up very often is scaling and hoarding. Currently, I have a 20TB server which I am rapidly filling and most posts talking about expanding recommend simply buying larger drives and slotting them in to a single machine. This definitely is the easiest way to expand, but seems like it would get you to about 100TB before you cant reasonably do that anymore. So how do you set up 100TB+ networks with multiple servers?

My main concern is that currently all my services are dockerized on a single machine running Ubuntu, which works extremely well. It is space efficient with hardlinking and I can still seed back everything. From different posts I've read, it seems like as people scale they either give up on hardlinks and then eat up a lot of their storage with copying files or they eventually delete their seeds and just keep the content. Does the Arr suite and Qbit allow dynamically selecting servers based on available space? Or are there other ways to solve these issues with additional tools? How do you guys set up large systems and what recommendations would you make? Any advice is appreciated from hardware to software!

Also, huge shout out to Saik0 from this thread: https://lemmy.dbzer0.com/post/24219297 I learned a ton from his post, but it seemed like the tip of the iceberg!

14
15
 
 

Just noticed this today - seems all the archiving activity has been noticed by NCBI / NLM staff. Thankfully most of SRA (the Sequence Read Archive) and other genomic data is also mirrored in Europe.

16
 
 

cross-posted from: https://beehaw.org/post/18335989

I set up an instance of the ArchiveTeam Warrior on my home server with Docker in under 10 minutes. Feels like I'm doing my part to combat removal of information from the internet.

17
 
 

In light of some of the recent dystopian executive orders, a lot of data is being proactively taken down. I am relying on this data for a report I'm writing at work, and I suspect a lot of others may be relying on it for more important reasons. As such, I created two torrents, one for the data behind the ETC Explorer tool and another for the data behind the Climate and Economic Justice Screening Tool. Here's an article about taking down the latter. My team at work suspects the former will follow soon.

Here are the .torrent files. Please help seed. They're not very large at all, <300 MB.

Of course this is worthless without access to these torrents so please distribute them to any groups you think would be interested or otherwise help make them available.

18
19
 
 

This is bad, like very bad. The proposed draft law in India, in its current form only prescribes deletions and purges of inactive accounts when the users die. There should be a clause where archiving or lock/suspension (like Facebook's memorialization feature) are described as alternative methods to account deletion.

If the law as it is is pushed through and passed by the legislature the understanding of the past will be destroyed in the long term, just like how the fires in LA have already did to the archives of the notable composer Arnold Schoenberg.

If you're an Indian citizen you can go to this page to post your feedback and concerns.

20
 
 

Trying to figure out if there is a way to do this without zfs sending a ton of data. I have:

  • s/test1, inside it are folders:
    • folder1
    • folder2

I have this pool backed up remotely by sending snapshots.

I'd like to split this up into:

  • s/test1, inside is folder:
    • folder1
  • s/test2, inside is folder:
    • folder2

I'm trying to figure out if there is some combination of clone and promote that would limit the amount of data needed to be sent over the network.

Or maybe there is some record/replay method I could do on snapshots that I'm not aware of.

Thoughts?

21
 
 

cross-posted from: https://slrpnk.net/post/17044297

You don't understand, I might need that hilarious Cracked listicle from fifteen years ago!

22
23
24
 
 

While we are deeply disappointed with the Second Circuit’s opinion in Hachette v. Internet Archive, the Internet Archive has decided not to pursue Supreme Court review. We will continue to honor the Association of American Publishers (AAP) agreement to remove books from lending at their member publishers’ requests.

We thank the many readers, authors and publishers who have stood with us throughout this fight. Together, we will continue to advocate for a future where libraries can purchase, own, lend and preserve digital books.

25
 
 

Most commented pages on each site sorted from most(Aniwave) to least(Anitaku) amount of comments:

Aniwave(9anime): Attack on Titan The Final Season Part 3 Episode 1

Gogoanime Old comments: Yuri on Ice Category page

Anitaku(Gogoanime): Kimetsu no Yaiba Yuukaku Hen Episode 10

Folders were compressed into tarballs with zstd level 9 compression:

Aniwave(9anime): TOTAL GB UNCOMPRESSED: 23.7 GiB TOTAL GB COMPRESSED:1.4 GiB

Gogoanime: TOTAL GB UNCOMPRESSED: 16.4 GiB TOTAL GB COMPRESSED: 769.5 MiB

Anitaku(Gogoanime): TOTAL GB UNCOMPRESSED: 7.2 GiB TOTAL GB COMPRESSED: 326.7 MiB

DOWNLOADS:

Aniwave(9anime) Comments: https://archive.org/details/aniwave-comments.tar

Anitaku(Gogoanime) March 2024: https://archive.org/details/anitaku-feb-2024-comments.tar

Gogoanime Comments Before 2021: https://archive.org/details/gogoanimes-comments-archive-prior-2021.tar

EDIT: I replaced all the mega links with archive.org links and removed all images to reduce file size

view more: next ›