this post was submitted on 10 Mar 2026
14 points (88.9% liked)

Sysadmin

13509 readers
4 users here now

A community dedicated to the profession of IT Systems Administration

No generic Lemmy issue posts please! Posts about Lemmy belong in one of these communities:
!lemmy@lemmy.ml
!lemmyworld@lemmy.world
!lemmy_support@lemmy.ml
!support@lemmy.world

founded 2 years ago
MODERATORS
 

I have begun the process of building a lab for my team of HPC consultants, and I'm trying to make some plans. I would like this to be as flexible as I can make it. I live 3½ hours away from the site, so the fewer trips down there to recable and/or move stuff around the better! Most of this hardware has various older InfiniBand connectivity, along with multiport LOM & OCP cards at either 1Gb or 10Gb. Most also have the option to do dedicated and shared BMC. We have 2 dedicated IPs (so far) that I'm currently using for the head node's BMC & SSH access. This will be all Linux, though we will be accessing web interfaces when testing various products. My initial thoughts:

  • Identify what we want to keep and what we want to excess. There's some _very_old hardware in there! There's also some old OmniPath hardware in there. We don't see much OPA, but some team members seem to think that may change. Still this stuff is old.
  • Carve out a management/provisioning network. Ideally, this will allow us to switch between dedicated and shared BMC ports at will. We use this for customer knowledge transfer when we demo our cluster management software. The share ports are usually the onboard port 1, which is usually 1gb, so this is easy enough. We can probably cable all of that up to 1 switch.
  • Identify a subset of nodes to cables up the capability of accessing the campus network. These systems are behind the company VPN, and we will be controlling login access ourselves. While I'm not worried about someone on the team doing something nefarious on the company network, I don't want everything to have this capability. Still, having the option with some will give us some flexibility, and we have a handful of systems with more Ethernet ports than we would otherwise need (campus LAN access is 1Gb).
  • Head node will run Proxmox to give us the flexibility to spin up temporary test heads for team member projects. The idea here is we can partition the network using VLANs to isolate what a group is doing with some systems from what anybody else is doing. The current head node has sufficient space to host shared home directories. We will also have a small IBM ESS that will be added to these racks next time I'm there.
  • I had thought about running some containers in either a VM on the head node or some LXCs. Right now the only thing I'm thinking about on that front is netbox.

This is what I have off the top of my head. If there's any useful software, procedures, or if I'm on the wrong path entirely, I'd appreciate your help. We have a modest budget, but we did convince our management to at least buy us a used 1Gb switch that is at least similar hardware that we would see "in the wild." We're hoping we can use the lab to show value there and get them to approve some other, still modest, requests in the future!

all 6 comments
sorted by: hot top controversial new old
[–] moonpiedumplings@programming.dev 3 points 2 days ago (1 children)

In addition to netbox, a wiki or other knowledgebase would be nice. You can document setup procedures as you go, and then other people can use that to figure stuff out.

[–] ClownStatue@piefed.social 1 points 2 days ago (1 children)

We actually use Redmine on another server that doesn’t require the VPN (still requires login though). Figured that would probably be a decent place for that stuff. Won’t be posting any passwords there! Initial access to the cluster will be key-based.

I (plus friends who do something similar) have been using centralized auth systems for this stuff. Proxmox supports OIDC, so if you are using Authentik or something similar you can just use one password.

And then Authentik supports 2FA, so you can use TOTP with that, or use passwords only.

[–] frongt@lemmy.zip 2 points 2 days ago (1 children)

What are you trying to lab? I would virtualize as much as possible, but that only gets you so far. You can't really virtualize appliances or hardware like infiniband connections.

You might just set up a whole day once a month to go in and run a class on setting up one kind of environment, then give them a few weeks to play around, break it, tear it down, and rebuild it.

[–] ClownStatue@piefed.social 1 points 18 hours ago

The team is spread out around the world, so going on site it really only feasible for me and a guy whose son lives in the same city.

As for what we’re trying to lab… first thing is large scale HPC/AI deployments. Starting with a model of the customer only wanting us to stand up, validate, burn in, and wipe as many racks at a time as possible. Right now we’re looking at a small deployment server that can connect to whatever network is needed and run our management software. Probably looking at some kind of automation for the networking as well. Why don’t we do that in manufacturing, you ask? There are some political reasons some of the stuff wouldn’t get done, but customers change their minds while hardware is in transit a lot! Also, stuff breaks and shimmies loose in transit all the time!

After that, we can look at our smaller deployments and see where we can automate things. Also looking to test/evaluate products so we can add to our delivery portfolio.