Erik, Tony, TJ, Jonathan, Dave:
at 12:04 Fri 19dec2025 PST we had a MSR network switch failure which took down the network for all the front ends, the /opt/rtcds NFS server and Guardian.
The Brocade Ruckus Network Switch SW-MSR-H1FE-STK (1/3) was found to be powered off (no LEDs illuminated). This is a ICX7150-48-4X10GR-RMT3 48port unit with firmware version 08.0.95g.
This unit is being powered by a Geist power distribution box, which itself is UPS powered, so all power cords are ORANGE. A second switch is also being powered by the Geist, and both it and the Geist were powered up.
We disconnected the failed switch's power cord (as found it seemed to be well connected, not loose) and plugged it back in, the switch did not light up. We then ran a longer power cord to the main rack power strips, again the switch did not light up. We then returned to the original power cord from the Geist.
While we were scratching our heads planning the next move we noticed the switch was lit up. We think it took at least a few minutes to show signs of life after its power cord was inserted.
We then let it boot up, which took a suspiciously long time, we expected something like 5 minutes but it was actually close to 15 minutes. At that point the network was returned to all the systems attached to this switch.
At this point all the front ends had returned in their running state, except h1sush12 and h1asc0 which had persistent DAQ errors. We restarted these frontends.
Many Guardian nodes were disconnected from the frontends, so we elected to reboot h1guardian1. It came back up with no problems.
Currently we do not know why this switch powered down. If it happens again our options are:
1) power it directly from the UPS rack power strip (i.e. not from the Geist) and give it plenty of time to show it is powering up
2) replace it with a spare switch
We also do not know why h1sush12 and h1asc0 had DAQ errors when the power was restored. Of all the corner systems, they did have recent hardware changes. They also had model changes, but so did h1lsc0.
FRS36424 ticket opened for this issue.