Reports until 10:56, Monday 08 December 2025
H1 CDS
jonathan.hanks@LIGO.ORG - posted 10:56, Monday 08 December 2025 (88418)
Replaced a failed disk on cdsfs0

I replaced a failed disk on cdsfs0.  zpool status told us:

	  raidz3-1               DEGRADED     0     0     0
	    sdk                  ONLINE       0     0     0
	    3081545339777090432  OFFLINE      0     0     0  was /dev/sdq1
	    sdq                  ONLINE       0     0     0
	    sdn                  ONLINE       0     0     0

This hinted the disk was /dev/sdq that failed. When identifying the physical disk behind /dev/sdq (dd if=/dev/sdq of=/dev/null, which does a continuous read of the disk to make it light up), it pointed to a disk in the caddy marked 1:17. I then told zfs to fail /dev/sdq1, and then reads started showing up on the disk (as identified by the leds).be 

To be safe I took the list of drives shown by zpool status, and the list of drives listed by the os (looking in /dev/disk/by-id).  I then identified every disk on the system by doing a long read from it (to force the LED).  There was a jump in caddies from 1:15-1:17.  After accounting for all disks, I figured the bad disk was the 1:16 slot.  I then pulled that disk.  zpool status showed no other issues.

After replacing the disk I had to create a gpt partition on it using parted.

Then I replaced the disk

zpool replace fs0pool 3081545339777090432 /dev/sdz

Now it is resilvering.

	  raidz3-1                 DEGRADED     0     0     0
	    sdk                    ONLINE       0     0     0
	    replacing-1            OFFLINE      0     0     0
	      3081545339777090432  OFFLINE      0     0     0  was /dev/sdq1
	      sdz                  ONLINE       0     0     0  (resilvering)
	    sdq                    ONLINE       0     0     0
	    sdn                    ONLINE       0     0     0
	    sdj                    ONLINE       0     0     0
	    sdm                    ONLINE       0     0     0

We need to retire this array. There are hints of problems on other disks.