[SOLVED] Wrong disk in slot after rebuild/reboot


Recommended Posts

Hey all, 

 

Recently my array had some errors with one of my disks. I had been in the process of consolidating my array using the Unbalance plugin to free up space.  Though for me this was taking forever as I have a lot of SMR drives in my array.  Anyways as I was moving data around I got an alert that said:

 

Alert [MEDIA] - Disk 4 in error state (disk dsbl)          ST8000AS0002-1NA17Z_Z840XT2C (sdk)

 

This lead me down a long road of trying to rebuild parity, then deciding that I should swap the disk with a spare from another system. I replaced that drive with WDC_WD80EFZX-68UW8N0_VKGW28LX, and rebuilt the array.  While rebuilding I received another error that Disk 1  a ST8000AS0002-1NA17Z_Z840X72E was also experiencing errors.

 

The array finished the rebuild/replace for the Disk 4 failure, but we had a heat wave roll through, so before I could look into everything I shutdown the system so it wouldn't overheat. I also bought a new drive to replace Disk 1 with and it just came in.

 

I started up the server, getting ready to replace Disk 1, when I saw that the system had tried to use my old Disk 4 ST8000AS0002-1NA17Z_Z840XT2C instead of the new Disk 4 WDC_WD80EFZX-68UW8N0_VKGW28LX.

 

When I tried to put the correct drive that the array had rebuilt upon, Unraid said it was wrong.  I didn't want to start anything with out finding out what I should do.

 

Smart Data shows 

 

(Old Disk 4) 

Current pending sector     2392

Offline uncorrectable        2392

 

(New Disk 4)

no errors

 

(Disk 1)

Reported uncorrect         7

Current pending sector   24

Offline uncorrectable       24

 

I have the array offline while I try to figure out what has happened. Any advise would be greatly appreciated.

 

 

media-diagnostics-20210816-1407.zip

Edited by Alex.vision
Link to comment

Since you rebooted can't see anything about the rebuild you said you did in syslog. Do you have an earlier diagnostics or syslog that shows the rebuild?

 

That's a lot of disks for only one parity. And apparently you had 2 disks that you probably should have already replaced instead of getting to a point where you needed to replace both.

 

That's more disks than I want to look at SMART reports for. Do you have any SMART warnings on the Dashboard page for any other disks?

 

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

Do you have backups of everything important and irreplaceable?

Link to comment

Normally I have the syslog mirroring to another computer, but I don't see information in it about the rebuild during that time. I am attaching it anyways, I don't really know what Im looking for. 

 

I know this is a lot of drives for 1 parity, I was running dual parity a long time ago, but I had a backplane fry the controller on 3 data drives and one of the parity drives a while ago and I have been working on building a complete backup system for my entire array.  I have offloaded about half the array onto 2 new servers that will act as a pair. I still want to make a backup of the data on this system, but with just having shelled out $2500 for my other pair I just cant afford to buy another 100TB of drives during this chia mining thing. I was hoping to make it just a little longer till prices came down.  

 

 

Timeline 

07-03-21

Last scheduled parity check (Completed with no errors)

 

07-15-21

Read Errors on Disk 4

Rebuild disk 4

Pending Sector Errors Disk 4

 

08-08-21

Replace Disk 4 with new disk

Rebuild New Disk 4

Parity Finished rebuild with 384 errors

Read Errors Disk 1

Shutdown array to replace Disk 1

 

08-16

New Disk 1 arrives, start server to begin replacement for Disk 1

Array list places old Disk 4 in array instead of New Disk 4.

 

Yeah I can understand not wanting to look through all those smart reports, there are a bunch of them :), I don't have any other smart warnings, on the main page. Though I may have marked them as "Acknowledge" so they are not showing.

 

I also am attaching a text file of my archive notifications that were from that time period. 

 

I do get emails as soon as a problem is detected.  

 

I don't have a complete backup of this server, it is mostly media, so I could get it all back, eventually. Though I really don't want to re-rip 100TB worth of data, ouch.

(Media) - Archived Notifications.txt alex.vision syslog Aug 10-12.txt

Link to comment
Aug 10 20:20:16 Media kernel: md: import disk4: (sdab) WDC_WD80EFZX-68UW8N0_VKGW28LX size: 7814026532 
...
Aug 10 20:21:44 Media kernel: md: recovery thread: recon D4 ...
...
Aug 12 05:25:25 Media kernel: md: disk1 read error, sector=13152692720
Aug 12 05:25:25 Media kernel: md: recovery thread: multiple disk errors, sector=13152692520
Aug 12 05:25:25 Media kernel: md: disk1 read error, sector=13152692728

There are lots of gaps in that syslog. Disk4 rebuild on the new disk began Aug 10 but don't know if it ever completed.

 

And another rebuild seems to be happening Aug 12 with lots of errors, but not enough to tell which disk was being rebuilt, whether it completed, or even what the assignments were.

 

Did you have any problems with flash drive during any of this?

Link to comment

No problems with the flash drive during any of this time.  I think the syslog server was not running all the time that the Unraid server was, so the log is incomplete. From the timeline, I know the New Disk 4 was being built on Aug 12, but within the last say 2 hours of it being complete is when it started showing errors for Drive 1.  I was monitoring the end of the rebuild closer than any other time. I want to say it had completed 98.2+ % of the rebuild before any notices about issues with Disk1 happened. 

 

So moving forward, should I tell Unraid to rebuild Disk 4 on to the new one again, hope that Disk 1 holds out, then replace Disk 1, with the new one I just ordered? 

 

or

 

Replace Disk 1 and hope the old disk 4 holds out, then replace it?

 

Since starting this whole thing, I have left the mover script off, I don't think I have written anything to the array but I'm not 100% sure.  If I have written anything it was just some media that can be lost.  Id like to minimize loosing an entire drive, but a few media files hopefully wont be missed.

Edited by Alex.vision
Link to comment
12 hours ago, Alex.vision said:

If I have written anything it was just some media that can be lost.

The problem isn't really about any files that can be lost, it is the fact that parity is updated when you write to the array, so parity will not be in sync with the old disks you replaced if you wrote anything after replacing.

 

Can you unassign disk4 then start the array? If so, post new diagnostics and a screenshot of Main - Array Devices.

Link to comment

OK, emulated disk4 is mounted, you can take a look at its files just as if it was really there. The emulated contents are what would be rebuilt if you did the rebuild again.

 

What I really want to see though, is the original and the replacement for disk4 are showing in Unassigned Devices. If you click the Setting (gears) icon for each, you should be able to set them to Read-only and then if they mount, you can also look at the files on each of them.

 

See if you can decide which of these, emulated, original, replacement, looks most correct.

 

If the replacement looks good as an Unassigned Device, maybe we can use it as disk4 in a New Config with Parity Valid and then try to rebuild disk1 to a new disk. And, since you will have original disk4 and original disk1, if there are any problems maybe files can be copied from those.

Link to comment

OK, I will mount both drives in unassigned devices and look into their contents and try to see if it looks good..  Are you thinking I should compare the drives with something like Krusader, or should a quick look at the folder structure and layout with Midnight Commander be sufficient? I disabled SMB, Docker and VM's before starting the array as I wanted to minimize anything writing to the array.  If I should use Krusader, I will need to stop the array and re-enable Docker.

 

Thanks for all the help!!

Link to comment

I ran into problems trying to mount the new Disk 4.  This was in my syslog.

 

Aug 17 10:18:18 Media unassigned.devices: Adding disk '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX'...
Aug 17 10:18:18 Media emhttpd: shcmd (69093): /usr/sbin/cryptsetup luksOpen /dev/sdab1 WDC_WD80EFZX-68UW8N0_VKGW28LX
Aug 17 10:18:18 Media unassigned.devices: Mount drive command: /sbin/mount -o rw,noatime,nodiratime '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX' '/mnt/disks/WDC_WD80EFZX-68UW8N0_VKGW28LX'
Aug 17 10:18:18 Media kernel: XFS (dm-28): Filesystem has duplicate UUID 55c3e94c-c28a-4132-adb2-43d2ff58c702 - can't mount
Aug 17 10:18:18 Media unassigned.devices: Mount of '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX' failed: 'mount: /mnt/disks/WDC_WD80EFZX-68UW8N0_VKGW28LX: wrong fs type, bad option, bad superblock on /dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX, missing codepage or helper program, or other error. '
Aug 17 10:18:18 Media unassigned.devices: Partition 'WDC_WD80EFZX-68UW8N0_VKGW28LX' cannot be mounted.

 

 

Looks like it can't be mounted.

Link to comment
6 minutes ago, Alex.vision said:

it can't be mounted.

So the rebuild never completed, or at least, it wasn't good maybe due to disk1 problems.

 

Can you mount original disk4?

 

Disk1

Serial Number:    Z840X72E
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
187 Reported_Uncorrect      -O--CK   093   093   000    -    7
197 Current_Pending_Sector  -O--C-   100   100   000    -    24
198 Offline_Uncorrectable   ----C-   100   100   000    -    24
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

Original disk4

Serial Number:    Z840XT2C
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
197 Current_Pending_Sector  -O--C-   093   045   000    -    2392
198 Offline_Uncorrectable   ----C-   093   045   000    -    2392
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

Let me know if original disk4 mounts.

 

Run an extended SMART test on disk1 and original disk4.

Link to comment

Hmm, I seem to be getting the same error on the old disk4

 

Aug 17 10:43:51 Media unassigned.devices: Adding disk '/dev/mapper/ST8000AS0002-1NA17Z_Z840XT2C'...
Aug 17 10:43:51 Media emhttpd: shcmd (69098): /usr/sbin/cryptsetup luksOpen /dev/sdk1 ST8000AS0002-1NA17Z_Z840XT2C
Aug 17 10:43:52 Media unassigned.devices: Mount drive command: /sbin/mount -o rw,noatime,nodiratime '/dev/mapper/ST8000AS0002-1NA17Z_Z840XT2C' '/mnt/disks/ST8000AS0002-1NA17Z_Z840XT2C'
Aug 17 10:43:52 Media kernel: XFS (dm-28): Filesystem has duplicate UUID 55c3e94c-c28a-4132-adb2-43d2ff58c702 - can't mount
Aug 17 10:43:52 Media unassigned.devices: Mount of '/dev/mapper/ST8000AS0002-1NA17Z_Z840XT2C' failed: 'mount: /mnt/disks/ST8000AS0002-1NA17Z_Z840XT2C: wrong fs type, bad option, bad superblock on /dev/mapper/ST8000AS0002-1NA17Z_Z840XT2C, missing codepage or helper program, or other error. '
Aug 17 10:43:52 Media unassigned.devices: Partition 'ST8000AS0002-1NA17Z_Z840XT2C' cannot be mounted.

 

I will run an extended test on disk1 now.

Link to comment

Also just thought I would ask, is it strange that right when attempting to mount both old disk4 and new disk4 with Unassigned devices there would be this line in the syslog..

 

Media kernel: XFS (dm-28): Filesystem has duplicate UUID 55c3e94c-c28a-4132-adb2-43d2ff58c702 - can't mount

 

I didn't notice it at first but is that saying that both devices have the same UUID?

 

I'm not sure what I'm reading in the syslog, so it may be nothing at all.

 

Link to comment
3 minutes ago, Alex.vision said:

is it strange that right when attempting to mount both old disk4 and new disk4 with Unassigned devices there would be this line in the syslog..

No, that's normal, since they are the same filesystem, and you can't have both mounted at the same time, note that the UUID can be changed in UD if needed.

Link to comment
13 minutes ago, Alex.vision said:

waiting for the Extended SMART report on old disk4 to finish, though I don't know if you still want it after UD couldn't mount it.

You should try new disk4 again after taking care of the duplicate UUID. Then you can try old disk4 again after the test completes.

 

 

Link to comment

Trurl, just so I'm clear with what your asking, you want me to change the UUID of new disk4, then try to mount it with UD?

 

Edit: Does it matter that the disk in my array are encrypted? Do I need to add the encryption passcode to each drive in UD, or because they are the same as the array it will be unlocked using that?

Edited by Alex.vision
Link to comment

OK, using the Change Disk UUID option in UD I selected the new disk4 drive and clicked Change UUID. The syslog reported 

 

Aug 17 15:03:00 Media unassigned.devices: Changing disk 'sda' UUID. Result:

  

I waited about a minute before attempting to mount the device again in UD, it didn't mount.

 

Aug 17 15:03:00 Media unassigned.devices: Changing disk 'sda' UUID. Result:
Aug 17 15:03:57 Media unassigned.devices: Adding disk '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX'...
Aug 17 15:03:57 Media emhttpd: shcmd (202422): /usr/sbin/cryptsetup luksOpen /dev/sdab1 WDC_WD80EFZX-68UW8N0_VKGW28LX
Aug 17 15:03:58 Media unassigned.devices: Mount drive command: /sbin/mount -o rw,noatime,nodiratime '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX' '/mnt/disks/WDC_WD80EFZX-68UW8N0_VKGW28LX'
Aug 17 15:03:58 Media kernel: XFS (dm-28): Filesystem has duplicate UUID 55c3e94c-c28a-4132-adb2-43d2ff58c702 - can't mount
Aug 17 15:03:58 Media unassigned.devices: Mount of '/dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX' failed: 'mount: /mnt/disks/WDC_WD80EFZX-68UW8N0_VKGW28LX: wrong fs type, bad option, bad superblock on /dev/mapper/WDC_WD80EFZX-68UW8N0_VKGW28LX, missing codepage or helper program, or other error. '
Aug 17 15:03:58 Media unassigned.devices: Partition 'WDC_WD80EFZX-68UW8N0_VKGW28LX' cannot be mounted.

 

It still seems to be identified by the same duplicate UUID. 

 

Would it be worth it to try mounting it in one of my other Unraid servers and bypass the duplicate UUID issue entirely.  I can do that in an hour or two when I get off work.

Link to comment
  • Alex.vision changed the title to [SOLVED] Wrong disk in slot after rebuild/reboot

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.