A Drive with Red X (SOLVED)

Gico · January 3, 2019

One drive has red X. I downloaded diagnostics, then rebooted , and downloaded a SMART report.

How does the SMART looks like? I have a replacement drive on hand,

but I prefer to take my chances with the failed one if it looks ok.

If so, what is the procedure for that?

juno-smart-20190103-1721.zip

juno-diagnostics-20190103-1709.zip

Edited January 8, 2019 by Gico

trurl · January 3, 2019

Diagnostics already includes SMART for all disks, syslog, and much more. So no need for the separate SMART.

Syslog also showing some issues with 1st cache. Just curious, why do you have 5 cache disks?

SMART for disk3 looks OK. Assuming you aren't getting any warnings for other array disks on the Dashboard page, you can check ALL connections and rebuild the disk to itself:

https://wiki.unraid.net/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_X_next_to_a_hard_disk.3F

Not sure what if anything should be done about cache other than checking connections, but you can deal with that after the disk3 rebuild.

Probably a good idea to quit writing to anything until everything is square again.

Gico · January 3, 2019

Done - rebuild began. Thanks.

5 cache disks - because they are small in size, and still working. Cache size is 3.8TB,
and this space allow me some flexibility when seeding torrents.

When I'll have a bigger drives I would be able to reduce the number of the cache drives.

I attached a SMART report because of this thread: "Disk dropped offline, so there's no SMART".

As for the 1st cache disk: The high CRC error count is from this event when I had a faulty PSU that had random hiccups.

trurl · January 3, 2019

17 minutes ago, Gico said:

I attached a SMART report because of this thread: "Disk dropped offline, so there's no SMART".

If you happen to notice your diagnostics is missing SMART for a disk then you can try to correct the problem causing it to not give SMART and get us a separate SMART file. And sometimes even if you get a SMART file it won't really contain anything so another attempt will be needed. So generally, no need to post one separately. SMART for all disks are already in Diagnostics, and we often want to check other disks before making a recommendation anyway, so that one SMART may not be enough. We will ask for it when we see it is missing or incomplete.

22 minutes ago, Gico said:

As for the 1st cache disk: The high CRC error count is from this event when I had a faulty PSU that had random hiccups.

I wasn't talking about that.. The syslog included in the current diagnostics you posted is full of lines complaining about that disk.

JorgeB · January 3, 2019

2 hours ago, trurl said:

Syslog also showing some issues with 1st cache

Not just the first one:

Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 1441, flush 0, corrupt 0, gen 0
Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sdr1 errs: wr 48902, rd 50277, flush 36, corrupt 63, gen 58
Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sde1 errs: wr 44260, rd 45177, flush 14, corrupt 0, gen 0

See here for more info on how to monitor pools:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

Gico · January 5, 2019

Disk 3 disabled again, and I'm having read errors on multiple disks.

Same symptoms as with the faulty PSU in the previous event. This is frustrating.

Corsair RM750x, low power CPU. What are the odds for that to happen again?

juno-diagnostics-20190105-2037.zip

Edit: Disk11 disabled too.

Edit2: Stopped the server, not fast enough: 4 disks disabled. I'll check the cables.

Edited January 5, 2019 by Gico

trurl · January 5, 2019

Check ALL connections, power and SATA. Cables should not be bundled and should have enough slack to allow the connectors to sit squarely on the connection. If this is a controller card, also reseat it.

Then post a new diagnostic since it looks like that latest one was before the additional disks disabled.

Gico · January 5, 2019

No controller card. All controllers are on board.

The disabled disks (3, 7 , 9, 11) are not connected with the same power & data cables.

Disks 7 & 9 were also reported "missing". After boot they are back and ok.

Attached diagnostics before the shutdown.

juno-diagnostics-20190105-2057.zip

Edited January 5, 2019 by Gico

trurl · January 5, 2019

Diagnostics only show disk3 disabled. Post a screenshot of Main - Array Devices.

Gico · January 5, 2019

trurl · January 5, 2019

Diagnostics are in fact showing disk11 also disabled. But it isn't showing up in the SMART folder of diagnostics, so it must not be responding.

Check connections again. change cables, try another port, etc. If it isn't seen in the BIOS Unraid won't be able to see it either, so check there before continuing with boot.

JorgeB · January 5, 2019

Problem was with the LSI controller, update it to latest firmware 20.00.07.

P.S. Did you look into the link I posted above about the pool? Still getting checksum errors,, you need to run a scrub.

Gico · January 6, 2019

18 hours ago, trurl said:

Diagnostics are in fact showing disk11 also disabled. But it isn't showing up in the SMART folder of diagnostics, so it must not be responding.

Check connections again. change cables, try another port, etc. If it isn't seen in the BIOS Unraid won't be able to see it either, so check there before continuing with boot.

Attached the SMART of disk 11, I rechecked connections, but it is unlikely to be cabling/connections as these disks are connected through separate cables.

18 hours ago, johnnie.black said:

Problem was with the LSI controller, update it to latest firmware 20.00.07.

P.S. Did you look into the link I posted above about the pool? Still getting checksum errors,, you need to run a scrub.

Firmware updated. Diagnostics attached. Array is currently not started.

Cache pool scrub - OK, but only when the main array would stabilize, because scrub takes a long time a prevents me from taking down the array, right?

Disk4 crc error count is crawling upwards at each boot 577 (yesterday)-->779-->782-->787-->817-->819-->820

What's the best course of action now?

juno-disk11-smart-20190106-1722.zip

juno-diagnostics-20190106-1721.zip

JorgeB · January 6, 2019

Replace that SATA cable and/or swap backplanes if in use, then rebuild the disabled disks, make sure they are mounting before rebuilding on top of the old disks, you can also rebuild to newer disks to play it safer in case something goes wrong during the rebuild.

Gico · January 6, 2019

Replaced 2 data cables. Unassigned disk3 and disk11, started the array, stopped the array,

assigned disk3, assign a replacement disk as disk11, started the array.

It looks like they are being reconstructed (orange triangle), but marked as "Unmountable: No file system".

Is it ok? Why aren't they emulated?

I also get these FCP errors, obviously because the disks are not emulated:

Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Media has disk11 set in its included disk settings
Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Temp has disk3 set in its included disk settings
Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Temp has disk11 set in its included disk settings

Edit: ok I missed "make sure they are mounting before rebuilding on top of the old disks".

I will wait for the rebuild to end, and then what?

Edited January 6, 2019 by Gico

JorgeB · January 6, 2019

Post new diags.

Gico · January 7, 2019

Attached.

Scrub results:

scrub status for cbff7a3a-fca4-4829-81a2-aea98601bbd9
   scrub started at Sun Jan 6 23:42:58 2019 and finished after 07:44:34
   total bytes scrubbed: 6.18TiB with 36183 errors
   error details: verify=1 csum=36182
   corrected errors: 36183, uncorrectable errors: 0, unverified errors: 0

Is there any cache disk that causing these errors and should be replaced?

juno-diagnostics-20190107-0647.zip

Edited January 7, 2019 by Gico

JorgeB · January 7, 2019

SMART for cache disks looks fine, issues are more likely connection related, scrub corrected all errors so check/replace cables, reset the filesystem error counters and keep monitoring like explained in the other link.

As for the unmountable disks, wait for the rebuild to finish and check filesystem on them, though like mentioned this should have been done before rebuilding, especially if rebuilding on top of the old disks.

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

or

https://wiki.unraid.net/Check_Disk_Filesystems#Drives_formatted_with_XFS

Gico · January 7, 2019

OK, thanks a lot for the help.

An old problem I have is that all shares are disconnected for a few seconds, several times a day.

When it happens I have no access for the shares from any PC in my network.

Could that be the reason for the cache filesystem problems?

As for disk11, it is a replacement. Should I continue with this disk for now, or try the original (also unmountable) disk?

JorgeB · January 7, 2019

Wait for the rebuild to finish and see if the filesystems are fixable.

Gico · January 7, 2019

Disk3:

Phase 1 - find and verify superblock...
- block cache size set to 720560 entries
sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96
would reset superblock root inode pointer to 96
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
would reset superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
would reset superblock realtime summary ino pointer to 98
Phase 2 - using internal log
- zero log...
zero_log: head block 1239298 tail block 1239298
- scan filesystem freespace and inode maps...
sb_icount 0, counted 19840
sb_ifree 0, counted 4772
sb_fdblocks 1464608875, counted 195524130
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 5
- agno = 2
- agno = 3
- agno = 4
- agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

Disk11:

Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
would write modified primary superblock
Primary superblock would have been modified.
Cannot proceed further in no_modify mode.
Exiting now.

Run repair on both disks?

JorgeB · January 8, 2019

Yes, without -n

Gico · January 8, 2019

FS repair was successful. Both disks were successfully mounted.

Anyway to know if something was lost?

Thank again for the help.

Disk3

Phase 1 - find and verify superblock...
- block cache size set to 720560 entries
sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96
resetting superblock root inode pointer to 96
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
- zero log...
zero_log: head block 1239300 tail block 1239300
- scan filesystem freespace and inode maps...
sb_icount 0, counted 19840
sb_ifree 0, counted 4772
sb_fdblocks 1464608875, counted 195524130
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 5
- agno = 2
- agno = 3
- agno = 4
- agno = 1
Phase 5 - rebuild AG headers and trees...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

Disk11

Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
- block cache size set to 690760 entries
sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96
resetting superblock root inode pointer to 96
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
- zero log...
zero_log: head block 295183 tail block 295179
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

JonathanM · January 8, 2019

1 hour ago, Gico said:

Anyway to know if something was lost?

Compare with your backups.

JorgeB · January 8, 2019

Also look for the lost+found folder, any lost/partial files should be there.

A Drive with Red X (SOLVED)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation