Gico Posted January 3, 2019 Share Posted January 3, 2019 (edited) One drive has red X. I downloaded diagnostics, then rebooted , and downloaded a SMART report. How does the SMART looks like? I have a replacement drive on hand, but I prefer to take my chances with the failed one if it looks ok. If so, what is the procedure for that? juno-smart-20190103-1721.zip juno-diagnostics-20190103-1709.zip Edited January 8, 2019 by Gico Quote Link to comment
trurl Posted January 3, 2019 Share Posted January 3, 2019 Diagnostics already includes SMART for all disks, syslog, and much more. So no need for the separate SMART. Syslog also showing some issues with 1st cache. Just curious, why do you have 5 cache disks? SMART for disk3 looks OK. Assuming you aren't getting any warnings for other array disks on the Dashboard page, you can check ALL connections and rebuild the disk to itself: https://wiki.unraid.net/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_X_next_to_a_hard_disk.3F Not sure what if anything should be done about cache other than checking connections, but you can deal with that after the disk3 rebuild. Probably a good idea to quit writing to anything until everything is square again. 1 Quote Link to comment
Gico Posted January 3, 2019 Author Share Posted January 3, 2019 Done - rebuild began. Thanks. 5 cache disks - because they are small in size, and still working. Cache size is 3.8TB, and this space allow me some flexibility when seeding torrents. When I'll have a bigger drives I would be able to reduce the number of the cache drives. I attached a SMART report because of this thread: "Disk dropped offline, so there's no SMART". As for the 1st cache disk: The high CRC error count is from this event when I had a faulty PSU that had random hiccups. Quote Link to comment
trurl Posted January 3, 2019 Share Posted January 3, 2019 17 minutes ago, Gico said: I attached a SMART report because of this thread: "Disk dropped offline, so there's no SMART". If you happen to notice your diagnostics is missing SMART for a disk then you can try to correct the problem causing it to not give SMART and get us a separate SMART file. And sometimes even if you get a SMART file it won't really contain anything so another attempt will be needed. So generally, no need to post one separately. SMART for all disks are already in Diagnostics, and we often want to check other disks before making a recommendation anyway, so that one SMART may not be enough. We will ask for it when we see it is missing or incomplete. 22 minutes ago, Gico said: As for the 1st cache disk: The high CRC error count is from this event when I had a faulty PSU that had random hiccups. I wasn't talking about that.. The syslog included in the current diagnostics you posted is full of lines complaining about that disk. Quote Link to comment
JorgeB Posted January 3, 2019 Share Posted January 3, 2019 2 hours ago, trurl said: Syslog also showing some issues with 1st cache Not just the first one: Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 1441, flush 0, corrupt 0, gen 0 Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sdr1 errs: wr 48902, rd 50277, flush 36, corrupt 63, gen 58 Dec 22 22:16:29 Juno kernel: BTRFS info (device sdc1): bdev /dev/sde1 errs: wr 44260, rd 45177, flush 14, corrupt 0, gen 0 See here for more info on how to monitor pools: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 Quote Link to comment
Gico Posted January 5, 2019 Author Share Posted January 5, 2019 (edited) Disk 3 disabled again, and I'm having read errors on multiple disks. Same symptoms as with the faulty PSU in the previous event. This is frustrating. Corsair RM750x, low power CPU. What are the odds for that to happen again? juno-diagnostics-20190105-2037.zip Edit: Disk11 disabled too. Edit2: Stopped the server, not fast enough: 4 disks disabled. I'll check the cables. Edited January 5, 2019 by Gico Quote Link to comment
trurl Posted January 5, 2019 Share Posted January 5, 2019 Check ALL connections, power and SATA. Cables should not be bundled and should have enough slack to allow the connectors to sit squarely on the connection. If this is a controller card, also reseat it. Then post a new diagnostic since it looks like that latest one was before the additional disks disabled. Quote Link to comment
Gico Posted January 5, 2019 Author Share Posted January 5, 2019 (edited) No controller card. All controllers are on board. The disabled disks (3, 7 , 9, 11) are not connected with the same power & data cables. Disks 7 & 9 were also reported "missing". After boot they are back and ok. Attached diagnostics before the shutdown. juno-diagnostics-20190105-2057.zip Edited January 5, 2019 by Gico Quote Link to comment
trurl Posted January 5, 2019 Share Posted January 5, 2019 Diagnostics only show disk3 disabled. Post a screenshot of Main - Array Devices. Quote Link to comment
trurl Posted January 5, 2019 Share Posted January 5, 2019 Diagnostics are in fact showing disk11 also disabled. But it isn't showing up in the SMART folder of diagnostics, so it must not be responding. Check connections again. change cables, try another port, etc. If it isn't seen in the BIOS Unraid won't be able to see it either, so check there before continuing with boot. Quote Link to comment
JorgeB Posted January 5, 2019 Share Posted January 5, 2019 Problem was with the LSI controller, update it to latest firmware 20.00.07. P.S. Did you look into the link I posted above about the pool? Still getting checksum errors,, you need to run a scrub. Quote Link to comment
Gico Posted January 6, 2019 Author Share Posted January 6, 2019 18 hours ago, trurl said: Diagnostics are in fact showing disk11 also disabled. But it isn't showing up in the SMART folder of diagnostics, so it must not be responding. Check connections again. change cables, try another port, etc. If it isn't seen in the BIOS Unraid won't be able to see it either, so check there before continuing with boot. Attached the SMART of disk 11, I rechecked connections, but it is unlikely to be cabling/connections as these disks are connected through separate cables. 18 hours ago, johnnie.black said: Problem was with the LSI controller, update it to latest firmware 20.00.07. P.S. Did you look into the link I posted above about the pool? Still getting checksum errors,, you need to run a scrub. Firmware updated. Diagnostics attached. Array is currently not started. Cache pool scrub - OK, but only when the main array would stabilize, because scrub takes a long time a prevents me from taking down the array, right? Disk4 crc error count is crawling upwards at each boot 577 (yesterday)-->779-->782-->787-->817-->819-->820 What's the best course of action now? juno-disk11-smart-20190106-1722.zip juno-diagnostics-20190106-1721.zip Quote Link to comment
JorgeB Posted January 6, 2019 Share Posted January 6, 2019 Replace that SATA cable and/or swap backplanes if in use, then rebuild the disabled disks, make sure they are mounting before rebuilding on top of the old disks, you can also rebuild to newer disks to play it safer in case something goes wrong during the rebuild. Quote Link to comment
Gico Posted January 6, 2019 Author Share Posted January 6, 2019 (edited) Replaced 2 data cables. Unassigned disk3 and disk11, started the array, stopped the array, assigned disk3, assign a replacement disk as disk11, started the array. It looks like they are being reconstructed (orange triangle), but marked as "Unmountable: No file system". Is it ok? Why aren't they emulated? I also get these FCP errors, obviously because the disks are not emulated: Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Media has disk11 set in its included disk settings Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Temp has disk3 set in its included disk settings Jan 6 21:12:25 Juno root: Fix Common Problems: Error: Share Temp has disk11 set in its included disk settings Edit: ok I missed "make sure they are mounting before rebuilding on top of the old disks". I will wait for the rebuild to end, and then what? Edited January 6, 2019 by Gico Quote Link to comment
Gico Posted January 7, 2019 Author Share Posted January 7, 2019 (edited) Attached. Scrub results: scrub status for cbff7a3a-fca4-4829-81a2-aea98601bbd9 scrub started at Sun Jan 6 23:42:58 2019 and finished after 07:44:34 total bytes scrubbed: 6.18TiB with 36183 errors error details: verify=1 csum=36182 corrected errors: 36183, uncorrectable errors: 0, unverified errors: 0 Is there any cache disk that causing these errors and should be replaced? juno-diagnostics-20190107-0647.zip Edited January 7, 2019 by Gico Quote Link to comment
JorgeB Posted January 7, 2019 Share Posted January 7, 2019 SMART for cache disks looks fine, issues are more likely connection related, scrub corrected all errors so check/replace cables, reset the filesystem error counters and keep monitoring like explained in the other link. As for the unmountable disks, wait for the rebuild to finish and check filesystem on them, though like mentioned this should have been done before rebuilding, especially if rebuilding on top of the old disks. https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui or https://wiki.unraid.net/Check_Disk_Filesystems#Drives_formatted_with_XFS 1 Quote Link to comment
Gico Posted January 7, 2019 Author Share Posted January 7, 2019 OK, thanks a lot for the help. An old problem I have is that all shares are disconnected for a few seconds, several times a day. When it happens I have no access for the shares from any PC in my network. Could that be the reason for the cache filesystem problems? As for disk11, it is a replacement. Should I continue with this disk for now, or try the original (also unmountable) disk? Quote Link to comment
JorgeB Posted January 7, 2019 Share Posted January 7, 2019 Wait for the rebuild to finish and see if the filesystems are fixable. Quote Link to comment
Gico Posted January 7, 2019 Author Share Posted January 7, 2019 Disk3: Phase 1 - find and verify superblock... - block cache size set to 720560 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96 would reset superblock root inode pointer to 96 sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 would reset superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 would reset superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... zero_log: head block 1239298 tail block 1239298 - scan filesystem freespace and inode maps... sb_icount 0, counted 19840 sb_ifree 0, counted 4772 sb_fdblocks 1464608875, counted 195524130 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 5 - agno = 2 - agno = 3 - agno = 4 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. Disk11: Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. Run repair on both disks? Quote Link to comment
Gico Posted January 8, 2019 Author Share Posted January 8, 2019 FS repair was successful. Both disks were successfully mounted. Anyway to know if something was lost? Thank again for the help. Disk3 Phase 1 - find and verify superblock... - block cache size set to 720560 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96 resetting superblock root inode pointer to 96 sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 resetting superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 resetting superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... zero_log: head block 1239300 tail block 1239300 - scan filesystem freespace and inode maps... sb_icount 0, counted 19840 sb_ifree 0, counted 4772 sb_fdblocks 1464608875, counted 195524130 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 5 - agno = 2 - agno = 3 - agno = 4 - agno = 1 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Disk11 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 690760 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 96 resetting superblock root inode pointer to 96 sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 resetting superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 resetting superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... zero_log: head block 295183 tail block 295179 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
JonathanM Posted January 8, 2019 Share Posted January 8, 2019 1 hour ago, Gico said: Anyway to know if something was lost? Compare with your backups. Quote Link to comment
JorgeB Posted January 8, 2019 Share Posted January 8, 2019 Also look for the lost+found folder, any lost/partial files should be there. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.