DannyG Posted April 20, 2020 Share Posted April 20, 2020 I found 2 RED X in my array tonight. Disk 10 & Disk 12 (0f 20) Disk 10 has data on it, disk 12 is still empty. If the disks are defective, I can either remove them or replace them with smaller ones. These Disks are part of my Netapp shelf. which I've only owned for about 2 month now, but have "disappeared" from my array before. (what I would notice is all new shows missing from my plex) a Reboot usually brought everything back up, but this is the first time I see an actual error like this. I removed disk 10, and rebooted the server. that didn't do much. I put the disk back in, and now it doing a "parity-sync/data rebuild" not sure if this will cause a data lost. What should I do? tower-diagnostics-20200419-2208.zip Quote Link to comment
trurl Posted April 20, 2020 Share Posted April 20, 2020 Rebuilding an unmountable filesystem usually results in an unmountable filesystem, so the filesystem will probably have to be repaired after the rebuild. If you had asked before starting the rebuild, we could have tried to repair the emulated filesystem before rebuilding it, and maybe we would have had other options at that point. 13 minutes ago, DannyG said: replace them with smaller ones Replacing / rebuilding a disk with a smaller disk isn't possible. You could set a New Config and assign any disk you wanted, but then rebuild wouldn't be possible. Disk12 will also have to be rebuilt to make it consistent with parity, but since it is empty it can be formatted again instead of repairing its filesystem. Diagnostics are after reboot so no syslog from when the problems occurred. Also your controller isn't passing the usual complete SMART attributes. I'm not sure how to interpret that but it does indicate disks are OK, just don't know whether or not its idea of OK is the same as ours. Let the disk10 rebuild complete if it will, and post new diagnostics either way. Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 sounds good. Thank you. another 12 hours or so to go till 100% Quote Link to comment
trurl Posted April 20, 2020 Share Posted April 20, 2020 10 hours ago, trurl said: your controller isn't passing the usual complete SMART attributes. Apparently that is what you get with SAS drives, less than ideal but what we have. Here is another recent post about that: https://forums.unraid.net/topic/91279-aneely-docker-image-filling-up/?do=findComment&comment=846769 Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 ok, so Disk 10 has just finish rebuilding. it's now green, but it's still displaying "Unmountable: No file system" I stopped the array and started it back up to see if it would change, but it didn't.tower-diagnostics-20200420-1413.zip I have attached my latest log. Disk 12 is still RED X. Quote Link to comment
JorgeB Posted April 20, 2020 Share Posted April 20, 2020 Check filesystem on both: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 (edited) This is what I got after running the disk10 (Disk 12 has the same results) check in maintenance mode with flags -nv Quote Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. Edited April 20, 2020 by DannyG Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 ok, tried the repair (just running -v) this is what I got Quote Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 15439760 entries sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap ino pointer to 129 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary ino pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 667834 tail block 667830 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 (edited) Can't I mount the drive first like it's asking? (to replay the log) Edited April 20, 2020 by DannyG Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 I don't want to lose the data on Disk10 - how do I mount the drive? (it looks mounted to me) -L sounds like I'll destroy the data on the disk.. can you confirm? Quote Link to comment
itimpi Posted April 20, 2020 Share Posted April 20, 2020 36 minutes ago, DannyG said: -L sounds like I'll destroy the data on the disk.. can you confirm? It normally destroys nothing. If anything does get lost it will only be related to the file that was being written at the time the corruption occurred. Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 Thank you @itimpi I used the -L flag and received the following: Phase 1 - find and verify superblock... sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap ino pointer to 129 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary ino pointer to 130 Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_icount 0, counted 3776 sb_ifree 0, counted 178 sb_fdblocks 976277679, counted 626303125 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (1:667824) is ahead of log (1:2). Format log to cycle 4. done Quote Link to comment
DannyG Posted April 20, 2020 Author Share Posted April 20, 2020 Alright, so I stopped the array, and spun it back up. Disk 10 is back! and it looks like the files are there too. Any idea why this is happening? Quote Link to comment
JorgeB Posted April 21, 2020 Share Posted April 21, 2020 7 hours ago, DannyG said: Any idea why this is happening? We'd need the diags before rebooting, grab them if it happens again. Quote Link to comment
DannyG Posted April 22, 2020 Author Share Posted April 22, 2020 I'l know for next time. Thank you very much. Quote Link to comment
DannyG Posted May 4, 2020 Author Share Posted May 4, 2020 Well.. it didn't take that long, but I'm getting the following errors now: Unraid array errors: 2020-05-03 04:42 Warning [TOWER] - array has errors Array has 12 disks with read errors Here's my diagnostics file (without a reboot) tower-diagnostics-20200504-0925.zip Quote Link to comment
trurl Posted May 4, 2020 Share Posted May 4, 2020 Disks 10-20 plus parity2 all disconnected. Are these on the same controller? Quote Link to comment
JorgeB Posted May 4, 2020 Share Posted May 4, 2020 Looks like a problem with the enclosure, there was a reset and lost communication with all disks: May 3 04:41:15 Tower kernel: scsi 3:0:13:0: Enclosure NETAPP DS424IOM3 0212 PQ: 0 ANSI: 5 May 3 04:41:15 Tower kernel: scsi 3:0:13:0: Power-on or device reset occurred Quote Link to comment
DannyG Posted May 4, 2020 Author Share Posted May 4, 2020 Yes, they're all on the same controller. but all the disk are still showing online on my dashboard. Quote Link to comment
trurl Posted May 4, 2020 Share Posted May 4, 2020 Have you considered going to fewer, larger disks to get the same capacity? Larger disks perform better, fewer disks are fewer opportunities for problems. In fact, it looks like you have enough free space to move all that off the 2TB disks and shrink the array. Quote Link to comment
DannyG Posted May 5, 2020 Author Share Posted May 5, 2020 (edited) Hi Trurl, My Server originally had 10x 2TB disk. they are internal disks running off 2x ibm m1015 controllers in IT mode from my Freenas days. Recently, I got my hands on a Netapp Shelf with some netapp disks. Those disks are the ones showing up as disk10-22 + 2nd parity drive. These are the ones that I'm having reliability issues with. I believe what you're asking me is "why don't I migrate everything off my reliable 2TB disks to my SAS 4TB drives" to have fewer disks and few problems. I'm just not convinced that that will be the case. PS: my 2TB drives are old, like 7-9 years old. the Plan is to just let them die and shrink the array as it happens. Edited May 5, 2020 by DannyG Quote Link to comment
trurl Posted May 5, 2020 Share Posted May 5, 2020 More than one way to get small old disks out of the array and shrink it. Since the disks you are having connection problems with are mostly empty, shrink the array to remove them, and then use some of those larger newer disks to rebuild those small old disks onto larger newer disks. Quote Link to comment
JonathanM Posted May 6, 2020 Share Posted May 6, 2020 23 hours ago, DannyG said: the Plan is to just let them die and shrink the array as it happens. The issue with that is if you have one of your critical drives full of content unexpectedly quit, you are relying on known failing drives to rebuild it. Unraid requires ALL drives to be read flawlessly to rebuild a failed drive. You may as well just run without parity, at least that way when a drive dies it doesn't immediately start stressing all the other marginal drives. I'm not being flippant here, you really do stand a better chance of keeping your data safe if you drop parity and use those two drives plus multiple other drives to keep backup copies of your data. You can set up scheduled copies with the user scripts add on, that way when drives die you have backups. Keeping questionable drives as members of the parity array will bite you. I lost data that way many years ago when I started with unraid, never again. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.