April 26, 20197 yr My Server has become unstable (crashed and parity checks lead to more crashes). After it come back again, I had a failed disk. The disk looks "unmountable". As I have plenty of space (50% usage) I'm wondering if I should remove the disk, try to put it back into the array or do something else. Is there a way to find out how "dead" the disk is? knowlage-diagnostics-20190426-1816.zip Edited April 30, 20197 yr by Jaster
April 26, 20197 yr Community Expert There are a few relatively recent UNC @ LBA errors, you should run an extended SMART test.
April 26, 20197 yr Author It's a very old 2TB disk, I guess it's time to remove it. Is there a way to remove it and have the array re allocate the missing data somehwere else (as said, I do have plenty of space left)?
April 26, 20197 yr Community Expert Not automatically, either check filesystem on the emulated disk and move the data to other disks, or mount the old disk with UD and copy to the array after doing a new config and re-syncing parity.
April 26, 20197 yr Author 5 minutes ago, johnnie.black said: Not automatically, either check filesystem on the emulated disk and move the data to other disks I do this by...? Hit the "check" button and then use unbalance?
April 26, 20197 yr Community Expert 2 minutes ago, Jaster said: I do this by...? https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui 2 minutes ago, Jaster said: and then use unbalance? It's an option.
April 26, 20197 yr Community Expert Then you'll still need to do a new config and re-sync parity without that disk, though this doesn't bode well: 26 minutes ago, Jaster said: crashed and parity checks lead to more crashes
April 26, 20197 yr Community Expert I would first try to find out why the server is crashing, run memtest, check cooling, power supply, etc.
April 26, 20197 yr Author It seems to be the disk, everything else is doing well. So I either replace or remove it. I'm sure it won't pass another parity run as I tried that a couple of times.
April 26, 20197 yr Community Expert Bad disk shouldn't make Unraid crash, and it doesn't look bad, still move the data first, remove the disk and then try to re-sync parity, if it's still crashes it wasn't the disk.
April 26, 20197 yr Author # Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value 1 Raw read error rate 0x002f 200 200 051 Pre-fail Always Never 0 3 Spin up time 0x0027 176 174 021 Pre-fail Always Never 4158 4 Start stop count 0x0032 097 097 000 Old age Always Never 3449 5 Reallocated sector count 0x0033 200 200 140 Pre-fail Always Never 0 7 Seek error rate 0x002e 200 200 000 Old age Always Never 0 9 Power on hours 0x0032 040 040 000 Old age Always Never 43936 (5y, 4d, 16h) 10 Spin retry count 0x0032 100 100 000 Old age Always Never 0 11 Calibration retry count 0x0032 100 100 000 Old age Always Never 0 12 Power cycle count 0x0032 100 100 000 Old age Always Never 257 192 Power-off retract count 0x0032 200 200 000 Old age Always Never 153 193 Load cycle count 0x0032 199 199 000 Old age Always Never 3295 194 Temperature celsius 0x0022 117 088 000 Old age Always Never 30 196 Reallocated event count 0x0032 200 200 000 Old age Always Never 0 197 Current pending sector 0x0032 200 200 000 Old age Always Never 0 198 Offline uncorrectable 0x0030 100 253 000 Old age Offline Never 0 199 UDMA CRC error count 0x0032 200 200 000 Old age Always Never 0 200 Multi zone error rate 0x0008 200 200 000 Old age Offline Never 0 I'm running the check now, let's see what happens. If it passes, I'll try to reset the config and run a parity check. Edited April 26, 20197 yr by Jaster
April 26, 20197 yr Author 1 minute ago, johnnie.black said: I already saw that on the diags. Phase 1 - find and verify superblock... - block cache size set to 3062096 entries Phase 2 - using internal log - zero log... zero_log: head block 8 tail block 4 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... agf_freeblks 122094651, counted 122094649 in ag 3 agf_freeblks 94873563, counted 94873565 in ag 1 agf_freeblks 121856187, counted 121856185 in ag 2 agi_freecount 1, counted 0 in ag 3 agi_freecount 1, counted 0 in ag 3 finobt agi_freecount 1, counted 0 in ag 1 agi_freecount 1, counted 0 in ag 1 finobt inode chunk claims untracked block, finobt block - agno 2, bno 3901920, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901921, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901922, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901923, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901924, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901925, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901926, inopb 8 inode chunk claims untracked block, finobt block - agno 2, bno 3901927, inopb 8 undiscovered finobt record, ino 2178699008 (2/31215360) finobt ir_freecount/free mismatch, inode chunk 2/31215360, freecount 30 nfree 32 invalid inode count, inode chunk 2/31215360, count 0 ninodes 64 undiscovered finobt record, ino 2147483712 (2/64) finobt ir_freecount/free mismatch, inode chunk 2/64, freecount 54 nfree 24 invalid inode count, inode chunk 2/64, count 0 ninodes 64 undiscovered finobt record, ino 2147484608 (2/960) finobt ir_freecount/free mismatch, inode chunk 2/960, freecount 6 nfree 28 invalid inode count, inode chunk 2/960, count 0 ninodes 64 agi_freecount 1, counted 0 in ag 2 agi_freecount 1, counted 90 in ag 2 finobt sb_ifree 9, counted 6 sb_fdblocks 338655161, counted 339149767 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... found inodes not in the inode allocation tree - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 1 - agno = 2 No modify flag set, skipping phase 5 Inode allocation btrees are too corrupted, skipping phases 6 and 7 Maximum metadata LSN (1:3751) is ahead of log (1:8). Would format log to cycle 4. No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Fri Apr 26 19:06:04 2019 Phase Start End Duration Phase 1: 04/26 19:06:03 04/26 19:06:04 1 second Phase 2: 04/26 19:06:04 04/26 19:06:04 Phase 3: 04/26 19:06:04 04/26 19:06:04 Phase 4: 04/26 19:06:04 04/26 19:06:04 Phase 5: Skipped Phase 6: Skipped Phase 7: Skipped Total run time: 1 second run xfs_repair with -L?
April 26, 20197 yr Community Expert First without -n, and if it still asks for it, and likely it will, use -L.
April 26, 20197 yr Author I think I screwed it up (a bit), I copy/pased repair with drive md1 inseard of md4. I cancled ([ctrl]+[C]) and as everything looked fine I went on and fixed 4. After I made a new config, it told me disk1 is unmounable. Trying to stop the array, it "hangs" with Array Stopping•Retry unmounting disk share(s)... argh.
April 26, 20197 yr Community Expert disk1 should be fixable with xfs_repair, new config should be done only after you copy disk4's data.
April 26, 20197 yr Author I can't get into xfs_repair as I can't get the array into maintaiance. As d4 was repaired, I'll do a new config and include it in order to run a parity check and hope. If it works, I'll unbalance all data off d4 and remove it.
April 26, 20197 yr Community Expert Only the file system was repaired (and it was the emulated disk filesystem, not the actual disk), it won't make any difference for a parity check, or if it crashes or not, though like I said I doubt it's disk related, still if you plan to remove disk4 no point in doing a new config with it. Edited April 26, 20197 yr by johnnie.black
April 26, 20197 yr Author I got all disk put back and try to run a parity check. Lets see what it does... Is there anything I can enable for some kind of "extended" monitoring?
April 26, 20197 yr Community Expert System notifications are enough to monitor usual disk warning signs.
April 27, 20197 yr Author Apr 27 20:19:16 tower kernel: ata3: hard resetting link Apr 27 20:19:16 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 27 20:19:16 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:16 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:16 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:16 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:16 tower kernel: ata3.00: configured for UDMA/133 Apr 27 20:19:16 tower kernel: ata3: EH complete Apr 27 20:19:16 tower kernel: ata3.00: exception Emask 0x10 SAct 0x200 SErr 0x400100 action 0x6 frozen Apr 27 20:19:16 tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error Apr 27 20:19:16 tower kernel: ata3: SError: { UnrecovData Handshk } Apr 27 20:19:16 tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED Apr 27 20:19:16 tower kernel: ata3.00: cmd 61/80:48:40:2f:76/00:00:03:00:00/40 tag 9 ncq dma 65536 out Apr 27 20:19:16 tower kernel: res 40/00:48:40:2f:76/00:00:03:00:00/40 Emask 0x10 (ATA bus error) Apr 27 20:19:16 tower kernel: ata3.00: status: { DRDY } Apr 27 20:19:16 tower kernel: ata3: hard resetting link Apr 27 20:19:17 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 27 20:19:17 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:17 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:17 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:17 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:17 tower kernel: ata3.00: configured for UDMA/133 Apr 27 20:19:17 tower kernel: ata3: EH complete Apr 27 20:19:18 tower kernel: ata3.00: exception Emask 0x10 SAct 0x4 SErr 0x400100 action 0x6 frozen Apr 27 20:19:18 tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error Apr 27 20:19:18 tower kernel: ata3: SError: { UnrecovData Handshk } Apr 27 20:19:18 tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED Apr 27 20:19:18 tower kernel: ata3.00: cmd 61/80:10:40:cc:eb/00:00:02:00:00/40 tag 2 ncq dma 65536 out Apr 27 20:19:18 tower kernel: res 40/00:10:40:cc:eb/00:00:02:00:00/40 Emask 0x10 (ATA bus error) Apr 27 20:19:18 tower kernel: ata3.00: status: { DRDY } Apr 27 20:19:18 tower kernel: ata3: hard resetting link Apr 27 20:19:18 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 27 20:19:18 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:18 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:18 tower kernel: ata3.00: supports DRM functions and may not be fully accessible Apr 27 20:19:18 tower kernel: ata3.00: NCQ Send/Recv Log not supported Apr 27 20:19:18 tower kernel: ata3.00: configured for UDMA/133 Apr 27 20:19:18 tower kernel: ata3: EH complete Array is back and party seems to be valid, but I do get some errors... how can I dig deeper? knowlage-diagnostics-20190427-2138.zip
April 28, 20197 yr Community Expert ata3 is the SSD, replaces cables, Samsung SSDs are particularity pick with cable quality.
April 28, 20197 yr Community Expert On the syslog, search for the ata#, you can also click on the little disk icon next to each disk on the main page to see that device's related log info.
Archived
This topic is now archived and is closed to further replies.