(Solved) Failed Disk (remove?)

April 26, 20197 yr

My Server has become unstable (crashed and parity checks lead to more crashes). After it come back again, I had a failed disk.

The disk looks "unmountable". As I have plenty of space (50% usage) I'm wondering if I should remove the disk, try to put it back into the array or do something else.

Is there a way to find out how "dead" the disk is?

knowlage-diagnostics-20190426-1816.zip

Edited April 30, 20197 yr by Jaster

Quote

April 26, 20197 yr

Community Expert

There are a few relatively recent UNC @ LBA errors, you should run an extended SMART test.

Quote

April 26, 20197 yr

Author

It's a very old 2TB disk, I guess it's time to remove it.

Is there a way to remove it and have the array re allocate the missing data somehwere else (as said, I do have plenty of space left)?

Quote

April 26, 20197 yr

Community Expert

Not automatically, either check filesystem on the emulated disk and move the data to other disks, or mount the old disk with UD and copy to the array after doing a new config and re-syncing parity.

Quote

April 26, 20197 yr

Author

5 minutes ago, johnnie.black said:

Not automatically, either check filesystem on the emulated disk and move the data to other disks

I do this by...? Hit the "check" button and then use unbalance?

Quote

April 26, 20197 yr

Community Expert

2 minutes ago, Jaster said:

I do this by...?

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

2 minutes ago, Jaster said:

and then use unbalance?

It's an option.

Quote

April 26, 20197 yr

Community Expert

Then you'll still need to do a new config and re-sync parity without that disk, though this doesn't bode well:

26 minutes ago, Jaster said:

crashed and parity checks lead to more crashes

Quote

April 26, 20197 yr

Author

So what do you suggest?

Quote

April 26, 20197 yr

Community Expert

I would first try to find out why the server is crashing, run memtest, check cooling, power supply, etc.

Quote

April 26, 20197 yr

Author

It seems to be the disk, everything else is doing well. So I either replace or remove it. I'm sure it won't pass another parity run as I tried that a couple of times.

Quote

April 26, 20197 yr

Community Expert

Bad disk shouldn't make Unraid crash, and it doesn't look bad, still move the data first, remove the disk and then try to re-sync parity, if it's still crashes it wasn't the disk.

Quote

April 26, 20197 yr

Author

#	Attribute Name	Flag	Value	Worst	Threshold	Type	Updated	Failed	Raw Value
1	Raw read error rate	0x002f	200	200	051	Pre-fail	Always	Never	0
3	Spin up time	0x0027	176	174	021	Pre-fail	Always	Never	4158
4	Start stop count	0x0032	097	097	000	Old age	Always	Never	3449
5	Reallocated sector count	0x0033	200	200	140	Pre-fail	Always	Never	0
7	Seek error rate	0x002e	200	200	000	Old age	Always	Never	0
9	Power on hours	0x0032	040	040	000	Old age	Always	Never	43936 (5y, 4d, 16h)
10	Spin retry count	0x0032	100	100	000	Old age	Always	Never	0
11	Calibration retry count	0x0032	100	100	000	Old age	Always	Never	0
12	Power cycle count	0x0032	100	100	000	Old age	Always	Never	257
192	Power-off retract count	0x0032	200	200	000	Old age	Always	Never	153
193	Load cycle count	0x0032	199	199	000	Old age	Always	Never	3295
194	Temperature celsius	0x0022	117	088	000	Old age	Always	Never	30
196	Reallocated event count	0x0032	200	200	000	Old age	Always	Never	0
197	Current pending sector	0x0032	200	200	000	Old age	Always	Never	0
198	Offline uncorrectable	0x0030	100	253	000	Old age	Offline	Never	0
199	UDMA CRC error count	0x0032	200	200	000	Old age	Always	Never	0
200	Multi zone error rate	0x0008	200	200	000	Old age	Offline	Never	0

I'm running the check now, let's see what happens. If it passes, I'll try to reset the config and run a parity check.

Edited April 26, 20197 yr by Jaster

Quote

April 26, 20197 yr

Community Expert

I already saw that on the diags.

Quote

April 26, 20197 yr

Author

1 minute ago, johnnie.black said:

I already saw that on the diags.


Phase 1 - find and verify superblock...
        - block cache size set to 3062096 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 8 tail block 4
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agf_freeblks 122094651, counted 122094649 in ag 3
agf_freeblks 94873563, counted 94873565 in ag 1
agf_freeblks 121856187, counted 121856185 in ag 2
agi_freecount 1, counted 0 in ag 3
agi_freecount 1, counted 0 in ag 3 finobt
agi_freecount 1, counted 0 in ag 1
agi_freecount 1, counted 0 in ag 1 finobt
inode chunk claims untracked block, finobt block - agno 2, bno 3901920, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901921, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901922, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901923, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901924, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901925, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901926, inopb 8
inode chunk claims untracked block, finobt block - agno 2, bno 3901927, inopb 8
undiscovered finobt record, ino 2178699008 (2/31215360)
finobt ir_freecount/free mismatch, inode chunk 2/31215360, freecount 30 nfree 32
invalid inode count, inode chunk 2/31215360, count 0 ninodes 64
undiscovered finobt record, ino 2147483712 (2/64)
finobt ir_freecount/free mismatch, inode chunk 2/64, freecount 54 nfree 24
invalid inode count, inode chunk 2/64, count 0 ninodes 64
undiscovered finobt record, ino 2147484608 (2/960)
finobt ir_freecount/free mismatch, inode chunk 2/960, freecount 6 nfree 28
invalid inode count, inode chunk 2/960, count 0 ninodes 64
agi_freecount 1, counted 0 in ag 2
agi_freecount 1, counted 90 in ag 2 finobt
sb_ifree 9, counted 6
sb_fdblocks 338655161, counted 339149767
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
found inodes not in the inode allocation tree
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 1
        - agno = 2
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
Maximum metadata LSN (1:3751) is ahead of log (1:8).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Fri Apr 26 19:06:04 2019

Phase		Start		End		Duration
Phase 1:	04/26 19:06:03	04/26 19:06:04	1 second
Phase 2:	04/26 19:06:04	04/26 19:06:04
Phase 3:	04/26 19:06:04	04/26 19:06:04
Phase 4:	04/26 19:06:04	04/26 19:06:04
Phase 5:	Skipped
Phase 6:	Skipped
Phase 7:	Skipped

Total run time: 1 second

run xfs_repair with -L?

Quote

April 26, 20197 yr

Community Expert

First without -n, and if it still asks for it, and likely it will, use -L.

Quote

April 26, 20197 yr

Author

I think I screwed it up (a bit), I copy/pased repair with drive md1 inseard of md4.

I cancled ([ctrl]+[C]) and as everything looked fine I went on and fixed 4.

After I made a new config, it told me disk1 is unmounable. Trying to stop the array, it "hangs" with Array Stopping•Retry unmounting disk share(s)... argh.

Quote

April 26, 20197 yr

Community Expert

disk1 should be fixable with xfs_repair, new config should be done only after you copy disk4's data.

Quote

April 26, 20197 yr

Author

I can't get into xfs_repair as I can't get the array into maintaiance.

As d4 was repaired, I'll do a new config and include it in order to run a parity check and hope. If it works, I'll unbalance all data off d4 and remove it.

Quote

April 26, 20197 yr

Community Expert

Only the file system was repaired (and it was the emulated disk filesystem, not the actual disk), it won't make any difference for a parity check, or if it crashes or not, though like I said I doubt it's disk related, still if you plan to remove disk4 no point in doing a new config with it.

Edited April 26, 20197 yr by johnnie.black

Quote

April 26, 20197 yr

Author

I got all disk put back and try to run a parity check. Lets see what it does...

Is there anything I can enable for some kind of "extended" monitoring?

Quote

April 26, 20197 yr

Community Expert

System notifications are enough to monitor usual disk warning signs.

Quote

April 27, 20197 yr

Author

Apr 27 20:19:16 tower kernel: ata3: hard resetting link
Apr 27 20:19:16 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 27 20:19:16 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:16 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:16 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:16 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:16 tower kernel: ata3.00: configured for UDMA/133
Apr 27 20:19:16 tower kernel: ata3: EH complete
Apr 27 20:19:16 tower kernel: ata3.00: exception Emask 0x10 SAct 0x200 SErr 0x400100 action 0x6 frozen
Apr 27 20:19:16 tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Apr 27 20:19:16 tower kernel: ata3: SError: { UnrecovData Handshk }
Apr 27 20:19:16 tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Apr 27 20:19:16 tower kernel: ata3.00: cmd 61/80:48:40:2f:76/00:00:03:00:00/40 tag 9 ncq dma 65536 out
Apr 27 20:19:16 tower kernel: res 40/00:48:40:2f:76/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Apr 27 20:19:16 tower kernel: ata3.00: status: { DRDY }
Apr 27 20:19:16 tower kernel: ata3: hard resetting link
Apr 27 20:19:17 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 27 20:19:17 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:17 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:17 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:17 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:17 tower kernel: ata3.00: configured for UDMA/133
Apr 27 20:19:17 tower kernel: ata3: EH complete
Apr 27 20:19:18 tower kernel: ata3.00: exception Emask 0x10 SAct 0x4 SErr 0x400100 action 0x6 frozen
Apr 27 20:19:18 tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Apr 27 20:19:18 tower kernel: ata3: SError: { UnrecovData Handshk }
Apr 27 20:19:18 tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Apr 27 20:19:18 tower kernel: ata3.00: cmd 61/80:10:40:cc:eb/00:00:02:00:00/40 tag 2 ncq dma 65536 out
Apr 27 20:19:18 tower kernel: res 40/00:10:40:cc:eb/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Apr 27 20:19:18 tower kernel: ata3.00: status: { DRDY }
Apr 27 20:19:18 tower kernel: ata3: hard resetting link
Apr 27 20:19:18 tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 27 20:19:18 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:18 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:18 tower kernel: ata3.00: supports DRM functions and may not be fully accessible
Apr 27 20:19:18 tower kernel: ata3.00: NCQ Send/Recv Log not supported
Apr 27 20:19:18 tower kernel: ata3.00: configured for UDMA/133
Apr 27 20:19:18 tower kernel: ata3: EH complete

Array is back and party seems to be valid, but I do get some errors... how can I dig deeper?

knowlage-diagnostics-20190427-2138.zip

Quote

April 28, 20197 yr

Community Expert

ata3 is the SSD, replaces cables, Samsung SSDs are particularity pick with cable quality.

Quote

April 28, 20197 yr

Author

Ok, thanks.

Where can I see the ata/port/drive mapping?

Quote

April 28, 20197 yr

Community Expert

On the syslog, search for the ata#, you can also click on the little disk icon next to each disk on the main page to see that device's related log info.

Quote

(Solved) Failed Disk (remove?)

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)