[SOLVED] Lost 173G from 500GB Cache


Recommended Posts

Last night I noticed many errors from my cache drives: FS went RO, Docker not responsive and the logs started doing:

 

Oct 12 23:46:00 fractal kernel: ata8.00: exception Emask 0x0 SAct 0xfffff03f SErr 0x0 action 0x6 frozen
Oct 12 23:46:00 fractal kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Oct 12 23:46:00 fractal kernel: ata8.00: cmd 61/18:00:a8:16:ff/00:00:0b:00:00/40 tag 0 ncq dma 12288 out
Oct 12 23:46:00 fractal kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 12 23:46:00 fractal kernel: ata8.00: status: { DRDY }
Oct 12 23:46:00 fractal kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Oct 12 23:46:00 fractal kernel: ata8.00: cmd 61/20:08:c0:16:ff/00:00:0b:00:00/40 tag 1 ncq dma 16384 out
Oct 12 23:46:00 fractal kernel:         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

and then 

Oct 13 00:27:29 fractal rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1908.0 try https://www.rsyslog.com/e/2027 ]
Oct 13 00:27:29 fractal rsyslogd: file '/mnt/user/meta/syslog-10.10.10.10.log'[2] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Read-only file system [v8.1908.0 try https://www.rsyslog.com/e/2027 ]
Oct 13 00:27:29 fractal rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1908.0 try https://www.rsyslog.com/e/2027 ]
Oct 13 00:27:29 fractal kernel: scsi_io_completion_action: 127 callbacks suppressed
Oct 13 00:27:29 fractal kernel: sd 7:0:0:0: [sdd] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Oct 13 00:27:29 fractal kernel: sd 7:0:0:0: [sdd] tag#5 CDB: opcode=0x28 28 00 03 61 18 e0 00 00 20 00
Oct 13 00:27:29 fractal kernel: print_req_error: 133 callbacks suppressed
Oct 13 00:27:29 fractal kernel: print_req_error: I/O error, dev sdd, sector 56695008
Oct 13 00:27:29 fractal kernel: btrfs_dev_stat_print_on_error: 127 callbacks suppressed
Oct 13 00:27:29 fractal kernel: BTRFS error (device dm-8): bdev /dev/mapper/sdd1 errs: wr 74, rd 10858, flush 0, corrupt 0, gen 0
Oct 13 00:27:29 fractal kernel: sd 8:0:0:0: [sde] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Oct 13 00:27:29 fractal kernel: sd 8:0:0:0: [sde] tag#25 CDB: opcode=0x28 28 00 0e 20 18 e0 00 00 20 00
Oct 13 00:27:29 fractal kernel: print_req_error: I/O error, dev sde, sector 236984544
Oct 13 00:27:29 fractal kernel: BTRFS error (device dm-8): bdev /dev/mapper/sde1 errs: wr 417, rd 9685, flush 0, corrupt 0, gen 0
Oct 13 00:27:29 fractal kernel: sd 7:0:0:0: [sdd] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Oct 13 00:27:29 fractal kernel: sd 7:0:0:0: [sdd] tag#3 CDB: opcode=0x28 28 00 11 86 70 70 00 00 08 00

I was able to shutdown after grabbing diagnostics and went to bed. Woke up this morning to a working array but the cache drive was missing 173G. From 500G to 327G.

 

Does anyone know what happened to my drives?

 

I fear that I am a victim of the excessive cache write bug and it has killed my SSDs although I would have expected them just to die rather than loose space.

 

I've included the diags from last night and this morning after a successful boot.

 

I'm going to look into new SSDs (any suggestions?) My motherboard can also support 2x M.2 drives.

 

 

Thanks!

 

 

 

fractal-diagnostics-20201013-0027.zip fractal-diagnostics-20201013-0658.zip

Edited by shaunsund
Link to comment

Both cache devices dropped offline at the same time:

Oct 12 23:47:21 fractal kernel: ata7: hard resetting link
Oct 12 23:47:56 fractal kernel: ata7: softreset failed (1st FIS failed)
Oct 12 23:47:56 fractal kernel: ata7: limiting SATA link speed to 3.0 Gbps
Oct 12 23:47:56 fractal kernel: ata7: hard resetting link
Oct 12 23:47:56 fractal kernel: ata8: softreset failed (1st FIS failed)
Oct 12 23:47:56 fractal kernel: ata8: limiting SATA link speed to 3.0 Gbps
Oct 12 23:47:56 fractal kernel: ata8: hard resetting link
Oct 12 23:48:01 fractal kernel: ata8: softreset failed (1st FIS failed)
Oct 12 23:48:01 fractal kernel: ata8: reset failed, giving up
Oct 12 23:48:01 fractal kernel: ata8.00: disabled
Oct 12 23:48:01 fractal kernel: ata8: EH complete
Oct 12 23:48:01 fractal kernel: ata7: softreset failed (1st FIS failed)
Oct 12 23:48:01 fractal kernel: ata7: reset failed, giving up
Oct 12 23:48:01 fractal kernel: ata7.00: disabled

They are both on the same Asmedia controller, so possibly a controller issue, or a power problem if both share a power connector/splitter.

 

This by itself should not cause data loss, but I can only see cache usage after the reboot, can't see what it was before because the diags are naturally after they both dropped.

 

 

 

Link to comment

I've got 2 other drives on the same controller in addition to the SSDs. No problems with those. Why the other 2 drives didn't also complain is curious.

 

Actually, I didn't get enough sleep. ~4hrs. Confused the Free space with the FS size.

 

Smart reports are OK on each drive. btrfs reports OK. It must be the motherboard. Which, seeing how its out of warranty by 3 months seems even more likely.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.