Jump to content

Cache errors on UNRAID 6.8.3


Recommended Posts

I recently added a new JBOD to my server as well as some new 1TB cache drives (configured BTRFS Raid 10). I  have been writing parity to a second parity drive when my cache failed rendering my dockers all dead. The drive on drive details tab shows "Unavailable - disk must be spun up", if you spin it up on the main page looks normal, but then go back to details and it has the same must be spun up message. I have had this problem a few times recently, which is how I ended up with only 7 drives in a raid 10 configuration for cache. I am currently trying to get mover to move all my files back the array so I can rebuild cache drives (and also manually moving them over to another unraid box). I have attached diagnostics.

 

Oh yeah, I pulled the hot swap drive to see if it would come back and it did show back up in unassigned devices as drive "sdal", it was originally "sdj". Unassigned devices will not let me re assign the UUID to get the drive back in cache array while array is online. In the process I inadvertently pulled cache drive "sdp" which is now also in unassigned devices as "sdai", again will not allow to reassign UUID. With the normal array if a drive is pulled and then reinserted the array seems to bring it back but cache does not, is that due to raid 10/ unraid xfs parity differences?

 

On another note, since the data should still be on those two cache drives, is there a way to add them back to cache without losing the data, everytime you adjust cache it just re-formats the drive has been my experience.

tower-diagnostics-20200713-1414.zip

Link to comment

Logs is filled with errors like these for multiple devices:

Jul 12 23:52:13 Tower kernel: sd 7:0:18:0: Power-on or device reset occurred
Jul 12 23:52:18 Tower kernel: mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Jul 12 23:52:19 Tower kernel: sd 7:0:18:0: Power-on or device reset occurred
Jul 12 23:52:19 Tower kernel: mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Jul 12 23:52:20 Tower kernel: sd 7:0:18:0: Power-on or device reset occurred
Jul 12 23:52:22 Tower kernel: mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Jul 12 23:52:22 Tower kernel: sd 7:0:1:0: Power-on or device reset occurred

 

This suggest a power/connection problem on that HBA, check all cables and/or try a different PSU.

 

After all those errors one of the cache devices ended up dropping offline:

Jul 12 23:53:07 Tower kernel: sd 7:0:15:0: Power-on or device reset occurred
Jul 12 23:53:07 Tower kernel: mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Jul 12 23:53:08 Tower kernel: sd 7:0:16:0: Power-on or device reset occurred
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: device_unblock and setting to running, handle(0x001c)
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6827 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6827 CDB: opcode=0x28 28 00 0a 3c 3a a0 00 00 20 00
Jul 12 23:53:08 Tower kernel: print_req_error: I/O error, dev sdj, sector 171719328
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6822 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Jul 12 23:53:08 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdj1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6822 CDB: opcode=0x2a 2a 00 0d 21 8d c0 00 09 80 00
Jul 12 23:53:08 Tower kernel: print_req_error: I/O error, dev sdj, sector 220302784
Jul 12 23:53:08 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdj1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6828 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Jul 12 23:53:08 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdj1 errs: wr 2, rd 1, flush 0, corrupt 0, gen 0
Jul 12 23:53:08 Tower kernel: sd 7:0:20:0: [sdj] tag#6828 CDB: opcode=0x28 28 00 08 f6 5b 08 00 00 18 00
Jul 12 23:53:08 Tower kernel: print_req_error: I/O error, dev sdj, sector 150362888

See here to help with the cache issue but you need to fix the HBA reset problems first.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...