Two disabled disks, and cache drives not correct


Go to solution Solved by JorgeB,

Recommended Posts

I started the day by moving my appdata, system, and domain folders off my cache pool onto the array so that I could upgrade to bigger cache drives. I moved the two 120gb ssd's into my DAS, and added two new 256gb ssds into my server in their place. I booted up, selected the new ssd's as the new disks for my appdata cache pool and hit the mover. Everything was fine.

Then some time later my 2nd parity drive ended up being disabled.

 

Mar  5 16:00:08 Skynet kernel: critical target error, dev sdy, sector 230862839 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Mar  5 16:00:08 Skynet kernel: BTRFS warning (device sdy1): failed to trim 1 device(s), last error -121
Mar  5 16:00:08 Skynet root: /etc/libvirt: 920.8 MiB (965480448 bytes) trimmed on /dev/loop3
Mar  5 16:00:08 Skynet root: /var/lib/docker: 21.8 GiB (23369732096 bytes) trimmed on /dev/loop2
Mar  5 16:00:08 Skynet root: /mnt/skynetcache: 339.6 GiB (364687364096 bytes) trimmed on /dev/sdb1
Mar  5 16:00:13 Skynet  sSMTP[18250]: Creating SSL connection to host
Mar  5 16:00:13 Skynet  sSMTP[18250]: SSL connection using TLS_AES_256_GCM_SHA384
Mar  5 16:00:15 Skynet  sSMTP[18250]: Sent mail for [email protected] (221 2.0.0 closing connection i8-20020aa79088000000b00594235980e4sm5017473pfa.181 - gsmtp) uid=0 username=root outbytes=468
Mar  5 16:10:24 Skynet webGUI: Successful login user root from 192.168.1.228
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: mpt2sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Mar  5 16:15:30 Skynet kernel: sd 15:0:17:0: device_block, handle(0x001c)
Mar  5 16:15:30 Skynet kernel: sd 15:0:18:0: device_block, handle(0x001d)
Mar  5 16:15:30 Skynet kernel: sd 15:0:17:0: [sdu] tag#3035 UNKNOWN(0x2003) Result: hostbyte=0x0e driverbyte=DRIVER_OK cmd_age=0s
Mar  5 16:15:30 Skynet kernel: sd 15:0:17:0: [sdu] tag#3035 CDB: opcode=0x2a 2a 00 0f 6d 59 e5 00 00 03 00
Mar  5 16:15:30 Skynet kernel: I/O error, dev sdu, sector 2070597416 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 0
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=2070597352
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=2070597360
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=2070597368
Mar  5 16:15:30 Skynet kernel: sd 15:0:17:0: [sdu] tag#3034 UNKNOWN(0x2003) Result: hostbyte=0x0e driverbyte=DRIVER_OK cmd_age=0s
Mar  5 16:15:30 Skynet kernel: sd 15:0:17:0: [sdu] tag#3034 CDB: opcode=0x2a 2a 00 00 5d c4 30 00 00 04 00
Mar  5 16:15:30 Skynet kernel: I/O error, dev sdu, sector 49160576 op 0x1:(WRITE) flags 0x0 phys_seg 4 prio class 0
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=49160512
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=49160520
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=49160528
Mar  5 16:15:30 Skynet kernel: md: disk29 write error, sector=49160536
Mar  5 16:15:31 Skynet kernel: sd 15:0:17:0: device_unblock and setting to running, handle(0x001c)
Mar  5 16:15:31 Skynet kernel: sd 15:0:17:0: Power-on or device reset occurred
Mar  5 16:15:31 Skynet kernel: sd 15:0:18:0: device_unblock and setting to running, handle(0x001d)
Mar  5 16:15:31 Skynet kernel: sd 15:0:18:0: Power-on or device reset occurred
Mar  5 16:15:33 Skynet flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Mar  5 16:21:04 Skynet  rpc.mountd[6271]: v4.2 client detached: 0x6f4ca8e26404f5fe from "192.168.1.48:768"



after seeing the logs I assumed I may have bumped a cable loose or something like that, so I stopped the array, shut it down, and reseated all the drives. I booted back up and was presented with no errors but the drive was still disabled. So I began to look up the process to re-enable the drive, and roughly 20-30 minutes later a data disk throws up an error and is disabled. I checked the logs and this is all I saw:

image.thumb.png.bc7101618873ae92a559a5188fb97fcc.png

 

so I panic and decide to stop the array and shut it down. The array stopped, or so I thought, and then I saw at the bottom it said

 Array Stopping•Sync filesystems...

 

after around 20 minutes of it just completely locked up like this I did a hard shut down. I pulled both the server and das out and reseated every cable and even moved some drives around from the das to server and server to das just for more trouble shooting help.

I booted everything back up and did a smart test on the drives (attached) ST1 is the parity drive, and H72 is the data disk.

I also just noticed that my two new cache drives (that holds my appdata, domains, and system) are showing up as incorrect.



I've uploaded my diagnostics... At this point I am totally lost at what I should be doing. I think I have two issues here, 1) the two disabled disks, and 2) the cache drives reporting as "wrong" when they are not.

Any help is appreciated.

 

 

image.png

skynet-diagnostics-20230306-0325.zip

Link to comment
1 hour ago, JorgeB said:

SMART look OK for both disks and it looks more like a power/connection problem, but you can run a long SMART test on both to confirm, to fix the pool issue:

unassign both pool devices, start array, stop array, re-assign both cache device, start array.

I will run a long smart test for both drives but I agree it looks more power/ connection related. How would I know if I fixed the problem by reseating all the drives?
If I unassign, start, stop, and reassign the cache drives will I lose any data?

how do I fix the disabled drives without losing any data (assuming smart tests come back ok)?

 

Link to comment
15 minutes ago, Neldonado said:

If I unassign, start, stop, and reassign the cache drives will I lose any data?

No, it will import any existing pool.

 

16 minutes ago, Neldonado said:

how do I fix the disabled drives without losing any data (assuming smart tests come back ok)?

After the tests, reboot and post new diags after array start.

Link to comment
1 hour ago, JorgeB said:

No, it will import any existing pool.

 

After the tests, reboot and post new diags after array start.

Which order of these should I go? Do tests, reboot, start array, post diags, and after that fix cache? Am I starting the array normally or in maintenance mode?

Link to comment
10 hours ago, JorgeB said:

Any order you want as long as diags after array start are the last thing.

 

Normal

 

I've posted my docs and extended tests for my drives. Both appear to be fine from what I can tell. I stopped my array after getting the diagnostics. I am not sure if leaving it on is a good idea until I know what's going on and how to fix.

 

skynet-diagnostics-20230306-2345.zip ST10000NM0226_ZA290T3P0000C905MDCF_35000c500a6c8fc23-20230306-1532.txt H7280A520SUN8.0T_001701PVZUGV_VLKVZUGV_35000cca260da2f70-20230306-1532.txt

Link to comment
14 hours ago, JorgeB said:

Both SMART tests aborted, try again, make sure to disable spin down if you have the SAS spin down plugin installed.

how long do the tests usually take? One of the drives has been at 100% for a few hours now but not complete. The second is at 67%... I started them both over 12hours ago.

 

I also noticed this in the logs while waiting. The array is stopped.

 

H0c.png

Link to comment
38 minutes ago, JorgeB said:

Replace cables for

 

cache device: (sds) ADATA_SU655_2L1929199ERW

 

Or connect it to an onboard SATA port if available.

 

Emulated disk is mounting assuming contents look correct you can rebuild on top (and re-sync parity2 at the same time).


I will try a new SAS breakout cable for that sds drive, or see if I have room to plug directly into the mobo. Do SSD's not play nice? 

Is there anything you can see that would tell you what caused these drives to disable in the first place? just loose cables? 

What's the safest way for me to go about rebuilding the contents assuming it's all there?

 

thank you so much for all the time helping me.

Link to comment
1 hour ago, Neldonado said:

Do SSD's not play nice? 

Sometimes not, also trim won't work with that HBA, it will on the onboard SATA.

 

1 hour ago, Neldonado said:

Is there anything you can see that would tell you what caused these drives to disable in the first place? just loose cables? 

Only what I mentioned initially, that it looks more like a power/connection problem, difficult to say more than that, you can replace/swap cables/slots to rule that out, could also be PSU related.

Link to comment
23 minutes ago, JorgeB said:

Sometimes not, also trim won't work with that HBA, it will on the onboard SATA.

 

Only what I mentioned initially, that it looks more like a power/connection problem, difficult to say more than that, you can replace/swap cables/slots to rule that out, could also be PSU related.

I agree that I think it's connection related. Fact is I was in there moving drives around just hours before, and both of these drives were on different PSU's and different SAS breakout cables. I purchased a few replacement cables and will try to put the SSD's on the mobo or remove them entirely from the equation. 

Can you recommend the safest method for me to rebuild the parity and data disk? 

Link to comment
14 minutes ago, JorgeB said:

Safest would be to rebuild the data disk to a spare, if there are no spares at least check/repalce cables before doing it.

as in, remove the disabled data disk, and put in a new disk? I don't have an extra disk lying around, unless I used my cache drive to replace it. but then I would be moving data from my cache back to my emulated array. 

Link to comment
On 3/8/2023 at 4:12 AM, JorgeB said:

Replace cables for

 

cache device: (sds) ADATA_SU655_2L1929199ERW

 

Or connect it to an onboard SATA port if available.

 

Emulated disk is mounting assuming contents look correct you can rebuild on top (and re-sync parity2 at the same time).

Connected the SSD's to onboard sata ports and rebuilt the data disk + parity, all is well! Thank you!

  • Like 1
Link to comment
On 3/8/2023 at 10:40 AM, JorgeB said:

Rebuild on top of the existing one, was just mentioning the safest way.

SO!

 

it's been almost a week and I just got a notification that same parity drive was disabled. I attached a short smart test of the disk a screen shot of errors, and my diagnostics. I haven't had any issues for over a week! All I did was maybe 30 minutes ago invoke the mover.

 

 

 

S8f.png

ST10000NM0226_ZA290T3P0000C905MDCF_35000c500a6c8fc23-20230315-1905.txt skynet-smart-20230315-1905.zip

Link to comment

Oops! accidentally attached the smart drive test twice. I attached my diagnostics. I went and moved some drives around and a different drive was throwing errors which made me think it was the cables. So I replaced both of the breakout cables. I saw no errors after this so I started the parity rebuild. Fast forward 7-8hrs and I see now one of my cache drives is giving me errors. I am a bit lost at this point. I included the diagnostics for right now as well.

 

R4b.png

skynet-diagnostics-20230315-1907.zip skynet-diagnostics-20230316-0345.zip

Link to comment
13 minutes ago, JorgeB said:
Mar 15 21:01:08 Skynet  emhttpd: import 33 cache device: (sdy) ST10000NM0226_ZA290T0B0000C905G44Y_35000c500a6c9041f

 

This device dropped offline, check replace cables then run a scrub.

I just replaced the breakout cables for this drive even though it didn't cause issues before. I will check connections again. Since parity rebuild is 25% complete should I let that finish before stopping/ checking cables?

How do I go about running a scrub and what does that do?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.