Power Flick? Unable to stop the array

mathomas3 · July 29, 2022

Hello all,

I think I had a little power flick or something, but I noticed that my server's fans starting spinning a little bit quicker for a second which prompted me to check the server. After pulling up the page disk 8 is red balled and all ther other disks have 300-400ish errors. Wanting to prevent further errors I tried to stop the array and that's where it's stuck.

'

lsof | grep /mnt
lsof: WARNING: can't stat() xfs file system /mnt/disk8
Output information may be incomplete.

'

Before any of this happened I suspected that disk 10 was on the way out due to 114 read errors that happened yesterday, though an extended SMART diag came back good...

And now the parity drive is red balled... HELP!

tower-diagnostics-20220729-1405.zip

Edited July 29, 2022 by mathomas3

mathomas3 · July 29, 2022

Also there is this error...

Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md11, logical block 1953506608, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522944, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522945, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522946, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522947, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522948, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522949, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522950, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522951, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md6, logical block 1953506608, async page read
Jul 29 12:10:01 Tower kernel: blk_update_request: I/O error, dev loop2, sector 41942912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

trurl · July 29, 2022

You are going to have to go for the power switch. Then post new diagnostics after rebooting and starting the array.

mathomas3 · July 29, 2022

2 minutes ago, trurl said:

You are going to have to go for the power switch. Then post new diagnostics after rebooting and starting the array.

As a sysadmin... that's something you tell users all the time but not something you ever want to be told yourself 🤔

mathomas3 · July 29, 2022

tower-diagnostics-20220729-1423.zip

Rebooted and posting Diags.

Those two disks are still red balled...

funny thing is though that disk 8 is a 2 month old ssd...

trurl · July 29, 2022

3 minutes ago, mathomas3 said:

disk 8 is a 2 month old ssd

Why do you have SSD in array with HDD parity? SSDs in the array can't be trimmed, and can only be written at parity speed.

mathomas3 · July 29, 2022

faster read times while also having the data protected is what I was aiming for... though I have been considering moving it out of the array for sometime now

trurl · July 29, 2022

Give me a little while to go through these diagnostics, fortunately we have those from before reboot also.

Post a screenshot of Main - Array Devices

mathomas3 · July 29, 2022

1 minute ago, trurl said:

Give me a little while to go through these diagnostics, fortunately we have those from before reboot also.

Post a screenshot of Main - Array Devices

mathomas3 · July 29, 2022

2 minutes ago, trurl said:

Give me a little while to go through these diagnostics, fortunately we have those from before reboot also.

Post a screenshot of Main - Array Devices

Before I forget to mention it.

Thank you for the quick assistance.

It's been a while since I have had drive failures, but never had more then one die at a time.

I did recently move UNRaid into this hardware, an HP 1u Server and a 24bay DAS, both with dual PSUs. Hasn't given me any issues up till now.

mathomas3 · July 29, 2022

Oh and after posting the most recent Diags I disabled docker to limit writes to the array. Dont think that should effect anything, but I figure it's for the best

trurl · July 29, 2022

Never got around to converting filesystem on disk3?

Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller?

mathomas3 · July 29, 2022

3 minutes ago, trurl said:

Never got around to converting filesystem on disk3?

Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller?

correct. All of the drives are connected via the RAID controller

When I moved everything over I had to use a different RAID controller(thought I was using a recommended one from the unraid wiki)

If I recall disks 9-11 are SAS and all the rest are SATA drives

trurl · July 29, 2022

I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports.

SMART report for parity looks fine, hasn't had extended self-test recently.

SMART report for disk8 looks fine, no self-test run.

Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page?

All 14 array data disks mounted including emulated disk8 so that's good.

trurl · July 29, 2022

1 minute ago, mathomas3 said:

had to use a different RAID controller(thought I was using a recommended one from the unraid wiki)

You must be looking at very old wiki. No RAID controllers are recommended these days. Some can be flashed to IT mode so they are no longer RAID controllers.

mathomas3 · July 29, 2022

2 minutes ago, trurl said:

I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports.

SMART report for parity looks fine, hasn't had extended self-test recently.

SMART report for disk8 looks fine, no self-test run.

Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page?

All 14 array data disks mounted including emulated disk8 so that's good.

Negative. Smart reports look good for the ones I have looked at.

image.png.37f9d259dfa3a288750f0146a667d2b9.png

trurl · July 29, 2022

Jul 14 00:00:01 Tower kernel: mdcmd (36): check NOCORRECT
Jul 14 00:00:01 Tower kernel: 
Jul 14 00:00:01 Tower kernel: md: recovery thread: check P Q ...

Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 Sense Key : 0x3 [current] [descriptor] 
Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 ASC=0x11 ASCQ=0x0 
Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 CDB: opcode=0x28 28 00 73 7d 14 85 00 00 80 00
Jul 15 05:55:20 Tower kernel: blk_update_request: critical medium error, dev sdg, sector 15500616856 op 0x0:(READ) flags 0x0 phys_seg 114 prio class 0
Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616792
Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616800

That looks like a disk problem. And similar for rest of parity check, but it completes

Jul 15 06:08:55 Tower kernel: md: sync done. time=108534sec
Jul 15 06:08:55 Tower kernel: md: recovery thread: exit status: 0

Some other syslog entries about the controller for disk7 after that while you were trying to shut things down.

trurl · July 29, 2022

Some things about your configuration not ideal. Often when I see this in diagnostics I say unrelated to the problem, but in your case it does seem related.

Your appdata, domains, and system shares have files on the array. Specifically disk8.

Ideally, all files for these shares would be on fast pool (cache) and set to stay there, so docker/VM performance isn't impacted by slower array, and so array disks can spin down, since these files are always open.

trurl · July 29, 2022

1 hour ago, mathomas3 said:

had a little power flick

Do you have UPS?

trurl · July 29, 2022

36 minutes ago, trurl said:

all the SAS disks have disconnected and reconnected as different devices

That remark applies to both sets of diagnostics.

I don't think you can rebuild anything until you resolve that.

mathomas3 · July 29, 2022

Just now, trurl said:

Do you have UPS?

I do have a UPS. Though I need to get a larger one, Im pushing this current one on what it can handle. For power I connected 1 PSU to the UPS and the second PSU directly to the wall.

Im thinking what happened is that the DAS lost power for 0.01 of a second causing the errors on the disks and since Parity and disk 8 were actively being accessed they got disabled.

mathomas3 · July 29, 2022

2 minutes ago, trurl said:

That remark applies to both sets of diagnostics.

I don't think you can rebuild anything until you resolve that.

When you say that the SAS disks have been disconnected and reconnected as different devices... what do you mean?

I didnt reassign the disks when everything was moved into the new box.

trurl · July 29, 2022

14 minutes ago, trurl said:

Some other syslog entries about the controller for disk7 after that while you were trying to shut things down.

Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics.

There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not.

That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled.

mathomas3 · July 29, 2022

3 minutes ago, trurl said:

Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics.

There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not.

That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled.

So you think both parity drives are hosed?

trurl · July 29, 2022

5 minutes ago, mathomas3 said:

When you say that the SAS disks have been disconnected and reconnected as different devices... what do you mean?

I didnt reassign the disks when everything was moved into the new box.

I mean while the server was running. Hardware issues made them disconnect, and then they reconnected, but linux gave them new device letters so Unraid doesn't know them anymore. That happened in those earlier diagnostics where things went south, and again in these current diagnostics.

Do you have any disks showing in Unassigned Devices?

Take a look at your latest diagnostics, in the smart folder. None of the SAS disks are assigned and they have different sd designations than in your screenshot. You can also see what I mean about RAID controllers making it difficult to identify disks since you have to actually open up each of those SMART reports to figure out which disk they are.

Power Flick? Unable to stop the array

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

JorgeB

mathomas3

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation