Power Flick? Unable to stop the array


Recommended Posts

Hello all,

I think I had a little power flick or something, but I noticed that my server's fans starting spinning a little bit quicker for a second which prompted me to check the server. After pulling up the page disk 8 is red balled and all ther other disks  have 300-400ish errors. Wanting to prevent further errors I tried to stop the array and that's where it's stuck.

 

'

lsof | grep /mnt
lsof: WARNING: can't stat() xfs file system /mnt/disk8
      Output information may be incomplete.

'

Before any of this happened I suspected that disk 10 was on the way out due to 114 read errors that happened yesterday, though an extended SMART diag came back good...

 

And now the parity drive is red balled... HELP!

 

tower-diagnostics-20220729-1405.zip

Edited by mathomas3
Link to comment

Also there is this error...

 

Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md11, logical block 1953506608, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522944, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522945, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522946, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522947, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522948, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522949, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522950, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md8, logical block 1953522951, async page read
Jul 29 12:10:01 Tower kernel: Buffer I/O error on dev md6, logical block 1953506608, async page read
Jul 29 12:10:01 Tower kernel: blk_update_request: I/O error, dev loop2, sector 41942912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

 

Link to comment
2 minutes ago, trurl said:

Give me a little while to go through these diagnostics, fortunately we have those from before reboot also.

 

Post a screenshot of Main - Array Devices

 

 

Before I forget to mention it.

 

Thank you for the quick assistance. 

It's been a while since I have had drive failures, but never had more then one die at a time. 

 

I did recently move UNRaid into this hardware, an HP 1u Server and a 24bay DAS, both with dual PSUs. Hasn't given me any issues up till now. 

Link to comment

Never got around to converting filesystem on disk3?

 

Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller?

 

 

Link to comment
3 minutes ago, trurl said:

Never got around to converting filesystem on disk3?

 

Your RAID controller is making it complicated identifying disks, one of many reasons RAID controllers are NOT recommended. And it looks like all the SAS disks have disconnected and reconnected as different devices, adding to the confusion. Are they all on that controller?

 

 

correct. All of the drives are connected via the RAID controller

 

When I moved everything over I had to use a different RAID controller(thought I was using a recommended one from the unraid wiki)

 

If I recall disks 9-11 are SAS and all the rest are SATA drives

Link to comment

I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports.

 

SMART report for parity looks fine, hasn't had extended self-test recently.

 

SMART report for disk8 looks fine, no self-test run.

 

Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page?

 

All 14 array data disks mounted including emulated disk8 so that's good.

 

Link to comment
1 minute ago, mathomas3 said:

had to use a different RAID controller(thought I was using a recommended one from the unraid wiki)

You must be looking at very old wiki. No RAID controllers are recommended these days. Some can be flashed to IT mode so they are no longer RAID controllers.

 

Link to comment
2 minutes ago, trurl said:

I don't know how to read SMART reports for SAS drives, or even if that RAID controller is giving useful SMART reports.

 

SMART report for parity looks fine, hasn't had extended self-test recently.

 

SMART report for disk8 looks fine, no self-test run.

 

Not going to look at others since there is so many. Do any have SMART warnings on the Dashboard page?

 

All 14 array data disks mounted including emulated disk8 so that's good.

 

Negative. Smart reports look good for the ones I have looked at. 

image.png.37f9d259dfa3a288750f0146a667d2b9.png

Link to comment
Jul 14 00:00:01 Tower kernel: mdcmd (36): check NOCORRECT
Jul 14 00:00:01 Tower kernel: 
Jul 14 00:00:01 Tower kernel: md: recovery thread: check P Q ...
Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 Sense Key : 0x3 [current] [descriptor] 
Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 ASC=0x11 ASCQ=0x0 
Jul 15 05:55:20 Tower kernel: sd 2:0:6:0: [sdg] tag#5891 CDB: opcode=0x28 28 00 73 7d 14 85 00 00 80 00
Jul 15 05:55:20 Tower kernel: blk_update_request: critical medium error, dev sdg, sector 15500616856 op 0x0:(READ) flags 0x0 phys_seg 114 prio class 0
Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616792
Jul 15 05:55:20 Tower kernel: md: disk10 read error, sector=15500616800

That looks like a disk problem. And similar for rest of parity check, but it completes

Jul 15 06:08:55 Tower kernel: md: sync done. time=108534sec
Jul 15 06:08:55 Tower kernel: md: recovery thread: exit status: 0

Some other syslog entries about the controller for disk7 after that while you were trying to shut things down.

Link to comment

Some things about your configuration not ideal. Often when I see this in diagnostics I say unrelated to the problem, but in your case it does seem related.

 

Your appdata, domains, and system shares have files on the array. Specifically disk8.

 

Ideally, all files for these shares would be on fast pool (cache) and set to stay there, so docker/VM performance isn't impacted by slower array, and so array disks can spin down, since these files are always open.

Link to comment
Just now, trurl said:

Do you have UPS?

I do have a UPS. Though I need to get a larger one, Im pushing this current one on what it can handle. For power I connected 1 PSU to the UPS and the second PSU directly to the wall.

 

Im thinking what happened is that the DAS lost power for 0.01 of a second causing the errors on the disks and since Parity and disk 8 were actively being accessed they got disabled. 

Link to comment
14 minutes ago, trurl said:

Some other syslog entries about the controller for disk7 after that while you were trying to shut things down.

Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics.

 

There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not.

 

That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled.

 

 

Link to comment
3 minutes ago, trurl said:

Actually I don't think you were shutting down yet, there was another syslog after that I needed to go through in those first diagnostics.

 

There are some entries in both of those syslogs about cache full, but it isn't now, I don't know if it was earlier or not.

 

That next syslog has similar controller problems with disk29 (parity2), more for disk8, and then looks like all disks start acting up. Eventually you get a write error on disk0 (parity), disk8, and disk29 (parity2). Write errors always disable disks, but that is one too many, so parity2 doesn't get disabled.

 

 

So you think both parity drives are hosed? 

Link to comment
5 minutes ago, mathomas3 said:

When you say that the SAS disks have been disconnected and reconnected as different devices... what do you mean?

 

I didnt reassign the disks when everything was moved into the new box. 

I mean while the server was running. Hardware issues made them disconnect, and then they reconnected, but linux gave them new device letters so Unraid doesn't know them anymore. That happened in those earlier diagnostics where things went south, and again in these current diagnostics.

 

Do you have any disks showing in Unassigned Devices?

 

Take a look at your latest diagnostics, in the smart folder. None of the SAS disks are assigned and they have different sd designations than in your screenshot. You can also see what I mean about RAID controllers making it difficult to identify disks since you have to actually open up each of those SMART reports to figure out which disk they are.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.