Jump to content

Disabled disk during array shutdown


Recommended Posts

Posted

I had Samba run into an issue and lock a temp file it with DENY_ALL. The only way to resolve that is to restart samba, so I decided to just shut down my VM and stop the array to restart it. Upon restarting the array I noticed disk1 was in an error state. I plan to just rebuild the disk because it looks healthy otherwise.

 

Anyone know why this occurred? The syslog seems to show a read/write error during unmounting of disks... or am I reading the log incorrectly?

 

Diagnostics attached. Any insight is appreciated.

tower-diagnostics-20240411-2248.zip

Posted
6 hours ago, johnsanc said:

The syslog seems to show a read/write error during unmounting of disks...

Yep:

 

Apr 11 22:11:55 Tower emhttpd: shcmd (5654356): umount /mnt/disk1
Apr 11 22:11:56 Tower kernel: XFS (md1p1): Unmounting Filesystem
Apr 11 22:11:56 Tower kernel: md: disk1 read error, sector=8593267848
Apr 11 22:11:56 Tower kernel: md: disk1 write error, sector=8593267848
Apr 11 22:11:56 Tower emhttpd: shcmd (5654357): rmdir /mnt/disk1

 

And there's no associated controller error/timeout, so possibly some timing issue.

Posted

Hmm interesting. That's what I thought as well. Shouldn't Unraid control this though and ensure all other services are stopped before unmounting?

What is the best practice for an array shutdown? Do I need to manually stop all docker containers first?

I cannot think of anything else that would have been running during unmount except maybe the file integrity plugin for hashing.

Posted
17 minutes ago, johnsanc said:

Shouldn't Unraid control this though and ensure all other services are stopped before unmounting?

And it does, this is not a normal problem, and I'm not sure if it was an Unraid issue, but it could have been, doing a shutdown/reboot with the array running should always be safe, could be a one time issue, due to some timings.

Posted

Thanks. I'll just let this disk rebuild and keep an eye out to see if happens again. Wish there was an easy way to see what file is associated with that sector to diagnose what might have happened. Do you know if there is a good way to do this that I can check after the disk rebuilds?

  • 1 month later...
Posted (edited)

Well this exact same thing happened again, but this time with a different disk. Is there anything I can do to prevent this?

 

May 19 13:07:09 Tower emhttpd: shcmd (6720903): umount /mnt/disk6
May 19 13:07:09 Tower kernel: XFS (md6p1): Unmounting Filesystem
May 19 13:07:10 Tower kernel: md: disk6 read error, sector=3910242856
May 19 13:07:18 Tower kernel: md: disk6 write error, sector=3910242856
May 19 13:07:18 Tower kernel: md: disk6 read error, sector=0
May 19 13:07:19 Tower kernel: md: disk6 write error, sector=0
May 19 13:07:19 Tower emhttpd: shcmd (6720904): rmdir /mnt/disk6

 

Edited by johnsanc
Posted

Possibly not a coincident, disk6 is the same model as disk1, I would try a couple of things for all disks from that model:

 

- disable spin down

and/or

- connect them to the onboard SATA, in case it's a compatibly issue with the LSI

 

Then re-test

  • 3 months later...
Posted (edited)

I was preparing to upgrade again to 6.12.13 and this happened again with disk6. I've attached my diagnostics hopefully something stands out.

Here is what I did:

  • changed disk1 and disk6 to never spin down
  • manually spun up disk1 and disk6 on the main page
  • shut down my vms
  • stopped all dockers
  • stopped array - disk6 marked as disabled after stopping array
  • saved diagnostics (attached)

Once the array shut down I went to check the main page, I see its marked as disabled again.

 

The disks rebuild perfectly fine every time, I've had no other errors whatsoever as far as I am aware except for during array shutdown.

Few questions:

  1. What exactly is unraid during while stopping the array that could cause this issue? It looks like it does reads and writes to a few sectors according to the log.
  2. For this particular disabled event, is it relatively safe to do a New Config, preserve all slots, select parity is valid, and restart the array without needing to rebuild? I really don't want to keep rebuilding disks everytime I need to upgrade or restart the array for any reason.

 

I know you say it may be an issue with the LSI card and disk combo, but this really sounds like there should be some additional checks or something before shutting down the array and declaring disks disabled.

EDIT:
One more update. I did a new config with valid parity and started a parity check and it immediately found 2 sync errors but nothing else so far:

Aug 24 11:18:23 Tower kernel: mdcmd (38): check correct
Aug 24 11:18:23 Tower kernel: md: recovery thread: check P Q ...
Aug 24 11:18:23 Tower kernel: md: recovery thread: PQ corrected, sector=0
Aug 24 11:18:23 Tower kernel: md: recovery thread: PQ corrected, sector=64


I also tried a non-correcting parity check for the the range 3910376000 to 3910377000 since the syslog showed a read/write error for 3910376504. Here are the results:

  • 514 sync errors detected as shown on the array main page
  • Syslog shows exactly 100 lines of errors starting at 3910376504 through 3910377296
  • Each error in the log is exactly 8 sectors apart
  • I was curious if maybe the log was capped at a certain number of error details, so i tried a shorter range. For some reason even with shorter ranges, the parity problems assistant plugin seems to check beyond the endpoint I specify, for example:
     
    Aug 24 12:04:17 Tower kernel: md: recovery thread: PQ incorrect, sector=3910377296
    Aug 24 12:04:17 Tower kernel: md: recovery thread: stopped logging
    Aug 24 12:04:23 Tower Parity Problem Assistant: Send notification: Partial parity check (Non-Correcting): Range: Sectors 3910376500-3910376550 (type=normal link=/Settings/Scheduler)

 

 

EDIT2:
I decided to do some investigative work to see if the data was correct or if parity was correct. I found what file occupies the offending area. It was a zip archive. I extracted the archive and lo and behold it is perfectly fine.

 

So I believe this means parity was incorrect. If this is the case, then a parity rebuild would have reconstructed bad data, overwriting good data that was there previously. If thats the case, that is a scary thought since I have rebuilt disks before and always assumed the parity was correct... but in situations like this I think the opposite occurred.

For now I'm exporting the File Integrity results and checking the exports for both disk1 and disk6 since they have both been rebuilt in the past for this array shutdown issue assuming parity was correct. I will report back in a day or two once the check has completed.

 

Could someone smarter than me weigh in on this? @JorgeB perhaps?

 

 

 

tower-diagnostics-20240824-1041.zip

Edited by johnsanc
Posted
18 hours ago, johnsanc said:

What exactly is unraid during while stopping the array that could cause this issue?

Difficult to say, but I would recommend connecting that disk to the onboard SATA and retest, it may be a compatibility issue with the controller.

  • 2 weeks later...
Posted

Just to follow up and close the loop on this. I believe there may have been a few issues at play.

  1. I had EPC enabled for these disks. I followed the instructions here. I used openSeaChest to check the features that were enabled on each of the drives and updated accordingly. This was the first time I used openSeaChest. Its a great tool and it would be nice if some of its functions were integrated into unraid, perhaps Fix Common Problems.
  2. The LSI card with these disks had outdated firmware. I updated this to the most recent.
  3. These Dell EMC disks also had outdated firmware. I updated from PAL7 to PAL9 because I noticed there was a fix for command timeouts, which I have observed before on these disks
  • 1 month later...
Posted

Well unfortunately the issue is not resolved, but it does seem to be related to spinup / spindown.

 

After the last time I ran into this issue I did a new config and i think my disk settings were reset to never spin down. Just a few days ago I changed my settings to spindown disks after about an hour. Shortly thereafter I ran into the issue again with 2 disks being disabled within the span of a day or so.

 

Now I see errors like this whenever the issue occurs:
(Diagnostics also attached...)

Oct 18 01:50:48 Tower emhttpd: spinning down /dev/sdae
Oct 18 01:56:42 Tower emhttpd: spinning down /dev/sdw
Oct 18 02:00:52 Tower emhttpd: read SMART /dev/sdx
Oct 18 02:02:45 Tower emhttpd: read SMART /dev/sdi
Oct 18 03:02:15 Tower emhttpd: spinning down /dev/sdag
Oct 18 03:03:02 Tower emhttpd: spinning down /dev/sdx
Oct 18 03:06:30 Tower kernel: sd 15:0:12:0: attempting task abort!scmd(0x00000000aaca4222), outstanding for 15276 ms & timeout 15000 ms
Oct 18 03:06:30 Tower kernel: sd 15:0:12:0: [sdae] tag#374 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Oct 18 03:06:30 Tower kernel: scsi target15:0:12: handle(0x0016), sas_address(0x5001e677bdaadff4), phy(20)
Oct 18 03:06:30 Tower kernel: scsi target15:0:12: enclosure logical id(0x5001e677bdaadfff), slot(20) 
Oct 18 03:06:30 Tower kernel: sd 15:0:12:0: device_block, handle(0x0016)
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: device_unblock and setting to running, handle(0x0016)
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: [sdae] tag#373 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=18s
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: [sdae] tag#373 CDB: opcode=0x88 88 00 00 00 00 03 7c 27 10 28 00 00 00 08 00 00
Oct 18 03:06:33 Tower kernel: md: disk1 read error, sector=14967836648
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: task abort: SUCCESS scmd(0x00000000aaca4222)
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: [sdae] Synchronizing SCSI cache
Oct 18 03:06:33 Tower kernel: sd 15:0:12:0: [sdae] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
Oct 18 03:06:33 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x5001e677bdaadff4)
Oct 18 03:06:33 Tower kernel: mpt2sas_cm1: removing handle(0x0016), sas_addr(0x5001e677bdaadff4)
Oct 18 03:06:33 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5001e677bdaadfff), slot(20)
Oct 18 03:06:33 Tower kernel: mpt2sas_cm1: handle(0x16) sas_address(0x5001e677bdaadff4) port_type(0x1)
Oct 18 03:06:34 Tower kernel: md: disk1 write error, sector=14967836648
Oct 18 03:06:34 Tower kernel: scsi 15:0:18:0: Direct-Access     ATA      ST18000NM002J-2T PAL9 PQ: 0 ANSI: 6
Oct 18 03:06:34 Tower kernel: scsi 15:0:18:0: SATA: handle(0x0016), sas_addr(0x5001e677bdaadff4), phy(20), device_name(0x0000000000000000)
Oct 18 03:06:34 Tower kernel: scsi 15:0:18:0: enclosure logical id (0x5001e677bdaadfff), slot(20) 
Oct 18 03:06:34 Tower kernel: scsi 15:0:18:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Oct 18 03:06:34 Tower kernel: scsi 15:0:18:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: Power-on or device reset occurred
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: Attached scsi generic sg31 type 0
Oct 18 03:06:34 Tower kernel: end_device-15:0:18: add: handle(0x0016), sas_addr(0x5001e677bdaadff4)
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] 4096-byte physical blocks
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] Write Protect is off
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] Mode Sense: 7f 00 10 08
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 18 03:06:34 Tower kernel: sdaj: sdaj1
Oct 18 03:06:34 Tower kernel: sd 15:0:18:0: [sdaj] Attached SCSI disk


 

tower-diagnostics-20241018-2242.zip

Posted

Not yet, I tried all the other suggested solutions from past threads like firmware updates and disabling EPC.

 

I am in the middle of a correcting parity check, and the concerning thing is that I have some random sync corrections, 92 so far and about 25% complete. They are strewn about and not all in one place.

  • The areas where the write errors occurred had no sync issues when I tested with Parity Problems Assistant with a sector range. Is that expected?
  • If unraid detects an error and kicks the disks offline why would i have sync issues in these other places?

I am doing a correcting parity check instead of disk rebuilds because frankly I am more confident in the data on the disks than I am with the parity. Last time this happened I did a new config and revalidated all my files with File Integrity exports, and not a single file had an issue. 

Posted

The latter. New config to reenable the disk. There was nothing in my control writing to array since everything is setup to use the cache pool. I really have no idea what these random write attempts are when the errors occur during spinup, spindown, array shutdown.

 

Frankly I'm leaning more toward only ever rebuilding disks if there's a complete disk failure. Situations like this seem like a disk rebuild from parity would do more harm than good.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...