6.2 Beta 21 2nd drive parity disk timeouts


Recommended Posts

Over the weekend I upgraded from Beta 20 to Beta 21. I also removed a drive that didn't have any data on it and replaced it with what I would use as my 2nd parity drive. Ever since I have done this the server has randomly stopped responding. I'm not sure how long it it taking for this to happen. The majority of the drives are on the onboard SATA ports. I do know the 2nd parity drive is on the  RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) card.

 

I'm seeing this in the syslog.

Apr 11 04:30:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0

Apr 11 04:30:37 Tower kernel: sas: ata7: end_device-5:0: dev error handler

Apr 11 04:30:37 Tower kernel: ata7.00: ATA-9: WDC WD40EFRX-68WT0N0,      WD-WCC4E0ETNU17, 82.00A82, max UDMA/133

Apr 11 04:30:37 Tower kernel: ata7.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32)

Apr 11 04:30:37 Tower kernel: ata7.00: configured for UDMA/133

Apr 11 04:30:37 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

Apr 11 04:30:37 Tower kernel: scsi 5:0:0:0: Direct-Access    ATA      WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 5

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: Attached scsi generic sg7 type 0

Apr 11 04:30:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0

Apr 11 04:30:37 Tower kernel: sas: ata7: end_device-5:0: dev error handler

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] 4096-byte physical blocks

Apr 11 04:30:37 Tower kernel: sas: ata8: end_device-5:1: dev error handler

Apr 11 04:30:37 Tower kernel: ata8.00: ATA-9: WDC WD30EFRX-68AX9N0,      WD-WMC1T0378623, 80.00A80, max UDMA/133

Apr 11 04:30:37 Tower kernel: ata8.00: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32)

Apr 11 04:30:37 Tower kernel: ata8.00: configured for UDMA/133

Apr 11 04:30:37 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] Write Protect is off

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] Mode Sense: 00 3a 00 00

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Apr 11 04:30:37 Tower kernel: scsi 5:0:1:0: Direct-Access    ATA      WDC WD30EFRX-68A 0A80 PQ: 0 ANSI: 5

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: Attached scsi generic sg8 type 0

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] 4096-byte physical blocks

Apr 11 04:30:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0

Apr 11 04:30:37 Tower kernel: sas: ata7: end_device-5:0: dev error handler

Apr 11 04:30:37 Tower kernel: sas: ata8: end_device-5:1: dev error handler

Apr 11 04:30:37 Tower kernel: sas: ata9: end_device-5:2: dev error handler

Apr 11 04:30:37 Tower kernel: ata9.00: ATA-7: Hitachi HDT725050VLA360,      VFD400R403SMYC, V56OA52A, max UDMA/133

Apr 11 04:30:37 Tower kernel: ata9.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)

Apr 11 04:30:37 Tower kernel: ata9.00: configured for UDMA/133

Apr 11 04:30:37 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] Write Protect is off

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] Mode Sense: 00 3a 00 00

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Apr 11 04:30:37 Tower kernel: scsi 5:0:2:0: Direct-Access    ATA      Hitachi HDT72505 A52A PQ: 0 ANSI: 5

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: [sdj] 976773168 512-byte logical blocks: (500 GB/466 GiB)

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: Attached scsi generic sg9 type 0

Apr 11 04:30:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0

Apr 11 04:30:37 Tower kernel: sas: ata7: end_device-5:0: dev error handler

Apr 11 04:30:37 Tower kernel: sas: ata8: end_device-5:1: dev error handler

Apr 11 04:30:37 Tower kernel: sas: ata9: end_device-5:2: dev error handler

Apr 11 04:30:37 Tower kernel: sas: ata10: end_device-5:3: dev error handler

Apr 11 04:30:37 Tower kernel: ata10.00: ATA-8: OCZ-AGILITY3, OCZ-S412FE6GEZ2441GM, 2.22, max UDMA/133

Apr 11 04:30:37 Tower kernel: ata10.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 31/32)

Apr 11 04:30:37 Tower kernel: ata10.00: configured for UDMA/133

Apr 11 04:30:37 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: [sdj] Write Protect is off

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: [sdj] Mode Sense: 00 3a 00 00

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: [sdj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Apr 11 04:30:37 Tower kernel: sdh: sdh1

Apr 11 04:30:37 Tower kernel: sd 5:0:0:0: [sdh] Attached SCSI disk

Apr 11 04:30:37 Tower kernel: scsi 5:0:3:0: Direct-Access    ATA      OCZ-AGILITY3    2.22 PQ: 0 ANSI: 5

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: [sdk] 234441648 512-byte logical blocks: (120 GB/112 GiB)

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: Attached scsi generic sg10 type 0

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: [sdk] Write Protect is off

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: [sdk] Mode Sense: 00 3a 00 00

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Apr 11 04:30:37 Tower kernel: sdk: sdk1

Apr 11 04:30:37 Tower kernel: sd 5:0:3:0: [sdk] Attached SCSI disk

Apr 11 04:30:37 Tower kernel: sdj: sdj1

Apr 11 04:30:37 Tower kernel: sd 5:0:2:0: [sdj] Attached SCSI disk

Apr 11 04:30:37 Tower kernel: sdi: sdi1

Apr 11 04:30:37 Tower kernel: sd 5:0:1:0: [sdi] Attached SCSI disk

Apr 11 04:30:37 Tower kernel: BTRFS: device fsid 39df29e1-dd7e-48dd-8c0e-ff4f9d9353c1 devid 2 transid 1060577 /dev/sdk1

 

I already tried to tell the disks not to spin down thinking that it might be related to the disks not spinning back up in a timely matter. But that wasn't it.

 

I ran out of time but tonight I was going to try and reset the config and remove the 2nd Parity drive to see if the server does the same thing. I think that might be simplest step to try. Please let me know if there are any other logs that you want or other things to try.

Link to comment

So after a reboot the server was responsive for about 2hrs. Now it responds to pings. WebUI is slow. I tried to stop array and its stuck.

 

Syslog shows

Apr 11 08:59:46 Tower kernel: mdcmd (44): nocheck

Apr 11 08:59:46 Tower kernel: md: nocheck_array: check not active

Apr 11 08:59:46 Tower emhttp: shcmd (1890): /etc/rc.d/rc.libvirt stop |& logger

Nothing else follows. Its been that way for a good 20minutes.

 

 

Link to comment

It would be better to provide your diagnostics.  NOTHING that you quoted above has any issues at all, as far as I can see.  There are no timeouts visible.  Both posts show normal messages.  The first post just shows the normal drive setup.  Perhaps your diagnostics will show more?

Link to comment

I just noticed that someone else is complaining about something else very similar to me. https://lime-technology.com/forum/index.php?topic=48360.0

 

I have now downgraded to 6.1.9 and am seeing the exact same behavior that I was seeing in the latest beta. I'm still formulating my plan of how to attack this. A software fix would be great if it is software.

 

Attached are the logs.

 

I also noticed that the disks are all spun-up but the parity sync time is now at 66 days, 23 hours, 18 minutes. I hope it doesn't take that long.

tower-diagnostics-20160412-1644.zip

Link to comment

It did complete at a reasonable time frame. It was a full rebuild. I did a new config setup thinking that it might have been related.

 

I think I going to hold off for the next release before I try again. Wife didn't like server being down. 

 

 

Sent from my iPhone using Tapatalk

Link to comment

* In the second syslog, where I think you reported the most issues, a cron for DuckDNS is malfunctioning, with the ps report showing 47 cron calls for it (so far).  I don't know what's wrong, but guessing it's a misconfiguration.  None of them appear to be using any CPU, so may not be serious, but probably indicates each one is hanging.

 

* In ALL of your syslogs (both 6.1.9 and 6.2-beta21), the message "shfs/user: share cache full" is filling up the syslog, perhaps a thousand of them.  Sorry, I don't know which cache it is referring to.

 

* [minor] When you have a chance, at boot, go into the BIOS settings and look for the SATA mode, make sure it is set to AHCI.  Somewhere, it is set to an IDE emulating mode, for the last 2 onboard SATA ports.

 

* The following messages are showing in all your syslogs -

Apr 12 16:04:28 Tower logger: Installing user plugins

Apr 12 16:04:28 Tower logger: plugin: installing: /boot/config/plugins/dynamix.plg

Apr 12 16:04:28 Tower logger: plugin: not installing older version

Apr 12 16:04:28 Tower logger: Starting go script

Apr 12 16:04:28 Tower emhttp: unRAID System Management Utility version 6.1.9

Apr 11 09:28:47 Tower root: Installing user plugins

Apr 11 09:28:47 Tower root: plugin: installing: /boot/config/plugins/dynamix.plg

Apr 11 09:28:47 Tower root: plugin: not installing older version

Apr 11 09:28:47 Tower root: Starting go script

Apr 11 09:28:47 Tower emhttp: unRAID System Management Utility version 6.2.0-beta21

This tends to indicate a remnant of an older version still there, which can cause problems.

 

* At the end of the 6.1.9 syslog, there is a failure of TimeMachine and AFP.

Link to comment

* In the second syslog, where I think you reported the most issues, a cron for DuckDNS is malfunctioning, with the ps report showing 47 cron calls for it (so far).  I don't know what's wrong, but guessing it's a misconfiguration.  None of them appear to be using any CPU, so may not be serious, but probably indicates each one is hanging.

 

  I have already removed that container.

 

* In ALL of your syslogs (both 6.1.9 and 6.2-beta21), the message "shfs/user: share cache full" is filling up the syslog, perhaps a thousand of them.  Sorry, I don't know which cache it is referring to.

 

  I'm going to work on that one. I have a limit setup on my cache pool if that is what it is referring to. I think it is hitting the limit on the pool. I would assume the mover would kick it.

 

* [minor] When you have a chance, at boot, go into the BIOS settings and look for the SATA mode, make sure it is set to AHCI.  Somewhere, it is set to an IDE emulating mode, for the last 2 onboard SATA ports.

 

  Going to look at that this weekend I hope. I have a AOC-SAS2LP-MV8 card in there I'm not sure if those drives are on there. I will check everything on the system. Thanks for pointing that out.

 

* The following messages are showing in all your syslogs -

Apr 12 16:04:28 Tower logger: Installing user plugins

Apr 12 16:04:28 Tower logger: plugin: installing: /boot/config/plugins/dynamix.plg

Apr 12 16:04:28 Tower logger: plugin: not installing older version

Apr 12 16:04:28 Tower logger: Starting go script

Apr 12 16:04:28 Tower emhttp: unRAID System Management Utility version 6.1.9

Apr 11 09:28:47 Tower root: Installing user plugins

Apr 11 09:28:47 Tower root: plugin: installing: /boot/config/plugins/dynamix.plg

Apr 11 09:28:47 Tower root: plugin: not installing older version

Apr 11 09:28:47 Tower root: Starting go script

Apr 11 09:28:47 Tower emhttp: unRAID System Management Utility version 6.2.0-beta21

This tends to indicate a remnant of an older version still there, which can cause problems.

* At the end of the 6.1.9 syslog, there is a failure of TimeMachine and AFP.

 

With those other issues. I'm thinking of backing up the config related items and basically do a fresh install. I could try 6.2. again and see how it goes. Thanks for the help. I will update the post when i have more detail.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.