Jump to content

Read Errors/Disk failure since upgrading to 6.1


JohnO

Recommended Posts

Greetings,

 

I've been running unRAID for about 18 months.  I've been running 6.0 since it was released, and I upgraded to 6.1 yesterday.

 

Since the upgrade, unRAID has been reporting read errors, and now one of 4 disks is offline.

 

I have not physically touched the hardware in weeks.  The physical machine diagnostics (temperature, etc.) all seem fine.  The SMART reports all seem fine.

 

This unRAID environment is a guest on ESXi 6.0.  The disk controller and all attached disks are "passed through" to unRAID so that unRAID has complete control of the disks.

 

I restarted the VM after the upgrade, but have not restarted since.  Not sure if that would help or hurt at this point.

 

I'm tempted to roll back to 6.0 as this seems to be a mighty coincidence.

 

I've attached the diagnostics report.

 

Thanks for any ideas.

 

John

 

oshtank-diagnostics-20150906-0833.zip

Link to comment

Since you are running unRAID as guest, I won't dig into this except to ask, are you sure there were no problems before the upgrade? I don't think disk problems are a likely result of the upgrade.

 

I would not expect disk errors to show up as part of an upgrade either.  I guess my question is around the way that the disk error information is collected.  Could it have changed somehow?

 

Thanks to the daily email notifications, I am sure that unRAID was not reporting any disk errors before the upgrade.

 

Thanks,

 

John

Link to comment

I've restarted the unRAID VM, and started the array.  The one drive is still "x'd" out, and is disabled.  The read error column is now zero'd out.

 

The SMART tests results are all good.  Not sure what to do next.  Should I try to re-enable the disabled drive?

 

Thanks,

 

John

 

Link to comment

Also -- the syslog is filling with the following:

 

mptscsih: ioc0: attempting task abort! (sc=ffff88006aa2e000)
Sep 6 13:10:44 OshTank kernel: sd 3:0:3:0: [sde] tag#0 CDB: opcode=0x2a 2a 00 ae a8 66 a0 00 00 18 00
Sep 6 13:11:14 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 6 13:11:14 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 6 13:11:23 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006aa2e000)
Sep 6 13:16:02 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c453c80)
Sep 6 13:16:02 OshTank kernel: sd 3:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 3f 0f f1 48 00 00 08 00
Sep 6 13:16:32 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 6 13:16:32 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 6 13:16:32 OshTank kernel: blk_update_request: I/O error, dev sde, sector 0
Sep 6 13:16:41 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c453c80)
Sep 6 13:16:41 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c453680)
Sep 6 13:16:41 OshTank kernel: sd 3:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 3f 0f f1 48 00 00 08 00
Sep 6 13:16:41 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c453680)
Sep 6 13:16:48 OshTank emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Sep 6 13:17:12 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006aa2e000)
Sep 6 13:17:12 OshTank kernel: sd 3:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 3f 10 0e 58 00 00 08 00
Sep 6 13:17:42 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 6 13:17:42 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 6 13:17:51 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006aa2e000)
Sep 6 13:17:51 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006aa2e180)
Sep 6 13:17:51 OshTank kernel: sd 3:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 3f 10 0e 58 00 00 08 00
Sep 6 13:17:51 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006aa2e180)
Sep 6 13:17:51 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006aa2e480)
Sep 6 13:17:51 OshTank kernel: sd 3:0:3:0: [sde] tag#0 CDB: opcode=0x28 28 00 3f 10 0e 58 00 00 08 00
Sep 6 13:17:51 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006aa2e480)

Link to comment

The problem strictly involves the SAS controller or its driver module mptsas.  It was causing problems almost from the moment you booted, before you started the array.  It was a never-ending series of aborts and resets and recoveries.  It seemed to occur when the drives weren't doing anything, and when there was I/O.  Often SMART requests error'd out.  And finally some of the I/O failed, probably I/O that happened during a reset moment.  All 4 drives had the same problems and read errors.  Disk 1 was only dropped when it tried to write at the wrong time probably, otherwise it could have been any drive.  No issues with the drives themselves.

 

I cannot tell whether it's a problem with the card, or with the driver, or with the way it's passed through, or some other incompatibility.  It's obviously not working though, in the current configuration.

 

An observation - the system is allocated only 1.8GB.  That usually means about 200MB reserved for the graphics adapter, which isn't used in unRAID.  I don't know how you change that in your VM, but unRAID only uses a text console, only needs the absolute minimum setting available.  That would provide almost 200MB more RAM for unRAID.

Link to comment

I cannot tell whether it's a problem with the card, or with the driver, or with the way it's passed through, or some other incompatibility.  It's obviously not working though, in the current configuration.

 

Thanks very much for the detailed feedback.

 

Since it seems like such a simple thing to try, I'd like to roll back to 6.01, which I had been running for some time without issue.

 

I see the /boot/previous folder, but I'm not sure of the correct way to roll back.  Is it as simple as copying the files to /boot and re-starting?  I'm guessing that even if it runs without errors, I'll still have to rebuild the one drive that is disabled.  Is that correct?

 

If I continue to see problems after rolling back, then I'll dig further. 

 

Does that plan of attack make sense to you?

 

An observation - the system is allocated only 1.8GB.  That usually means about 200MB reserved for the graphics adapter, which isn't used in unRAID.  I don't know how you change that in your VM, but unRAID only uses a text console, only needs the absolute minimum setting available.  That would provide almost 200MB more RAM for unRAID.

 

I can certainly allocate more RAM.  My understanding was that it was not required, as I'm not using dockers, and other VMs within unRAID.  If you recommend I increase RAM, I can certainly increase it.

 

Thanks again for your assistance!

 

John

Link to comment

Since it seems like such a simple thing to try, I'd like to roll back to 6.01, which I had been running for some time without issue.

 

I see the /boot/previous folder, but I'm not sure of the correct way to roll back.  Is it as simple as copying the files to /boot and re-starting?  I'm guessing that even if it runs without errors, I'll still have to rebuild the one drive that is disabled.  Is that correct?

I had to check the /previous folder to see what's there.  It does just look like the previous main files, so you're correct, just copy them back and reboot.  All you really need are bzroot and bzimage.

 

An observation - the system is allocated only 1.8GB.  That usually means about 200MB reserved for the graphics adapter, which isn't used in unRAID.  I don't know how you change that in your VM, but unRAID only uses a text console, only needs the absolute minimum setting available.  That would provide almost 200MB more RAM for unRAID.

 

I can certainly allocate more RAM.  My understanding was that it was not required, as I'm not using dockers, and other VMs within unRAID.  If you recommend I increase RAM, I can certainly increase it.

Without Dockers or VMs, you're probably fine with that.  I'm used to bare metal installations, not what's normal when virtualized.  I automatically react when I see a RAM number somewhat less than full gigabyte numbers, because it usually means wasted memory, memory reserved for a graphics adapter that won't be used, when it could have been helping out with unRAID caching or other things.  Now, you can help me learn - when you set up the RAM requirement for unRAID, did you set it to 1.8GB or 2.0GB?  In other words, are you limited to setting RAM in gigabyte increments or something smaller?

Link to comment

I had to check the /previous folder to see what's there.  It does just look like the previous main files, so you're correct, just copy them back and reboot.  All you really need are bzroot and bzimage.

 

I'm just heading off to sleep, so I'll try this tomorrow after work.  Thanks for the confirmation on that part of the process.

 

Now, you can help me learn - when you set up the RAM requirement for unRAID, did you set it to 1.8GB or 2.0GB?  In other words, are you limited to setting RAM in gigabyte increments or something smaller?

 

I'm using VMware ESXi, which is their "Enterprise-class" Hypervisor.  In a smart move, VMware lets you use that Hypervisor for free in small environments.  The configuration options are very granular.  I can allocate in very small chucks - I manually selected 1856 MB, as you can see in the screen shot attached below from the VMware vSphere configuration client.

 

Thanks again,

 

John

unRAID-VM-Config.PNG.22a651069bae4101c467e57e5876ab44.PNG

Link to comment

I had to check the /previous folder to see what's there.  It does just look like the previous main files, so you're correct, just copy them back and reboot.  All you really need are bzroot and bzimage.

 

Ok -- I rolled back to 6.01, and the errors seemed to have stopped accumulating.

 

Of course, the one drive is still disabled.

 

What is the next recommended course of action?

 

Should I try to re-build the drive?

 

I have attached a fresh diagnostic captured after my rollback and reboot.

 

Thanks for any advice!

 

John

oshtank-diagnostics-20150910-2128.zip

Link to comment

I had to check the /previous folder to see what's there.  It does just look like the previous main files, so you're correct, just copy them back and reboot.  All you really need are bzroot and bzimage.

 

Ok -- I rolled back to 6.01, and the errors are there - but seemed somewhat reduced.  Sigh.  Maybe the controller card is bad. :(  Not sure.

 

Suggestions? 

 

Here is what I see in the syslog now:

 

Sep 10 21:44:41 OshTank rpc.mountd[7924]: authenticated mount request from 192.168.62.113:804 for /mnt/user/Vault (/mnt/user/Vault)
Sep 10 21:45:58 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c3b4c00)
Sep 10 21:45:58 OshTank kernel: sd 3:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 75 1c 63 a8 00 00 08 00
Sep 10 21:46:28 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 10 21:46:28 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 10 21:46:37 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c3b4c00)
Sep 10 21:46:37 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c3b4a80)
Sep 10 21:46:37 OshTank kernel: sd 3:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 ae a8 67 80 00 00 08 00
Sep 10 21:46:37 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c3b4a80)
Sep 10 21:46:37 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c784000)
Sep 10 21:46:37 OshTank kernel: sd 3:0:3:0: [sde] tag#0 CDB: opcode=0x28 28 00 b3 f1 90 18 00 00 08 00
Sep 10 21:46:37 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c784000)
Sep 10 21:47:11 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c784300)
Sep 10 21:47:11 OshTank kernel: sd 3:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ae a9 08 88 00 04 00 00
Sep 10 21:47:41 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 10 21:47:41 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 10 21:47:50 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c784300)
Sep 10 21:47:50 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c784000)
Sep 10 21:47:50 OshTank kernel: sd 3:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 ae a9 08 88 00 04 00 00
Sep 10 21:47:50 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c784000)
Sep 10 21:47:50 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c784c00)
Sep 10 21:47:50 OshTank kernel: sd 3:0:3:0: [sde] tag#0 CDB: opcode=0x28 28 00 ae a9 08 88 00 04 00 00
Sep 10 21:47:50 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c784c00)
Sep 10 21:48:26 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c784f00)
Sep 10 21:48:26 OshTank kernel: sd 3:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ae d8 da 68 00 03 00 00
Sep 10 21:48:56 OshTank kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Sep 10 21:48:56 OshTank kernel: mptbase: ioc0: Initiating recovery
Sep 10 21:49:05 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c784f00)
Sep 10 21:49:05 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c3b5500)
Sep 10 21:49:05 OshTank kernel: sd 3:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 ae d8 d9 68 00 04 00 00
Sep 10 21:49:05 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c3b5500)
Sep 10 21:49:05 OshTank kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88006c3b4a80)
Sep 10 21:49:05 OshTank kernel: sd 3:0:3:0: [sde] tag#0 CDB: opcode=0x2a 2a 00 ae a9 a1 c8 00 04 00 00
Sep 10 21:49:05 OshTank kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88006c3b4a80)

 

Suggestions?  If I were to get a replacement controller, could I just move the drives and expect them to be recognized correctly, or would I have to reformat everything and start from scratch?

 

Thanks for any advice.

 

John

oshtank-diagnostics-20150910-2150.zip

Link to comment

Ok -- I rolled back to 6.01, and the errors are there - but seemed somewhat reduced.  Sigh.  Maybe the controller card is bad. :(  Not sure.

The syslog only covers less than an hour, so I'd have to say the errors from that card are just as bad, for such a short time.  I can't say whether the card is bad, or just needs newer firmware, or there's an incompatibility somewhere.

 

If I were to get a replacement controller, could I just move the drives and expect them to be recognized correctly, or would I have to reformat everything and start from scratch?

The drives will move without a hiccup, so long as the same flash drive is used.  Whenever unRAID boots, it assumes a new system it has never seen before, and gets the drive assignments and config from the super.dat on the boot flash.  You could take all the drives to a completely different system and boot the flash and unRAID will look exactly the same, everything intact.

Link to comment

I can't say whether the card is bad, or just needs newer firmware, or there's an incompatibility somewhere.

 

OK - thanks for the info.  Time to start troubleshooting in earnest.

 

The drives will move without a hiccup, so long as the same flash drive is used.

 

Very good to know.  Thanks, Rob.

 

John

Link to comment

The syslog only covers less than an hour, so I'd have to say the errors from that card are just as bad, for such a short time.  I can't say whether the card is bad, or just needs newer firmware, or there's an incompatibility somewhere.

 

OK - this is looking better.

 

For my next step in troubleshooting, I shutdown all my VMs, powered down the VM host, unplugged it from the UPS (it has not been unplugged in over a year), re-seated disk controller card, and checked all SATA connections (they all seemed tight), powered up the host, and re-started the unRAID VM.  After a few minutes without error, I brought up my other VMs, including a Linux Host that NFS mounts the unRAID drives and uses it for CrashPlan backups.

 

It's been 45 minutes with no console errors!

 

There are 7 backup jobs running.  I also used my TiVo to pull a movie from the unRAID NAS.  No obvious problems there either.

 

Maybe this was some weird VMhost and power issue?

 

I've attached the diagnostic logs for the time since boot.

 

John

oshtank-diagnostics-20150911-1008.zip

Link to comment

Syslog looks completely clean!  It hasn't been up very long, but there were always problems before now.  It's a great start any how.  Hope you will have nothing else to report, nothing negative that is!

 

My inclination would be there was a card connection issue, possibly causing either flaky data or power issues in the card connections.  But I don't know that for sure.

Link to comment

Syslog looks completely clean!

 

Syslog still looks clean (attached).

 

I had had a similar issue a few weeks ago with a video card passed through to Windows.  I had done a Windows Update, and restart the VM, and the video card started to freeze up after about 30 seconds of use.  I had assumed it was related to the Windows Update, in the same way that I initially thought my unRAID problem was related to the 6.1 upgrade.  After seeing unRAID working, I went ahead and re-added the video card to my Windows VM (via passthrough) and now it is working as well!

 

 

What would you recommend to bring the disabled drive back online?  The message says the device is disabled, and contents are emulated.  Should I follow the instructions to re-enable the drive that are here:

 

http://lime-technology.com/wiki/index.php?title=Troubleshooting#Re-enable_the_drive

 

or should I follow the steps to check and fix an XFS disk here:

 

http://www.lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

 

or should I do something else entirely?

 

Thanks for any guidance!

 

John

 

 

oshtank-diagnostics-20150911-1306.zip

Link to comment

Sorry, I forgot about Disk 1.  You will need to re-enable it by rebuilding it onto itself, to pick up any writes that may have occurred since it was disabled.  They would be in the emulated version of Disk 1.

 

The standard procedure is to unassign the drive, start and stop the array (to get unRAID to forget the drive), then re-assign the drive and start the array, which should begin the rebuild.

Link to comment

Sorry, I forgot about Disk 1.  You will need to re-enable it by rebuilding it onto itself, to pick up any writes that may have occurred since it was disabled.  They would be in the emulated version of Disk 1.

 

The re-build is underway!  Thanks again for your help.

 

John

Link to comment

Parity Check underway.

 

The parity check completed successfully.  After running for a couple of hours, I upgraded to 6.1.2, and used the unRAID GUI to reboot.

 

I immediately started seeing the same sort of disk controller errors that started this journey!

 

I shutdown the unRAID VM, and then did the same VMware host power down I did yesterday.  After shutting everything down, and physically removing power, I waited for about a minute, then powered everything back up.

 

That was about 8 hours ago.  All has been running clean since.

 

I don't know what it is - but something about the leap from 6.0.1 to 6.1.x really needed to see power removed from the server.  Just rebooting from the VM was not enough to clear out something.

 

I've added my current diagnostic logs for completeness, but will mark this as RESOLVED!

 

Thanks RobJ!

 

John

 

 

oshtank-diagnostics-20150912-1754.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...