[SOLVED] Dsk disabled error: MBR: unaligned


Recommended Posts

Hello

I'm running Unraid 4.7 and I have not monitored the system for a while.

I just found this morning  a disk disabled and the following errors:

 

Apr 23 21:19:54 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Apr 23 21:19:54 Tower kernel: ata2.00: failed command: READ DMA EXT

Apr 23 21:19:54 Tower kernel: ata2.00: cmd 25/00:00:3f:1d:18/00:04:67:00:00/e0 tag 0 dma 524288 in

Apr 23 21:19:54 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Apr 23 21:19:54 Tower kernel: ata2.00: status: { DRDY }

Apr 23 21:19:59 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)

Apr 23 21:20:04 Tower kernel: ata2: soft resetting link

Apr 23 21:20:09 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)

Apr 23 21:20:14 Tower kernel: ata2: SRST failed (errno=-16)

Apr 23 21:20:14 Tower kernel: ata2: soft resetting link

Apr 23 21:20:19 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)

Apr 23 21:20:24 Tower kernel: ata2: SRST failed (errno=-16)

Apr 23 21:20:24 Tower kernel: ata2: soft resetting link

Apr 23 21:20:29 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)

Apr 23 21:20:59 Tower kernel: ata2: SRST failed (errno=-16)

Apr 23 21:20:59 Tower kernel: ata2: soft resetting link

Apr 23 21:21:04 Tower kernel: ata2: SRST failed (errno=-16)

Apr 23 21:21:04 Tower kernel: ata2: reset failed, giving up

Apr 23 21:21:04 Tower kernel: ata2.00: disabled

Apr 23 21:21:04 Tower kernel: ata2.00: device reported invalid CHS sector 0

Apr 23 21:21:04 Tower kernel: ata2: EH complete

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 18 1d 3f 00 04 00 00

Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729633599

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 18 21 3f 00 03 f8 00

Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729634623

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 00 00 42 d7 00 00 40 00

Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 17111

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Unhandled error code

Apr 23 21:21:04 Tower kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00

 

and further down

 

sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 67 19 5f 9f 00 04 00 00

Apr 23 21:21:04 Tower kernel: end_request: I/O error, dev sdb, sector 1729716127

Apr 23 21:21:04 Tower kernel: md: disk6 read error

Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633536/2, count: 1

Apr 23 21:21:04 Tower kernel: md: disk6 read error

Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633544/2, count: 1

Apr 23 21:21:04 Tower kernel: md: disk6 read error

Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633552/2, count: 1

Apr 23 21:21:04 Tower kernel: md: disk6 read error

Apr 23 21:21:04 Tower kernel: handle_stripe read error: 1729633560/2, count: 1

Apr 23 21:21:04 Tower kernel: md: disk6 read error

 

etc.

 

I'm attaching the full syslog below.... and I'm sorry for the trouble..

 

After reading the sections regarding alignment in unraid v 4.7, I'm uncertain (read confused!) what I need to do to hold onto my data... If it is still intact. I'd appreciate very much your guidance... Thank you!

hg

syslog.1.zip

Link to comment

Disk6 is disabled. Please post a SMART report for /dev/sdb.

Hello;

Thanks so much for taking the time... I have gone through some radical activities, pulling everything out of the case, cleaning the thick dust (computers are like vacuum cleaners!...) and renstalled everything in a larger and better ventilated case. I just finished reinstalling everything, and rebooting, and the problem persists (disk 6 disabled, due to, i guess, MBR misalligned). Clicking on the disk link, I'mredirected to another screen, which has on it just this:

 

disk6 settings

Partition format: MBR: unaligned  

File sytem type: reiserfs  

Spin down delay: Use default                

Spinup group(s): host1  

 

Checking the syslog more carefully, I couldn't dind obvious errors related to either the hard drive, or the controller.

 

I found a few other (new, I guess) messages, which I cannot understand where they come from and what do I need to do about them. I've added to the message the new syslog, as well as a smart report for each of the disks, including sdb. I've extracetd from syslog those messages, and posted them below, for visibility.

 

 

Apr 26 17:41:09 Tower kernel: PCI: Using MMCONFIG for extended config space

Apr 26 17:41:09 Tower kernel: ACPI Warning: Incorrect checksum in table [OEMB] - 0D, should be 08 (20090903/tbutils-314)

Apr 26 17:41:09 Tower kernel: ACPI: No dock devices found.

 

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: supports D1 D2

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: PME# supported from D0 D1 D2 D3hot D3cold

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.0: PME# disabled

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: reg 20 io port: [0x9800-0x981f]

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: supports D1 D2

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: PME# supported from D0 D1 D2 D3hot D3cold

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.1: PME# disabled

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: reg 20 io port: [0x9880-0x989f]

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: supports D1 D2

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: PME# supported from D0 D1 D2 D3hot D3cold

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.2: PME# disabled

Apr 26 17:41:09 Tower kernel: pci 0000:00:10.3: reg 20 io port: [0x9c00-0x9c1f]

 

26 17:41:09 Tower kernel: system 00:05: iomem range 0xfecc0000-0xfecc0fff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:07: iomem range 0xfec00000-0xfec00fff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:07: iomem range 0xfee00000-0xfee00fff has been reserved

Apr 26 17:41:09 Tower kernel: system 00:0a: iomem range 0xe0000000-0xefffffff has been reserved

Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0x0-0x9ffff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xc0000-0xcffff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xe0000-0xfffff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0x100000-0x7fffffff could not be reserved

Apr 26 17:41:09 Tower kernel: system 00:0b: iomem range 0xfeb00000-0xffffffff could not be reserved

Apr 26 17:41:09 Tower kernel: pci 0000:03:04.0: BAR 6: address space collision on of device [0xfea00000-0xfea7ffff]

Apr 26 17:41:09 Tower kernel: pci 0000:03:05.0: BAR 6: address space collision on of device [0xfe980000-0xfe9fffff]

Apr 26 17:41:09 Tower kernel: pci 0000:03:06.0: BAR 6: address space collision on of device [0xfeac0000-0xfeadffff]

Apr 26 17:41:09 Tower kernel: pci 0000:03:07.0: BAR 6: address space collision on of device [0xfeae0000-0xfeae3fff]

Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: PCI bridge, secondary bus 0000:01

Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: IO window: 0xb000-0xbfff

Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: MEM window: 0xfe800000-0xfe8fffff

Apr 26 17:41:09 Tower kernel: pci 0000:00:01.0: PREFETCH window: 0xf4000000-0xf7ffffff

Apr 26 17:41:09 Tower kernel: pci 0000:00:02.0: PCI bridge, secondary bus 0000:02

Apr 26 17:41:09 Tower kernel: pci 0000:00:02.0: IO window: 0x1000-0x1fff

Any further suggestions for troubleshooting and resolution are greatly appreciated!

 

Thankyou

logs.zip

Link to comment

The disk is disabled because a write to it failed. MBR: unaligned has nothing to do with a disk being disabled and is a correct setting for non-Advanced Format drives. The other messages you posted are informational startup messages.

 

ata2=disk6=sdb is disabled. In order to determine the status of this disk we require a SMART report. Please post a SMART report for this disk.

Link to comment

The disk is disabled because a write to it failed. MBR: unaligned has nothing to do with a disk being disabled and is a correct setting for non-Advanced Format drives. The other messages you posted are informational startup messages.

 

ata2=disk6=sdb is disabled. In order to determine the status of this disk we require a SMART report. Please post a SMART report for this disk.

 

OK. Thanks for the clarification.

Sorry for not posting the SMART reports earlier. I picked up the wrong file from my disk. I've just removed that, and attached to the previous message, a new zip file, containing both the smart reports, and the last syslog. I hope they will point to the problem.

 

Also... I just realized that I actually can access the disk, even if it is marked RED, and disabled (in unmenu screen).

I can read the information on disk6, open directories and files, as if nothing happened. I'm confused!

 

 

Link to comment
Also... I just realized that I actually can access the disk, even if it is marked RED, and disabled (in unmenu screen).

I can read the information on disk6, open directories and files, as if nothing happened. I'm confused!

 

That's what the parity does.

 

Try stopping the array and unassigning the disk from the slot. Then, start the array and stop it again. Then, assign the disk again and start the array again allowing the disk to rebuild. If it was a connection problem then the disk will be rebuilt, otherwise the rebuild will fail.

 

Peter

 

Link to comment

The SMART report for Disk 6 looks very good, and the syslog looks great, with Disk 6 found and mounted like normal, no issues at all.  In fact it looks so good, I cannot tell from the syslog that it is marked Disabled!  ( I hope that in the future Tom will allow us to see a Disabled status in the syslog )

 

Your array was running fine until it suddenly lost contact with Disk 6 at 21:19:54.  It tried and tried but could not get any response from the drive, and in less than 90 seconds, had disabled the drive.  ALL subsequent errors can be ignored because you obviously cannot read or write to a drive that the operating system no longer considers present.  In my experience, a drive that suddenly stops responding, with no other errors, is usually completely fine, unless in the somewhat rare case that it suffered a catastrophic failure.  The attempt to retrieve a SMART report will then clearly indicate whether the drive is fine or not.  Because you have a clean SMART report, the drive is not at fault, and therefore the problem must have been an anomaly with the disk controller (VIA-based), or a cable came loose.  Powering off, then checking the cables, then powering back on usually clears the issue.  If it happens again, you may have motherboard/controller issues.

 

Since the drive is probably fine, you can use the Trust My Array procedure to more quickly restore the array - or let it rebuild per Peter's instructions.

Link to comment

Folks, Thanks much for your help. I went with rebuilding the drive, just because I read Peter's advice first. In retrospect, I should have plugged in a new, identical, drive, and kept the original drive untouched, just in case rebuilding the data might have got into problems. Knock in wood - the rebuilding is still several hours away from completion.

 

While that is working away, I have two more questions, if you don't mind...:

1.- would you interpret the previously posted boot-up messages where memory cannot be reserved, and BAR6 - address space collision as unresolved conditions (by the kernel)? What about ACPI checksum error? Should PME be enabled in the BIOS?  Do these require my intervention in any way?

 

2.- since I don't have a backup for my unraid stored data, and rely exclusively on the parity drive to keep it intact, has anyone experience with mirroring the parity drive at the MB BIOS level? Would unraid recognize it as one drive only, and just use it as such? Is the parity drive mirroring offering any real (additional) protection?

 

Thanks again very much for your help!

Link to comment

Since the drive is probably fine, you can use the Trust My Array procedure to more quickly restore the array - or let it rebuild per Peter's instructions.

Make an informed decision... you can choose.

 

You can get back the data you were writing to the failed disk by NOT using the "trust-my-array" Procedure, and re-construct the failed drive as described earlier in this thread,

or

You can get back immediate parity protection by forcing the array to trust/think parity is correct, but know it will have errors, because a write to the physical drive failed, so it may not accurately be usable in recovery situations until after a full parity check is performed.

 

 

Link to comment

since I don't have a backup for my unraid stored data

a RAID array is NOT a replacement for a backup of your precious data.  All it would take is one lightning strike, flood, fire, etc to cause you to lose all your files.  Please invest in a backup drive, stored off-site for really important files.

has anyone experience with mirroring the parity drive at the MB BIOS level? Would unraid recognize it as one drive only, and just use it as such?
This has been done by at least one member, but it was done for speed, not for the additional protection.  The user had multiple disks being written at the same time.  It will usually eliminate the capability to spin down the drive or read its temperature.
Is the parity drive mirroring offering any real (additional) protection?

The parity disk is no more important than any other disk in your server when re-constructing a failed disk.  All are equally important.
Link to comment

Is the parity drive mirroring offering any real (additional) protection?

 

Nope, it offers almost nothing extra at all. The only case it would help was if the parity and a data drive failed at the same time. You'd be better off mirring the data drives. Well, you'd actually really be better off storing any critical data to an off-site backup.

 

Peter

 

Link to comment

Thanks for the backup advice. I know that I have to bite the bullet, and such events show it to be important.

I'm thinking to replicate the data, over to a friend's server (setup something reciprocal) - just didn't find the time to do it (thinking about using rsync with ssh to send periodic updates, once we exchange a copy of our data)

 

What is the Pro community on this forum using for backup of larger volumes of data? I've been putting this off for quite some time, for cost reasons, but also thinking that I will have to manage those backups (periodic verification, refresh, purging etc)

 

Coming back to the server - I guess the lack of comments to those boot-up messages didn't trigger any thoughts? Since I'm at it now, I'm trying to resolve everything I can in this swoop.

 

I'm going also to attach - hopefully it will work - the screen with SMART history... There are a couple of things I kept ignoring, and I'm wondering what is your opinion...

 

- It didn't... the file is 340kb and it didn't load -

I hope you can pick it up from here: http://ifile.it/usoaqc5/Downloads.zip - found this free online storage site w/o a registration requirement. I hope it will work...

 

Thanks again very much for your time!

 

Link to comment

What is the Pro community on this forum using for backup of larger volumes of data? I've been putting this off for quite some time, for cost reasons, but also thinking that I will have to manage those backups (periodic verification, refresh, purging etc)

I have Crashplan running on my server and it works to back up most of my stuff.  I do not keep backups of my DVD/TV Shows as I can, if I have to, rip them all to the server again.  For the more important stuff (photos, documents, etc.) I use crashplan to backup to the cloud.

 

Link to comment

Hello to All and thanks for all the assistance.

 

My server is back to normal:

Array Status

STARTED, 10 disks in array.    Parity is Valid:.  Last parity check < 1 day ago .  Parity updated 1 times to address sync errors. 

I could not find where from the parity error came from, but I hope that having now all the drives statuses green, I'm back to normal.

 

If anyone has left some energy before I close this thread, I'd appreciate some feedback/advice to my previous posting (boot-up messages, and HDD  smart history warnings)

 

I've upgraded unmenu in between, and looks awsome. Thanks for this great add-on, makes life easier to look at system details.

 

I'll check closer Crashplan for backup - at the first sight looked interesting, with all the data deduplication and everything.

 

hg

 

 

Link to comment

1.- would you interpret the previously posted boot-up messages where memory cannot be reserved, and BAR6 - address space collision as unresolved conditions (by the kernel)? What about ACPI checksum error? Should PME be enabled in the BIOS?  Do these require my intervention in any way?

 

Just ignore them, these are completely normal, and you can find variations of almost all of them in all other syslogs.  It is just the Linux OS adapting itself to your machine, determining what can be reserved and where, etc.

 

As to the SMART display you linked, it is in a form I am not familiar with, but enough info is there for me to say that there is nothing of concern visible.  Most of the warnings are about very high hours, but you probably knew that.  Many of the drives are quite old, with a lot of hours on them.  You can perhaps adjust the warning and error thresholds for them, to quash those warnings.  The only other ones I see are 1 Reallocated_sector count, 1 UDMA_CRC count, and 1 ATA_Error count, and none of them have changed over the entire time of monitoring, so not a current problem.

 

Just a side note, I'm not sure it is useful to graph Power_On_Hours.  Isn't it just tracking how time passes - over time?  I suppose if one was shutting the machine off more one month than another month, it could be useful, but only from a server usage standpoint, hardly from a SMART standpoint.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.