unraid keeps corrupting my ssd?

November 3, 20178 yr

Starting at the beginning: I had two ssd's in a cache pool when I first installed unraid. A 128Gb and a 64Gb. I noticed that only half of the 128 was being used, so I removed the 64.

Then it showed 128gb capacity on cache.

When I got 63Gb+ on the cache drive, however....I started to get "full" errors sometimes, and vm's would pause, etc.

That was when I realized my cache drive was still only using a 64gb btrs file-system. I dealt with it for a while though, because I didn't want to take it out of service to redo it.

Fast forward a few months. Array drives are maxed out. 8 drives. 4 in each drive cage, connected via sff8087 to sata breakout cables to my IBM m1015. (IT mode)

Bought a proper 12 drive shelf. Connects via 3xsff8088 to a 8201-16e. All array drives show up, and all is well. Except when I connected my m1015 to the backplane of my server, and put my 128Gb ssd in a slot....it wasn't recognized. Didn't show up at all.

Hmm....that's strange. It was working on this same card, when it was in the drive cage. No matter....I have a 12 slot shelf now, and only 9 drives.....I'll just put it in slot number 12.

Detected it, showed "unmountable" and didn't mount it as cache.

I tried a bunch of things to get it to mount, but eventually gave up.

Remembering the less than ideal 64Gb partition I've been using, I thought "Well, I'll just format it and start over....reinstall my dockers, and copy my vm .img file back to it.

Formatted, assigned as cache. All is well.

Until I try to install a docker:

Error: failed to register layer: chown /var/lib/docker/btrfs/subvolumes: read-only file system

Looking at forums, I found "read only file system" is usually because of corruption.

Looked in disk logs, and saw things like:

Nov 3 11:34:22 RITTERNET1 kernel: sd 1:0:7:0: [sdi] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Nov 3 11:34:22 RITTERNET1 kernel: sd 1:0:7:0: [sdi] tag#13 CDB: opcode=0x2a 2a 00 00 43 55 30 00 08 00 00
Nov 3 11:34:22 RITTERNET1 kernel: blk_update_request: I/O error, dev sdi, sector 4412720

Also:

Nov 3 11:34:22 RITTERNET1 kernel: Buffer I/O error on dev sdi1, logical block 550814, lost async page write

So I thought my ssd must be bad. Replaced with brand new 850 pro 256gb......and got the same behavior.

All seems well, until I try to install a docker...

I tried to change it to reiserfs, format, then switch to xfs, format..... as soon as I try to install a docker.....errors.

It's like the docker partition is inherently corrupt!

Diagnostic files attached.

ritternet1-diagnostics-20171103-1216.zip

Edited November 3, 20178 yr by castanza128
clarity

Quote

November 3, 20178 yr

I believe in order tp do a cache pool, the drives have to be the same size, i could be wrong

Quote

November 3, 20178 yr

Author

Just to be clear:

My problem is that I can't set docker or VM to store on my ssd.

Whether the ssd is mounted as cache or mounted separately with "unnasigned devices" it will always give errors, as soon as I try to install a docker.

XFS, Reiserfs, or BTRFS, doesn't matter. It will act like a corrupt drive, and go read-only.

It's not the drive, because I replaced it, and got the same behavior......and the old drive was removed, formatted ntfs, and is now happily working in another machine.

Edited November 3, 20178 yr by castanza128

Quote

November 3, 20178 yr

You log is full of with these errors:

Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:00 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:01 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:02 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:03 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:04 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:05 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:06 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:07 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:08 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action

And there's nothing logged after this, so try this, reboot, start the array and try to install a docker, then grab and post new diags.

Quote

November 3, 20178 yr

Author

10 minutes ago, johnnie.black said:

You log is full of with these errors:


Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:00 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:01 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:02 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:03 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:04 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:05 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:06 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:07 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action LNXCPU:08 is not defined
Nov  3 06:21:16 RITTERNET1 root: ACPI group processor / action

And there's nothing logged after this, so try this, reboot, start the array and try to install a docker, then grab and post new diags.

I have always had that error, even when ssd was acting normally.

Fresh diags right after reboot, and docker install attempt:

diags.zip

Edited November 3, 20178 yr by castanza128

Quote

November 3, 20178 yr

It's definitely a hardware problem, since there are 16000+ CRC errors on the SSD it's likely a bad SATA cable.

P.S.: LSI2008 doesn't support trim on most SSDs, including yours, so it would be best on another controller, your onboard controller is set to IDE, if it supports AHCI you should connect it there.

Quote

November 3, 20178 yr

52 minutes ago, castanza128 said:

I have always had that error, even when ssd was acting normally.

Forgot to say, I wasn't saying these errors are a problem, just that they spam the syslog and make it much harder to find the real problems, and because of the them the syslog may fill up and stop logging.

Quote

November 3, 20178 yr

Author

21 minutes ago, johnnie.black said:

It's definitely a hardware problem, since there are 16000+ CRC errors on the SSD it's likely a bad SATA cable.

P.S.: LSI2008 doesn't support trim on most SSDs, including yours, so it would be best on another controller, your onboard controller is set to IDE, if it supports AHCI you should connect it there.

Next time I can get physical access to the server, (in a few days) I'll try it on another port.....or maybe move that m1015 card to another slot and see if I can get it to recognize the ssd.

It was strange that when I connected the m1015 into my backplane, it wouldn't detect the ssd in the first drive slot.... maybe it is conflicting with the lsi2008 because they are on the same pci-e riser? Worth a try.

Quote

unraid keeps corrupting my ssd?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)