Unraid OS version 6.10.0 available


Recommended Posts

2 hours ago, Vr2Io said:

Just confirm this also affect NVMe

I would be cautious of suspecting a general issue after a single occurrence, lots of people using NVMe devices, I myself have eight between my two main servers, see if you can reproduce the issue with a different board, if not then we could try to find someone with the same one, it could just be some defective hardware.

Link to comment
1 hour ago, JorgeB said:

I would be cautious of suspecting a general issue after a single occurrence, lots of people using NVMe devices, I myself have eight between my two main servers, see if you can reproduce the issue with a different board, if not then we could try to find someone with the same one, it could just be some defective hardware.

That's fine and agree, so I have test on another platform, it is AMD X399 due to it just free for test, all hardware is different and OS 6.10.1 with IOMMU enable.

 

The test show even worst, it can't mount the NVMe after format ( I do it twice ).

 

May 28 16:25:12 X399 unassigned.devices: Removing all partitions from disk '/dev/nvme0n1'.
May 28 16:25:25 X399 unassigned.devices: Format device '/dev/nvme0n1'.
May 28 16:25:25 X399 unassigned.devices: Device '/dev/nvme0n1' block size: 1000215216.
May 28 16:25:25 X399 unassigned.devices: Clearing partition table of disk '/dev/nvme0n1'.
May 28 16:25:25 X399 unassigned.devices: Clear partition result: 1+0 records in 1+0 records out 2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00448569 s, 468 MB/s
May 28 16:25:25 X399 unassigned.devices: Reloading disk '/dev/nvme0n1' partition table.
May 28 16:25:25 X399 unassigned.devices: Reload partition table result: /dev/nvme0n1:  re-reading partition table
May 28 16:25:25 X399 unassigned.devices: Creating Unraid compatible mbr partition on disk '/dev/nvme0n1'.
May 28 16:25:26 X399 kernel: nvme0n1: p1
May 28 16:25:26 X399 unassigned.devices: Create mbr partition table result: write mbr signature done 
May 28 16:25:26 X399 unassigned.devices: Reloading disk '/dev/nvme0n1' partition table.

May 28 16:25:26 X399 unassigned.devices: Reload partition table result: BLKRRPART failed: Device or resource busy  /dev/nvme0n1:  re-reading partition table
May 28 16:25:26 X399 unassigned.devices: Formatting disk '/dev/nvme0n1' with 'btrfs' filesystem.
May 28 16:25:31 X399 kernel: BTRFS: device fsid 1da2b2f0-b622-4cd3-83f3-1bfb840e7ba5 devid 1 transid 6 /dev/nvme0n1p1 scanned by mkfs.btrfs (43803)
May 28 16:25:34 X399 unassigned.devices: Reloading disk '/dev/nvme0n1' partition table.
May 28 16:25:34 X399 kernel: nvme0n1: p1
May 28 16:25:34 X399 unassigned.devices: Reload partition table result: /dev/nvme0n1:  re-reading partition table
May 28 16:25:41 X399 unassigned.devices: Adding partition 'nvme0n1p1'...
May 28 16:25:41 X399 unassigned.devices: Mounting partition 'nvme0n1p1' at mountpoint '/mnt/disks/SSD'...
May 28 16:25:41 X399 unassigned.devices: Mount drive command: /sbin/mount -t 'ntfs' -o rw,noatime,nodiratime,nodev,nosuid,nls=utf8,umask=000 '/dev/nvme0n1p1' '/mnt/disks/SSD'

May 28 16:25:41 X399 unassigned.devices: Mount of 'nvme0n1p1' failed: 'NTFS signature is missing. Failed to mount '/dev/nvme0n1p1': Invalid argument The device '/dev/nvme0n1p1' doesn't seem to have a valid NTFS. Maybe the wrong device is used? Or the whole disk instead of a partition (e.g. /dev/sda, not /dev/sda1)? Or the other way around? '
May 28 16:25:41 X399 unassigned.devices: Partition 'P02848106820' cannot be mounted.

 

 

 

After change to 6.10.2, all resume normal. But if I restore back 6.10.1 ( X399 platform ), problem can't reproduce even reboot again ...... anyway I may do more test on original problem hardware ( B365 ).

 

** Above problem should be UD reload partition table issue and mount the BTRFS in NTFS. **

 

During B365 mobo upgrade, I have test many item and many time and corrupt problem never solve .... both NVMe ( WD & Samsung ) were new.

 

 

 

Edited by Vr2Io
  • Like 1
Link to comment
2 hours ago, jkpe said:

Just to confirm... if I have been running 6.10.1 on a MicroServer gen8 with Xeon processor should I definitely expect file corruption or just maybe? How widespread is it?

To add to this, is there an easy way to see this for XFS? (as both cache and array are xfs for me) I have only found info on this issue with btrfs, should we assume that if a drive does has corruption it was written, and written incorrectly to the parity as well and wouldn't come out as an issue during a parity check?

Link to comment
20 hours ago, JorgeB said:

I would be cautious of suspecting a general issue after a single occurrence, lots of people using NVMe devices, I myself have eight between my two main servers, see if you can reproduce the issue with a different board, if not then we could try to find someone with the same one, it could just be some defective hardware.

My nvme cache (raid1) just went readonly this morning, probably corrupted filesystem.

Can't for sure say its related, but it's the first time in 5 years I have had it happen.

 

(currently on 6.10.2)

EDIT: after reboot cache seems to have come good

Edited by tjb_altf4
Link to comment
11 hours ago, jkpe said:

if I have been running 6.10.1 on a MicroServer gen8 with Xeon processor should I definitely expect file corruption or just maybe?

I wouldn't say definitely, there have been multiple cases, but as expected we mostly hear form people that had issues, so unclear if all are affected, also some cases are much serious than others, I would say that it's a good sign if you've been running the server for a few days without noticing any issues, like docker stopping working, etc, if want post the diagnostics or pm them to me, might be able to give you a better idea.

  • Like 2
Link to comment
9 hours ago, BiGBaLLA said:

To add to this, is there an easy way to see this for XFS? (as both cache and array are xfs for me) I have only found info on this issue with btrfs, should we assume that if a drive does has corruption it was written, and written incorrectly to the parity as well and wouldn't come out as an issue during a parity check?

With btrfs it's easier to see if there have been issues, it records any corruptions found and it usually quickly goes read-only if the issue is bad enough, with XFS only the most severe case might notice problems, like unmountable filesystem or a lost+found appearing after running xfs_repair, running a parity check is also a good idea.

  • Like 1
Link to comment
9 hours ago, tjb_altf4 said:

My nvme cache (raid1) just went readonly this morning, probably corrupted filesystem.

Can't for sure say its related, but it's the first time in 5 years I have had it happen.

 

(currently on 6.10.2)

EDIT: after reboot cache seems to have come good

 

If you saved the diags before rebooting please post or pm them to me if you prefer.

Link to comment
2 hours ago, JorgeB said:

I wouldn't say definitely, there have been multiple cases, but as expected we mostly hear form people that had issues, so unclear if all are affected, also some cases are much serious than others, I would say that it's a good sign if you've been running the server for a few days without noticing any issues, like docker stopping working, etc, if want post the diagnostics or pm them to me, might be able to give you a better idea.

Thank you.

I have run a scrub against both cache's and both came back with 0 errors. I haven't noticed any issues. I have disabled VT-d this morning.

Link to comment
On 5/20/2022 at 8:12 PM, JorgeB said:

I would recommend anyone running a HP MicroServer Gen8 to not update for now, there have been multiple cases of filesystem corruption after updating, with both XFS and btrfs, looks like the hardware doesn't get along with the new kernel, not clear if it's all models in general or just some specific BIOS/CPU combos, so if anyone updated without issues please post here.

 

 

 

 

Hi JorgeB, I do have a HP Microserver Gen8 with a E3-1265L V2 processor, VT-D enabled and have no such stability issues nor the errors you describe in syslog with the 6.10 release.

 

Just to be sure, I followed your advice and just disabled the VT-D feature but I'm pretty sure I'm not affected at all (many days of uptime)

 

Can you explain how you came to the conclusion that the issue is caused by the new linux kernel, and only affecting broadcom tg3 nics on systems with VT-D enabled ?

 

 

 

Link to comment

Looking into syslog I found many of these. What does that mean?

 

May 22 07:58:52 Tower kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PR00._CPC], AE_NOT_FOUND (20210730/psargs-330)
May 22 07:58:52 Tower kernel: ACPI Error: Aborting method \_SB.PR01._CPC due to previous error (AE_NOT_FOUND) (20210730/psparse-529)

 

Thanks in advance.

 

tower-diagnostics-20220531-0753.zip

Edited by hawihoney
Link to comment
53 minutes ago, hawihoney said:

Looking into syslog I found many of these. What does that mean?

It can be 2 things:

1. bios bug: errors come from acpi tables, if the bios is bugged you can have that errors (you have bios 1.1, latest version should be 1.2, if you are confident you could try an update)

2. kernel bug/outdated: it may be fixed one day with a kernel update

 

_CPC is continuous performance control related tocpu: with cpc enabled the system will use for amd cpus amd-pstate function, otherwise it will fall back to acpi legacy P-states.

On intel it enables futures like speed shift (p-states) without consulting acpi.

 

Having that errors should not compromise at all your experience.

Edited by ghost82
  • Thanks 1
Link to comment
10 hours ago, quack75 said:

Can you explain how you came to the conclusion that the issue is caused by the new linux kernel, and only affecting broadcom tg3 nics on systems with VT-D enabled ?

 

My supposition was based on the cases found so far, all had vt-d enable and the only common hardware, besides a lot of Intel devices which seemed unlikely to me as the source of the problem or there would be a lot more affected servers, were the NICs, can't say why your server is not affected, but @Thorstenjust found what looks like the exact same issue reported in Ubuntu:

 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970453

 

This should hopefully help get to the bottom of the problem.

Link to comment
12 hours ago, quack75 said:

 

Hi JorgeB, I do have a HP Microserver Gen8 with a E3-1265L V2 processor, VT-D enabled and have no such stability issues nor the errors you describe in syslog with the 6.10 release.

 

Just to be sure, I followed your advice and just disabled the VT-D feature but I'm pretty sure I'm not affected at all (many days of uptime)

 

Can you explain how you came to the conclusion that the issue is caused by the new linux kernel, and only affecting broadcom tg3 nics on systems with VT-D enabled ?

 

 

 

Just to add to this - and I don't want to jynx myself - but I have run a scrub and a parity check and both come back fine (54TB array... phew!) .. I am running Gen8, Xeon CPU E3-1240 V2, VT-D was enabled.

 

Either way thank you @JorgeB for your help and advice.

Edited by jkpe
  • Like 1
Link to comment
21 hours ago, quack75 said:

Hi JorgeB, I do have a HP Microserver Gen8 with a E3-1265L V2 processor, VT-D enabled and have no such stability issues nor the errors you describe in syslog with the 6.10 release.

 

9 hours ago, jkpe said:

Just to add to this - and I don't want to jynx myself - but I have run a scrub and a parity check and both come back fine (54TB array... phew!) .. I am running Gen8, Xeon CPU E3-1240 V2, VT-D was enabled.

 

Both @quack75and @jkpedo you mind confirming if you were using the onboard NIC, and not some add-on model?

 

Also please post the BIOS you have, e.g.:

HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

You can see that in first few lines of the syslog.

 

 

Link to comment
3 hours ago, JorgeB said:

 

 

Both @quack75and @jkpedo you mind confirming if you were using the onboard NIC, and not some add-on model?

 

Also please post the BIOS you have, e.g.:

HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

You can see that in first few lines of the syslog.

 

 

Onboard NIC yes, both ports are bonded. Onboard HBA no.. I'm using an LSI card for all drives.

BIOS:

HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

 

  • Like 1
Link to comment
16 hours ago, JorgeB said:

 

 

Both @quack75and @jkpedo you mind confirming if you were using the onboard NIC, and not some add-on model?

 

Also please post the BIOS you have, e.g.:

HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

You can see that in first few lines of the syslog.

 

 

 

My BIOS is :

HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015

 

I use the onboard NIC too, but not bonding the 2 ports. I have 1 ILO dedicated port and the other one is dedicated to Unraid.

For HBA I'm using an add-on IBM M1015 card (aka LSI SAS9220-8i) for all drives except for one SSD (Cache drive) which is plugged on the CDROM port.

Edited by quack75
  • Like 1
Link to comment

Thank you both, unfortunately doesn't make things any clearer, trying to focus mostly on the Microserver gen8 since it's by far the most affected model but can't find anything in common why some are not affected:

 

-jkpe's BIOS matches a lot of the affected models, quack75 has a different date but since the other matches it doesn't really matter.

 

-some of the affected users have a bare stock server, without any extra card/controllers, others have an extra NIC, controller, NVMe device, etc

 

-affected models all have Xeon CPUs, which is required for vt-d on these servers, and both Sandybridge and Ivybridge can be affected

Link to comment

as the issue seems to be related to shared memory, could the type & amount of RAM in the server play some role ?

I have 16GB of ECC memory in my G8.

 

Another question, could the bug be triggered by some specific usage in Unraid ?

I'm using xfs filesystem on all my drives except the cache which uses btrfs.

And I'm using docker containers but no VM at all (and thus no pass through of devices)

 

 

 

Link to comment
45 minutes ago, quack75 said:

I have 16GB of ECC memory in my G8.

Most I saw have the same.

 

45 minutes ago, quack75 said:

Another question, could the bug be triggered by some specific usage in Unraid ?

Unlikely, especially if the same exact issue occurs with Ubuntu and zfs.

 

Pass-through also appears to not be a factor, most of the users affected had vt-d enabled because it is by default, but some didn't even had any VMs.

 

Link to comment

Well I did a parity check yesterday and for the first time in many years it found 23 parity errors...

It may be unrelated to the tg3 bug but it's very unlikely...

 

After the parity was repaired, I xfs checked the 6 disks of my array. I'm not sure of the result because the output of xfs_repair is unclear to me. All I can say is that no lost+found directory was created on the drives so I hope I'm fine with my data and suffered no loss...

 

 

 

 

Link to comment
5 minutes ago, quack75 said:

Well I did a parity check yesterday and for the first time in many years it found 23 parity errors...

 

5 minutes ago, quack75 said:

All I can say is that no lost+found directory was created on the drives so I hope I'm fine with my data and suffered no loss...

 

The action in the first quote put you in the position that you found yourself in the the second quote.  You are fortunate that you did not have a failure which required that you had to rebuild a disk using that parity situation/data.  (Most of the time, it is the parity data that has the error, not an data error on the disks...)  

 

If there are other folks out there operating their servers in this type of environment, please take a few minutes to setup a monthly parity check and the Notification app (    Settings   >>>   Notification settings  ). Having a daily notification arrive that your server is healthy will require that you delete the notice until the day comes when the notification informs you of a problem.  Addressing it then will often be a minor task.  Getting a 'notification' of problem(s) via the route that the server is unavailable/unreachable often means you have lost data.  

Link to comment
1 hour ago, quack75 said:

Well I did a parity check yesterday and for the first time in many years it found 23 parity errors...

It may be unrelated to the tg3 bug but it's very unlikely...

 

With a Microserver G8 used with vt-d enabled it's most likely related, unless there was an unclean shutdown, you though you weren't affected but possible were, though very lightly, unclear for now why some users see severe issues after a couple of minutes use while others only minor or even no issues after a few days, as long as vt-d is now disabled you shouldn't see any more issues.

 

1 hour ago, quack75 said:

It may be unrelated to the tg3 bug but it's very unlikely...

Note that it's unclear for now if the NIC is the source of the problem, it has been found in all affected models so far but it could be just a coincidence, in fact the last case I've found might indicate that the NIC is not the problem, the only thing I know for sure for now is that there's an issue with the kernel and vt-d, and it affects mostly HP servers, not for example HP workstations, at least not so far, but there's also one Lenovo server, so it's not just HP.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.