Duniac Posted January 30, 2020 Share Posted January 30, 2020 (edited) Hi All, I have been using Unraid for a couple of couple of months now, and have 1 parity drive with 5 drives in the storage array (would like to configure a second parity disk soon). This week disk 3 became disabled with errors in the logs (blk_update_request: i/o error). I researched the error and tried connecting the disk to another channel on the hba controller, no change. After changing the channel, I removed the disk from the array and ran pre-clear, the error appeared in the pre-read, I then ran pre-clear again and it completed successfully. Then ran pre-clear again successfully. At this point I hoped that the problem was fixed. So I then added the disk back to the array and the parity rebuild started, but then failed with the same error. I now have two notifications appearing: Unraid array errors: 30-01-2020 16:55 Warning [UNRAID-MEDIA] - array has errors Array has 3 disks with read errors --------------------------- Unraid array errors: 30-01-2020 17:01 Warning [UNRAID-MEDIA] - array has errors Array has 4 disks with read errors Can someone please point me in the right direction? Is it possible the HBA Controller is failing or is there a problem with Unraid? Attached diagnostics. Thanks in advance. unraid-media-diagnostics-20200130-1053.zip Edited January 30, 2020 by Duniac Quote Link to comment
JorgeB Posted January 30, 2020 Share Posted January 30, 2020 Log is completely spammed with these errors from the HBA: Jan 30 04:40:06 Unraid-Media kernel: mpt3sas 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 30 04:40:06 Unraid-Media kernel: mpt3sas 0000:04:00.0: AER: device [1000:0087] error status/mask=00000001/00002000 Jan 30 04:40:06 Unraid-Media kernel: mpt3sas 0000:04:00.0: AER: [ 0] RxErr Jan 30 04:40:17 Unraid-Media kernel: pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:00.0 While technically this is not a problem if nothing else it spams the log, look for a bios update or try a different slot for the HBA to see if these errors stop, also a good idea to update the LSI to latest firmware since it's on a very old one, current is p20.00.07.00 Quote Link to comment
JorgeB Posted January 30, 2020 Share Posted January 30, 2020 Most likely the actual problem, there's a known problem with some 10TB Seagates and LSI, Seagate released a firmware update to fix that: https://apps1.seagate.com/downloads/certificate.html?key=1237891795995 Quote Link to comment
Duniac Posted January 30, 2020 Author Share Posted January 30, 2020 Thanks for the suggestions, will look at updating the firmware on the HBA controller and HDD. Hopefully this will correct the problem. In the meantime I have suspended all operations and turned off all containers, I don't want any operation occurring while I have two bad disks. Quote Link to comment
Duniac Posted February 2, 2020 Author Share Posted February 2, 2020 Update: followed the instructions https://wiki.unraid.net/UnRAID_Manual_-_FAQ The drive was identified correctly, added to the array and the parity rebuild began. However, the error re-appeared, see new diagnostics. Ran SMART extended self-test, however the following message appeared - Interrupted (host reset). Also noticed that all of my docker containers have gone missing. Also, also seems that this drive is not being emulated at all, as I am missing a lot of content. I will be purchasing a new drive in about one week, but in the meantime I'll be running with no protection and it seems that Unraid is totally shitting itself now! Any assistance anyone can provide will be very appreciative. unraid-media-diagnostics-20200202-0741.zip Quote Link to comment
JorgeB Posted February 2, 2020 Share Posted February 2, 2020 Multiple issues: Feb 2 17:12:23 Unraid-Media kernel: XFS (md3): Metadata CRC error detected at xfs_dir3_block_read_verify+0x7c/0xc5 [xfs], xfs_dir3_block block 0x100000038 Feb 2 17:12:23 Unraid-Media kernel: XFS (md3): Unmount and run xfs_repair Disk 3 had a corrupt file system. Feb 2 17:59:08 Unraid-Media kernel: md: disk1 read error, sector=952405472 Feb 2 17:59:08 Unraid-Media kernel: md: disk1 read error, sector=952405480 Feb 2 17:59:08 Unraid-Media kernel: md: disk1 read error, sector=952405488 Feb 2 17:59:08 Unraid-Media kernel: md: recovery thread: multiple disk errors, sector=952405168 Feb 2 17:59:08 Unraid-Media kernel: md: disk1 read error, sector=952405496 Feb 2 17:59:08 Unraid-Media kernel: md: disk1 read error, sector=952405504 There were read errors on disk1 during disk3 rebuild, so rebuild will be corrupt, if the rebuild didn't finish you can try again after fixing the disk1 issues, and fixing the file system on disk3. Quote Link to comment
Duniac Posted February 2, 2020 Author Share Posted February 2, 2020 12 hours ago, johnnie.black said: There were read errors on disk1 during disk3 rebuild, so rebuild will be corrupt, if the rebuild didn't finish you can try again after fixing the disk1 issues, and fixing the file system on disk3. How should I approach fixing the read errors occuring? Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 Disk looks healthy, but there are known issues with those models and LSI, Seagate release a firmware update for that, see if that helps. https://apps1.seagate.com/downloads/certificate.html?key=1237891795995 Quote Link to comment
Duniac Posted February 3, 2020 Author Share Posted February 3, 2020 12 hours ago, johnnie.black said: Disk looks healthy, but there are known issues with those models and LSI, Seagate release a firmware update for that, see if that helps. https://apps1.seagate.com/downloads/certificate.html?key=1237891795995 It does seem odd that I have been operating without problems for about six months. Should I look at replacing the Controller or will the firmware update be sufficient? Will also be performing the following: Purchasing another drive to ensure dual parity Running diagnostics on the disk causing the problems Updating firmware on all disks Extracting the controller card and updating the firmware Reinstalling controller card and disk Will preclear new disk and the disk causing the problems. Insert new disk into the array in the position of the disk causing problems. Ensure parity has sync'd correctly. Then add the disk which caused the problems as parity. Quote Link to comment
JorgeB Posted February 4, 2020 Share Posted February 4, 2020 9 hours ago, Duniac said: will the firmware update be sufficient? Should be, it was release specifically to fix that problem (possible others also), also worth checking all connections. First thing you want to do is to rebuild disk3 (assuming it never completed), since old rebuild was going to be mostly corrupt. Quote Link to comment
Duniac Posted February 7, 2020 Author Share Posted February 7, 2020 On 2/4/2020 at 6:54 PM, johnnie.black said: Should be, it was release specifically to fix that problem (possible others also), also worth checking all connections. First thing you want to do is to rebuild disk3 (assuming it never completed), since old rebuild was going to be mostly corrupt. Looks like things are going from bad to worse. I've now lost disk 1 (Unmountable: No file system). Backing up data seems to be a lost cause at the moment as nothing is copying, everything is showing read errors. I do have backups of critical data, but I hope not to have to loose everything, but it seems like that is what is happening now. Still don't understand why these problems has suddenly appeared, but will try to move forward... Quote Link to comment
itimpi Posted February 7, 2020 Share Posted February 7, 2020 11 minutes ago, Duniac said: I've now lost disk 1 (Unmountable: No file system). As long as the disk has not physically failed this is normally easily rectified. The first thing, though, is to work out why you are getting errors reported in the first place. Quote Link to comment
JorgeB Posted February 7, 2020 Share Posted February 7, 2020 45 minutes ago, Duniac said: everything is showing read errors. This is after the firmware update? Quote Link to comment
Duniac Posted February 14, 2020 Author Share Posted February 14, 2020 Hi All, Couple of updates: 1. I have purchased a new drive to try to recover data. 2. I completed a firmware update on the disk which originally caused the problems. Now, when I applied the firmware update on the disk, Seagate advises to backup the disk as it may result in lost data. itimpi: I have ran a full scan of the disk which was experiencing the problems (took about 15 hours), no errors were found. tee-tee jorge: The errors I experienced were prior to extracting the disk to update the firmware. I haven't had time in the last week to investigate further and have had the machine turned off. What is the best approach for trying to recover data? I have lost a lot and desperately need to recover some data which was added to the array and wan't backed up. I have tried access the data from within UnRaid, by navigating to the disk and trying to see the data, but nothing is appearing. Are there any tools which can be used to try to extract the data? Quote Link to comment
JorgeB Posted February 14, 2020 Share Posted February 14, 2020 Please post current diags. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.