bonienl Posted May 12, 2019 Share Posted May 12, 2019 (edited) 1 hour ago, barrygordon said: I am surprised that there is no need to assign the new drive into the array You do need to assign the new disk. Unraid will see the old disk as missing/wrong and once the new disk is assigned, it understands it is a replacement. Edited May 12, 2019 by bonienl Quote Link to comment
bonienl Posted May 12, 2019 Share Posted May 12, 2019 1 hour ago, barrygordon said: Because of the errors being reported on a parity check I am not sure what the best approach would be On the Main page you can click on the "disk icon" of each drive. This opens a window with information specific to this drive. Do you see any any errors in here (perhaps post a screenshot)? Quote Link to comment
Frank1940 Posted May 12, 2019 Share Posted May 12, 2019 1 hour ago, barrygordon said: If parity is valid then why will a parity check run again a short time later report the same scale of error (4000+). Are you doing the Parity Check with this box checked? This box is on the Main tab/page. If you are, what SATA expansion card are you using? Also, just below the button that starts the Parity Check, there is another box named 'History' which gives the history of the previous Parity Checks. That will provide you (and us) with the actual results of those earlier parity checks. The counts should be stable. IF they are not, we need to figure out why? Quote Link to comment
trurl Posted May 12, 2019 Share Posted May 12, 2019 5 hours ago, barrygordon said: It always comes back with parity valid and finding between 4000 and 4200 errors. I understand why it believes parity is valid if the corrections were written to the parity drive. If parity is valid then why will a parity check run again a short time later report the same scale of error (4000+). It is as if parity was not being corrected on the parity drive or some other disk(s) is failing constantly. If you run a correcting parity check that has more than zero sync errors, you should run an additional non-correcting parity check. If that parity check still has more than zero sync errors, then you have a hardware problem that needs to be addressed. Have you done a memtest lately? Bad RAM is a likely cause of recurring sync errors. Post diagnostics, or at least a syslog, that contains the period where sync errors are found. And another syslog for a subsequent check if sync errors recur. Comparing those syslogs might give some clue to what is going on. Quote Link to comment
trurl Posted May 12, 2019 Share Posted May 12, 2019 3 hours ago, Frank1940 said: there is another box named 'History' which gives the history of the previous Parity Checks. That will provide you (and us) with the actual results of those earlier parity checks. The counts should be stable. Not only should the counts be stable, they should be stable at exactly zero. Quote Link to comment
Frank1940 Posted May 12, 2019 Share Posted May 12, 2019 23 hours ago, barrygordon said: My only concern is that I ran a parity check which claimed there were many errors. It took approximately 10 hours and reported approximately 4000 errors. I do not remember it ever reporting errors before. I tried to read some data and it all read fine (actually just played some movies). I have stuff in the drives I no longer need so I am thinking about just deleting that stuff and rerunning a parity check. When I run the parity check should I "Write corrections to the parity drive"? Comments, advice greatly appreciated. tower-syslog-20190511-1144.zip 21.54 kB · 0 downloads 11 minutes ago, trurl said: Post diagnostics, or at least a syslog, that contains the period where sync errors are found. And another syslog for a subsequent check if sync errors recur. Back a page ago, the OP did post up Syslog which contains this section of errors (Which I snipped for clarity---) May 10 13:28:37 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks. May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=0 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=8 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=16 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=24 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=32 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=96 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=128 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6888 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6896 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6904 <<<<< SNIP >>>>> May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17888 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17896 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17904 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17912 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17920 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17928 May 10 13:28:38 Tower kernel: md: recovery thread: stopped logging One quick thing that I observed is that is a pattern to these errors. Notice that they are separated by eight sectors. The entire section that was logged consisted three blocks that have this pattern. Quote Link to comment
barrygordon Posted May 12, 2019 Share Posted May 12, 2019 bonienl: 1- I would suggest that the instructions for replacing a failed drive or installing a larger drive should explicitly state the need to re-assign the slot with the new drive 2- I have attached a set of screen snapshots as requested Frank1940(1): 1- Yes the write corrections to parity is checked 2- I have attached a listing of the screen from Tools>System Devices. The actual board that controls the drives is: SAS/SATA HBA SuperMicro AOC SASLP MV8 SAS-SAS Cables 3 Ware SFF-8087 to SFF-8087 3- I have attached a copy of the "History" as requested trurl: 1- I will run a non-correcting parity check and a ram test later today. I had run a memory test earlier and it showed no memory errors. Frank1940(2): I am not sure what the significance of the error spacing you have pointed out is. Is there any way of ascertaining which drive it is pertaining to? Tools_System Devices.txt Parity History.txt tower-syslog-20190512-1631.zip unRaid screen snapshots.zip Quote Link to comment
Frank1940 Posted May 12, 2019 Share Posted May 12, 2019 24 minutes ago, barrygordon said: Frank1940(2): I am not sure what the significance of the error spacing you have pointed out is. Is there any way of ascertaining which drive it is pertaining to? In the back of my mind, I seem to recall that this pattern of failures (or a very similar one) has shown up before. However, I don't remember what was the resolution of the problem but it also seems to me that it was not the fault of the data on the disks but rather in the hardware. I was hoping that someone else might come across this thread but that does not seem likely as long as it is. I am going to 'bump' @johnnie.black and see if he has any thoughts. Johnnie, have a look at this post for what I am bumping you about: 5 hours ago, Frank1940 said: Back a page ago, the OP did post up Syslog which contains this section of errors (Which I snipped for clarity---) May 10 13:28:37 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks. May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=0 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=8 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=16 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=24 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=32 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=96 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=128 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6888 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6896 May 10 13:28:37 Tower kernel: md: recovery thread: P incorrect, sector=6904 <<<<< SNIP >>>>> May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17888 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17896 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17904 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17912 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17920 May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17928 May 10 13:28:38 Tower kernel: md: recovery thread: stopped logging One quick thing that I observed is that is a pattern to these errors. Notice that they are separated by eight sectors. The entire section that was logged consisted three blocks that have this pattern. Quote Link to comment
barrygordon Posted May 12, 2019 Share Posted May 12, 2019 I have just started a memory test. When it is done I will report any errors and reboot/restart the array and run a parity check WITHOUT writing corrections to the array. Could a loose cable be causing the types of problems being seen? Right now it is a bitch to take the case out of the rack as it is quite heavy and I am restricted in what I can lift, however, I can solve that problem with a little carpentry. I was wondering: The system has an internal hard drive which it does not use but can be booted from. I believe it has Windows XP on it as it was built in 2009-2010. The drive is seen by unRaid as not part of the array it is a Samsung 120 gig drive which the system recognizes as an unassigned drive (sdb) so it is obviously connected It also has a DVD drive which I do not believe is currently connected. If I boot win XP will it muck up the hard drives in the hot-swap slots? I can always pull them out. If I build Win 10 onto the samsung drive and boot from it will it muck up the hard drives? By muck-up, I mean write to them. Can I make a self-contained version of linux on a flash drive, boot it and have it come up without affecting any of the array drives? Is there a chkdisk program or its equivalent that I could then use to check each of the array disks in a read only mode? If that is all possible, since I have a spare spindle can I make a copy of an array drive to a fresh drive using that system. Is there a plugin that does this sort of thing? If the SAS/SATA controller is bad I can get a new one for not too outrageous a price, but how do I determine if it is bad?. Advice? Comments? Quote Link to comment
trurl Posted May 13, 2019 Share Posted May 13, 2019 2 hours ago, barrygordon said: Could a loose cable be causing the types of problems being seen? These will usually show up as I/O errors in the syslog. I am not clear on why you would want to boot another OS. Other than the parity checks and some SMART warnings, are you having some other problem? Can you access all your files? Quote Link to comment
barrygordon Posted May 13, 2019 Share Posted May 13, 2019 I am having no other problems. I can access data on each of the data drives. I have a lot of data that I don't need any longer and I plan to delete these files after the parity check has completed. I will delete them from my win 10 system using the network access. Is there a way to delete files from the unRaid GUI? With regard to booting another OS I was just curious. I have a very inquisitive mind. Prior to all of this, I had both shares and disks visible when I looked at the Tower from my WIN system. The Tower was running unRaid 6.1.3. Now to get that view in my WIN system I need to enable both disk shares and user shares. I don't remember what was enabled when I was running 6.1.3; it might have been both. Other than the known bug regarding copying files onto themselves is there any issue in enabling both? I do not use the command line of the unRaid (Tower) system, in fact, I don't even log in; I just use the graphical interface from my browser. I am going to bed. Tomorrow is another day. Quote Link to comment
JorgeB Posted May 13, 2019 Share Posted May 13, 2019 On 5/12/2019 at 10:31 PM, Frank1940 said: May 10 13:28:38 Tower kernel: md: recovery thread: P incorrect, sector=17912 I didn't read the entire thread but this was a non correcting check, a correcting check would appear as: Quote P corrected, sector=17912 On 5/12/2019 at 10:31 PM, Frank1940 said: Notice that they are separated by eight sectors. This is normal, it's the first bit of each byte parity is checked for every standard 4k Linux block, each block has 8 sectors (with standard 512E drives). Quote Link to comment
Frank1940 Posted May 13, 2019 Share Posted May 13, 2019 (edited) OK, I have done a bit of googling on the SASLP MV8 controller that you are using. (Google search engine does a much better job of searching this forum than the forum's search engine. Just put Unraid in as a parameter.) It uses a Marvell chip set which has had issues with Unraid for quite some time. (Just look at the current 6.7.0 Release Thread!) Basically, it is best practice to replace any Marvell-based controller that exhibits any sort of problems with one made by LSI. See here for models: https://forums.unraid.net/topic/69018-sata-controller-replacement-question-and-advice/?tab=comments#comment-630097 You can find these on E-bay for less than $100.00. You will find both new and used ones there. The used ones are coming from server farms that are being updated and are always genuine LSI-OEM parts. They do have to be cross-flashed with new firmware. (You always want the LSI card to be in the "IT-Mode "!) You can do the crossflashing yourself or buy a card that is crossed-flashed for $10-15 more. Now for the new cards that you find on E-bay. There are counterfeit LSI cards out there. You want to avoid them as they have (basically) no warranty. When you buy on E-bay, vet the vendor. I always look for a USA vendor with a good rating who has been sold a lot of these cards. (You will get a large number of listings on E-bay with "LSI IT-Mode" as a search parameter.) You can also find LSI cards on Amazon but again vet the vendor (and be careful in reading the description as many of these cards are strictly RAID cards). PS: be sure to run memtst for 24 hours. Note that the built-in memtst will not check ECC memory. Edited May 13, 2019 by Frank1940 Quote Link to comment
trurl Posted May 13, 2019 Share Posted May 13, 2019 9 hours ago, barrygordon said: With regard to booting another OS I was just curious. I have a very inquisitive mind. Windows has no native support for Unraid filesystems, so it won't be able to read or write the Unraid disks unless you tell it to format them or otherwise mess with them (repartition, etc) in Disk Management. Quote Link to comment
itimpi Posted May 13, 2019 Share Posted May 13, 2019 10 hours ago, barrygordon said: With regard to booting another OS I was just curious. I have a very inquisitive mind. If you want to try this then I would suggest booting a Linux ‘Live’ distribution that can run from a DVD or USB stick without needing installation. Such a system can read Unraid disks without any problems. Quote Link to comment
barrygordon Posted May 13, 2019 Share Posted May 13, 2019 trurl, That was my impression. I just wasn't sure if the mere booting of windows would try and write to all the drives it could see itimpi, That was what I planned to do. As I mentioned there is an internal drive (sdb) in the case that has the grub loader and two OS's Win XP and Ubuntu. I am not sure what versions. If I were to pull all of the array drives and boot from that internal drive selecting Ubuntu I should be able to verify that is intact and working. I could then do it again with all of the array drives plugged in and Ubuntu should see them all. The above is an activity for another day, another time. Quote Link to comment
barrygordon Posted May 14, 2019 Share Posted May 14, 2019 I believe I am almost done dealing with my unRaid Tower, and can once again start treating it as an appliance. First I want to thank all in this community who contributed advice, but most of all for the lowering of my anxiety level. I have run two successful parity checks (non-correcting) with zero errors. I have removed all sorts of older data which I no longer need. I have spot checked each of the drives and I can read from them fine. The array has 6.09 TB of free space. I have ordered the following equipment's: 1- A 3 TB hard drive to replace Drive 5 which was causing sector reallocation notifications 2- A 6TB hard drive to become my new parity drive 3- A LSI 6Gbps SAS HBA LSI 9201-8i It is my intent to replace Drive 5 first. Wait a while and then Replace the Parity Drive. I will keep the LSI controller, but I am not sure when I will install it. Obviously, if the current Marvel controller fails I will replace it My only concern is that the last time I ran a "Correcting" parity check There were over 4000 sync errors but at the end of the check, it stated that parity was valid. A little confusing. Once again thanks to all, and if there are additional concerns/comments/advice just post them. Barry Gordon Quote Link to comment
Frank1940 Posted May 14, 2019 Share Posted May 14, 2019 41 minutes ago, barrygordon said: My only concern is that the last time I ran a "Correcting" parity check There were over 4000 sync errors but at the end of the check, it stated that parity was valid. A little confusing. What happens when you run a 'correcting parity check' is that any parity errors on the parity drive are overwritten with the results of the parity calculation--- 'correcting the error'. As you can easily realize, This 'fixes' all of the parity errors. So that there now should be zero parity errors. If you were to do a non-correcting parity check, it should now have zero errors. (By the way, it is a good idea to do so!) Quote Link to comment
barrygordon Posted May 14, 2019 Share Posted May 14, 2019 I have done two non-correcting parity checks; one before I deleted a bunch of old data, and one after. They both showed zero errors. I also have scheduled a non-correcting parity check to run monthly. I believe I am now covered. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.