Beginner needs advice. (SMART, setting up dual parity, disabled disks, temp, other weird bugs)


Georg

Recommended Posts

Hi community

 

I just joined the unraid community and trying to set up my new built server. I tried to find solutions online but often ended up in outdated treats or unraid wikis. Btw I am using unraid 6.7.2, asrock X470 Taichi Ultimate with a Ryzen 7 2700x. This is connected via a used 9300 SAS 9300-16i to a Supermicro backplane (BPN-SAS-846A).

 

1. When preclearing my first disks I have noticed that it says:

 

1.png.b2a67dcd711e03ddb54d8b07ce938ca6.png

 

Is this normal? I did preclear 4 external drives in their cases via USB and two Ironwolf drives via the backplane. If I recall correctly it sad it only with the ones connected to the backplane. Having connected all of them via the backplane all of them mentioned this one above.

What consequence has that?

Does it need some response fom my side?

 

 

2. While finalizing an extended SMART test a drive showed a “UDMA CRC error count” error. I have read that this is may due to bad cabling. Is this correct?
Unfortunately by the time finding out I was not able to tell anymore in which bay it happened. Consequently, I have tested (doing one preclear per bay, 8tb) all bays without being able to reproduce the error.  I feel like it has nothing to do with the disk as it got precleard 4 times after that without another incidence.

Are there other things that can produce that error or can I do something else other than hoping for the best?

 

 

3. System temp plugin: Since the beginning I was never able to get correct temperature values. They are all minus and and not changing at all e.g. the cpu says always -62.5. I also could not install an other driver than the suggested one although there are at least 4 types of fans in the system present (3 of which are noctua). (Assuming the driver refers to noctua).  What am I doing wrong?

 

imageproxy.php?img=&key=e5eec7c5c933ca162.png.215ae3e1bee7292134f07a072e8f0689.png

 

My final goal would be to set one of them to the temperature of the hottest drive after the data issue is fixed. If the HBA card has one as well I would like to connect one to that too.

 

4. Additionally, the numbers (Fan #) detected by unraid are also different than to the plugs on the MB or in the bios. Can I change that?

 

5. At one point this was shown and I assumed it is just wrong data. After a restart this was gone and never occurred again. I assume this was just a harmless one?
Right know I kind of realize that I might have destroyed helpful information with that restart…

(btw it has now 16 gigs of ram)

3.png.bfcf8895ff26873b5fe1d3b353029631.png

 

6. How do I set up dual parity correctly?
Just starting the array after preclearing and formatting the disks is not the solution. Also not letting it do a parity check.  

4.png.fc8af54dc1cb327bebb69c2434e584f0.png

 

I don’t know where the errors come from and I don’t know why parity two is disabled. I think the errors came after the second parity check where I hoped it may build the second parity disk.

State of now is that the second parity is disabled as well as one array drive (not the one with the SMART error). How do I get these going now or what is the issue there?

It could may be connected to that but the working and the disabled drives say that:

5.png.43a989773dd1acdaea762361256c2d09.png

 

Or could it be that the second parity does not allow empty bays between the disks?

 

7. In an earlier attempt there was suddenly data on the drives. What is this?

imageproxy.php?img=&key=e5eec7c5c933ca166.thumb.png.6cf1afd1585c01f9a64c73c56320469c.png

 

8. Overall this is the current status of my array but I think there is a lot not how it should be…

It would be very nice you could give me some hints on there I should search to fix this problem. Btw Disk four is the one with the SMART errors.

 

7.png.947b65009faf746599d594039eae5317.png

 

latest SMART reports:
ST8000VN0022-2EL_ZA1FFL42-20191019-0038.txt

ST8000VN0022-2EL_ZA1FFM3K-20191019-0038.txt

WDC_WD80EZAZ-11T_7HK4H4UF-20191019-0037.txt

WDC_WD80EZAZ-11T_7HK8YNSF-20191019-0037.txt

WDC_WD80EZAZ-11T_7HK9A3HF-20191019-0037.txt

WDC_WD80EZAZ-11T_7SK7DBTW-20191019-0037.txt

 

I thought it could help to number the questions to avoid missunderstandings if you use it in your answer.
Thank you very much in advance for your efforts!

 

 

Edited by Georg
Link to comment
14 hours ago, johnnie.black said:

Please post the diagnostics: Tools -> Diagnostics ideally containing a syslog from when the disks got disabled.

There you go:
atlas-diagnostics-20191023-2040.zip

 

Unfortunately I have restarted the server since then as part of my problem solving attempt.
I could set up a "new config" and try to triger it again. It would be surprising to me if some issues would not come back... 
Thx for investing the time!

Link to comment

There is no data yet, wanted to have it stable before. So new config it is. 
Regarding the cable, I already had that idea and run 8gb preclear ein each bay but could not trigger the error again. Do you think the cable is just the cause for the SMART errors (issue 2) or also for some of the others?

 

Should I do the new config and then run preclear in each bay again to test the cables again?


How would I correctly set up a dual parity? Is it possible to leave bays empty between disks for later?

Link to comment

Thank you for your help!

 

17 hours ago, johnnie.black said:

UDMA CRC errors are almost always a bad SATA cables, but can also be a bad enclosure, controller, port.

How am I going to find out which cable it is? 
I did a preclear in each bay and could not trigger an error. Do you have a better idea?
 

Btw I did a new config now and did the parity check. Everything was fine and looked good until I have added a folder:

My only clue is that it might be connected to the trial key which expired today.
8.png.e216a4167d58defbf3cb267dc0a3eee7.png
9.png.96abc865732bca3f73497fad0e068c1a.png

10.thumb.PNG.49d90cb49dd6bf3fd1f220ab1da7eeca.PNG

atlas-diagnostics-20191026-0043.zip

 

Link to comment
5 hours ago, Georg said:

How am I going to find out which cable it is? 

The one connected to the disk that has growing CRC errors.

 

You're having errors in multiple disks at the same time:

 

Oct 26 00:47:43 Atlas kernel: md: disk9 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk10 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk11 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk0 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk29 read error, sector=12884901840

 

This is likely a controller, cable or power problem, start replacing one thing at a time.

 

 

Link to comment

Sorry but I don’t see which one has growing CRC errors. If we are talking about disk 8 (7HK8YNSF), it has 5 counts since they occurred a month ago. In the meantime, I have moved disks many times and added or at least physically touched several components and I have never noticed that there was a count added.

 

These below mentioned disks are connected to different cables (at least if I follow the numbering of unraid). 0 is the parity and on one cable, 8,9,10 is on another (but just two of them with errors) and 11 is on a third cable. Disk 29 however I don’t know what this one is. The Unraid USB stick maybe, which is connected to an internal USB 2.0.

1 hour ago, johnnie.black said:

 


Oct 26 00:47:43 Atlas kernel: md: disk9 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk10 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk11 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk0 read error, sector=12884901840
Oct 26 00:47:43 Atlas kernel: md: disk29 read error, sector=12884901840

 

 

The two facts above and that the error occurred seconds after I have added a share make me feel that it is not logical to conclude to the cables.

 

The cables in the system are not so cheap ones from china (but still a bit cheaper than supermicro ones). I know price does not necessary correspond to quality. It will take some time but I try to get another cable from an other manufacturer to test it.
Now I hate me for being cheap…

 

Can cable management also be a factor in terms of creating interfearance?

 

Regarding controller I would not know where to start. Should I connect a few shares via sata directly to the mb an see if I can trigger an error?

 

Power: My 850W power supply is definitely sufficient with no big consumer in there yet. The HBA should use 27W (nominal) according to the manufacturer. It has a 6 pin power connector on it but it is not connected as I concluded that the mb will provide this power via PCIe. Was I wrong with my assumption or should I connect it to the power directly for troubleshooting reasons?   

Link to comment
26 minutes ago, Georg said:

Sorry but I don’t see which one has growing CRC errors.

You were one who mentioned getting CRC errors, CRC attribute doesn't reset, so if it's not increasing there's no problem, they are just old errors.

 

26 minutes ago, Georg said:

Disk 29 however I don’t know what this one is

Disk29 is parity2

 

26 minutes ago, Georg said:

The two facts above and that the error occurred seconds after I have added a share make me feel that it is not logical to conclude to the cables.

If they are on separate cables it's likely not a SATA cable problem, most likely the controller or power supply, it's a hardware issue, it won't have anything to do with a share being created, don't forget you already had disk errors before when parity2 and disk1 got disabled, likely related to same issue, though since you don't have syslog from that time can't say for sure.

 

 

 

Edited by johnnie.black
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.