Jump to content

RobJ

Members
  • Posts

    7,135
  • Joined

  • Last visited

  • Days Won

    4

Everything posted by RobJ

  1. Yeah, renaming the Cores does not seem to work, you have to rename the 'temps'. I haven't read enough to fully understand why. 'MB Temp' and 'CPU Temp' are the 2 labels that Dynamix is looking for, and they don't mean a thing as to what the true source of their temp numbers are. We use sensors.conf to force renaming the first 2 ports from the default values to what Dynamix is looking for. I don't know what 'Physical id 0' is, you will have to do some research on coretemp. You might try the 'modinfo coretemp' command, may produce some helpful info. Who knows, 'Physical id 0' might be the max of the 4 core temps! In your example it is, but that could be coincidence.
  2. Try a simple one like this: chip "coretemp-*" label temp1 "MB Temp" label temp2 "CPU Temp" If you get anything at all, then you can experiment with changing temp1 and temp2 to other things. Make sure this sensors.conf is put into /etc/sensors.d, then refresh the Dynamix screen. If it works, then make sure it is copied to /boot/config/plugins/dynamix for reinstallation after each boot.
  3. Good write up! I've added it to the UnRAID Topical Index, Backups topic. You may also want to examine it for previous discussions about backups.
  4. What happens when you try the modprobe it87 command at a command prompt? If it says 'device busy', you might try what pauven did here (near the end of the post). Or try what xamindar did here, forcing a different ID.
  5. Here's a quick and dirty setup for AMD based boards that use the k8temp driver, most of those with Athlon 64 or FX or Opteron CPU's. It does not show the motherboard temp, only the temps for the first 2 cores, but since the temps on the Dynamix screens aren't labeled, that's OK. It should work for most k8temp-compatible boards, but apparently some of these boards do not reliably show correct temps. Hopefully yours will, as mine does. * Install the Dynamix System Temp plugin, and reboot. You should see the 2 temp icons above and to the right, but with blank temps. * Copy the attached sensors.conf.txt file to the flash drive, to the path /boot/config/plugins/dynamix. In Windows, this path would be something like \\tower\flash\config\plugins\dynamix. Rename it to sensors.conf (remove the .txt). * At a console, copy the same sensors.conf to /etc/sensors.d, and refresh your Dynamix screen. (Or you can reboot again.) * If it worked, your Dynamix screen should display the 2 temps. They aren't labeled, but just remember that they are CPU core0 and core1. This is partly from memory, if I've missed a step, or something behaved differently than stated above, let me know, and I'll revise this. Why would someone do this instead of the longer but more complete Wiki instructions (found here)? * To see if it works first, before trying the longer procedure. * You don't feel like installing Perl and dealing with the technical questions of sensors-detect. * Because like me your board requires a driver not installed with UnRAID. My board needs the Fintek driver f71882fg. You can check here for more board and sensor chip driver info. Edit: Really sorry, I forgot one of the most important steps. Inserted above. sensors.conf.txt
  6. Wiki page was: Setting up lm_sensors to display CPU/motherboard temperatures in add-ons such as SimpleFeatures Renamed to: Setting up CPU and board temperature sensing I've updated the wiki page for Dynamix, added a few Dynamix specific notes, and renamed it. Well technically I moved it to the new name, with a redirect at the old name. So the old name still works, but lands you on the wiki page under the new name. If you wish, bonienl, you could edit your OP to point to the new name. Also, it would be nice if you could check it over for accuracy.
  7. In my experience, by far the most likely cause of all CRC errors is the cables. I'd concentrate most on them (besides it's the cheapest thing to fix!). You say you changed the cables. Were the replacements known to be high quality and good cables, tested on other drives? I strongly recommend never skimping on SATA cables, especially since very good quality ones can be obtained so cheaply at MonoPrice. Try swapping a suspect cable with one known to never cause errors, then see if the errors follow the cable. Make sure the cables are not tightly tied together, especially tied with power cables. They look very neat that way, but are more likely to suffer interference and crosstalk. Check the connectors on both ends, and the connections on the drives and the ports on the card, make sure they are clean, no apparent dust or corrosion. Make sure the power cables are well connected too, especially any power cable splitters. If in doubt, try swapping suspect connections with known good ones, and test. Another less likely possibility is the power supply. Flaky power can cause drive interface issues. I suppose overheating drives could too, but you would have mentioned that.
  8. No good ideas here. Just to eliminate a possibility, can you try re-running with adjusted email options, as the -M 4 option runs some code not used by anything else. Try -M 3 and lower, and perhaps with no email at all. Are you receiving the emails correctly, both with and without the -c option? Most likely, Joe L will have to help you, when he has time.
  9. Drive couldn't be more perfect (but you probably knew that, right?)
  10. The important number for those is the VALUE, which for both is 100, as in 100% perfect, can't be any more perfect than that. It's not that they are close to the threshold, but that the manufacturer has factory set the thresholds to start so close to 100, who knows why. Your SMART reports for that drive appear to be perfect, nothing to worry about.
  11. It's hard to conclude too much from just a short syslog excerpt. It's best to attach the entire syslog, zipped, plus a SMART report for the drive. There is evidence of a bad sector, plus some other failure, but I'd rather not make any conclusions without seeing the very first error reported, plus the SMART info.
  12. I did some research, and found only 2 possibilities - the drive SMART firmware tried to write to the SMART log but SMART was not enabled at that instant (possibly at drive startup), or there is a bug in the SMART firmware on that drive. Not something to worry about, as it only happened once (the second is a retry of the first), and that was a long time ago.
  13. I was too tired to take much time with it, sorry. This SMART report is intact, thanks. The line above shows 81 Current Pending sectors, a very ominous sign, especially when you only have 5 operational hours on the drive. It means it has already found 81 sectors that are probably bad. As near as I can tell, you have started Preclear 3 times, then aborted it quite early, probably because of the errors and how long it was taking, but these short passes make this even more ominous, in that you shouldn't find even one error over the entire drive, and you found 81 in just the first 1 or 2 percent. What I based my opinion on was the syslog you attached and the 2 syslog excerpts you posted. They all show a series of errors logged by the exception handler. All of them are noted as 'media error' which means a problem with a physical sector on the drive surface. More specifically, the error flag raised for each of those sectors is 'UNC' (short for 'UNCorrectable'), which means the sector was found to be corrupted so much that even the embedded error correction info could not fix it. Because these first Preclear passes are just read passes, we CANNOT conclude for sure that the drive is bad yet, until the drive attempts to fix them, by rewriting them correctly. At that point, the drive will determine if the magnetic media under the sector is good or bad, and either return the sector to service, validly rewritten, or remap it elsewhere (as a reallocated sector). The drive MAY be bad, but the SMART report is not showing any mechanical issues, so far, so it's possible the magnetic surface is good but has been scrambled some how??? Not likely, but possible. An immediate zeroing pass would probably be a good next step, skipping the Preclear Pre-Read, and forcing writes to all sectors. It should rather quickly help you decide if the drive is worth further effort or not. Syntax I believe would be "preclear_disk.sh -W /dev/sdd".
  14. I would run another Preclear or 2 on it, check for additional reallocated sectors. If you can obtain clean results, no further adverse numbers, then it is usable. With that many reallocated sectors, I know that some users would prefer to reserve the drive for secondary uses, such as holding backups.
  15. You have a series of bad sectors on this drive, very early on it too. The SMART report was not very useful, as it was truncated on the right side at 80 columns, cutting off the RAW numbers. Not sure what did that.
  16. Not this time ... Look at the fuller "picture"-- first, always be a little suspicious of numbers that are "all ones" (ie, 0x0fffffff); then look carefully at the preceding commands in the error log for the conclusive clue. --UhClem Oops, you are absolutely right. I didn't recognize that number in decimal. Not sure what to make of it though, need a lot more context, more of the code path to here. If you check his last SMART report, the 5 last errors show alternating like the above, then simple reads, then repeat. If I had to guess (and that is all I can do here), I would say there is a firmware issue. The LBA, even if it is 0x0fffffff, appears to be valid, about at the 137GB point. But in this small context, is probably a mask, and appears to be a part of an internal reset, possibly an internal crash? If my 'guess' is correct, then you cannot trust this drive. UhClem, I'd like to hear your opinion.
  17. Those ARE the bad blocks, in more detail and only the last 5. UNC is short for UNCorrectable, so "Error: UNC at LBA = 0x0fffffff = 268435455" roughly means "bad block at 268435455". You probably got a typical response, from any manufacturing rep. But I would expect SeaTools to provide a similar report.
  18. My guess is you received a drive with some problems, that had been 'repaired' by clearing the SMART tables, masking the problems, so the first test runs re-exposed the problems. The first 2 Preclears seemed to have dealt with most of them, and the third looks much better, but I'm not confident you've uncovered ALL of the marginal sectors yet. I'd run 2 or 3 more Preclears, and I'd only feel more confident if I had at least 2 passes with NO further changes, no more Current Pending sectors at any phase, no additional Uncorrectables, no additional Reallocated sectors. If interested and have time, you might also try a full badblocks run with the -w option. The other possibility is that it's a bad drive, and it's going to continue getting worse. I suspect that after another Preclear, you will either know it's bad or may decide that you aren't willing to trust the drive, even if it starts behaving, has clean reports.
  19. The drive in first pic is troubling, shows 22 new bad sectors AFTER the Post-Read, not good. Preclear it again. Only if after another one or 2 Preclears it appears to be fine should it be considered safe to use. The drive in second pic looks older with some wear, but may be OK at the moment. We really need to check the SMART reports for both drives. All of the reports including the Preclear reports are in a folder on the flash drive. I don't remember the path, but it has 'Preclear' in the name.
  20. no, they look great. Probably... only time will tell. So I should not be worried about the raw read error rate or spin retry count? Spin_Retry_Count = 100 100 97 near_thresh 0 End-to-End_Error = 100 100 99 near_thresh 0 I could not improve on Joe's response, as he was absolutely correct. Just a tip, if you are going to question the experts, it would be good to understand the subject matter first? As you can see above, the VALUE's for Spin_Retry_Count and End-to-End_Error are 100, could not be more perfect. The VALUE's for Raw_Read_Error_Rate for 2 of the drives at THIS moment are a little below 100, but are not a problem, they are still a long way from 44.
  21. Like someone showing us their report card with all A+'s and all test scores equal to 100, you showed us your perfect SMART report indicating zero problems found and all VALUE's of 100 or better, and then you ask if anyone sees anything wrong??? I think most of us didn't know how to respond.
  22. The 3 SMART reports look fine, except that the 2 new NAS drives got rather hot. Their temps rose from 25 to 41 which is quite a jump. I suspect they were installed somewhere without any airflow. Nothing wrong with the drives though. That's only a small fragment of the syslog, so hard to draw any conclusions. Whatever drive is sdf has a lot of bad sectors, and the drive sdd has found a single bad sector. I can't tell which drives those are without the earlier part of the syslog, but you can identify them from the UnRAID webgui.
  23. Greetings and welcome! While the better funded users around here would probably respond that they prefer late model drives with pristine SMART reports, your 4 drives fit the other philosophy, of using UnRAID as a great place, a second life, for drives to retire to. Your drives do show some age and wear, but are still in very good shape, and the Preclears found no new problems. I would use them without hesitation. All but the first show a few remapped sectors, and one shows a higher than normal High_Fly_Writes count of 477, and one shows a lower than normal Seek_Error_Rate of 51, so I would monitor the SMART reports for these drives now and then, perhaps every few months?
  24. If the disk controller was not compatible with 3TB drives, then it would not work at all, that is, you would not see 3TB available. Yours does appear to work fine. I can't see any connection with the motherboard, because we need a suspect that is able to cause corruption and trouble WITHIN the drive, and that is a very short list. The only possibilities I can think of are power issues, heat issues, or a defective drive (bad firmware). Remember that if you discover a relation between a software running and the drive crashing, then that could only point to heavy power draw (perhaps pulling the voltage down too low), and that implicates the power supply (not the software or other hardware), as either too inadequate or too poorly designed. Aerocool makes interesting products (I like some of their cases), but I don't believe they are a good name in power supplies (I could be wrong, I haven't kept up with PSU's recently). I can't help being very suspicious of your PSU.
×
×
  • Create New...