Jump to content

candre23

Members
  • Posts

    96
  • Joined

  • Last visited

Everything posted by candre23

  1. Again, am I in any danger if I downgrade to the stable release? Do I just download 4.7 and overwrite what's on my boot drive?
  2. Oh dear god what just happened? Last night I pulled two WD20EADS drives from my WHS box, stuck them in the unraid server, and started preclearing them. I wake up to discover that the unraid server is completely non-responsive. The web UI is down, and there's nothing on the local monitor. Crap. With no other options, I hit the reset button. It booted, but somehow the server took it upon itself to assign one of the new WD20EADS drives to the slot formerly occupied by the WD10EACS. The server is now stopped, the start button is grayed out, and it claims it is "upgrading disk". First of all, what's up with the total system crash? Even if linux shits the bed, there should still be something on the local monitor. Second of all, why is unraid upgrading itself without my consent? Third of all, why is it doing it with a drive that was, at best, only half precleared? And fourth, where the hell is my WD10EACS? Here's the log: http://pastebin.com/Wr0L810t
  3. The repair run fixed the same 11 errors that the last several scans found. With no more options for determining if I even have a problem (let alone determining the cause), I deleted the original data on the WHS that I had copied to the unraid server. WHS is now shuffling the remaining data around to remove a couple drives so I can continue with the migration. For better or worse, there's no going back now.
  4. How many read-only parity checks did you run and get the same results? The first sync was in repair mode. It found and "fixed" 12 errors. I have done three non-repair passes since then, and all three have found the same 11 errors. These coincide with 11 of the 12 from the first pass. Now I am running a repair-sync to fix the 11 errors that were broken by the very first sync. That mysterious 12th error hasn't popped up in any of the subsequent passes, so I think it was legitimately fixed by the first repair-sync.
  5. Sync complete, and it's the same 11 errors as the last several passes. I'm just going to go ahead and repair them and move on.
  6. I was very careful to go with well-tested hardware. The motherboard is a Biostar A760G M2+, CPU is a sempron 145, the SATA card is a SASLP-MV8, and the drives are two WD20EARS and one WD10EACS. There are several more drives (mix of WD and Seagate) in my WHS box that I will be migrating over, but for now, it's just the 3 WD greens. RAM is a pair of 1GB A-DATA DDR2 800 sticks and the boot drive is a 2GB tuff-n-tiny. Running unraid pro 5.0b4 and the default drive format is set to 4k aligned. Here's the full syslog since the last reboot earlier this evening when I moved the drives to a different cage: http://pastebin.com/EJZ8XUzQ Currently 43% through the non-repair sync and it's found 8 errors that match up with the previous syncs.
  7. Neither machine has been improperly shut down since all this started. Both servers are running on a UPS, so I know there haven't been any power issues (checked the log on my desktop). I have shut the unraid box down a couple times, but that was always a proper shutdown through the web UI, and never while it was copying or syncing. Actually, I still do have all the original files on the old WHS. I've been too nervous to delete them until I figure out what the cause of the problem is on the new unraid server. Ran a long smart scan on all three disks, and all came up clean. Ran a non-repair scan twice, both times coming up with the same 11 errors. All 11 errors coincide with errors that the first repair scan "fixed". The original repair scan actually found 12 errors, so I guess the one error that hasn't popped up in subsequent scans was legitimate. This stinks. As far as I can tell, nothing is broken enough to be obvious. I think my next step will be to switch around some of the SATA cables, just to see if that makes a difference. EDIT: Rather than just swap cables, I swapped the trays into another cage that is running through the SASLP-MV8, re-assigned the drives to the correct slots and started yet another non-repair sync. This should rule out the MB SATA ports, the cables, and the backplane. If this comes up with the same 11 errors, I'm just going to fix them, move on with migration, and hope for the best. There's not much else I can do.
  8. I let it run 12 full passes worth of memtest, and no errors found. It's currently 36% through a "do not fix" sync, and it has already found 7 errors. These errors are in the exact same spots as the 1st 7 errors of the last test. Now, the llast sync I ran was supposed to "fix" any errors in the parity as it found them. If I'm understanding BJP999, then it actually screwed up the parity in those 12 places the 1st time, and this scan is finding those screwups. It's still not telling me why it screwed up in the first place, though. If this scan finds the exact same 12 errors as the first scan, then I guess I just run it again (without fixing) to be sure. If that scan finds the exact same 12 errors again, then I run yet another scan (with fixing) to fix the original "fixes". Once that's done, I should have a good parity. The only problem is that I still won't know why I had errors in the first place. What a pain.
  9. Been running memtest since it was suggested. On the 10th iteration and no errors yet. I only have three drives in the system at the moment, so narrowing it down shouldn't be too complicated. I guess I run memtest for a few more hours, and if there's no errors, I move to long SMART tests for each drive. If that shows nothing, then I just run non-correcting parity sync checks until some sort of pattern emerges and replace the drive that keeps having problems. If the problems are still random and affecting different disks, then I should assume the memory is the problem and replace it even though it tested fine. Do I have that right?
  10. I did that before I started. I only ran through two passes, but no errors were found. Rebooted server and am running memtest again. I need to get this sorted out ASAP. The next time my WHS shuts down, it will likely not come back up. Is there any way to determine which disk is getting errors? You'd think the log would specify where the errors were found, but it doesn't.
  11. First half of the story is here. Precleared drives (no errors), started from scratch, moved ~2.7 TB of data from WHS to unraid. Ran parity check just to be sure all is well before deleting data from WHS. Came up with 12 sync errors. I don't want to delete anything from the old server until I know it's safe on the new server. Should I be concerned that I keep getting a few sync errors? Is it just normal? EDIT: Relevant section of syslog: Feb 26 23:48:26 Tower kernel: md: recovery thread woken up ... Feb 26 23:48:26 Tower kernel: md: recovery thread checking parity... Feb 26 23:48:26 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Feb 27 00:13:52 Tower kernel: md: parity incorrect: 247669256 Feb 27 00:19:30 Tower kernel: md: parity incorrect: 302163048 Feb 27 00:27:30 Tower kernel: md: parity incorrect: 378208880 Feb 27 01:19:57 Tower kernel: md: parity incorrect: 853558232 Feb 27 01:51:18 Tower kernel: md: parity incorrect: 1109535408 Feb 27 02:00:59 Tower kernel: md: parity incorrect: 1185806184 Feb 27 02:30:43 Tower kernel: md: parity incorrect: 1405480344 Feb 27 02:31:18 Tower kernel: md: parity incorrect: 1409544312 Feb 27 02:36:35 Tower kernel: md: parity incorrect: 1446585832 Feb 27 03:12:19 Tower kernel: md: parity incorrect: 1680282472 Feb 27 04:16:51 Tower kernel: mdcmd (50): spindown 2 Feb 27 04:40:01 Tower logrotate: ALERT - exited abnormally. Feb 27 04:53:57 Tower kernel: md: parity incorrect: 2527535128 Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: renewing lease of 192.168.1.123 Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: acknowledged 192.168.1.123 from 192.168.1.1 Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: leased 192.168.1.123 for 86400 seconds Feb 27 06:27:39 Tower kernel: md: parity incorrect: 3405040336 Feb 27 07:38:12 Tower kernel: md: sync done. time=28185sec Feb 27 07:38:12 Tower kernel: md: recovery thread sync completion status: 0
×
×
  • Create New...