Crash on moving files from reiserfs to XFS


BillyJ

Recommended Posts

During the move from reiserfs to XFS my server suffered a CPU stall and a hard reset was the only option available.

 

It was in the process of moving 2.7 TB worth of a Movies folder from Disk5 4TB (RESIER) to Disk6 3TB (XFS). I know it didn't complete the move, maybe 25% through. Used disk space is 2.90 TB.

 

I kick off the move via Midnight Commander using the F6 function, I get prompted that the Target already exists. I choose Over all targets?  NONE.

 

Now i've got 6TB (or close enough) to duplicate data and there is no way there is enough free space to continue.

 

Does anyone have any ideas? Is the data in fact duplicate so i should be able to delete the Movies folder off my Disk6 and restart a complete move?

 

Thanks

Will

Link to comment

I would use rsync to copy/move files from one disk to another via the command line.

 

rsync will compare modification time and size, then skip the files that already exist.

 

example usage would be

cd /mnt/disk5

rsync -avPX . /mnt/disk6

you can use the -n flag to do a dry run and see what would be copied (I almost always do that first)

 

rsync -n -avPX . /mnt/disk#

see what spits out

then do it without the -n

 

AFTER all is said and done you can do another rsync to compare files by checksum rather them modification time and size

 

do this with

 

rsync -n -rcvPX . /mnt/disk#

This will show you what might get copied because the files did not compare via checksum.

If you want to do the actual copy again for files that do not compare take off the dry run flag (-n)

 

My final command is usually to remove what compares with --remove-source-file.

 

rsync --remove-source-file -rcvPX . /mnt/disk#

 

This has the effect of comparing the tree with checksums and removing files that match.

After that I do a final review down the tree with find to make sure there are no files left over

 

find . -type f -ls

 

and a final cleanup with

 

find . -depth -type d -empty -ls -delete

which removes empty directories

 

Tips:

When doing this. Always be careful of where you are.

Always use the -n flag first to make sure you get the expected results.

Link to comment
  • 2 weeks later...
  • 3 months later...

rsync -n -rcvPX . /mnt/disk#

This will show you what might get copied because the files did not compare via checksum.

If you want to do the actual copy again for files that do not compare take off the dry run flag (-n)

 

What would cause rsync errors?  I copied about 3TB over using rsync, and the above turned up about 500 errors.  I looked at a few, and it seems that there is a single byte that differs between source and destination in each of the files.  The vast majority of the files (there are 542000 of them) were copied fine.

 

The server has run flawlessly for years.  I installed a new 4TB drive (which is the destination here).  It passed a preclear before I started using it.  The source drive is reiserfs and the destination is xfs.  Copying the problem files is straightforward enough (and seems to work), but I'm not sure I trust this array any more.

 

 

Link to comment

rsync -n -rcvPX . /mnt/disk#

This will show you what might get copied because the files did not compare via checksum.

If you want to do the actual copy again for files that do not compare take off the dry run flag (-n)

 

What would cause rsync errors?  I copied about 3TB over using rsync, and the above turned up about 500 errors.  I looked at a few, and it seems that there is a single byte that differs between source and destination in each of the files.  The vast majority of the files (there are 542000 of them) were copied fine.

 

The server has run flawlessly for years.  I installed a new 4TB drive (which is the destination here).  It passed a preclear before I started using it.  The source drive is reiserfs and the destination is xfs.  Copying the problem files is straightforward enough (and seems to work), but I'm not sure I trust this array any more.

 

Suggest you post more details about your system. What version of unRaid are you running? Motherboard? Controllers? A syslog might be helpful.

 

There was a bug in RFS from v6 (not 5) beta 7/8 that caused corruption. If you are running such a version, suggest you immediately upgrade. If not, the more details you can provide the better. What you are describing is not normal.

 

Update: thinking further - have you run a memory test? Faulty memory could cause such a symptom.

Link to comment

Did you ever run v5 beta 7/Beta 8?  Those releases had a reiserfs bug that could result in silent file corruption.

 

Isn't this irrelevant?  Whether my data (on the source disk) is corrupt or not shouldn't matter.  Even if it is corrupt, I'd expect v6 to be able to duplicate it perfectly to the new drive.

In any event, I don't think that I ever ran v5 beta 7 or 8.

 

Link to comment

rsync -n -rcvPX . /mnt/disk#

This will show you what might get copied because the files did not compare via checksum.

If you want to do the actual copy again for files that do not compare take off the dry run flag (-n)

 

What would cause rsync errors?  I copied about 3TB over using rsync, and the above turned up about 500 errors.  I looked at a few, and it seems that there is a single byte that differs between source and destination in each of the files.  The vast majority of the files (there are 542000 of them) were copied fine.

 

The server has run flawlessly for years.  I installed a new 4TB drive (which is the destination here).  It passed a preclear before I started using it.  The source drive is reiserfs and the destination is xfs.  Copying the problem files is straightforward enough (and seems to work), but I'm not sure I trust this array any more.

 

Suggest you post more details about your system. What version of unRaid are you running? Motherboard? Controllers? A syslog might be helpful.

 

There was a bug in RFS from v6 (not 5) beta 7/8 that caused corruption. If you are running such a version, suggest you immediately upgrade. If not, the more details you can provide the better. What you are describing is not normal.

 

Update: thinking further - have you run a memory test? Faulty memory could cause such a symptom.

 

Thankfully, whatever is going on isn't as 'scary' as it could be, since system is not my primary unraid server.  This entire system is a backup for my primary unraid server, and even with a total loss of the system I wouldn't lose anything of value.

 

The system:

  Motherboard: Intel D975XBX
  CPU: Intel® Core™2 Extreme Processor X6800 @ 2.93GHz
  Ram: 8G
  Controller: m1015
  Drives:
    Array drives (connected to m1015 controller)
      Parity 4TB  WDC_WD40EZRX
      Disk1  3TB  ST3000DM001  reiserfs
      Disk2  3TB  WDC_WD30EFRX reiserfs
      Disk3  3TB  WDC_WD30EFRX reiserfs
      Disk4  3TB  ST3000DM001  reiserfs
      Disk5  4TB  WDC_WD40EZRX reiserfs
      Disk6  4TB  WDC_WD40EZRX xfs
    Non-array 
             64G  Crucial SSD (connected to m1015 controller)
             500G ST3500830A EIDE

 

It does look like the drives are running a bit hot -- parity is at 41 degrees Celsius

 

I haven't run a memory test since the box was set up.

 

Disk6 and the 500G EIDE drive were just installed as I upgraded to v6.  Maybe I have a loose cable somewhere.

 

I just tried some copies again, and had some crazy results... I now think that I am having read errors on disk5!...

root@BadAxe:~# rsync -crvaPX /mnt/disk5/tower_backup/tower/videos_home/* /mnt/disk6/t/tower_backup/tower/videos_home/
sending incremental file list
MP_Wedding.mpeg
  1,019,044,339 100%   17.23MB/s    0:00:56 (xfr#1, to-chk=85/87)
peyton.avi
  2,095,902,208 100%   15.79MB/s    0:02:06 (xfr#2, to-chk=83/87)

sent 3,115,711,126 bytes  received 62 bytes  8,715,276.05 bytes/sec
total size is 6,643,050,254  speedup is 2.13
root@BadAxe:~# rsync -crvaPX /mnt/disk5/tower_backup/tower/videos_home/* /mnt/disk6/t/tower_backup/tower/videos_home/
sending incremental file list
EA_Wedding/Wedding_320x240.avi
  1,245,246,976 100%   21.09MB/s    0:00:56 (xfr#1, to-chk=75/87)

sent 1,245,555,048 bytes  received 43 bytes  3,886,287.34 bytes/sec
total size is 6,643,050,254  speedup is 5.33
root@BadAxe:~#

 

What the above shows is that the first copy pass confirmed that the Wedding_320x240.avi file was ok, but the second pass detected that it needed to be copied again.  Definitely starting to sound like it could be memory or a disk read error.

 

I am not at home right now.  I will kick off a parity check though.  I think that a "pass" would mean that I have a RAM problem.  Parity problems wouldn't tell me much -- bad parity could be due to a either drive or RAM problems.

 

Link to comment

Good news. Might want you consider moving off the EIDE drives, which are pretty long in the tooth at this point.

 

Well.. not great news though.  I still have a system which isn't working as it should...

 

Re the EIDE drive: It has low hours -- just trying to squeeze a bit of value from some old hardware.

Link to comment

Update: thinking further - have you run a memory test? Faulty memory could cause such a symptom.

 

We have a winner!

 

I downloaded and compiled memtester... which allows me to test RAM on a running system...

Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 6144MB (6442450944 bytes)
got  6144MB (6442450944 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : testing   1FAILURE: possible bad address line at offset
0x1614d95d0.
Skipping to next test...
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
FAILURE: 0x1f50973e6e6cca44 != 0x1f50973e6e6eca44 at offset 0xa14d9dc8.
  Compare MUL         : FAILURE: 0x00000000 != 0x00020000 at offset 0xa14d9dc8.
  Compare DIV         :   Compare OR          : ok
  Compare AND         : ok
FAILURE: 0x7ef951d11c2495ac != 0x7ef951d11c2695ac at offset 0xa14d9dc8.
  Sequential Increment:   Solid Bits          : testing   0FAILURE: 0x00000000 !
= 0x00020000 at offset 0xa14d9dc8.
  Block Sequential    : testing   0FAILURE: 0x00000000 != 0x00020000 at offset 0
xa14d9dc8.
  Checkerboard        : testing   1FAILURE: 0x5555555555555555 != 0x555555555557
5555 at offset 0xa14d9dc8.
  Bit Spread          : testing   6

 

Definitely memory related.  The errors make me think I might have a poorly seated memory card.  The errors all seem to be in the same data line.

Link to comment

memtest is included with unRAID. You select it from the boot menu. It doesn't work with a running system like that other tool you mentioned, but you might also try testing with memtest.

There are good reasons NOT to run it on a running system, especially one which is storing and serving valuable data. Not least of which is the need for the running system to occupy at least some of the ram that needs to be tested. Dedicated memory test programs that run by themselves are a much better option.
Link to comment

Hey -- "not smart" stings!

 

Recall...

Thankfully, whatever is going on isn't as 'scary' as it could be, since system is not my primary unraid server.  This entire system is a backup for my primary unraid server, and even with a total loss of the system I wouldn't lose anything of value.

 

I know the risks of running memory tests on an active system.  Before running this tool, I already concluded that all this data can't be trusted since i don't know how long the RAM has been failing.  I have the luxury of not having to trust this data since it's a backup system.  Once I get the RAM problem fixed, I'll either be deleting everything before the next backup, or using "rsync -c" to verify that it is a true image of the primary unraid server.

 

I sometimes wonder if I have gone a bit overboard...  I have a secondary unraid server that I use to periodically back up my primary unraid server.  The primary server is mostly backups of files stored on various computers, including snapshots and inbound crashplan backup files.  I'm exposed due to not having one of the servers off site, but otherwise feel pretty safe.

 

I did run memtest overnight last night -- it turned up 1600 errors in 10 hours, all related to occassional fails in one bit of addresses in a very narrow range of memory.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.