case notes: migrating our SVN to SourceForge

Recently I endeavored to migrate all modules for a particular project from our SVN to SourceForge.  This is just some notes documenting what I went through along the way, so it may or may not apply to your situation.

Some highlights of our situation:

1. Our SVN is a mess to start out with.  People committing to our SVN have various degrees of experience in terms of using development tools.  They typically blindly follow some online tutorials they can find.  So most modules don’t follow the standard practice of trunk/ tags/ and branches/ and are checked in to the wrong place.

2. Our SVN is a lab-wise SVN containing code modules from all projects.  For this particular project, our protocol is to check in all code modules under a directory bearing the project name, e.g.

svn+ssh://server/path/to/reporoot/COOLPROJECT/module1
svn+ssh://server/path/to/reporoot/COOLPROJECT/module2
...

3. Our project space on SourceForge is dedicated to this particular project, so after they are migrated over, the modules no longer need to be under the directory bearing the project name.

4.  Our project space on SourceForge was created as a beta (aka Allura) project.  Finding documentation for the Allura project isn’t terribly easy, and the documentation Google finds is usually for the classic projects.  My initial plan was to migrate one module at a time, so each developer could move their stuff over whenever they are ready to commit.  But it didn’t work out that way.  After speaking with Chris Tsai from SourceForge (thanks Chris!), I realized the migration must be done for all modules in a relatively short period of time, and it would require 2 steps.  The first step is migrate from our SVN to an intermediate SourceForge “classic” SVN.  This can be done in an incremental manner (i.e. module-by-module); however, this classic SVN is read-only via HTTP, and therefore is not useable.  Therefore it requires the second step, which is to import the classic SVN to Allura.  But the second step must be done when the intermediate “classic” SVN is ready, because it would wipe out the existing stuff already in Allura, and the Allura SVN doesn’t allow users to import modules.

So to get started, the first thing is to tidy up our original SVN.  This mostly involves creating proper directory structures, moving things around to the correct place, and renaming some modules with more appropriate names.  Along the way, I also zapped a few directories, including target/ and build/ directories that shouldn’t have been checked in in the first place.  I could have used “svn move” and “svn rename” and “svn delete” commands, but instead, I took an easier way by using the Subclipse plugin for Eclipse and performed the tidy-ups from the GUI.

The second step is to do a dump of our SVN.  Doing a full dump (svnadmin dump /path/to/reporoot > fullsvn.dump) is always the safest, but the resulting dump file could be huge.  For example, a full dump of our SVN would be 17GB, and all I need are things pertain to this project.  So a better option is to dump revisions pertain to this project directory, i.e.

svnadmin dump /path/to/reporoot -r 759:1848 > partialsvn.dump

where 759 is the revision number with which the project specific directory was created, and 1848 the latest revision number.  You can find the revision numbers by

svn log svn+ssh://server/path/to/reporoot/COOLPROJECT

and look at the first and the last lines.

However, when I initially did this, I relied on the Subclipse plugin to view the revision history of the directory, but failed to realize by default, it only shows the most recent 25 revision numbers (there are buttons for loading the next 25, as well as loading all, but I didn’t know then), so I used a much higher revision number as the lower bound.  The dump process generated a bunch of warning messages, complaining that some revision references an earlier revision that is not included in the dump.

...
* Dumped revision 1788.
WARNING: Referencing data in revision 1696, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
WARNING: Referencing data in revision 1720, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
WARNING: Referencing data in revision 1721, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
* Dumped revision 1789.
* Dumped revision 1790.
* Dumped revision 1791.
...

All I had to do was redo the dump using the earliest revision so all referenced revisions were included.

The third step is to use svndumpfilter to pick out the module(s) I need to a separate dump file.  I could’ve done this by including all modules in one command, but I prefer creating one dump file for each module since I need some additional manipulation afterwards.  Also by having one dump file for each module, it also saves a lot of pain when something goes wrong during the load.

Here it pays to have some intimate knowledge about the modules, especially after things got moved around quite a bit.  Say there is a module that used to be named “module1”, and I decided to rename it to “higgs-boson-capture” as part of the tidy-up process.  I used the following command to separate this module into its own dump:

cat partialsvn.dump | svndumpfilter --drop-empty-revs include COOLPROJECT/module1 COOLPROJECT/higgs-boson-capture > higgsbosoncap.dump

If I didn’t know about the renaming of the module, and only included its latest name, then COOLPROJECT/module1 would not be included into the dump, and when I try to load that into the SourceForge SVN, I would get errors saying file not found as some reference from its previous name, COOLPROJECT/module1.

The option “–drop-empty-revs” seemed like a good idea at that time, because it would produce a leaner dump file.  But later it turned out to be a bad idea, and I will get to that in just a moment.

Recall earlier in bullet points 2 and 3, I mentioned how the path should be changed after they are migrated to SourceForge.  So instead of being COOLPROJECT/higgs-boson-capture, it would simply be higgs-boson-capture.  If I just load the dump file as is, it would still be COOLPROJECT/higgs-boson-capture.  Changing the path boils down to modifying two things in the dump file, Node-path: and Node-copyfrom-path:.

A dump file can be huge, and also can be a mixture of text and binary contents; manually editing these lines would be next to impossible, not to mention most text editors can’t even handle large files.  Folks who are knowledgeable about “awk” and “sed” can probably fix this with relative ease, but I unfortunately don’t know anything about them, so instead, I created a utility in Java for replacing the node paths.  The code is available at http://sourceforge.net/p/azzurri/code-0/6/tree/nodepathreplace/.  It may not be the most efficient code, but gets the job done.

For our particular case, I needed to run the tool twice:  One pass to change COOLPROJECT/module1 to module1, and another pass to change COOLPROJECT/higgs-boson-capture to higgs-boson-capture.  The order doesn’t matter, but you need to make sure all its previous name paths are changed.

The it is just a matter of following instructions to create a shell session on SourceForge (http://sourceforge.net/apps/trac/sourceforge/wiki/Shell%20service) and to load the dump to the SVN (http://sourceforge.net/apps/trac/sourceforge/wiki/Subversion%20import%20instructions).  Of course, you need to SCP the dump file to SourceForge before you can load it — the documentation I found about SourceForge SCP is for uploading files for release (going through the FRS, or File Release Server); for our purpose, we need to SCP the dump file to the current shell session:

# do the SCP from your local server and push the file to sourceforge
scp higgsbosoncap.dump higgs,coolproject@shell.sourceforge.net:/home/users/h/hi/higgs/

change your username and project name accordingly, and also pay attention to the directory names after /home/users/

I did this for a couple of modules and all loaded fine, but then one module gave me an error saying “File not found” when I tried to load it.  I was 100% sure all its past names were included when I did svndumpfilter, yet it complained a file was not found.   The dump file was small enough to be inspected with a text editor, and I noticed the offending revision was making a reference to a file path from a revision that does not exist in the file.  So I went back and re-did the svndumpfilter, but this time, without the “–drop-empty-revs” option.  That indeed solved the problem.

This was a good exercise, but not a fun one as I spent all day working on this on the 4th of July (and the next day), instead of seeing parades and fireworks.