Recently, I worked to recover a client’s corrupted SVN repository. While the best protection against repository corruption is good backups, these are not always up-to-date or intact. Unless there are backups, repository corruption will almost certainly result in some data loss.
However, by working around corrupt revisions, it may be possible to restore the repository to functionality with minor data loss, and potentially detect what data loss has occurred–if an up-to-date and intact working copy is available.
Success will depend, of course, upon the level of SVN repository corruption. If most (or all) revisions are corrupt, not much can be done. When corruption occurs without backups, the task becomes recovering as much data as possible and identifying what data has been lost, rather than expecting full recovery.
The primary strategy for working around corruption is to create a new repository, omitting the specific revisions which were corrupted. This is achieved by creating an SVN dumpfile from the existing repository containing only valid revisions, then importing the dumpfile into a new repository.
Here are the steps for recovery:
1. Detect Corruption
Corrupted revisions usually come to light when attempting a full repository checkout or running operations that only affect one particular revision. A definitive diagnosis can be achieved using the `svnadmin verify` command against the repository in question.
If a corrupted revision is detected, `svnadmin` will print out an error message. Specific revisions or ranges can be checked with the `-r` option to `svnadmin verify.`
For example, let’s say we have a repository with 10 revisions. Revisions 4 and 7 are corrupted. We can check for corruption on the repository using `svnadmin verify`:
jk@gerty ~/svn $ svnadmin verify myrepository * Verified revision 0. * Verified revision 1. * Verified revision 2. * Verified revision 3. svnadmin: E160004: Missing node-id in node-rev at r4 (offset 290)
The verification will stop at the first sign of corruption. (In this case, at revision 4.)
2. Dump Valid Revisions
An SVN dumpfile is created using the `svnadmin dump` command against a repository. Specific revisions can be selected with the `-r` option to `svnadmin dump`, and revisions can be isolated so that they only include incremental changes using the `–incremental` option.
Continuing our example, let’s start dumping the repository:
jk@gerty ~/svn $ svnadmin dump myrepository > dumpfile * Dumped revision 0. * Dumped revision 1. * Dumped revision 2. * Dumped revision 3. svnadmin: E160004: Missing node-id in node-rev at r4 (offset 290)
We can then move beyond that corrupt revision by using the `-r` option to select the revision beyond the corrupted one, and `–incremental` to only include changes made in the dumped revisions (instead of the whole source tree):
jk@gerty ~/svn $ svnadmin dump --incremental -r 6:HEAD myrepository >> dumpfile * Dumped revision 6. svnadmin: E160004: Missing node-id in node-rev at r7 (offset 289)
Note that we use `>>` to append to the existing dumpfile, and that we actually use a revision offset of 6 instead of 5 (since revision 4 is corrupted, the ‘incremental’ dump for revision 5 cannot be calculated).
Then we can dump the rest of the repository as follows:
jk@gerty ~/svn $ svnadmin dump --incremental -r 9:HEAD myrepository >> dumpfile * Dumped revision 9. * Dumped revision 10.
Note again that we actually use a revision offset of 9 instead of 8.
Caveat: If revisions 4 and 7 only added new files to the source tree, revisions 5 and 8 could be dumped without the `–incremental` option as full revisions (avoiding comparison with revisions 4 and 7, which fail because of the corruption). However, in our example, revisions 4 and 7 modify existing files, so dumping without `–incremental` would cause an error when loading into a new repository because the files already exist in the source tree:
svnadmin: E160020: File already exists: filesystem 'recoveredrepo/db', transaction '4-4', path 'file'
3. Create a New Repository
A new SVN repository is easily created with `svnadmin create`:
jk@gerty ~/svn $ svnadmin create recoveredrepo
4. Restore Valid Revisions from Dumpfile
Importing the dumpfile into the new SVN repository is also easily accomplished with `svnadmin load`:
jk@gerty ~/svn $ cat dumpfile | svnadmin load recoveredrepo <<< Started new transaction, based on original revision 1 ------- Committed revision 1 >>> <<< Started new transaction, based on original revision 2 ------- Committed revision 2 >>> <<< Started new transaction, based on original revision 3 ------- Committed revision 3 >>> <<< Started new transaction, based on original revision 4 ------- Committed revision 4 >>> <<< Started new transaction, based on original revision 6 ------- Committed new rev 5 (loaded from original rev 6) >>> <<< Started new transaction, based on original revision 7 ------- Committed new rev 6 (loaded from original rev 7) >>> <<< Started new transaction, based on original revision 9 ------- Committed new rev 7 (loaded from original rev 9) >>> <<< Started new transaction, based on original revision 10 ------- Committed new rev 8 (loaded from original rev 10) >>>
Note that because of the omitted revisions, the new repository will have revisions numbers that do not necessarily correspond with the old revision numbers.
5. Compare the Recovered Repository
Now that we have recovered what we can from the corrupted SVN repository, it is time to perform a checkout:
jk@gerty ~/svn $ svn co file:///Users/jk/svn/recoveredrepo recoveredwc Checked out revision 8.
We can also compare it with an up-to-date working copy (if one is available):
jk@gerty ~/svn $ diff -r -x .svn recoveredwc/ originalwc/ diff -r -x .svn recoveredwc/somefile originalwc/somefilefile 1c1 < Hello World --- > Nothing
Note that we omit comparison of the `.svn` directories with `-x .svn`.
From the comparison, we can see that some data in the new recovered working copy is different from the original working copy. If the original working copy was fully up-to-date, this may give us a hint as to what data was lost due to corrupted revisions.
While this solution would not detect data loss in the repository history, it would potentially allow us to detect data loss in the latest version of the files (which are often the most relevant).
Using the diff, we can determine which version of the file is most correct on a case-by-case basis . Any necessary updates can then be committed to the new recovered repository.