wiki:SubVersion/Dumpfilter
Last modified 4 years ago Last modified on 05/18/10 15:50:34

Dumpfilter for Subversion Repositories

Introduction

There is no normal user interface to manipulate revisions in subversion repositories after there got committed. All changes has to be checked in as a new revision. If someone e.g. committed a classified file by mistake it can't be removed permanently from the repository. It will be always be accessable under its commit revision.

There is a good reason because there is no normal user interface. This changes should normally not be made by users because there are against the version control philosophy that every change gets remembered forever.

The only way to delete or manipulate old revisions is to dump the whole repository into a file, to filter this file and to recreate the repository out form this file. Subversion comes with the filter program svndumpfilter which lets you include or exclude specific pathes inside your dumpfile, but that's it. The good thing is that the dumpfile format is in a text format (No, not ASCII! UTF-8 of course), so it can easily be filtered. Even without having the specification! (which I couldn't find anywhere in the WWW).

So what was I as a Linux user was doing as I figured out that I accidentally checked in all my code with Windows line endings and that there is no way to fix this except to commit everything again with Unix line endings which will double the repository size because now every line (i.e. just the ending!) got changed?
Right, I was writting a perl script to fix this in the dumpfile! Aeh, ..
No, actually I did it right and wrote a whole perl module. So it can be used in more than one script and works a little bit like a dumpfile API.

Dumpfiles can be created and imported in the following way:

 # Create
 svnadmin dump ''/path/to/repository'' > dumpfile

 # Import
 svnadmin create ''/path/to/new/repository''
 svnadmin load ''/path/to/new/repository'' < dumpfile 

They can also be filtered on the fly:

 svnadmin create ''/path/to/new/repository''
 svnadmin dump ''/path/to/repository'' | svndumpfilter_xxx - - | svnadmin load ''/path/to/new/repository''

Short description of the Perl module

I called the module 'SVN::Dumpfilter'. A minimal usage example:

  use SVN::Dumpfilter qw(Dumpfilter);
  
  Dumpfilter($dumpfile_filename, $outfile_filename, \&callback_filter_subfunction);

It reads the given dumpfile, parses it and calls the callback function for every node entry. The function gets the a hash reference of the node record. Every change, addition, deletion, ... of a file or direcory results in a node inside the dumpfile. Revisions (i.e. the start of one) are nodes as well.
The node record is a hash which holds two other hash refs, a array ref and a scalar ref. A example for a normal node record is:

$href = {
          'content' => \'(content)', # scalar ref 
          'properties_order' => [],  # array ref
          'properties' => {},        # hash ref
          'header' => {              # hash ref
                        'Content-length' => '922',
                        'Text-content-length' => 922,
                        'Node-action' => 'change',
                        'Node-kind' => 'file',
                        'Node-path' => 'trunk/filename.pl',
                        'Text-content-md5' => 'c7ed3072d412de68da477350f8e8056f'
                      }
        };

The 'properties_order', which holds all property names, is there just to be able to output all properties in the original order, which makes testing of the module much easier. I wrote e.g. a null-filter which does nothing, so if the parser works correctly the output equals the input. I quite good way to test the code.
New added properties just have to be placed in the 'properties' hash, but don't have to be added to 'properties_order'.

A revision node looks like this:

$href = {
          'properties_order' => [
                                  'svn:log',
                                  'svn:author',
                                  'svn:date'
                                ],
          'properties' => {
                            'svn:log' => 'Log message, ...',
                            'svn:date' => '2006-05-10T13:31:40.486172Z',
                            'svn:author' => 'martin'
                          },
          'header' => {
                        'Content-length' => '151',
                        'Prop-content-length' => 151,
                        'Revision-number' => '58'
                      }

        };

So for example to access the file path use:

 $href->{header}->{Node-path}

To check if its a revision or normal node check for:

 exists $href->{header}->{Revision-number}

If any property or the content got changed the filter must call the functions svn_recalc_prop_header or svn_recalc_textcontent_header, respectively. The function svn_recalc_content_header call both functions and can be used as shortcut. This functions recalculate length and checksum headers to ensure a correct, non-corrupt output dumpfile.

For more details look into the code of the below filters. There act also as examples.