When dealing with this, from the 2003-01-13 dump rchec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp00.rdf -rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp00.rdf archec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp/bruppp03.rdf -rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp/bruppp03.rdf Not only just the same size but the same file. Only the later is legal as the protocol goes. Now they seem to have the same date as well but I'm sure there will be cases when one or the other is older. What to do? I think, first sort a RePEc archive by file modification times. When a duplicate contents is found, create a revisit record for warc, noting the modification time of the newer resource, and say it's a copy of the older resource. Within a single dump, the file name have to be different anyway. If, when examining the same archive from a different tarbal, and the file name is the same, don't create a revisit record. Just make sure to process the tarbals in order, and assume the dates on the tarballs are correct. Warn if that assumption is violated, i.e. the mtime on a file in a later tarball is older than the mtime in a later tarball, warn about this. The problem with that approach is that the preserved copy is not necessarily the one that is protocol compliant. The reason I chose to work on this archive at all is that it contains files ending in ~, from emacs I guess. So I first wrote a function looking for rdf~ files, and store these as if they had no ~. But think sorting by time would not impact this order since if the ~ is a genuine backup, it will be older. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
participants (1)
-
Thomas Krichel