When dealing with this, from the 2003-01-13 dump
rchec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp00.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp00.rdf
archec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp/bruppp03.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp/bruppp03.rdf
Not only just the same size but the same file.
Only the later is legal as the protocol goes.
Now they seem to have the same date as well but I'm sure
there will be cases when one or the other is older.
What to do?
I think, first sort a RePEc archive by file modification times.
When a duplicate contents is found, create a revisit record for warc,
noting the modification time of the newer resource, and say
it's a copy of the older resource. Within a single dump,
the file name have to be different anyway.
If, when examining the same archive from a different tarbal,
and the file name is the same, don't create a revisit record.
Just make sure to process the tarbals in order, and assume
the dates on the tarballs are correct. Warn if that assumption
is violated, i.e. the mtime on a file in a later tarball
is older than the mtime in a later tarball, warn about this.
The problem with that approach is that the preserved copy
is not necessarily the one that is protocol compliant.
The reason I chose to work on this archive at all is that
it contains files ending in ~, from emacs I guess. So I
first wrote a function looking for rdf~ files, and store
these as if they had no ~. But think sorting by time
would not impact this order since if the ~ is a genuine
backup, it will be older.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
I just started to write a test WARC. Here is the start of a
test file.
WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Record-ID: <urn:uuid:1baaba9e-b976-11eb-aed6-901b0ef71694>^M
WARC-Date: 2021-05-20T14:17:28Z^M
Content-Type: application/warc-fields^M
Content-Length: 232^M
^M
operator: Thomas Krichel <krichel(a)openlib.org>
funder: Fondation Banque de France
project: Lebach, http://governance.repec.org/applications/lebach.docx
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISino_28500_version1_latestdraft.pdf
^M
^M
WARC/1.0^M
WARC-Type: resource^M
WARC-Target-URI: file:///RePEc/aah/aarhec/aarhec1988.rdf^M
WARC-Date: 2004-04-01T21:21:24Z^M
WARC-Record-ID: <urn:uuid:1baabcec-b976-11eb-aed6-901b0ef71694>^M
Content-Type: application/octet-stream^M
WARC-Block-Digest: sha1:4SQLBS5JEULWYJ7JUJEO5XXCFL5FS7XM^M
Content-Length: 4683^M
^M
Template-Type: ReDIF-Paper 1.0^M
Title: TWO PAPERS ON THE TEST OF LUCAS VARIABILITY HYPOTHESIS.^M
Author-Name: CHRISTENSEN, M.^M
Author-Name: PALDAM, M.^M
Keywords: tests ; supply ; economic theory ; demand^M
Overall this is looking good.
Files can be stored as resource records. The URI is the file
starting with RePEc. Sure this is not an absolute file name
but I don't think we need to be that pedantic. The time
on the resource is the time in the tarball that I have. I will take
care to also archive files with a ~ ending as if they are versions
of the file without the tilda.
The UUID is the same, I still have to find out why.
I intend to add the tarball date to the warcinfo fields.
The idea is to have on file per RePEc archive. Later, we will be
able to run this on a daily bases.
Comments on these choices are very welcome. A bad policy now
will be hard to undo!
I have written to Olaf and Jan about the need for me to have more
disk space. While I hope this project will save disk space it's
not enough. The problem is that darni is 95% full.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel