I have migrated all the archec files from helos to tagol.
Now it makes sense to work on the web site. Expect action
in the next few days.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
Christian Zimmermann writes
> I may have older data than 2003. I cannot see file dates in the backup.
Well they ought to be there. Without a date on the file, we can't
date the records in the the file. In a protocol violation, WARC-Dates
in ArchEc ReDIF data are the dates on the file, not the date I
capture the file. That is different in full-text ArchEc, where
the time of capture is used as usual.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
This is long, and very important.
I have finished working on the tarballs. Here is the historic listing
repecsnapshot@helos:~$ ls -l archive/
total 1653215736
-rw-r--r-- 1 repecsnapshot repecsnapshot 484660039 Feb 14 2020 RePEc_2005-01-13.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 904066637 Feb 1 2007 RePEc_2007-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 941231076 Mar 1 2007 RePEc_2007-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 963658331 Apr 1 2007 RePEc_2007-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 985673832 May 1 2007 RePEc_2007-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1456266164 Feb 1 2008 RePEc_2008-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1473425648 Mar 1 2008 RePEc_2008-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 1862245883 Apr 5 2009 RePEc_2009-04-05.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6194425874 May 1 2009 RePEc_2009-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6449767907 Jun 1 2009 RePEc_2009-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6494054786 Jul 1 2009 RePEc_2009-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6454682521 Aug 1 2009 RePEc_2009-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6488278395 Sep 1 2009 RePEc_2009-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6537097218 Oct 1 2009 RePEc_2009-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6602806339 Nov 1 2009 RePEc_2009-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6633233978 Dec 1 2009 RePEc_2009-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6674276232 Jan 1 2010 RePEc_2010-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3056671753 Feb 1 2010 RePEc_2010-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2271868077 Mar 1 2010 RePEc_2010-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2299346881 Apr 1 2010 RePEc_2010-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2351089160 May 1 2010 RePEc_2010-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 133496832 Jun 1 2010 RePEc_2010-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2370259005 Jul 1 2010 RePEc_2010-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2139186858 Aug 1 2010 RePEc_2010-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2210199963 Sep 1 2010 RePEc_2010-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2257736575 Oct 1 2010 RePEc_2010-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2363757937 Nov 1 2010 RePEc_2010-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2466406462 Dec 1 2010 RePEc_2010-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2602037604 Jan 1 2011 RePEc_2011-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2639670405 Feb 1 2011 RePEc_2011-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2706271220 Mar 1 2011 RePEc_2011-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2774583851 Apr 1 2011 RePEc_2011-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2827230082 May 1 2011 RePEc_2011-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 2954008530 Jun 1 2011 RePEc_2011-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3029854979 Jul 1 2011 RePEc_2011-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3081255473 Aug 1 2011 RePEc_2011-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3213419623 Sep 1 2011 RePEc_2011-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3322078071 Oct 1 2011 RePEc_2011-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3419023452 Nov 1 2011 RePEc_2011-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3553889294 Dec 1 2011 RePEc_2011-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3752766704 Jan 1 2012 RePEc_2012-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3838959870 Feb 1 2012 RePEc_2012-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3942625941 Mar 1 2012 RePEc_2012-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 3974520415 Apr 1 2012 RePEc_2012-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4034926857 May 1 2012 RePEc_2012-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4071460961 Jun 1 2012 RePEc_2012-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4284563770 Jul 1 2012 RePEc_2012-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4261127463 Aug 1 2012 RePEc_2012-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4302407553 Sep 1 2012 RePEc_2012-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4348557297 Oct 1 2012 RePEc_2012-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 4688488319 Nov 1 2012 RePEc_2012-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 5641946931 Jul 11 2016 RePEc_2013-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 5711299146 Feb 25 2013 RePEc_2013-02-25.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6120639264 Mar 1 2013 RePEc_2013-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6197183316 Apr 1 2013 RePEc_2013-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6294581775 May 1 2013 RePEc_2013-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6536699176 Jul 1 2013 RePEc_2013-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6515265030 Aug 1 2013 RePEc_2013-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 6726546517 Sep 1 2013 RePEc_2013-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7040050302 Oct 1 2013 RePEc_2013-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7176390568 Nov 1 2013 RePEc_2013-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7705018567 Jan 1 2014 RePEc_2014-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 7819037451 Feb 1 2014 RePEc_2014-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 8055801010 Mar 1 2014 RePEc_2014-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 8154382673 Apr 1 2014 RePEc_2014-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54249581051 Sep 18 2015 RePEc_2015-09-18.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54292904727 Oct 1 2015 RePEc_2015-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 54477600620 Nov 1 2015 RePEc_2015-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 17861613243 Dec 1 2015 RePEc_2015-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18152473788 Jan 1 2016 RePEc_2016-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 18643207732 Feb 1 2016 RePEc_2016-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19244499570 Mar 1 2016 RePEc_2016-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12390121472 Apr 1 2016 RePEc_2016-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19202734174 May 1 2016 RePEc_2016-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19729705031 Jun 1 2016 RePEc_2016-06-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 19906775204 Jul 1 2016 RePEc_2016-07-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 12988989440 Aug 1 2016 RePEc_2016-08-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20380834585 Sep 1 2016 RePEc_2016-09-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 20604648388 Oct 1 2016 RePEc_2016-10-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 21685010228 Nov 1 2016 RePEc_2016-11-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 22583953287 Dec 1 2016 RePEc_2016-12-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23748121441 Jan 1 2017 RePEc_2017-01-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23988179882 Feb 1 2017 RePEc_2017-02-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 23892989701 Mar 1 2017 RePEc_2017-03-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 24571501618 Apr 1 2017 RePEc_2017-04-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 25381949487 May 1 2017 RePEc_2017-05-01.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32375688674 Jan 7 2020 RePEc_2020-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 32752245757 Feb 7 2020 RePEc_2020-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33228125218 Mar 7 2020 RePEc_2020-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33425579195 Apr 7 2020 RePEc_2020-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 33724114112 May 7 2020 RePEc_2020-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36387840691 Jun 7 2020 RePEc_2020-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 36978171038 Jul 7 2020 RePEc_2020-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37175093933 Aug 7 2020 RePEc_2020-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37606509948 Sep 7 2020 RePEc_2020-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 37806459795 Oct 7 2020 RePEc_2020-10-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38279176241 Nov 7 2020 RePEc_2020-11-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 38520442306 Dec 7 2020 RePEc_2020-12-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39072868567 Jan 7 2021 RePEc_2021-01-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39444298486 Feb 7 2021 RePEc_2021-02-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39742557323 Mar 7 2021 RePEc_2021-03-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 39915211339 Apr 7 2021 RePEc_2021-04-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50509905007 May 7 2021 RePEc_2021-05-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 50706129419 Jun 7 23:34 RePEc_2021-06-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 51138502107 Jul 7 23:20 RePEc_2021-07-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52134005876 Aug 7 23:50 RePEc_2021-08-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52359572651 Sep 7 23:51 RePEc_2021-09-07.tar.gz
-rw-r--r-- 1 repecsnapshot repecsnapshot 52793698280 Oct 8 00:00 RePEc_2021-10-07.tar.gz
repecsnapshot@helos:~$ du -sb archive/
1692892034551 archive/
9180530114 vault
So it's roughly 1.7 Terabytes. Some of them appear to contain
Holywood movies.
Much of the ArchEc work of extracting templates is automated in the
software I wrote. The template data is in the vault
archec@svega:~$ du -sb vault
9180530114 vault
So the vault is 0.54% of the tarbals.
All the non-template data in the archive had to be manually sorted
into material that we want to keep. It is is the cellar
archec@svega:~$ du -sb cellar/
61256595997 cellar/
Thus the vault is only 0.15% of the cellar. But that comparison
is somewhat missleading, because the vault is compressed tarbals,
one per RePEc archive, whereas the cellar is by tarball date
archec@svega:~/cellar$ du -sb * | less
634287180 2005-01-13
536861971 2007-02-01
44834272 2007-03-01
25349991 2007-04-01
21045749 2007-05-01
498172493 2008-02-01
17374225 2008-03-01
464236588 2009-04-05
94967981 2009-05-01
27444649 2009-06-01
45358045 2009-07-01
36796993 2009-08-01
25261355 2009-09-01
52208902 2009-10-01
71659459 2009-11-01
27825748 2009-12-01
51560339 2010-01-01
57371378 2010-02-01
24226893 2010-03-01
25456810 2010-04-01
49648187 2010-05-01
5224014 2010-06-01
43502725 2010-07-01
38417643 2010-08-01
55986330 2010-09-01
51172577 2010-10-01
129154130 2010-11-01
81382474 2010-12-01
151121065 2011-01-01
33716019 2011-02-01
81951320 2011-03-01
78195257 2011-04-01
42583732 2011-05-01
99387926 2011-06-01
88551512 2011-07-01
63851460 2011-08-01
168786636 2011-09-01
89232615 2011-10-01
105594132 2011-11-01
210477319 2011-12-01
191575866 2012-01-01
114081797 2012-02-01
164878279 2012-03-01
60185290 2012-04-01
66670237 2012-05-01
53926838 2012-06-01
219640488 2012-07-01
109846291 2012-08-01
67363920 2012-09-01
55089588 2012-10-01
357429531 2012-11-01
779190379 2013-02-07
169804007 2013-02-25
293097014 2013-03-01
132990958 2013-04-01
126529750 2013-05-01
248474898 2013-07-01
91941447 2013-08-01
244424689 2013-09-01
355380515 2013-10-01
180926842 2013-11-01
491438262 2014-01-01
199507171 2014-02-01
304653827 2014-03-01
125102163 2014-04-01
11951277650 2015-09-18
54133492 2015-10-01
179301723 2015-11-01
216168974 2015-12-01
296785856 2016-01-01
602075445 2016-02-01
550662685 2016-03-01
203808483 2016-04-01
285364623 2016-05-01
392463192 2016-06-01
343421203 2016-07-01
569052659 2016-08-01
259768124 2016-09-01
340470924 2016-10-01
1353382811 2016-11-01
244735724 2016-12-01
221196926 2017-01-01
314834229 2017-02-01
203896839 2017-03-01
432300255 2017-04-01
113302552 2017-05-01
9293990047 2020-01-07
697589478 2020-02-07
230345565 2020-03-07
235876226 2020-04-07
344132561 2020-05-07
2754910067 2020-06-07
575383272 2020-07-07
457447081 2020-08-07
538063845 2020-09-07
505377160 2020-10-07
554300026 2020-11-07
298479749 2020-12-07
541246056 2021-01-07
732556684 2021-02-07
244025703 2021-03-07
361546881 2021-04-07
11993508900 2021-05-07
279982020 2021-06-07
599068731 2021-07-07
459875968 2021-08-07
246665705 2021-09-07
529427666 2021-10-07
and the data is not compressed.
Within each cellar date, I have taken care to create a symlink to an
earlier version if that is identical. But if I don't operate by
date, well then I can't have two version of the same file, unless I
would build a warc archive similar to what I have in the
vault. Ideally, I would merge the cellar into the vault. At this
time, this is not done. Getting to the stage has already been a
Herculean task that has occupied my since July, and for which I am
paid only 2000 Euros.
So this is the cellar, and then there is the trash
archec@svega:~$ du -sb trash
150200223 trash
of the trash, we only really have to note the checksums, but for the
vast majority, I actually have the full files, in a file name that
is the basename of the file prefixed by the SHA1 of the contents
archec@svega:~/trash$ ls -lrt | tail -5
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2KKDANE66S53P32ILOCPH5DYWCKHRFUD_borra_031.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2FFBOX4DCJNZSQOVT3F63L444KIXH3C4_2017a-10-195-255.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2DQ243MZNAAFZD7V4RAQPY46TZ5CZODI_borra_051.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 2DH2QTRCGI6G5LZO4QKSTC67A5UWGJIP_2008-08.rdf
-rw-r--r-- 1 archec archec 245 Oct 7 15:03 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf
The last file as an example:
archec@svega:~/trash$ cat 24V7M67PQAPD45UA5SL4A7SQUK6N64QA_2014-05-505-526.rdf
<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br><br>Your support ID is: <17685956588297156091><br><br><a href='javascript:history.back();'>[Go Back]</body></html>archec@svega:~/trash$
The is from RePEc:bdr. An archive that recently expanded and has only
junk. Clearly this example shows an important weakness of rarch.
It currently has no support for reading potential and determining
it by contents. It has not been an important issue, but RePEc:bdr
makes it important.
There are a few rare occasions in the trash where the data contains
no Template-Types, but otherwise seems correct ReDIF data. I have
not corrected them manually. The only corrections I did was for
garbled UTF-16-LE data, where I took empty bytes out. And I think
once I manually changed en-dashes into a normal dashes to be able to
incorporate a file. Clearly starting to fix more would result in me
not finishing by the end of the year. The only thing more that
could be done is to fix the few files without template-type.
To summarize, we can use the shuftis. They contain summaries
of the records only. Example
archec@svega:~/shufti$ zcat cit.json.gz
{
"RePEc:cit": {
"2004-12-08T20:27:25Z": [
1264,
293
],
"2013-03-10T16:49:37Z": [
2862,
293
]
}
}
archec@svega:~/rarch/bin$ ./sumshu
4574547 11268778
The first number is the number of records, and the second the number
of instances of those records. So we have 2.46 instances per
record. Presumably, if I were to run this on a weekly basis, the
number of instances would increase.
The immediate step is to remove the tarballs from being backed up by
aigtu. Aigtu was 98% full recently. Then, I will move archec to
helos, and delete tarballs there over time, starting with onces that
are not that useful, like the most recent ones. At the same time, I
will start live survey of data. Live survey will work in a different
way from the tarballs but the main infrastructure is there.
I give myself my heartfelt congradulations for this work.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
Hi gang,
I'm still working on the tarballs of ArchEc. Here is a
submission to a meeting explains where I am heading to
in non-technical terms. The technical details are still
progress, even after two months. It's just that there is
a lot of junk to fight with in the RePEc tarballs.
-----------------------------------------------------------------
In digital archiving, it is quite common to start with records in
files. If each file contains just one record, then preserving the
files is essentially the same as preserving the records. But what if
the files contain several records? And what if the files change over
time? For example, if the files are harvested from a bunch of
sources. These sources maybe creating poorly-formed records. Each
source offers files that contain records. We can harvest the current
versions of these files and they land up on our disk ... how to
preserve them?
One approach is to preserve the files. In that case we loose no
information. We can just look at what files are changing, and copy the
changed ones into an archival location. I see two problems. (1) if
there are files in which records are accumulating, we are wasting disk
space because we store the same record over and over again. (2) what
about consumers of our archival material? They presumably do not care
about the files. They want the records. If we just tell them, “hey,
here are the files, go figure”, we are not likely to get much buy-in.
Another approach is to split the files into records, and preserve the
records. Then we solve the two problems with the file-based
approach. But we get more serious problems. We loose any information
that was tied to the fact that records where in the same file. If the
software used to split the records was buggy, or if we change our
opinion about record borders, or the nature of records, we have no way
to get back. We could circumvent that we some form of clever
metadata. It has to be clever because the data may be broken in all
sort of ways.
The prior art that lead me to this problem is the ArchEc
http://archec.repec.org project. It is a humble effort to preserve
RePEc http://repec.org data. A first stage received €3000 funding from
the Fondation Banque de France pour la recherche économique. In this
stage, I worked on preserving full-text instances pointed to in the
RePEc data. In the current stage, funded with €2000 by the same
funder, the work ultimately aims to preserve the actual RePEc
records. But even in the funding application, available at
http://governance.repec.org/applications/lebach.docx, I only pledged
to work on preserving files, because I was unsure about how to
preserve records.
Well, on 2 August 2021, I had an ingenious idea how to preserve files
and records. I have implemented it. Thus, I have theory, software and
a wealth of experience that I will share during the talk. While the
work is obviously made for RePEc, the conceptual framework and the
methods used apply to any time-varying collections of records that are
in files. And the data moves into WARCs. I guess that the audience
will be familiar with that format.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
I just finished adding the 2007-02-01 date to the 2005-01-13,
so for the first time I have warcs for two dumps.
I have not deleted data yet, although I have the software done.
Neither have a worked clearning out junk from the remaining
data, once the assumed template file are in the warc.
I intend to manually work through non-decodable handles. This
could be a shitload of work.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
I'm returning to ArchEc after five months in ernad land.
I'm completely moving away from preserving just files, after getting
an idea on Monday August 2 about creating a format that would
preserve both files *AND* ReDIF records. Here is the first record I
wrote today in a WARC
WARC/1.0
WARC-Type: resource
WARC-Target-URI: RePEc:tcd:tcduee:991
WARC-Date: 2004-06-03T16:38:49Z
WARC-Record-ID: <urn:uuid:ca86a44f-1c74-4c21-a513-bafde962dd88>
Content-Type: application/octet-stream
WARC-Block-Digest: sha1:QG75JDQT6XITM67IS3FKNRTBMDMPSSCG
Content-Length: 1855
Template-Type: ReDIF-Paper 1.0
Author-Name: Drudy, P.J. and Punch, M
Author-Postal: Department of Economics, Trinity College, Dublin 2, Ireland
Title: The "Regional Problem", Urban Disadvantage and Development
Classification-Jel:
Abstract: Using a range of data on population, the labour force, employment, unemployment and incomes, Section 1 of this paper outlines the changing nature of the "regional problem" and offers an assessment of regional performance in Ireland over the last 25 years. In 1971 there was some justification for concluding that Dublin was performing well in comparison to other regions, particularly in the western and north-western parts of the country, but this generalisation is no longer tenable. Section 2 examines the problem of urban disadvantage with particular reference to the Dublin Region. This section also focuses on the meaning of development and whether the groups experiencing disadvantage benefit from the development process. The high levels of unemployment, educational disadvantage, lone-parent households, as well as the high proportion of people in the unskilled or semi-skilled social classes, all suggest that a substantial portion of the population has been largely excluded from the benefits of economic and social progress over the recent years.
Creation-Date: 1999
X-Acknowledgements: We are grateful for the helpful comments and encouragement of various colleagues in Trinity College and, in particular, Andrew MacLaran, Alan Matthews and Frances Ruane. Any remaining inadequacies are obviously our responsibility. The views expressed in this paper are those of the authors and do not reflect the views of the Department of Economics, Trinity College, Dublin.
File-URL: http://www.economics.tcd.ie/tep/tepno1PJ99.PDF
File-Format: application/pdf
Handle: RePec:tcd:tcduee:991
I abuse the RePEc handle (after normalizing it) as a handle.
More on this soon.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
When dealing with this, from the 2003-01-13 dump
rchec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp00.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp00.rdf
archec@svega:~/bench/2005-01-13/RePEc/bru$ ls -l bruppp/bruppp03.rdf
-rw-r--r-- 1 archec archec 35394 Jul 13 2004 bruppp/bruppp03.rdf
Not only just the same size but the same file.
Only the later is legal as the protocol goes.
Now they seem to have the same date as well but I'm sure
there will be cases when one or the other is older.
What to do?
I think, first sort a RePEc archive by file modification times.
When a duplicate contents is found, create a revisit record for warc,
noting the modification time of the newer resource, and say
it's a copy of the older resource. Within a single dump,
the file name have to be different anyway.
If, when examining the same archive from a different tarbal,
and the file name is the same, don't create a revisit record.
Just make sure to process the tarbals in order, and assume
the dates on the tarballs are correct. Warn if that assumption
is violated, i.e. the mtime on a file in a later tarball
is older than the mtime in a later tarball, warn about this.
The problem with that approach is that the preserved copy
is not necessarily the one that is protocol compliant.
The reason I chose to work on this archive at all is that
it contains files ending in ~, from emacs I guess. So I
first wrote a function looking for rdf~ files, and store
these as if they had no ~. But think sorting by time
would not impact this order since if the ~ is a genuine
backup, it will be older.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
I just started to write a test WARC. Here is the start of a
test file.
WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Record-ID: <urn:uuid:1baaba9e-b976-11eb-aed6-901b0ef71694>^M
WARC-Date: 2021-05-20T14:17:28Z^M
Content-Type: application/warc-fields^M
Content-Length: 232^M
^M
operator: Thomas Krichel <krichel(a)openlib.org>
funder: Fondation Banque de France
project: Lebach, http://governance.repec.org/applications/lebach.docx
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISino_28500_version1_latestdraft.pdf
^M
^M
WARC/1.0^M
WARC-Type: resource^M
WARC-Target-URI: file:///RePEc/aah/aarhec/aarhec1988.rdf^M
WARC-Date: 2004-04-01T21:21:24Z^M
WARC-Record-ID: <urn:uuid:1baabcec-b976-11eb-aed6-901b0ef71694>^M
Content-Type: application/octet-stream^M
WARC-Block-Digest: sha1:4SQLBS5JEULWYJ7JUJEO5XXCFL5FS7XM^M
Content-Length: 4683^M
^M
Template-Type: ReDIF-Paper 1.0^M
Title: TWO PAPERS ON THE TEST OF LUCAS VARIABILITY HYPOTHESIS.^M
Author-Name: CHRISTENSEN, M.^M
Author-Name: PALDAM, M.^M
Keywords: tests ; supply ; economic theory ; demand^M
Overall this is looking good.
Files can be stored as resource records. The URI is the file
starting with RePEc. Sure this is not an absolute file name
but I don't think we need to be that pedantic. The time
on the resource is the time in the tarball that I have. I will take
care to also archive files with a ~ ending as if they are versions
of the file without the tilda.
The UUID is the same, I still have to find out why.
I intend to add the tarball date to the warcinfo fields.
The idea is to have on file per RePEc archive. Later, we will be
able to run this on a daily bases.
Comments on these choices are very welcome. A bad policy now
will be hard to undo!
I have written to Olaf and Jan about the need for me to have more
disk space. While I hope this project will save disk space it's
not enough. The problem is that darni is 95% full.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
I have been talking for 51 minutes now with two representatives of
this company, copied here. I suggested they can use ArchEc data to
mine it for plagiarism, and pointed them to the vault and plind
data. They said they can't do without metadata. I said they could
construct a link from the handle in the plind, say to Econpapers.
This is a problem because the plind data may refer to handles that
are no longer valid. It's probably quite rare, but it still is an
issue. This could only be addressed by archiving the metadata in
WARCs, which is where I would like to go in the long run. I don't
myself know how long RePEc web sites keep data for handles that
drop out of RePEc.
If they need fielded data I could potentially write a converter for
them from ReDIF to whatever format they use, as a consultancy
job. But a link to econpapers should do for now.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel