Thomas Krichel writes
> Could you be interested in ArchEc data?
Christoph Semken writes
> The description on the website is very brief. What is the difference with
> RePEc? Does it mainly add versioning? This could be interesting since I
> would not have to keep track of the changes myself (to notify users when an
> article they bookmarked changed).
Well, the idea is to archive RePEc. There are really two aspects to
that. One is to archive all RePEc metadata records and all versions
that have existed of these records. That part is not done. All we
have at this time is a series of dumps of the records at a
particualar time. The other is to archive all full-text contents in
the sense of the File-URL payloads in all RePEc archives, and all
versions thereof, well, within reason. That part is partially done.
Both parts are supposed to be merged, in the follwing sense. There
will be a single file per RePEc handle. It will contain all versions
of the metadata and all versions of the full text. The format will
be the WARC, as developed by the Internet Archive. I have released
an initial set of WARCs and a set of plind files. These are PayLoad
INDex files, in json, one per series, where full-text can be found
in the WARCS.
If your institution could sponsor a server for ArchEc that would be
welcome. Even if you could rsync a copy that would be great.
I will send you a mail recently sent to ArchEc-run under separate
cover.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
Bryan Newbold writes
> RePEc has been on our list to harvest and crawl for some time,
RePEc is not something you can crawl, in the web of the term.
Your best bet is to download RePEc metadata and read that.
As Christian Zimmermann has hinted, I am working on ArchEc a project
to digitally archive RePEc. I intend to archive all of RePEc, not
just working papers, but also the metadata records that we
have. Progress has been extremly modest because of lack of resources. In
2019, we got a 3k Euro subsidy. I am in the second year of working
on this. The software I wrote in intended to be fairly generic, in
the sense that it takes a pile of metadata, extracts full text
links. My severe mistake was to use wget to do the actual downloads.
A lot of current code is checks and corrects on wget output. But I
have a first set of WARC files available. You can get them with
krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/
The vault has the actual archives. I will have roughly about 900k
docs available by the end of the year.
I also provide payload indexing data "plind", aggregated by RePEc
series for user services to deliver full text out of the archives,
reachable by
krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/
> Do you know anybody with familiarity of both RePEc and programming
> (eg, Python)
Me. I code in Python, XSLT, Perl, Javascript...
> who would be interested in taking a stab at this?
Interested yes, but if there no funding for it, it will have to wait
until the rest of ArchEc is done. That's a matter of years rather
than months. That said, which a current income of about 600 Euros
a month, I'm ready to work.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel