Internet Archive and RePEc

4 Oct 2020

      Bryan Newbold writes
...
RePEc has been on our list to harvest and crawl for some time,
RePEc is not something you can crawl, in the web of the term.
  Your best bet is to download RePEc metadata and read that.

  As Christian Zimmermann has hinted, I am working on ArchEc a project
  to digitally archive RePEc. I intend to archive all of RePEc, not
  just working papers, but also the metadata records that we
  have. Progress has been extremly modest because of lack of resources. In
  2019, we got a 3k Euro subsidy. I am in the second year of working
  on this. The software I wrote in intended to be fairly generic, in
  the sense that it takes a pile of metadata, extracts full text
  links. My severe mistake was to use wget to do the actual downloads.
  A lot of current code is checks and corrects on wget output.  But I
  have a first set of WARC files available.  You can get them with

krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/

  The vault has the actual archives.  I will have roughly about 900k
  docs available by the end of the year.

  I also provide payload indexing data "plind", aggregated by RePEc
  series for user services to deliver full text out of the archives,
  reachable by

krichel@trabbi/tmp$ rsync -av  rsync://archec.repec.org/plind/ plind/
...
Do you know anybody with familiarity of both RePEc and programming
(eg, Python)
Me. I code in Python, XSLT, Perl, Javascript...
...
who would be interested in taking a stab at this?
Interested yes, but if there no funding for it, it will have to wait
  until the rest of ArchEc is done. That's a matter of years rather
  than months. That said, which a current income of about 600 Euros
  a month, I'm ready to work. 

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel

Thomas Krichel

Bryan Newbold

tags

participants (2)