Bryan Newbold writes
RePEc has been on our list to harvest and crawl for some time,
RePEc is not something you can crawl, in the web of the term. Your best bet is to download RePEc metadata and read that. As Christian Zimmermann has hinted, I am working on ArchEc a project to digitally archive RePEc. I intend to archive all of RePEc, not just working papers, but also the metadata records that we have. Progress has been extremly modest because of lack of resources. In 2019, we got a 3k Euro subsidy. I am in the second year of working on this. The software I wrote in intended to be fairly generic, in the sense that it takes a pile of metadata, extracts full text links. My severe mistake was to use wget to do the actual downloads. A lot of current code is checks and corrects on wget output. But I have a first set of WARC files available. You can get them with krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/ The vault has the actual archives. I will have roughly about 900k docs available by the end of the year. I also provide payload indexing data "plind", aggregated by RePEc series for user services to deliver full text out of the archives, reachable by krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/
Do you know anybody with familiarity of both RePEc and programming (eg, Python)
Me. I code in Python, XSLT, Perl, Javascript...
who would be interested in taking a stab at this?
Interested yes, but if there no funding for it, it will have to wait until the rest of ArchEc is done. That's a matter of years rather than months. That said, which a current income of about 600 Euros a month, I'm ready to work. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
ArchEc sounds great! I can offer a couple resources that may be helpful: As you may already know, anybody can create an account on archive.org and upload content, which we will preserve and provide free access/bandwidth for. We have an API and command line tools. I can help with advise on splitting and organizing content if you are interested. This could be an extra mirror in parallel to ArchEx itself. I can run targeted crawls of large URL lists into wayback and provide public download access to the resulting WARC files. We don't always provide direct access to our WARCs but can for specific projects like this. This might be helpful if wget is giving you trouble. I could also share our heritrix3 (Java web crawling framework) configuration that I use for paper crawling, though it is a bit complex of a tool to use. For context, the path to ingesting content into fatcat (and scholar.archive.org) would involve: python code to transform RePEc metadata into fatcat schema, which is based on CSL-JSON. Possibly using a new persistent identifier for RePec articles. Eg, like this: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/im... python code to harvest new RePEc metadata records every day (for transformation, crawling, and ingest). Eg, similar to we do for arxiv.org, pubmed, and crossref DOIs. We have a pipeline to crawl URLs both in large batches and for daily updates, and are working on tooling that will cluster and/or merge different versions ("releases") of a paper ("work"). Eg, a draft and a final publication version. --bryan On Sat, Oct 3, 2020 at 10:07 PM, Thomas Krichel <krichel@openlib.org> wrote:
Bryan Newbold writes
RePEc has been on our list to harvest and crawl for some time,
RePEc is not something you can crawl, in the web of the term. Your best bet is to download RePEc metadata and read that.
As Christian Zimmermann has hinted, I am working on ArchEc a project to digitally archive RePEc. I intend to archive all of RePEc, not just working papers, but also the metadata records that we have. Progress has been extremly modest because of lack of resources. In 2019, we got a 3k Euro subsidy. I am in the second year of working on this. The software I wrote in intended to be fairly generic, in the sense that it takes a pile of metadata, extracts full text links. My severe mistake was to use wget to do the actual downloads. A lot of current code is checks and corrects on wget output. But I have a first set of WARC files available. You can get them with
krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/
The vault has the actual archives. I will have roughly about 900k docs available by the end of the year.
I also provide payload indexing data "plind", aggregated by RePEc series for user services to deliver full text out of the archives, reachable by
krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/
Do you know anybody with familiarity of both RePEc and programming (eg, Python)
Me. I code in Python, XSLT, Perl, Javascript...
who would be interested in taking a stab at this?
Interested yes, but if there no funding for it, it will have to wait until the rest of ArchEc is done. That's a matter of years rather than months. That said, which a current income of about 600 Euros a month, I'm ready to work.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
participants (2)
-
Bryan Newbold -
Thomas Krichel