- ArchEc-run - RePEc Lists

Losheim report
by Thomas Krichel 13 Jan '21

13 Jan '21

It's here http://archec.repec.org/losheim_report.docx I would love somebody to give it a read through. I intend to send it on Thursday. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

restructure plind
by Thomas Krichel 13 Dec '20

13 Dec '20

Sorry for the noise from Woodside NY. After a most inspiring conversation with JMBC today, I decided to make paper handles (papids) the keys in plind JSON. I add the relative file as a field 'F' is the data for each payload. Yes, that duplicates that value, because all plods of a particular papid are in the same relfi ... but ok. It's impure but it seems to work. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

plid stats
by Thomas Krichel 12 Dec '20

12 Dec '20

I just wrote a few lines of python to summarize the plind archec@darni:$ plind_stats 1071893 papers 1447873 PDF payloads 1226498 plodis The plodi is a playload digest. So basically this tells you how many different payloads we have. My policy is to duplicate payloads if they belong to different papers. While this wastes disk space, anything else would make it harder of consumers of the data. Having hit over a million on all figures is good, it should make the funders happy. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

about ArchEc
by Thomas Krichel 26 Oct '20

26 Oct '20

Thomas Krichel writes > Could you be interested in ArchEc data? Christoph Semken writes > The description on the website is very brief. What is the difference with > RePEc? Does it mainly add versioning? This could be interesting since I > would not have to keep track of the changes myself (to notify users when an > article they bookmarked changed). Well, the idea is to archive RePEc. There are really two aspects to that. One is to archive all RePEc metadata records and all versions that have existed of these records. That part is not done. All we have at this time is a series of dumps of the records at a particualar time. The other is to archive all full-text contents in the sense of the File-URL payloads in all RePEc archives, and all versions thereof, well, within reason. That part is partially done. Both parts are supposed to be merged, in the follwing sense. There will be a single file per RePEc handle. It will contain all versions of the metadata and all versions of the full text. The format will be the WARC, as developed by the Internet Archive. I have released an initial set of WARCs and a set of plind files. These are PayLoad INDex files, in json, one per series, where full-text can be found in the WARCS. If your institution could sponsor a server for ArchEc that would be welcome. Even if you could rsync a copy that would be great. I will send you a mail recently sent to ArchEc-run under separate cover. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

Internet Archive and RePEc
by Thomas Krichel 06 Oct '20

06 Oct '20

Bryan Newbold writes > RePEc has been on our list to harvest and crawl for some time, RePEc is not something you can crawl, in the web of the term. Your best bet is to download RePEc metadata and read that. As Christian Zimmermann has hinted, I am working on ArchEc a project to digitally archive RePEc. I intend to archive all of RePEc, not just working papers, but also the metadata records that we have. Progress has been extremly modest because of lack of resources. In 2019, we got a 3k Euro subsidy. I am in the second year of working on this. The software I wrote in intended to be fairly generic, in the sense that it takes a pile of metadata, extracts full text links. My severe mistake was to use wget to do the actual downloads. A lot of current code is checks and corrects on wget output. But I have a first set of WARC files available. You can get them with krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/ The vault has the actual archives. I will have roughly about 900k docs available by the end of the year. I also provide payload indexing data "plind", aggregated by RePEc series for user services to deliver full text out of the archives, reachable by krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/ > Do you know anybody with familiarity of both RePEc and programming > (eg, Python) Me. I code in Python, XSLT, Perl, Javascript... > who would be interested in taking a stab at this? Interested yes, but if there no funding for it, it will have to wait until the rest of ArchEc is done. That's a matter of years rather than months. That said, which a current income of about 600 Euros a month, I'm ready to work. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

2 1

vault and plind opened
by Thomas Krichel 25 Sep '20

25 Sep '20

I have just opened the vault and the plind for your rsyncing pleasure krichel@trabbi/tmp$ mkdir plind krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/ The plind is the payload index. It says where in the vault file PDF data for papers can be found. The start of the payload is at 'b', the length at 'f'. The 'o' field has the PDF status. 'm' according to mime type 'a' it has something "%PDF" inside first 100 bytes 'p' it has "PDF" in the futli, important for ftp 'f' it has an URL starting with "ftp://" 'r' is from a WARC resource record that contains a payload, i.e. not preceeded by a WARC metadata record, or not concurrent to another record. At this time a tiny fraction of the plind data is available. I'm still running a full set. The vault contains the actual warcs. I recommend to exclude the cdx files krichel@trabbi/tmp$ mkdir vault krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/ At this time, there is no way to actually limit this to files that are actually mentioned in the plind. This is important since we hold PDF only for a minority of papers. Suggestions welcome. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

Re: [ArchEc] darni's third crash
by Thomas Krichel 11 Aug '20

11 Aug '20

I made another o/s update and rebooted. We had similar crashes with aigtu. It once crashed while I was on the Transibireian railway, and there I spilled tea on my laptop so when I got to Moscow I could not reboot. But strangely enough, aigtu has been well behaved for about 18 months or so now. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

darni crashed
by Thomas Krichel 30 Jul '20

30 Jul '20

At 23:45 last night. I woke up shortly after that. A hardware reset I performed about 25 minutes later brought it back. I did not find a trace in the log as to what may have caused the crash. Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 1

Re: [ArchEc] please fix this problem on ftp://openlib.org
by Thomas Krichel 08 Jul '20

08 Jul '20

Saiki, A writes > I hope you remember this exchange of Emails. I found today that you are > doing a new project > http://archec.repec.org/public_html/index_2013-09-18.html > > Is this an archive of an older page? Yes. > My old profile (which contained my date of birth) was removed in > March 2012, if I remember correctly. What is this, and is my old > info (d.o.b) affected at all by this? No. It's for document data. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0

first updates
by Thomas Krichel 15 Jun '20

15 Jun '20

Quick update on ArchEc. Just now I'm running the first updates of existing warcs, according to an urgency index calculated with a customizable schedule. Took a very long time to get there. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel

1 0