I have just opened the vault and the plind for your rsyncing pleasure krichel@trabbi/tmp$ mkdir plind krichel@trabbi/tmp$ rsync -av rsync://archec.repec.org/plind/ plind/ The plind is the payload index. It says where in the vault file PDF data for papers can be found. The start of the payload is at 'b', the length at 'f'. The 'o' field has the PDF status. 'm' according to mime type 'a' it has something "%PDF" inside first 100 bytes 'p' it has "PDF" in the futli, important for ftp 'f' it has an URL starting with "ftp://" 'r' is from a WARC resource record that contains a payload, i.e. not preceeded by a WARC metadata record, or not concurrent to another record. At this time a tiny fraction of the plind data is available. I'm still running a full set. The vault contains the actual warcs. I recommend to exclude the cdx files krichel@trabbi/tmp$ mkdir vault krichel@trabbi/tmp$ rsync -av --exclude '*.cdx' rsync://archec.repec.org/vault/ vault/ At this time, there is no way to actually limit this to files that are actually mentioned in the plind. This is important since we hold PDF only for a minority of papers. Suggestions welcome. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
participants (1)
-
Thomas Krichel