[ArchEc] current work

9 Oct 2021

      Hi gang,

  I'm still working on the tarballs of ArchEc. Here is a
  submission to a meeting explains where I am heading to
  in non-technical terms. The technical details are still
  progress, even after two months. It's just that there is
  a lot of junk to fight with in the RePEc tarballs.

  -----------------------------------------------------------------

  In digital archiving, it is quite common to start with records in
  files. If each file contains just one record, then preserving the
  files is essentially the same as preserving the records. But what if
  the files contain several records? And what if the files change over
  time? For example, if the files are harvested from a bunch of
  sources. These sources maybe creating poorly-formed records. Each
  source offers files that contain records. We can harvest the current
  versions of these files and they land up on our disk ... how to
  preserve them?

  One approach is to preserve the files. In that case we loose no
  information. We can just look at what files are changing, and copy the
  changed ones into an archival location. I see two problems. (1) if
  there are files in which records are accumulating, we are wasting disk
  space because we store the same record over and over again. (2) what
  about consumers of our archival material? They presumably do not care
  about the files. They want the records. If we just tell them, “hey,
  here are the files, go figure”, we are not likely to get much buy-in.

  Another approach is to split the files into records, and preserve the
  records. Then we solve the two problems with the file-based
  approach. But we get more serious problems. We loose any information
  that was tied to the fact that records where in the same file. If the
  software used to split the records was buggy, or if we change our
  opinion about record borders, or the nature of records, we have no way
  to get back. We could circumvent that we some form of clever
  metadata. It has to be clever because the data may be broken in all
  sort of ways.

  The prior art that lead me to this problem is the ArchEc
  http://archec.repec.org project. It is a humble effort to preserve
  RePEc http://repec.org data. A first stage received €3000 funding from
  the Fondation Banque de France pour la recherche économique. In this
  stage, I worked on preserving full-text instances pointed to in the
  RePEc data. In the current stage, funded with €2000 by the same
  funder, the work ultimately aims to preserve the actual RePEc
  records. But even in the funding application, available at
  http://governance.repec.org/applications/lebach.docx, I only pledged
  to work on preserving files, because I was unsure about how to
  preserve records.

  Well, on 2 August 2021, I had an ingenious idea how to preserve files
  and records. I have implemented it. Thus, I have theory, software and
  a wealth of experience that I will share during the talk. While the
  work is obviously made for RePEc, the conceptual framework and the
  methods used apply to any time-varying collections of records that are
  in files. And the data moves into WARCs. I guess that the audience
  will be familiar with that format.

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel