Hi gang, I'm still working on the tarballs of ArchEc. Here is a submission to a meeting explains where I am heading to in non-technical terms. The technical details are still progress, even after two months. It's just that there is a lot of junk to fight with in the RePEc tarballs. ----------------------------------------------------------------- In digital archiving, it is quite common to start with records in files. If each file contains just one record, then preserving the files is essentially the same as preserving the records. But what if the files contain several records? And what if the files change over time? For example, if the files are harvested from a bunch of sources. These sources maybe creating poorly-formed records. Each source offers files that contain records. We can harvest the current versions of these files and they land up on our disk ... how to preserve them? One approach is to preserve the files. In that case we loose no information. We can just look at what files are changing, and copy the changed ones into an archival location. I see two problems. (1) if there are files in which records are accumulating, we are wasting disk space because we store the same record over and over again. (2) what about consumers of our archival material? They presumably do not care about the files. They want the records. If we just tell them, “hey, here are the files, go figure”, we are not likely to get much buy-in. Another approach is to split the files into records, and preserve the records. Then we solve the two problems with the file-based approach. But we get more serious problems. We loose any information that was tied to the fact that records where in the same file. If the software used to split the records was buggy, or if we change our opinion about record borders, or the nature of records, we have no way to get back. We could circumvent that we some form of clever metadata. It has to be clever because the data may be broken in all sort of ways. The prior art that lead me to this problem is the ArchEc http://archec.repec.org project. It is a humble effort to preserve RePEc http://repec.org data. A first stage received €3000 funding from the Fondation Banque de France pour la recherche économique. In this stage, I worked on preserving full-text instances pointed to in the RePEc data. In the current stage, funded with €2000 by the same funder, the work ultimately aims to preserve the actual RePEc records. But even in the funding application, available at http://governance.repec.org/applications/lebach.docx, I only pledged to work on preserving files, because I was unsure about how to preserve records. Well, on 2 August 2021, I had an ingenious idea how to preserve files and records. I have implemented it. Thus, I have theory, software and a wealth of experience that I will share during the talk. While the work is obviously made for RePEc, the conceptual framework and the methods used apply to any time-varying collections of records that are in files. And the data moves into WARCs. I guess that the audience will be familiar with that format. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel