Ivan Kurmanov writes
I've been working on nebka last week or more with RAS database.
The first thing i did: I modified the ACIS code to use Storable's nfreeze() function when storing the data into the db. Also at that time I've found that part of the ACIS code on nebka was already using nfreeze().
Than I've written a test script which runs through the specified database tables (mysql) and checks their values for being written with Storable correctly.
Then the last 5 days or so i was working on the upgrade daemon database (Berkeley DB) of RAS on nebka. It also contains Storable-encoded strings, and is important. First, i've written a script (actually, a version of the same script mentioned above), to check the values in it (by full scanning, and checking each). That found some issues.
In fact, that checking has found some serious data corruption, where some keys where mixed with values and some values being hopelessly broken.
I did suspect things where not ok because when I do fewer update (timout more then the default 1 week) things are not updated properly. I got serious complaints from CZ about it. I wanted to reduce the load on the box by setting a higher TOO_OLD.
Second, I've written a script to correct values which need correction (via nfreeze) and to remove the ones that cannot be corrected.
That work is now done. BUT at least a part of the update daemon database is still corrupted there. The good news is that this part is not RePEc-part and it would not (should not) cause any trouble to rebuild it. The bad news is that it is corrupted on the internal Berkeley DB level and I do not know how to fix it. The corruption is reported by the db4.6_verify tool. Berkeley DB documentation is not clear on how to fix it, or I'm not looking hard enough.
I do a db_dump foo | db_load foo_clean when I am desparate. On PubMed 20 million records, that takes several days to do. I have been desparete many times.
Which leads me to the whole other topic: in the longer run we should avoid Berkeley DB and use something else instead. Luckily, this shouldn't be too hard to do, since all the BerkleyDB-related code is concentrated in a couple of modules or so. And there are alternatives. Kyoto Cabinet http://fallabs.com/kyotocabinet/ being one of them.
or mongo or couch.... I have had *tons* of trouble with BDB. I hate it. But still I don't think we should work on this now. CZ will kick our buts. You need to look at the ACIS code to find where the problem is. Blaming them on BDB is not the way forward I think.
Now, I've moved out the corrupted BDB file ~/acis/RI/data/ACIS/records (renamed it).
I think the data is now ready to be migrated to the new server (again). The changes in ~/acis/ need to be copied over too, as well as the (partially) fixed ~/acis/RI/data contents. Dan I suggest you to do that.
I think he should not do this at all. Instead he should start with the text data, and the most recent ACIS release, and build a clear dataset. I have done this today and I will forward my notes on it.
It may be a good idea to wait for the nearest run of the nightly script to create the RI database snapshots in ~/backup/2011/08/26 and to use those.
With these data and code changes, it should all run on the new server, but we'll do our tests and checks as soon as it is copied.
Any comments? Questions?
I bet that this will not work. Dan will still not be able to read your storables, and he will not read his own storables when he updates perl. Get rid of Storable. Keep BDB, for now. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel