I've started working on preparing (fixing) the Storable-serialized data of RAS for proper (full) migration from nebka, and I was working with the live code and live database. And I've made a mistake. The mistake caused an important part of the data in the database -- the data column in the objects table -- to be overwritten with a value that was relevant to only one of these records. In other words, i've put something which looks like a proper document details into description of a large number of other documents. I don't know how many of the records were affected, but i estimate that probably at least several thousands. When I realized what is going on, I've aborted the operation and killed the mysql thread that was doing the job. And before that I've also (via the same mistake) have rewritten all institution details in the DB. This corruption would mean that wrong data would be shown to the users. Specifically, in research profile suggestions and in institutions search. With Thomas' help, I've taken RAS down and has put the Service Temporarily Unavailable page online instead. At the same time I've disabled most of the RAS-related cronjobs in the aras account. And I've started a full update of RePEc in the update daemon, which should rewrite the corrupted data with correct data taken from the files. But this update may take days to complete. That's why i've disabled the cronjobs to have as minimal concurrent jobs as possible. I don't have a better estimate now. I'm watching the update daemon log, but i don't expect it to finish soon anyway. -ivan
That explains the complaints I got... Please make sure to update the message on the front page. It dates form the last outage. -- Christian Zimmermann FIGUGEGL! Economic Research Federal Reserve Bank of St. Louis P.O. Box 442 St. Louis MO 63166-0442 USA http://ideas.repec.org/zimm/ On Sat, 20 Aug 2011, Ivan Kurmanov wrote:
I've started working on preparing (fixing) the Storable-serialized data of RAS for proper (full) migration from nebka, and I was working with the live code and live database. And I've made a mistake. The mistake caused an important part of the data in the database -- the data column in the objects table -- to be overwritten with a value that was relevant to only one of these records. In other words, i've put something which looks like a proper document details into description of a large number of other documents. I don't know how many of the records were affected, but i estimate that probably at least several thousands.
When I realized what is going on, I've aborted the operation and killed the mysql thread that was doing the job.
And before that I've also (via the same mistake) have rewritten all institution details in the DB.
This corruption would mean that wrong data would be shown to the users. Specifically, in research profile suggestions and in institutions search.
With Thomas' help, I've taken RAS down and has put the Service Temporarily Unavailable page online instead. At the same time I've disabled most of the RAS-related cronjobs in the aras account.
And I've started a full update of RePEc in the update daemon, which should rewrite the corrupted data with correct data taken from the files. But this update may take days to complete. That's why i've disabled the cronjobs to have as minimal concurrent jobs as possible. I don't have a better estimate now. I'm watching the update daemon log, but i don't expect it to finish soon anyway.
-ivan
_______________________________________________ RAS-run mailing list RAS-run@lists.openlib.org http://lists.openlib.org/cgi-bin/mailman/listinfo/ras-run
'Christian Zimmermann' writes
That explains the complaints I got...
Please make sure to update the message on the front page. It dates form the last outage.
I wrote the site on snefru, but the dns change goes to raneb. I have deleted the site on snefru as to avoid a similar error in the future. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
here is a status update on the issue. The update process in the update daemon was running until midnight by the server time, but have not finished. By then it has processed 750 archives of 1356 total. It was then interrupted by the nightly script job, which I did not disable in crontab. The nightly script has restarted the update daemon, which caused the update process to stop. When I got up this morning, i've requested an update again, with parameters to (hopefully) run quickly -- without thorough processing -- through the parts that were processed yesterday. my estimate is that the processing will take 20-28 hours from now to get finished. i'll make sure the nightly script does not interfere this time. Then I'll do some checks and probably some more selective updates via the update daemon and then we would be ready to put the service back online. -ivan On Sat, Aug 20, 2011 at 8:43 PM, Ivan Kurmanov <duraley@gmail.com> wrote:
I've started working on preparing (fixing) the Storable-serialized data of RAS for proper (full) migration from nebka, and I was working with the live code and live database. And I've made a mistake. The mistake caused an important part of the data in the database -- the data column in the objects table -- to be overwritten with a value that was relevant to only one of these records. In other words, i've put something which looks like a proper document details into description of a large number of other documents. I don't know how many of the records were affected, but i estimate that probably at least several thousands.
When I realized what is going on, I've aborted the operation and killed the mysql thread that was doing the job.
And before that I've also (via the same mistake) have rewritten all institution details in the DB.
This corruption would mean that wrong data would be shown to the users. Specifically, in research profile suggestions and in institutions search.
With Thomas' help, I've taken RAS down and has put the Service Temporarily Unavailable page online instead. At the same time I've disabled most of the RAS-related cronjobs in the aras account.
And I've started a full update of RePEc in the update daemon, which should rewrite the corrupted data with correct data taken from the files. But this update may take days to complete. That's why i've disabled the cronjobs to have as minimal concurrent jobs as possible. I don't have a better estimate now. I'm watching the update daemon log, but i don't expect it to finish soon anyway.
-ivan
Ivan Kurmanov writes
we would be ready to put the service back online.
I trust you will be able to reset the DNS. I have authorized ssh-dss AAAAB3NzaC ... tnqH4Q== iku@yabloko.local to enter binder@snefru.openlib.org. Kindly test this. The file ~/README explains what to do. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Thanks, it works, and I've read the README. I'm not sure what exact changes should I do. I guess repec.db is the file to edit, and in it -- I need to comment the sorry line for authors? -ivan On Aug 21, 2011 12:52 PM, "Thomas Krichel" <krichel@openlib.org> wrote:
Ivan Kurmanov writes
we would be ready to put the service back online.
I trust you will be able to reset the DNS. I have authorized
ssh-dss AAAAB3NzaC ... tnqH4Q== iku@yabloko.local
to enter binder@snefru.openlib.org. Kindly test this. The file ~/README explains what to do.
Cheers,
Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Ivan Kurmanov writes
Thanks, it works, and I've read the README. I'm not sure what exact changes should I do. I guess repec.db is the file to edit, and in it -- I need to comment the sorry line for authors?
change the line authors IN A 128.252.177.191 authors IN A 137.99.31.70 Yes, and uncomment the one that has neka's ip. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Thanks, Thomas. Approx. 1300 archives have been processed. I estimate the processing will finish within an hour. I have already restored the DNS for authors.repec.org and have updated the sorry cgi script with a message saying "Please try again in a few hours." -ivan On Sun, Aug 21, 2011 at 3:56 PM, Thomas Krichel <krichel@openlib.org> wrote:
Ivan Kurmanov writes
Thanks, it works, and I've read the README. I'm not sure what exact changes should I do. I guess repec.db is the file to edit, and in it -- I need to comment the sorry line for authors?
change the line
authors IN A 128.252.177.191 authors IN A 137.99.31.70
Yes, and uncomment the one that has neka's ip.
Cheers,
Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Update: SERVICE RESTORED. I've just got up. The database is cleaned now, ie. fully updated. I've restored the acis cgi, and session clean-up and the APU. A little later I'll restore the other cron jobs too. -ivan On Mon, Aug 22, 2011 at 1:01 AM, Ivan Kurmanov <duraley@gmail.com> wrote:
Thanks, Thomas.
Approx. 1300 archives have been processed. I estimate the processing will finish within an hour.
I have already restored the DNS for authors.repec.org and have updated the sorry cgi script with a message saying "Please try again in a few hours."
-ivan
On Sun, Aug 21, 2011 at 3:56 PM, Thomas Krichel <krichel@openlib.org> wrote:
Ivan Kurmanov writes
Thanks, it works, and I've read the README. I'm not sure what exact changes should I do. I guess repec.db is the file to edit, and in it -- I need to comment the sorry line for authors?
change the line
authors IN A 128.252.177.191 authors IN A 137.99.31.70
Yes, and uncomment the one that has neka's ip.
Cheers,
Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Ivan Kurmanov writes
Update: SERVICE RESTORED.
My congradulations! But the episode illustrates we need a testing server. I can set one up on holda. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
On Mon, Aug 22, 2011 at 8:35 AM, Thomas Krichel <krichel@openlib.org> wrote:
Ivan Kurmanov writes
Update: SERVICE RESTORED.
My congradulations!
thanks for your support, Thomas. I'm sorry for this incident.
But the episode illustrates we need a testing server. I can set one up on holda.
in this particular case i don't quite see how this would have helped. In this case my arrogance wouldn't let me think on how to test these changes before doing them on production. -i
Cheers,
Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
participants (3)
-
'Christian Zimmermann' -
Ivan Kurmanov -
Thomas Krichel