I have investigated a little what the problem with nebka may be. Here is what we know: - we have ext3 errors that gave some sort of panic in the night of Thursday to Friday. - a reboot fixed it, the machine looked fine - nebka went down again, approx 24 hours after the first crash - Thomas was doing a complete backup of the machine with rsync at the time. He did not get to the original data of the machine, the aras account. - We had a similar set of crashes in June 2006, that were diagnosed as an issue with a directory in CitEc that had too many files. At the time, I wrote: According to http://en.wikipedia.org/wiki/Ext3, the maximum number of files a directory can have is V*2^(-13), where V is the size of the volume in blocks. On raneb, this would be 56335 (V=461494280). On nebka, this is 8551 (V=70057172). This would mean we are still in trouble for both (we have 12000 NBER WPs). I hope I am misunderstanding. So I investigated on raneb to see whether we have any overfull directories that may get mirrored to nebka. I found in the adrepec account ~/ftp/CitEc/nbr/nberwo ~/ftp/opt/CitEc/nbr/nberwo which each have 10630 files. So if my forecast from 18 months ago is correct, we have the same problem as before, but in a subdirectory this time. If this is correct: the solution, I think, is to have a larger volume. It turns out we have one for this machine, Bob sent it two months ago. We had to divert it for the machine running IDEAS because of a more serious HD problem. We have a new machine for IDEAS, we just need to configure it and transfer content, then the drive could be reallocated to nebka. I would just need Tim to get started on the new machine before I am back to Connecticut (January 28). Does this make sense? In the immediate, we would need to reboot the machine Monday, comment out all crontab jobs, investigate the true origin of the problem (we found it last year by looking a problematic inodes with fsck), and then only try to back up (only the aras account, in particular the userdata directory). I will be in a train back to Paris again while the machine probably gets back up (Monday EST 10am-3pm), but I will check in as soon as possible once back in Paris. Christian Zimmermann FIGUGEGL! Department of Economics University of Connecticut 341 Mansfield Road, Unit 1063 Storrs, CT 06269-1063 http://ideas.repec.org/zimm/ christian.zimmermann@uconn.edu http://ideas.repec.org/e/pzi1.html
Christian Zimmermann writes
- a reboot fixed it, the machine looked fine - nebka went down again, approx 24 hours after the first crash
I just looked at this again, the crontab I commented from mutabor, and whom I think is responsible, is NOT an upload from mutabor to nebka but the reverse #!/bin/sh rsync -t --log-format=%n aras@nebka.openlib.org:citec-export/* /home/adnetec/ras-exports/ | ~/Ivan/handle_ras_exports.pl /home/adnetec/ras-exports/
- Thomas was doing a complete backup of the machine with rsync at the time. He did not get to the original data of the machine, the aras account. - We had a similar set of crashes in June 2006, that were diagnosed as an issue with a directory in CitEc that had too many files. At the time, I wrote:
But this was an upload, and it was a number taht was a lot bigger than the numbers we have now.
Does this make sense? In the immediate, we would need to reboot the machine Monday, comment out all crontab jobs, investigate the true origin of the problem (we found it last year by looking a problematic inodes with fsck), and then only try to back up (only the aras account, in particular the userdata directory).
OK. In addition, I suggest you open an account at the ideas machine, to hold the most important data from acis and ras. This backup should be conducted every hour or so, in addition to backups to sahure (later to raneb) and fafner, done on alternate days.
I will be in a train back to Paris again while the machine probably gets back up (Monday EST 10am-3pm), but I will check in as soon as possible once back in Paris.
I will be at home on Monday night. I am 5 hours ahead of you, 11 hours ahead of EST. If I can be of any help any time, please don't hesitate to call me on my home number below. I can call you right back. Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel
On Sun, 20 Jan 2008, Thomas Krichel wrote:
Christian Zimmermann writes
- a reboot fixed it, the machine looked fine - nebka went down again, approx 24 hours after the first crash
I just looked at this again, the crontab I commented from mutabor, and whom I think is responsible, is NOT an upload from mutabor to nebka but the reverse
#!/bin/sh rsync -t --log-format=%n aras@nebka.openlib.org:citec-export/* /home/adnetec/ras-exports/ | ~/Ivan/handle_ras_exports.pl /home/adnetec/ras-exports/
- Thomas was doing a complete backup of the machine with rsync at the time. He did not get to the original data of the machine, the aras account.
That backup, in parallel to the rsync above, must have been working on the directories I mention below.
- We had a similar set of crashes in June 2006, that were diagnosed as an issue with a directory in CitEc that had too many files. At the time, I wrote:
But this was an upload, and it was a number taht was a lot bigger than the numbers we have now.
But rsync is a huge resource hog, and we have less free space than 18 months ago. Looking at some literature on rsync, it turns out it holds the information about the whole directory tree in memory. So plicing things up can give welcome relief. Swap space would be grateful.
Does this make sense? In the immediate, we would need to reboot the machine Monday, comment out all crontab jobs, investigate the true origin of the problem (we found it last year by looking a problematic inodes with fsck), and then only try to back up (only the aras account, in particular the userdata directory).
OK.
In addition, I suggest you open an account at the ideas machine, to hold the most important data from acis and ras. This backup should be conducted every hour or so, in addition to backups to sahure (later to raneb) and fafner, done on alternate days.
That is a possibility for the new machine. Not the current one, which has 1/3 of the disk space nebka has. But I absolutely refuse to use rsync. We have even debated moving RAS to the new machine, to economize on rack space. But we may want to have both for redundancy.
I will be in a train back to Paris again while the machine probably gets back up (Monday EST 10am-3pm), but I will check in as soon as possible once back in Paris.
I will be at home on Monday night. I am 5 hours ahead of you, 11 hours ahead of EST.
If I can be of any help any time, please don't hesitate to call me on my home number below. I can call you right back.
Cheers,
Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel
participants (2)
-
Christian Zimmermann -
Thomas Krichel