Christian Zimmermann writes
> I am sorry to rely so much on you for this episode, knowing your busy
> schedule. If there is anything you cherish from Paris, I will be more than
> happy to bring it!
Let me suggest a Vachrin Mont d'Or from Vacroux, rue Daguerre,
near Denfer-Rocheraud. He may not fancy it now, but he
surely will after he has eaten it. If you place some
next to nebka, the smell may even fix the disk.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Ruggieri, Timothy writes
> I went over and ran fsck on Nebka again yesterday. Rather than just
> leave the computer running and hope it doesn't die again, I started
> making an emergency backup to an external hard drive, just in case. I
> was unable to finish the backup before I had to leave, so I powered
> Nebka down, both to prevent any more disk corruption and to keep the
> parts of the backup I had already done from becoming outdated.
>
> I will finish backing Nebka up today and then start it up again. If it
> keeps dying, we will at least have all files as of this weekend.
Thank you so much Tim. This is so helpful to us. I'll
buy you beers for this next time you come to New York City.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
I am finally in Paris after much delay. I see no sign of life for nebka.
What are the news?
Christian Zimmermann FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu
http://ideas.repec.org/e/pzi1.html
I have investigated a little what the problem with nebka may be. Here is
what we know:
- we have ext3 errors that gave some sort of panic in the night of
Thursday to Friday.
- a reboot fixed it, the machine looked fine
- nebka went down again, approx 24 hours after the first crash
- Thomas was doing a complete backup of the machine with rsync at the
time. He did not get to the original data of the machine, the aras
account.
- We had a similar set of crashes in June 2006, that were diagnosed as an
issue with a directory in CitEc that had too many files. At the time, I
wrote:
According to http://en.wikipedia.org/wiki/Ext3, the maximum number of
files a directory can have is V*2^(-13), where V is the size of the volume
in blocks. On raneb, this would be 56335 (V=461494280). On nebka, this is
8551 (V=70057172). This would mean we are still in trouble for both (we
have 12000 NBER WPs). I hope I am misunderstanding.
So I investigated on raneb to see whether we have any overfull directories
that may get mirrored to nebka. I found in the adrepec account
~/ftp/CitEc/nbr/nberwo
~/ftp/opt/CitEc/nbr/nberwo
which each have 10630 files. So if my forecast from 18 months ago is
correct, we have the same problem as before, but in a subdirectory this
time.
If this is correct: the solution, I think, is to have a larger volume. It
turns out we have one for this machine, Bob sent it two months ago. We had
to divert it for the machine running IDEAS because of a more serious HD
problem. We have a new machine for IDEAS, we just need to configure it and
transfer content, then the drive could be reallocated to nebka. I would
just need Tim to get started on the new machine before I am back to
Connecticut (January 28).
Does this make sense? In the immediate, we would need to reboot the
machine Monday, comment out all crontab jobs, investigate the true origin
of the problem (we found it last year by looking a problematic inodes with
fsck), and then only try to back up (only the aras account, in particular
the userdata directory).
I will be in a train back to Paris again while the machine probably gets
back up (Monday EST 10am-3pm), but I will check in as soon as
possible once back in Paris.
Christian Zimmermann FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu
http://ideas.repec.org/e/pzi1.html
Ruggieri, Timothy writes
> I went over to UITS and took a look at Nebka. As you suspected, there
> seems to have been a filesystem corruption problem of some kind. The
> console was full of EXT3 errors. I shut down the computer and forced a
> complete fsck on restart. After the disk check, Nebka seems to be
> working again. I took the liberty of creating an account for myself so
> I could log in remotely via SSH. SSH seems to be working, along with
> Apache, although when you connect to nebka.uconn.edu via the web you get
> the Apache startup page. I don't know anything about what services
> Nebka is offering, so I have not checked much further.
>
> Please try to connect to Nebka and see what, if anything, is still
> broken.
The disk has crashed again. I have no backup for
the crucial data in /home/aras. The most
important there are /home/aras/acis/userdata and
/home/aras/acis/backup
If the machine is kept running, that data may
go away. If you guys don't have a local backup,
we would be in a severe fix.
If someone can make it there as soon as possible,
take a backup of /home/aras, and put in a new
disk with a basic debian o/s in it, I can then
work on restoring service. The disk that we
have there now is a gonner.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
IKu writes
> It's down again. I've logged in successfully (ssh),
> but /bin/ls is unavailable (input/output error).
> /bin/mount unavailable.
> /bin/cat unavailable.
> /bin/bash is readable.
>
> perl runs.
>
> I don't know what to do. We need to do our
> best to recover/save RAS data.
I tihnk it is still there. But we need somebody
to reboot, check file system, nad then save
/home/aras somewhere else.
We need this urgent.
root@nebka:/home/aras/acis/userdata#
is still there. but it may no longer be there later.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
this is log from a skype chat with Ivan, before he
reported nebka down again
[12:54:44] Иван В. Курманов: morning!
[12:54:53] Thomas Krichel: I am backing up
[12:54:57] … backing up
[12:55:01] Иван В. Курманов: did you have any luck making a backup of RAS?
[12:55:06] … ok
[12:55:13] Thomas Krichel: a night of panik is over
[12:55:24] … It's not done yet
[12:55:32] … it takes time to copy all the stuff.
[12:55:31] Иван В. Курманов: ok
[12:55:50] Thomas Krichel: but I have sahure and fafner to back it up independently.
[12:55:53] … I will check.
[12:55:59] … I promise
[12:56:07] Иван В. Курманов: what are you backing up?
[12:56:12] … what exactly?
[12:56:19] Thomas Krichel: /var /etc /root /home
[12:56:54] Иван В. Курманов: /home/aras/acis/userdata is of primary importance
[12:56:55] Thomas Krichel: I think we can soon delete /home/aacis
[12:57:21] … I can try to run an extra backup for that, say ever hour
[12:57:51] Иван В. Курманов: /home/aras/backup contains some important bits
[12:58:20] Thomas Krichel: I know. I have written a nightly for awho to do them there too.
--
Now panic has set in again.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
IKu writes
> It's down again. I've logged in successfully (ssh),
> but /bin/ls is unavailable (input/output error).
> /bin/mount unavailable.
> /bin/cat unavailable.
> /bin/bash is readable.
>
> perl runs.
>
> I don't know what to do. We need to do our
> best to recover/save RAS data.
arrgh! /home/aras is not yet backed up.
I recomend shutting down. But I can not shut
it down because it does not see shutdown anymore.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Ivan is just skyping me
Иван В. Курманов: Thomas, there's a problem on nebka
[12:59:45] … RAS is down
[12:59:50] … and I can't ssh into the machine
[13:00:50] Thomas Krichel: probably an electricty problem
[13:01:02] Иван В. Курманов: no, the machine responds
[13:01:14] … it gets a 500 internal server error
[13:01:25] … and asks for a password when you try ssh
[13:01:36] Thomas Krichel: disk failure!
[13:01:49] Иван В. Курманов: it may be
[13:01:54] Thomas Krichel: most likely
[13:02:49] … There is notthing we can do but alert the ras-run list
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel