I am heading now to the server farm to do hardware maintenance on nebka.
This may take a few hours. Kit or Sune, please redirect authors.repec.org
to repec.econ.uconn.edu until further notice. Thanks.
Christian Zimmermann FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu
http://ideas.repec.org/e/pzi1.html
If you are doing any work on nebka that presents a risk of panic or
shutdown, please wait until Tuesday PM (Eastern Time). I will be off
campus or teaching.
Christian Zimmermann FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu
http://ideas.repec.org/e/pzi1.html
My barrage of emails seems to have worked, nebka's barracuda reputation is
not poor anymore.
Christian Zimmermann FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu
http://ideas.repec.org/e/pzi1.html
Bob Parks writes
> Could very well be - the eduunix.ccut.edu is very good and I will go with
> your theory. I will be interested to know just
> how you rsync to the 143 gig and then make it bootable.
I did it here. My laptop froze to death. I bought a desktop,
and put the disk from the laptop in it, and booted
from the laptop disk. After initalizing
the desktop disk on /dev/sdb, (making sure
I put in a swap) I mounted the main part of it to /vol,
then
rsync -va --exclude /vol --exclude /proc --exclude /sys / /vol
Then
grub-install --directory /vol/boot/grub /dev/sdb
Edited /vol/etc/fstab, putting in the right device
for the swap, hopefully, (did work after a few
tries ;-)
Put laptop disk out, put desktop disk in its place,
bingo.
In nebka, the rsnyc is likely to fail. Therefore
rid stop
shutdown -h now
boot in single user, mount / readonly.
e2fsck -c -y /dev/sda
say, to use the badblack program to check
the disk, mark the bad block as bad so that
they don't turn up in the filesystem. If
my theory is correct, and no other bad
block appears (2 big ifs), you can start
with the rsync.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
----- Forwarded message from Bob Parks <bparks(a)artsci.wustl.edu> -----
Envelope-to: krichel@localhost
Delivery-date: Thu, 28 Feb 2008 00:47:49 +0600
From: Bob Parks <bparks(a)artsci.wustl.edu>
To: Thomas Krichel <krichel(a)openlib.org>
X-Antivirus: avast! (VPS 080227-0, 02/27/2008), Outbound message
X-Antivirus-Status: Clean
X-SA-Exim-Connect-IP: 128.252.93.43
X-SA-Exim-Mail-From: bparks(a)artsci.wustl.edu
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on snefru.openlib.org
X-Spam-Level:
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00 autolearn=ham
version=3.2.3
Subject: Re: [RAS] badblocks
X-SA-Exim-Version: 4.2.1 (built Tue, 21 Aug 2007 23:39:36 +0000)
X-SA-Exim-Scanned: Yes (on snefru.openlib.org)
Thomas Krichel wrote:
> Bob Parks writes
>
>
>> Yes, IMHO. As Christian wrote earlier about nebka, there are limits to
>> directory sizes. He seemed to indicate that a cron job
>> with du might have been the entire problem. We have had similar problems
>> in the past.
>
> my theory: du puts stress on the disk, it hits the bad block, and bang!
>
Possible, very possible.
>> There are bad blocks on every disk. Bad blocks, unless a large number,
>> do not show that the 'disk' is failing. And again, this is a mirror'ed
>> disk, two disks, in Raid 1, with a hardware controller. Now that I think
>> on it,
>> it is not clear what badblocks on what disk are being reported by the
>> Adaptec controller -
>>
>
> my theory: the disk is one disk to the o/s.
Yes it is, but a bad block is a physical disk concept - but who knows what
evil lurks in the depths.
>
>> Note that nearly identical hardware exists on Bill's RFE machine and
>> never an error. You have had problems
>> on nebka, and snefru (idential hardware) and raneb (very different
>> hardware). That alone leads me to suspect
>> software.
>>
>
> I don't remember a problem on snefru. The common file set are
> the adrepec files (common on raneb, sahure, fafner, nebka, mutabor) and
> the citec files, common on mutabor, raneb,
> snefru, sahure, fafner (Yes, I back up!).
> What I think is what's written in 27.2.4. badblocks and e2fsck
> of
> http://eduunix.ccut.edu.cn/index/html/linux/OReilly.LPI.Linux.Certification…
>
> They say
> When a disk is failing, it will usually get an exponential increase in
> bad blocks, and after a short while it will run out of spare blocks,
> whereupon you will get into trouble with your filesystems on that
> disk.
>
> It has already run out of spare blocks, that's why some
> bad blocks show up to the o/s.
>
Could very well be - the eduunix.ccut.edu is very good and I will go with
your theory. I will be interested to know just
how you rsync to the 143 gig and then make it bootable.
Bob
> Cheers,
>
> Thomas Krichel http://openlib.org/home/krichel
> RePEc:per:1965-06-05:thomas_krichel
> phone: +7 383 330 6813 skype: thomaskrichel
>
> _______________________________________________
> RAS-run mailing list
> RAS-run(a)lists.openlib.org
> http://lists.openlib.org/cgi-bin/mailman/listinfo/ras-run
>
----- Forwarded message from Bob Parks <bparks(a)artsci.wustl.edu> -----
Envelope-to: krichel@localhost
Delivery-date: Wed, 27 Feb 2008 23:57:15 +0600
From: Bob Parks <bparks(a)artsci.wustl.edu>
To: Thomas Krichel <krichel(a)openlib.org>
X-Antivirus: avast! (VPS 080227-0, 02/27/2008), Outbound message
X-Antivirus-Status: Clean
X-SA-Exim-Connect-IP: 128.252.93.43
X-SA-Exim-Mail-From: bparks(a)artsci.wustl.edu
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on snefru.openlib.org
X-Spam-Level:
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00 autolearn=ham
version=3.2.3
Subject: Re: [RAS] badblocks
X-SA-Exim-Version: 4.2.1 (built Tue, 21 Aug 2007 23:39:36 +0000)
X-SA-Exim-Scanned: Yes (on snefru.openlib.org)
Thomas Krichel wrote:
>
> Bob Parks writes
>
>
>> My meory is different about raneb.
>>
>
> We had a bad disk. When we replaced, it was fine.
>
>
>> All of the errors seem to melt away when some of the crons are disabled -
>> such as Christian did with the ones involving du.
>>
>>
>
> So what does this conclude? A software problem?
>
Yes, IMHO. As Christian wrote earlier about nebka, there are limits to
directory sizes. He seemed to indicate that a cron job
with du might have been the entire problem. We have had similar problems
in the past.
>
>> Get rid of THE disk does not compute. There are two disks, and in a
>> configuration that is as fault tolerant as
>> it gets.
>>
>
> But it will break at some stage. The badblocks show it's broken.
>
There are bad blocks on every disk. Bad blocks, unless a large number, do
not show that the 'disk' is failing. And again, this is a mirror'ed disk,
two disks, in Raid 1, with a hardware controller. Now that I think on it,
it is not clear what badblocks on what disk are being reported by the
Adaptec controller -
Note that nearly identical hardware exists on Bill's RFE machine and never
an error. You have had problems
on nebka, and snefru (idential hardware) and raneb (very different
hardware). That alone leads me to suspect
software.
>
>> Up to you and Christian but I believe this is not the solution. Bob
>>
>
> What is the solution?
>
>
As Christian has done, carefully bring the machine back to life without all
the crons and add the crons sparingly. I have not
heard of any more problems with nebka since he did that and it is on the
same Raid 1 2 disk mirror.
If you do decide to make the 143 gig bootable, Christian should, after a
time, boot and enter the Adaptec controller.
Then break the 'container' which has the two 68 gig disks, and then you can
have two 68 gig disks, check them individually,
and gain 68 gig of space.
In the end, it is your choice.
Bob
> Cheers,
>
> Thomas Krichel http://openlib.org/home/krichel
> RePEc:per:1965-06-05:thomas_krichel
> phone: +7 383 330 6813 skype: thomaskrichel
>
----- End forwarded message -----
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Bob Parks writes
> Yes, IMHO. As Christian wrote earlier about nebka, there are limits to
> directory sizes. He seemed to indicate that a cron job
> with du might have been the entire problem. We have had similar problems
> in the past.
my theory: du puts stress on the disk, it hits the bad block, and bang!
> There are bad blocks on every disk. Bad blocks, unless a large number, do
> not show that the 'disk' is failing. And again, this is a mirror'ed disk,
> two disks, in Raid 1, with a hardware controller. Now that I think on it,
> it is not clear what badblocks on what disk are being reported by the
> Adaptec controller -
my theory: the disk is one disk to the o/s.
> Note that nearly identical hardware exists on Bill's RFE machine and never
> an error. You have had problems
> on nebka, and snefru (idential hardware) and raneb (very different
> hardware). That alone leads me to suspect
> software.
I don't remember a problem on snefru. The common file set are
the adrepec files (common on raneb, sahure, fafner, nebka,
mutabor) and the citec files, common on mutabor, raneb,
snefru, sahure, fafner (Yes, I back up!).
What I think is what's written in 27.2.4. badblocks and e2fsck
of
http://eduunix.ccut.edu.cn/index/html/linux/OReilly.LPI.Linux.Certification…
They say
When a disk is failing, it will usually get an exponential increase in
bad blocks, and after a short while it will run out of spare blocks,
whereupon you will get into trouble with your filesystems on that
disk.
It has already run out of spare blocks, that's why some
bad blocks show up to the o/s.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Bob Parks writes
> My meory is different about raneb.
We had a bad disk. When we replaced, it was fine.
> All of the errors seem to melt away when some of the crons are disabled -
> such as Christian did with the ones involving du.
>
So what does this conclude? A software problem?
> Get rid of THE disk does not compute. There are two disks, and in a
> configuration that is as fault tolerant as
> it gets.
But it will break at some stage. The badblocks show it's broken.
> Up to you and Christian but I believe this is not the solution. Bob
What is the solution?
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Christian Zimmermann writes
> I will be leaving in a little while campus to return at 7:30 tomorrow
> morning. Wait until then if you want to reboot.
I will not do anything, unless you want me to set up the spam
filter.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
Christian Zimmermann writes
> I am back and stand ready to run to the server farm if necessary.
With 11 hours time difference, I was in bed. I have been thinking
a bit more.
I remember, when I had a similar problem with raneb, there
were only 12 or 40 bad bad blocks, but they caused the disk
to crash. Now that the offending disk has been replaced, it's
all quiet on the raneb front. I would therefore suggest that
the troubles come from the bad block
The way I understand disks, is that decay is expenential.
Most modern disks have some extra space through RAID, that
is hidden from the O/S. When bad block appear the data is
moved from the bad blocks to blocks that are healthy, in
a way that is transparent to the o/s. When there are too
many bad blocks, the o/s start seeing them, and that's
when Linux gets rather merciless, it does not take hardware
issues lightly.
So even with 3 bad blocks, we need to get rid of the disk,
software updates will not help.
e2fsck has a -c option that will scan for bad blocks and
mark the bad blocks as bad, so that they are not used by the
o/s. When we run this on startup, with the root file system
mounted read-only, it should mark the bad blocks. If
we then immediately (so that there are no further bad
blocks) rsync the files from sda to sdb, make sdb bootable,
then swap disks to boot from sdb, we should be fine.
I did such an operation locally and can give further
instructions if you agree with the general course of
action.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel