- RAS-run - RePEc Lists

RAS downtime
by Christian Zimmermann 13 Mar '08

13 Mar '08

I am heading now to the server farm to do hardware maintenance on nebka. This may take a few hours. Kit or Sune, please redirect authors.repec.org to repec.econ.uconn.edu until further notice. Thanks. Christian Zimmermann FIGUGEGL! Department of Economics University of Connecticut 341 Mansfield Road, Unit 1063 Storrs, CT 06269-1063 http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu http://ideas.repec.org/e/pzi1.html

1 0

nebka work
by Christian Zimmermann 10 Mar '08

10 Mar '08

If you are doing any work on nebka that presents a risk of panic or shutdown, please wait until Tuesday PM (Eastern Time). I will be off campus or teaching. Christian Zimmermann FIGUGEGL! Department of Economics University of Connecticut 341 Mansfield Road, Unit 1063 Storrs, CT 06269-1063 http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu http://ideas.repec.org/e/pzi1.html

2 7

barracuda reputation
by Christian Zimmermann 07 Mar '08

07 Mar '08

My barrage of emails seems to have worked, nebka's barracuda reputation is not poor anymore. Christian Zimmermann FIGUGEGL! Department of Economics University of Connecticut 341 Mansfield Road, Unit 1063 Storrs, CT 06269-1063 http://ideas.repec.org/zimm/ christian.zimmermann(a)uconn.edu http://ideas.repec.org/e/pzi1.html

2 2

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

Bob Parks writes > Could very well be - the eduunix.ccut.edu is very good and I will go with > your theory. I will be interested to know just > how you rsync to the 143 gig and then make it bootable. I did it here. My laptop froze to death. I bought a desktop, and put the disk from the laptop in it, and booted from the laptop disk. After initalizing the desktop disk on /dev/sdb, (making sure I put in a swap) I mounted the main part of it to /vol, then rsync -va --exclude /vol --exclude /proc --exclude /sys / /vol Then grub-install --directory /vol/boot/grub /dev/sdb Edited /vol/etc/fstab, putting in the right device for the swap, hopefully, (did work after a few tries ;-) Put laptop disk out, put desktop disk in its place, bingo. In nebka, the rsnyc is likely to fail. Therefore rid stop shutdown -h now boot in single user, mount / readonly. e2fsck -c -y /dev/sda say, to use the badblack program to check the disk, mark the bad block as bad so that they don't turn up in the filesystem. If my theory is correct, and no other bad block appears (2 big ifs), you can start with the rsync. Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

1 0

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

----- Forwarded message from Bob Parks <bparks(a)artsci.wustl.edu> ----- Envelope-to: krichel@localhost Delivery-date: Thu, 28 Feb 2008 00:47:49 +0600 From: Bob Parks <bparks(a)artsci.wustl.edu> To: Thomas Krichel <krichel(a)openlib.org> X-Antivirus: avast! (VPS 080227-0, 02/27/2008), Outbound message X-Antivirus-Status: Clean X-SA-Exim-Connect-IP: 128.252.93.43 X-SA-Exim-Mail-From: bparks(a)artsci.wustl.edu X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on snefru.openlib.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.3 Subject: Re: [RAS] badblocks X-SA-Exim-Version: 4.2.1 (built Tue, 21 Aug 2007 23:39:36 +0000) X-SA-Exim-Scanned: Yes (on snefru.openlib.org) Thomas Krichel wrote: > Bob Parks writes > > >> Yes, IMHO. As Christian wrote earlier about nebka, there are limits to >> directory sizes. He seemed to indicate that a cron job >> with du might have been the entire problem. We have had similar problems >> in the past. > > my theory: du puts stress on the disk, it hits the bad block, and bang! > Possible, very possible. >> There are bad blocks on every disk. Bad blocks, unless a large number, >> do not show that the 'disk' is failing. And again, this is a mirror'ed >> disk, two disks, in Raid 1, with a hardware controller. Now that I think >> on it, >> it is not clear what badblocks on what disk are being reported by the >> Adaptec controller - >> > > my theory: the disk is one disk to the o/s. Yes it is, but a bad block is a physical disk concept - but who knows what evil lurks in the depths. > >> Note that nearly identical hardware exists on Bill's RFE machine and >> never an error. You have had problems >> on nebka, and snefru (idential hardware) and raneb (very different >> hardware). That alone leads me to suspect >> software. >> > > I don't remember a problem on snefru. The common file set are > the adrepec files (common on raneb, sahure, fafner, nebka, mutabor) and > the citec files, common on mutabor, raneb, > snefru, sahure, fafner (Yes, I back up!). > What I think is what's written in 27.2.4. badblocks and e2fsck > of > http://eduunix.ccut.edu.cn/index/html/linux/OReilly.LPI.Linux.Certification… > > They say > When a disk is failing, it will usually get an exponential increase in > bad blocks, and after a short while it will run out of spare blocks, > whereupon you will get into trouble with your filesystems on that > disk. > > It has already run out of spare blocks, that's why some > bad blocks show up to the o/s. > Could very well be - the eduunix.ccut.edu is very good and I will go with your theory. I will be interested to know just how you rsync to the 143 gig and then make it bootable. Bob > Cheers, > > Thomas Krichel http://openlib.org/home/krichel > RePEc:per:1965-06-05:thomas_krichel > phone: +7 383 330 6813 skype: thomaskrichel > > _______________________________________________ > RAS-run mailing list > RAS-run(a)lists.openlib.org > http://lists.openlib.org/cgi-bin/mailman/listinfo/ras-run >

1 0

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

----- Forwarded message from Bob Parks <bparks(a)artsci.wustl.edu> ----- Envelope-to: krichel@localhost Delivery-date: Wed, 27 Feb 2008 23:57:15 +0600 From: Bob Parks <bparks(a)artsci.wustl.edu> To: Thomas Krichel <krichel(a)openlib.org> X-Antivirus: avast! (VPS 080227-0, 02/27/2008), Outbound message X-Antivirus-Status: Clean X-SA-Exim-Connect-IP: 128.252.93.43 X-SA-Exim-Mail-From: bparks(a)artsci.wustl.edu X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on snefru.openlib.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.3 Subject: Re: [RAS] badblocks X-SA-Exim-Version: 4.2.1 (built Tue, 21 Aug 2007 23:39:36 +0000) X-SA-Exim-Scanned: Yes (on snefru.openlib.org) Thomas Krichel wrote: > > Bob Parks writes > > >> My meory is different about raneb. >> > > We had a bad disk. When we replaced, it was fine. > > >> All of the errors seem to melt away when some of the crons are disabled - >> such as Christian did with the ones involving du. >> >> > > So what does this conclude? A software problem? > Yes, IMHO. As Christian wrote earlier about nebka, there are limits to directory sizes. He seemed to indicate that a cron job with du might have been the entire problem. We have had similar problems in the past. > >> Get rid of THE disk does not compute. There are two disks, and in a >> configuration that is as fault tolerant as >> it gets. >> > > But it will break at some stage. The badblocks show it's broken. > There are bad blocks on every disk. Bad blocks, unless a large number, do not show that the 'disk' is failing. And again, this is a mirror'ed disk, two disks, in Raid 1, with a hardware controller. Now that I think on it, it is not clear what badblocks on what disk are being reported by the Adaptec controller - Note that nearly identical hardware exists on Bill's RFE machine and never an error. You have had problems on nebka, and snefru (idential hardware) and raneb (very different hardware). That alone leads me to suspect software. > >> Up to you and Christian but I believe this is not the solution. Bob >> > > What is the solution? > > As Christian has done, carefully bring the machine back to life without all the crons and add the crons sparingly. I have not heard of any more problems with nebka since he did that and it is on the same Raid 1 2 disk mirror. If you do decide to make the 143 gig bootable, Christian should, after a time, boot and enter the Adaptec controller. Then break the 'container' which has the two 68 gig disks, and then you can have two 68 gig disks, check them individually, and gain 68 gig of space. In the end, it is your choice. Bob > Cheers, > > Thomas Krichel http://openlib.org/home/krichel > RePEc:per:1965-06-05:thomas_krichel > phone: +7 383 330 6813 skype: thomaskrichel > ----- End forwarded message ----- -- Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

1 0

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

Bob Parks writes > Yes, IMHO. As Christian wrote earlier about nebka, there are limits to > directory sizes. He seemed to indicate that a cron job > with du might have been the entire problem. We have had similar problems > in the past. my theory: du puts stress on the disk, it hits the bad block, and bang! > There are bad blocks on every disk. Bad blocks, unless a large number, do > not show that the 'disk' is failing. And again, this is a mirror'ed disk, > two disks, in Raid 1, with a hardware controller. Now that I think on it, > it is not clear what badblocks on what disk are being reported by the > Adaptec controller - my theory: the disk is one disk to the o/s. > Note that nearly identical hardware exists on Bill's RFE machine and never > an error. You have had problems > on nebka, and snefru (idential hardware) and raneb (very different > hardware). That alone leads me to suspect > software. I don't remember a problem on snefru. The common file set are the adrepec files (common on raneb, sahure, fafner, nebka, mutabor) and the citec files, common on mutabor, raneb, snefru, sahure, fafner (Yes, I back up!). What I think is what's written in 27.2.4. badblocks and e2fsck of http://eduunix.ccut.edu.cn/index/html/linux/OReilly.LPI.Linux.Certification… They say When a disk is failing, it will usually get an exponential increase in bad blocks, and after a short while it will run out of spare blocks, whereupon you will get into trouble with your filesystems on that disk. It has already run out of spare blocks, that's why some bad blocks show up to the o/s. Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

1 0

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

Bob Parks writes > My meory is different about raneb. We had a bad disk. When we replaced, it was fine. > All of the errors seem to melt away when some of the crons are disabled - > such as Christian did with the ones involving du. > So what does this conclude? A software problem? > Get rid of THE disk does not compute. There are two disks, and in a > configuration that is as fault tolerant as > it gets. But it will break at some stage. The badblocks show it's broken. > Up to you and Christian but I believe this is not the solution. Bob What is the solution? Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

1 0

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

Christian Zimmermann writes > I will be leaving in a little while campus to return at 7:30 tomorrow > morning. Wait until then if you want to reboot. I will not do anything, unless you want me to set up the spam filter. Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

2 6

Re: [RAS] badblocks
by Thomas Krichel 27 Feb '08

27 Feb '08

Christian Zimmermann writes > I am back and stand ready to run to the server farm if necessary. With 11 hours time difference, I was in bed. I have been thinking a bit more. I remember, when I had a similar problem with raneb, there were only 12 or 40 bad bad blocks, but they caused the disk to crash. Now that the offending disk has been replaced, it's all quiet on the raneb front. I would therefore suggest that the troubles come from the bad block The way I understand disks, is that decay is expenential. Most modern disks have some extra space through RAID, that is hidden from the O/S. When bad block appear the data is moved from the bad blocks to blocks that are healthy, in a way that is transparent to the o/s. When there are too many bad blocks, the o/s start seeing them, and that's when Linux gets rather merciless, it does not take hardware issues lightly. So even with 3 bad blocks, we need to get rid of the disk, software updates will not help. e2fsck has a -c option that will scan for bad blocks and mark the bad blocks as bad, so that they are not used by the o/s. When we run this on startup, with the root file system mounted read-only, it should mark the bad blocks. If we then immediately (so that there are no further bad blocks) rsync the files from sda to sdb, make sdb bootable, then swap disks to boot from sdb, we should be fine. I did such an operation locally and can give further instructions if you agree with the general course of action. Cheers, Thomas Krichel http://openlib.org/home/krichel RePEc:per:1965-06-05:thomas_krichel phone: +7 383 330 6813 skype: thomaskrichel

2 3