Re: [RAS] nebka is now running

15 Mar 2008

      Christian Zimmermann writes
...
Tim got it finally to work. I must have done something in the RAID  
configuration utility that erase the tables on sdb1.
Oh great.
...
The current state of the system is: kernel 2.6, ext3 filesystem with  
dir_index feature, empty sdb1, boot and root on sda1.
Note: Time added the dir_index feature also to sda1. This allows better  
handling of large filesystems, but works only with 2.6
Tim is convinced, and I agree, that we do not have a hard drive problem.  
The problem is software related and has to do with the fact that there is 
an awful lot of disk I/O going on on this machine. We should assess all  
the rsync's and such running and see whether they are necessary, and  
whether they needed at the current frequency.
Yes, it is an i/o related software issue. Linux kernels don't
  handle hardware problems gracefully, but horribly. This also
  applies to bad disks. To solve the issue, you either rewrite
  the Linux kernel, or you get a new disk.
...
We should also be used the second drive to distribute the I/O load  
optimally across the two drives. Say, put only /home on sdb1, or only  
/home/aras.
My sense this strategy would also be valid on raneb, snefru, etc., which  
seem to have disk emergencies more often than usual...
No. There is no space on them, the close to 1TB of disk
  space on raneb and snefru is used up by backups. But there 
  is no backup of nebka because of your bedevilling of rsync.

  Snefru has had no disk problem. Raneb had it, it was bad blocks,
  changed disk, all clear. Chichek had it, it was bad blocks, changed
  disk, all clear. Fafner had it, it was bad blocks, changed disk,
  all clear. In the meantime, I keep backups. 

  The fact that we were able to do the entire rsync after marking
  the bad blocks as bad demonstrates that when the system does
  not hit the bad blocks, it works. Next bad block comes along
  it will go belly up again. In my experience, the more bad blocks
  you have, the more bad blocks you get.
...
1) put the RePEc Author Service back online. We were having recently 
15-40 new authors a day signing up. We do not want to discourage new 
users.
2) Think hard how to optimize disk load
3) Then only implement new strategy.
The first priority should be a complete backup, daily. More
  rsync, not less

  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                RePEc:per:1965-06-05:thomas_krichel
  phone: +7 383 330 6813                       skype: thomaskrichel