Re: kernel problems


Subject: Re: kernel problems
From: civileme (civileme@mandrakesoft.com)
Date: Sun Jul 14 2002 - 23:20:43 AKDT


On Sunday 14 July 2002 10:45 pm, Matthew Schumacher wrote:
> Hello all,
>
> Last week a mail server I manage suddenly failed so I tried to get a
> shell to the server and found that I can login but bash would hang. I
> thought this to be interesting to I passed a 'ps -ef' though ssh to the
> machine to see if that would work. To my surprise it did. I got a list
> back of 248 processes. After further investigation it seems that my box
> hit a process limit and simply would not start a bash shell (but it did
> start a ps which is odd). Anyway I started passing kill commands though
> ssh to try and shut it down but kill wouldn't work, kill -9 wouldn't
> work, killall wouldn't work, shutdown wouldn't work, nothing I did would
> kill a process. After fighting with it for about an hour I drove to the
> co-lo room and hit the switch.
>
> Does anyone know what might cause this? Why would the kernel simply
> refuse to kill anything. Btw, the box normally has 110-130 processes
> running so something had to happen to cause it to hit 248. The extra
> processes looked to be stale sendmail/qpopper/imap processes.
>
> The machine is running redhat 7.3 with the redhat 2.4.18-4 kernel. I
> tried using a generic kernel I compiled but my scsi performance dropped
> in half. After some conversation with Alan Cox he says that redhat
> patches some hi-mem/scsi code to their kernels which fixes some
> performance problems with the generic kernel. I also had a lot of
> trouble using quotas under heavy load with the generic kernel so alas I
> am running a redhat kernel.
>
> Anyway, I really can't deal with software failure on this machine.
> Hopefully someone will have a suggestion on how to trouble shoot this.
> If not maybe I'll start testing FreeBSD....
>
> Later,
>
> schu
>
>
> ---------
> To unsubscribe, send email to <aklug-request@aklug.org>
> with 'unsubscribe' in the message body.

Hmm,

I would have had webmin running to force a reboot (with sync) but I find it
strange that you did not try killall5

What processor? I can think of a couple that have problems that would cause
something like this with different access modes on pages exactly 32K apart.

Also SMP, UP, and what memory model?

And would you by chance have jabberd running on that thing?

Civileme

---------
To unsubscribe, send email to <aklug-request@aklug.org>
with 'unsubscribe' in the message body.



This archive was generated by hypermail 2a23 : Sun Jul 14 2002 - 23:20:47 AKDT