...making Linux just a little more fun!
jim ruxton [cinetron at passport.ca]
Hi I'm on a P4 running an smp kernel ie.
uname -a : Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 GNU/Linux
For some reason occasionaly one of my cpu's runs away and starts showing 100% cpu usage, however top only shows the program using the most cpu at 7% or less. Where is that 93% going? This is driving me crazy. I have to reboot the machine to get it back to normal. I've tried Fedora and Kubuntu. Though Kubuntu appeared more stable it still happened. Played around with changing max_cstate without success.I've just downgraded to a 386 kernel to see if that helps. This isn't a real solution however. Do you have any suggestions? I've asked around on forums and irc channels without luck. Thanks. Jim
output of cat /proc/cpuinfo :
processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.06GHz stepping : 9 cpu MHz : 3059.352 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 6123.68 clflush size : 64
Mulyadi Santosa [mulyadi.santosa at gmail.com]
Hi...
> Hi I'm on a P4 running an smp kernel ie. > > uname -a : > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > GNU/Linux > > For some reason occasionaly one of my cpu's runs away and starts showing > 100% cpu usage, however top only shows the program using the most cpu at > 7% or less.
Looking at /proc/cpuinfo, sounds like you have P4 with hyperthread enabled, true? If so, this 7% max of CPU usage, is it only on one core or cumulative load over all cores?
> Where is that 93% going? This is driving me crazy. I have to > reboot the machine to get it back to normal. I've tried Fedora and > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > around with changing max_cstate without success.I've just downgraded to > a 386 kernel to see if that helps. This isn't a real solution however. > Do you have any suggestions? I've asked around on forums and irc > channels without luck. Thanks. >My simple suggestion right now: please download latest stable kernel from kernel.org and compile by yourself. There is a chance you hit a (unknown?) kernel scheduler bug. Or it could be a power management problem (is it a mobile processor?).
Since I am not really knowledgeable in this field, try to fiddle with options during kernel configuration such as ACPI, APM, APIC support and so on.
let's see if that will cure your problem.
regards,
Mulyadi
Ben Okopnik [ben at linuxgazette.net]
On Mon, Oct 08, 2007 at 02:21:21PM -0400, jim ruxton wrote:
> Hi I'm on a P4 running an smp kernel ie. > > uname -a : > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > GNU/Linux > > For some reason occasionaly one of my cpu's runs away and starts showing > 100% cpu usage, however top only shows the program using the most cpu at > 7% or less. Where is that 93% going? This is driving me crazy. I have to > reboot the machine to get it back to normal. I've tried Fedora and > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > around with changing max_cstate without success.I've just downgraded to > a 386 kernel to see if that helps. This isn't a real solution however. > Do you have any suggestions? I've asked around on forums and irc > channels without luck. Thanks.
One of the things that might be worth trying is reducing everything to a minimum and seeing if it still happens. That is, try booting into single mode and shutting down any non-critical processes, then just letting the system "cook" for a little while. If it happens again, you've most likely narrowed it down to either a) the kernel or b) the hardware. For troubleshooting the former, check out the kernel compilation options (I seem to recall a developer/debugging option accessible via 'make config'.) For the latter, try a hairdryer aimed at the offending CPU.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Thomas Adam [thomas.adam22 at gmail.com]
On 08/10/2007, Ben Okopnik <ben at linuxgazette.net> wrote:
> One of the things that might be worth trying is reducing everything to a > minimum and seeing if it still happens. That is, try booting into single > mode and shutting down any non-critical processes, then just letting the > system "cook" for a little while. If it happens again, you've most > likely narrowed it down to either a) the kernel or b) the hardware. For > troubleshooting the former, check out the kernel compilation options (I > seem to recall a developer/debugging option accessible via 'make > config'.) For the latter, try a hairdryer aimed at the offending CPU.
Yes, and quite often the other main offender is choosing a wrong arch type for your system. That'll cause quite often either a high CPU load or massive amounts of disk access. In addition, specifying:
noapic
As a kernel parameter might also help.
-- Thomas Adam
René Pfeiffer [lynx at luchs.at]
Hello, Jim!
On Oct 08, 2007 at 1421 -0400, jim ruxton appeared and said:
> Hi I'm on a P4 running an smp kernel ie. > > uname -a : > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > GNU/Linux
Well, if you run a SMP kernel the output is usually something like this:
Linux nightfall 2.6.22.9 #1 SMP PREEMPT Tue Oct 2 22:02:38 CEST 2007 x86_64 GNU/Linux
uname shows a "SMP" next to the version. Double check if your kernel has activated SMP code. Your output of /proc/cpuinfo only shows one processor. You should have two (indicated by a different ID next to the field "processor").
It should be noted that although some P4 CPUs have hyperthreading not all mainboards or BIOS variants support it. I've seen P4s not capable of hyperthreading simply because of the hardware or the BIOS on the board.
> For some reason occasionaly one of my cpu's runs away and starts showing > 100% cpu usage, however top only shows the program using the most cpu at > 7% or less. Where is that 93% going? This is driving me crazy.
Does the system become unresponsive or sluggish?
Can you check if "top" shows any distribution of the load? My top and procinfo tools also show what the load is like, such as IRQ activity, waiting for I/O or other things. This is an output of procinfo on my SMP system (Intel Core 2 Duo):
lynx at nightfall:~$ procinfo Linux 2.6.22.9 (lynx at nightfall) (gcc [can't parse]) #??? 2CPU [nightfall.luchs.at] Memory: Total Used Free Shared Buffers Mem: 1028328 967664 60664 0 16480 Swap: 995988 972 995016 Bootup: Tue Oct 2 22:06:31 2007 Load average: 0.03 0.07 0.08 1/133 27736 user : 6:15:19.98 2.2% page in : 4400622 disk 1: 451r 214w nice : 0:01:00.35 0.0% page out: 3204961 disk 2: 6518r 4384w system: 2:15:43.08 0.8% page act: 858311 disk 3: 139898r 364672w IOwait: 0:06:14.76 0.0% page dea: 479245 hw irq: 0:25:24.70 0.1% page flt: 34457299 sw irq: 0:18:43.76 0.1% swap in : 25 idle : 11d 10:38:31.98 95.6% swap out: 233 uptime: 5d 23:37:23.38 context :109507629 [...]
There's user, nice, system, IOwait, the IRQ modes and idle time. And yes, now I see that procinfo also counts the CPUs by printing "2CPU". Using "x86info -v" is yet another way of checking SMP mode and CPU capabilities.
Best regards, Ren?.
Marty Leisner [leisner at rochester.rr.com]
jim ruxton <cinetron at passport.ca> writes on Mon, 08 Oct 2007 14:21:21 EDT
> Hi I'm on a P4 running an smp kernel ie. > > uname -a : > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > GNU/Linux > > For some reason occasionaly one of my cpu's runs away and starts showing > 100% cpu usage, however top only shows the program using the most cpu at > 7% or less. Where is that 93% going? This is driving me crazy. I have to > reboot the machine to get it back to normal. I've tried Fedora and > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > around with changing max_cstate without success.I've just downgraded to > a 386 kernel to see if that helps. This isn't a real solution however. > Do you have any suggestions? I've asked around on forums and irc > channels without luck. Thanks. > Jim > > > output of cat /proc/cpuinfo : >
top also reports kernel, idle, user and a few others...
Are you 100% user? Is the system sluggish?
What type of load average are you seeing?
Also, if you're SMP there's an option in top to display each CPU (in recent procps tops, type 1).
[tops a lovely program -- there's been so many flavors since Bill Lefebvre version -- all slightly different...
Have you also run other load monitors as a sanity check?
marty
jim ruxton [cinetron at passport.ca]
Thanks Mulaydi.
> > Hi I'm on a P4 running an smp kernel ie. > > > > uname -a : > > > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > > GNU/Linux > > > > For some reason occasionaly one of my cpu's runs away and starts showing > > 100% cpu usage, however top only shows the program using the most cpu at > > 7% or less. > > Looking at /proc/cpuinfo, sounds like you have P4 with hyperthread > enabled, true? If so, this 7% max of CPU usage, is it only on one core > or cumulative load over all cores?No it is only on one core.
> > Where is that 93% going? This is driving me crazy. I have to > > reboot the machine to get it back to normal. I've tried Fedora and > > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > > around with changing max_cstate without success.I've just downgraded to > > a 386 kernel to see if that helps. This isn't a real solution however. > > Do you have any suggestions? I've asked around on forums and irc > > channels without luck. Thanks. > > > My simple suggestion right now: please download latest stable kernel > from kernel.org and compile by yourself. There is a chance you hit a > (unknown?) kernel scheduler bug. Or it could be a power management > problem (is it a mobile processor?).It is a P4 on a laptop. I'm not sure how to tell whether or not it is a mobile version?
> > Since I am not really knowledgeable in this field, try to fiddle with > options during kernel configuration such as ACPI, APM, APIC support and > so on.I've tried with noapic as a kernal flag with no luck.
Jim
jim ruxton [cinetron at passport.ca]
Thanks Ben,
> > Hi I'm on a P4 running an smp kernel ie. > > > > uname -a : > > > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > > GNU/Linux > > > > For some reason occasionaly one of my cpu's runs away and starts showing > > 100% cpu usage, however top only shows the program using the most cpu at > > 7% or less. Where is that 93% going? This is driving me crazy. I have to > > reboot the machine to get it back to normal. I've tried Fedora and > > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > > around with changing max_cstate without success.I've just downgraded to > > a 386 kernel to see if that helps. This isn't a real solution however. > > Do you have any suggestions? I've asked around on forums and irc > > channels without luck. Thanks. > > One of the things that might be worth trying is reducing everything to a > minimum and seeing if it still happens. That is, try booting into single > mode and shutting down any non-critical processes, then just letting the > system "cook" for a little while. If it happens again, you've most > likely narrowed it down to either a) the kernel or b) the hardware. For > troubleshooting the former, check out the kernel compilation options (I > seem to recall a developer/debugging option accessible via 'make > config'.) For the latter, try a hairdryer aimed at the offending CPU.I tried switching into Single User Mode once the 1 processor jumped to 100% but it didn't seem to change anything. I've tried unloading some modules but I keep getting Resource currently unavailable after running rmmod. Do you know what this would mean in this context? This glitch is hard to track down because sometimes it doesn't happen for a day or so after I've rebooted.
jim
jim ruxton [cinetron at passport.ca]
Thanks Ren? ,
> > > > uname -a : > > > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > > GNU/Linux > > Well, if you run a SMP kernel the output is usually something like this: > > Linux nightfall 2.6.22.9 #1 SMP PREEMPT Tue Oct 2 22:02:38 CEST 2007 > x86_64 GNU/Linux > > uname shows a "SMP" next to the version. Double check if your kernel has > activated SMP code. Your output of /proc/cpuinfo only shows one > processor. You should have two (indicated by a different ID next to the > field "processor").Hmm that is curious isn't it, beause htop and top show 2 processors.
> > It should be noted that although some P4 CPUs have hyperthreading not > all mainboards or BIOS variants support it. I've seen P4s not capable of > hyperthreading simply because of the hardware or the BIOS on the board.But the fact I see two processors in htop and top that are constantly changing in terms of load I would think that would mean HT is enabled?
> > > For some reason occasionaly one of my cpu's runs away and starts showing > > 100% cpu usage, however top only shows the program using the most cpu at > > 7% or less. Where is that 93% going? This is driving me crazy. > > Does the system become unresponsive or sluggish?Very sluggish to input ie. enetering text or backspacing is really slow. Sometimes it can't retrieve a webpage.
> > Can you check if "top" shows any distribution of the load? My top and > procinfo tools also show what the load is like, such as IRQ activity, > waiting for I/O or other things. This is an output of procinfo on my > SMP system (Intel Core 2 Duo):Below is my procinfo: (at this time things are not going bonkers. I'll run this again next time processor 1 runs away)
Linux 2.6.20-16-generic (root at terranova) (gcc 4.1.2 ) #2 2CPU [jims-laptop] Memory: Total Used Free Shared Buffers Mem: 515428 481384 34044 0 25356 Swap: 489940 0 489940 Bootup: Mon Oct 8 17:04:34 2007 Load average: 0.27 0.09 0.02 1/141 7093 user : 0:02:23.44 0.2% page in : 297863 disk 1: 14884r 12157w nice : 0:00:00.50 0.0% page out: 178564 system: 0:04:57.27 0.5% page act: 71635 IOwait: 0:01:38.46 0.1% page dea: 12933 hw irq: 0:00:07.60 0.0% page flt: 2111672 sw irq: 0:01:06.03 0.1% swap in : 0 idle : 15:36:19.55 98.9% swap out: 0 uptime: 7:53:13.86 context : 10939172 irq 0: 7098612 timer irq 16: 3368 yenta, Intel ICH5 irq 1: 3092 i8042 irq 17: 348537 uhci_hcd:usb1, nvidi irq 7: 0 parport0 irq 18: 3 uhci_hcd:usb2, ohci1 irq 8: 3 rtc irq 19: 0 uhci_hcd:usb3 irq 9: 7793376 acpi irq 20: 10624 eth0 irq 12: 478552 i8042 irq 21: 0 ehci_hcd:usb4 irq 14: 27046 libata irq 22: 1179944 eth1 irq 15: 132569 libata
Below is the result of x86info -v: This statement at the bottom is kind of cryptic:
WARNING: Detected SMP, but unable to access cpuid driver. Used Uniprocessor CPU routines. Results inaccurate.Any idea what it could mean?? Thanks again for the help.
x86info v1.18. Dave Jones 2001-2006 Feedback to <davej at redhat.com>. Found 2 CPUsCPU #1 /dev/cpu/0/cpuid: No such file or directory Found unknown cache descriptors: 40 50 5b 66 70 7b Family: 15 Model: 2 Stepping: 9 Type: 0 Brand: 9 CPU Model: Pentium 4 (Northwood) [D1] Original OEM Processor name string: Intel(R) Pentium(R) 4 CPU 3.06GHz Feature flags: Onboard FPU Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter Model-Specific Registers Physical Address Extensions Machine Check Architecture CMPXCHG8 instruction Onboard APIC SYSENTER/SYSEXIT Memory Type Range Registers Page Global Enable Machine Check Architecture CMOV instruction Page Attribute Table 36-bit PSEs CLFLUSH instruction Debug Trace Store ACPI via MSR MMX support FXSAVE and FXRESTORE instructions SSE support SSE2 support CPU self snoop Hyper-Threading Thermal Monitor Pending Break Enable cntx-id xTPR Extended feature flags: Instruction trace cache: Size: 12K uOps 8-way associative. L1 Data cache: Size: 8KB Sectored, 4-way associative. line size=64 bytes. L2 unified cache: Size: 512KB Sectored, 8-way associative. line size=64 bytes. Instruction TLB: 4K, 2MB or 4MB pages, fully associative, 64 entries. Found unknown cache descriptors: 40 50 5b 66 70 7b Data TLB: 4KB or 4MB pages, fully associative, 64 entries. The physical package supports 2 logical processors -------------------------------------------------------------------------- CPU #2 Found unknown cache descriptors: 40 50 5b 66 70 7b Family: 15 Model: 2 Stepping: 9 Type: 0 Brand: 9 CPU Model: Pentium 4 (Northwood) [D1] Original OEM Processor name string: Intel(R) Pentium(R) 4 CPU 3.06GHz Feature flags: Onboard FPU Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter Model-Specific Registers Physical Address Extensions Machine Check Architecture CMPXCHG8 instruction Onboard APIC SYSENTER/SYSEXIT Memory Type Range Registers Page Global Enable Machine Check Architecture CMOV instruction Page Attribute Table 36-bit PSEs CLFLUSH instruction Debug Trace Store ACPI via MSR MMX support FXSAVE and FXRESTORE instructions SSE support SSE2 support CPU self snoop Hyper-Threading Thermal Monitor Pending Break Enable cntx-id xTPR Extended feature flags: Instruction trace cache: Size: 12K uOps 8-way associative. L1 Data cache: Size: 8KB Sectored, 4-way associative. line size=64 bytes. L2 unified cache: Size: 512KB Sectored, 8-way associative. line size=64 bytes. Instruction TLB: 4K, 2MB or 4MB pages, fully associative, 64 entries. Found unknown cache descriptors: 40 50 5b 66 70 7b Data TLB: 4KB or 4MB pages, fully associative, 64 entries. The physical package supports 2 logical processors -------------------------------------------------------------------------- WARNING: Detected SMP, but unable to access cpuid driver. Used Uniprocessor CPU routines. Results inaccurate. ''
jim ruxton [cinetron at passport.ca]
Thanks Marty,
> > Hi I'm on a P4 running an smp kernel ie. > > > > uname -a : > > > > Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 > > GNU/Linux > > > > For some reason occasionaly one of my cpu's runs away and starts showing > > 100% cpu usage, however top only shows the program using the most cpu at > > 7% or less. Where is that 93% going? This is driving me crazy. I have to > > reboot the machine to get it back to normal. I've tried Fedora and > > Kubuntu. Though Kubuntu appeared more stable it still happened. Played > > around with changing max_cstate without success.I've just downgraded to > > a 386 kernel to see if that helps. This isn't a real solution however. > > Do you have any suggestions? I've asked around on forums and irc > > channels without luck. Thanks. > > Jim > > > > > > output of cat /proc/cpuinfo : > > > > top also reports kernel, idle, user and a few others... > > Are you 100% user? Is the system sluggish?Yes I am the only user and the system is very sluggish.
> > What type of load average are you seeing?I see 100% on one cpu and 7% or so on the other. In terms of overall load, I'll check again next time the sytem goes crazy.
> > Also, if you're SMP there's an option in top to display each CPU (in recent procps tops, type 1).Yes I see that now , thanks. I had been using htops which shows both cpus automatically. I didn't know about entering 1 in top to get both cpus. Thanks. Jim.
> > Have you also run other load monitors as a sanity check?Just htops but I can tell as soon as it happens because everything crawls along including closing down the system.
Jim
Ramon van Alteren [ramon at forgottenland.net]
Ben Okopnik wrote:
> On Mon, Oct 08, 2007 at 02:21:21PM -0400, jim ruxton wrote: > >> Hi I'm on a P4 running an smp kernel ie. >> >> uname -a : >> >> Linux jims-laptop 2.6.20-16-386 #2 Sun Sep 23 19:47:10 UTC 2007 i686 >> GNU/Linux >> >> For some reason occasionaly one of my cpu's runs away and starts showing >> 100% cpu usage, however top only shows the program using the most cpu at >> 7% or less. Where is that 93% going? This is driving me crazy. I have to >> reboot the machine to get it back to normal. I've tried Fedora and >> Kubuntu. Though Kubuntu appeared more stable it still happened. Played >> around with changing max_cstate without success.I've just downgraded to >> a 386 kernel to see if that helps. This isn't a real solution however. >> Do you have any suggestions? I've asked around on forums and irc >> channels without luck. Thanks. >> > > One of the things that might be worth trying is reducing everything to a > minimum and seeing if it still happens. That is, try booting into single > mode and shutting down any non-critical processes, then just letting the > system "cook" for a little while. If it happens again, you've most > likely narrowed it down to either a) the kernel or b) the hardware. For > troubleshooting the former, check out the kernel compilation options (I > seem to recall a developer/debugging option accessible via 'make > config'.) For the latter, try a hairdryer aimed at the offending CPU. > >I've seen this behaviour before in combination with run-away acpi processes (acpid and friends) My own laptop (a dell D810 with a dual-core with Hyperthreading) shows the same behaviour if I try to hot-(un)dock it from it's docking station
Ramon
Ben Okopnik [ben at linuxgazette.net]
On Tue, Oct 09, 2007 at 12:48:33AM -0400, jim ruxton wrote:
> Ben Okopnik wrote: > > > > One of the things that might be worth trying is reducing everything to a > > minimum and seeing if it still happens. That is, try booting into single > > mode and shutting down any non-critical processes, then just letting the > > system "cook" for a little while. If it happens again, you've most > > likely narrowed it down to either a) the kernel or b) the hardware. For > > troubleshooting the former, check out the kernel compilation options (I > > seem to recall a developer/debugging option accessible via 'make > > config'.) For the latter, try a hairdryer aimed at the offending CPU. > > I tried switching into Single User Mode once the 1 processor jumped to > 100% but it didn't seem to change anything.
You need to go single-user before it happens. That way, you'll know if it's some program that you're running when you're not in single-user, which is the valuable piece of info here.
> I've tried unloading some > modules but I keep getting Resource currently unavailable after running > rmmod.
Sounds like your process table got filled up - that's when I've seen that kind of message.
> Do you know what this would mean in this context? This glitch is > hard to track down because sometimes it doesn't happen for a day or so > after I've rebooted.
Unfortunately, this kind of problem usually breaks down to one of two possibilities:
1) You're just Joe Average-User, messing around at home and trying to learn something. Because of this, you have the leisure to experiment as I suggested, so you spend several days twiddling and fiddling, and you discover a kernel bug that earns you world-wide recognition, undying fame, and a bouquet of roses from Linus Torvalds (*NOW* you'll be able to get all the girls!)
2) You work in the corporate environment, and your company hasn't quite yet spent the money on the spares that you've been begging them to get for years. You curse them, curse the damn computers, and start madly swapping hardware and kernels until it sorta kinda works (i.e., no crashes in a 24-hour period), and forget all about it until the next problem comes along.
Or there's always whatever intermediate version of these that you come up with for yourself. Take your pick.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Mulyadi Santosa [mulyadi.santosa at gmail.com]
Hi Jim...
I am sorry for this late reply...
> No it is only on one core. >OK, I think you hit a classic Linux kernel "bug". Not a bug in its true meaning, but the inability of Linux kernel to recognize the specific nature of Hyperthread processor. AFAIK, HT is not like multi core CPU. You have one true core and a "sibling". This sibling is actually like...uhm can't find the right word...ok..let's just say a "dummy" processor. It's like additional pipelines but isn't really a complete processor. But in Linux kernel's point of view, the real core and the sibling are identical.
This brings one implication. During load balancing between real core and the sibling, sometimes Linux kernel brings heavy jobs to the sibling instead of toward the real core (the powerful one). I don't know if the recent kernel release has already fix this "bug", so you are free to test.
Or, if you're really desperate, try to stick with non SMP kernel. That way, your tasks will always get assigned to the real core, abandoning the weak sibling.
I hope this clears up your confusion.
regards,
Mulyadi