Wednesday, January 2, 2013

Crash Dump Analysis (Installing and Configuring Kdump crash )


Kdump is a kernel crash dumping mechanism and is very reliable because the crash dump is captured from the context of a freshly booted kernel and not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever system crashes. This second kernel, often called the crash kernel, boots with very little memory and captures the dump image.
The first kernel reserves a section of memory that the second kernel uses to boot. Kexec enables booting the capture kernel without going through the BIOS, so contents of the first kernel's memory are preserved, which is essentially the kernel crash dump.
Configuring and using the kdump

[root@test ~] yum install kexec-tools crash

[root@test ~] vi /boot/grub/grub.conf
add kernel /boot/vmlinuz-2.6.18-128.el5 ro root=LABEL=/ rhgb quiet crashkernel=128M@16M
initrd /boot/initrd-2.6.18-128.el5.img
This crashkernel parameter reserve memory for dump that is 128M while keeping initial memory allocation to 16M .

[root@test ~] init 6
Reboot and choose the crash kernel to boot.

check the Crash kernel correctly installed or not using the below command.

[root@gai-1399 ~]# cat /proc/iomem|grep Crash
 01000000-08ffffff : Crash kernel

In /etc/kdump.conf file we can specify location where dump file get stored . Normally dump file get stored inside /var/crash folder.

[root@test ~] vi /etc/kdump
path /var/crash

[root@test ~]# /etc/init.d/kdump start
Starting kdump:                                            [  OK  ]
[root@test ~]# /etc/init.d/kdump status
Kdump is operational

[root@test ~]# chkconfig --level 3 kdump on
[root@test ~]# chkconfig --list | grep kdump
kdump           0:off   1:off   2:off   3:on    4:off   5:off   6:off

[root@test ~]# cat /proc/cmdline
ro root=LABEL=/1 crashkernel=128M@16M rhgb quiet

[root@test ~] yum install kernel-debug*

Installed:
  kernel-debug.i686 0:2.6.18-308.el5  kernel-debug-devel.i686 0:2.6.18-308.el5

To analyze the crash kernel We need kernel-debuginfo package, we can the analyse the vmcore file through the crash utility.

[root@test ~]# echo 1 > /proc/sys/kernel/sysrq
[root@test ~]# echo c > /proc/sysrq-trigger

The above command makes Linux kernel to crash, and the YYYY-MM-DD-HH:MM/vmcore file will be generated to the location we have selected in the configuration.

/proc/sysrq-trigger -Using the echo command to write to this file, a remote root user can execute most System Request Key commands remotely as if at the local terminal. To echo values to this file, the /proc/sys/kernel/sysrq must be set to a value other than 0

rebooted twice.

[root@test ~]# cd /var/crash/
[root@test crash]# ls
2012-11-04-14:23  

[root@test ~]#crash /var/crash/2012-11-04-14:23\vmcore /usr/lib/debug/lib/modules/2.6.18-308.16.1.el5.centos.plus/vmlinux 

To display the kernel message buffer, type the log command at the interactive prompt. 

crash>log

To display the kernel stack strace
 
crash> bt

Display process status using ps
crash > ps

crash> help

*              files          mod            runq           union
alias          foreach        mount          search         vm
ascii          fuser          net            set            vtop
bt             gdb            p              sig            waitq
btop           help           ps             struct         whatis
dev            irq            pte            swap           wr
dis            kmem           ptob           sym            q
eval           list           ptov           sys
exit           log            rd             task
extend         mach           repeat         timer

crash version: 5.1.8-1.el5.centos   gdb version: 7.0
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".

crash> exit

Below is a sample crash analysis of my system.                                                          
                                                                     
[root@test 2012-11-04-15:57]# crash vmcore /usr/lib/debug/lib/modules/2.6.18
-308.16.1.el5.centos.plus/vmlinux

crash 5.1.8-1.el5.centos
Copyright (C) 2002-2011  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

WARNING: kernel version inconsistency between vmlinux and dumpfile

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-308.16.1.el5.centos.plus/vmlinux
    DUMPFILE: vmcore
        CPUS: 2
        DATE: Sun Nov  4 15:57:11 2012
      UPTIME: 01:28:17
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 175
    NODENAME: test.net
     RELEASE: 2.6.18-308.13.1.el5
     VERSION: #1 SMP Tue Aug 21 17:10:06 EDT 2012
     MACHINE: i686  (2926 Mhz)
      MEMORY: 1.9 GB
       PANIC: "SysRq : Trigger a crashdump"
         PID: 315
     COMMAND: "bash"
        TASK: f5b9caa0  [THREAD_INFO: eff0c000]
         CPU: 0
       STATE: TASK_RUNNING (SYSRQ)

crash> log
Linux version 2.6.18-308.13.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Tue Aug 21 17:10:06 EDT 2012
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009f800 (usable)
 BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007b9a1740 (usable)
 BIOS-e820: 000000007b9a1740 - 000000007b9a37a0 (ACPI NVS)
 BIOS-e820: 000000007b9a37a0 - 000000007e000000 (reserved)
 BIOS-e820: 00000000f4000000 - 00000000f8000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved)
 BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved)
1081MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f9bf0
Using x86 segment limits to approximate NX protection
On node 0 totalpages: 506273
  DMA zone: 4096 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 276897 pages, LIFO batch:31
DMI 2.6 present.
DMI: Hewlett-Packard HP Compaq 6000 Pro MT PC/3048h, BIOS 786G2 v01.09 08/25/2009
Using APIC driver default
ACPI: RSDP (v000 COMPAQ                                ) @ 0x000e5810
ACPI: RSDT (v001 HPQOEM SLIC-BPC 0x20090825  0x00000000) @ 0x7b9c5840
ACPI: FADT (v001 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c58e8
ACPI: MADT (v001 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c595c
ACPI: ASF! (v032 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c59e0
ACPI: MCFG (v001 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c5a43
ACPI: TCPA (v001 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c5a7f
ACPI: HPET (v001 COMPAQ EAGLLAKE 0x00000001  0x00000000) @ 0x7b9c5c27
ACPI: DSDT (v001 COMPAQ DSDT_PRJ 0x00000001 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0xf808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 7:7 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 7:7 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x00] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x00] disabled)
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 80000000 (gap: 7e000000:76000000)
Detected 2926.118 MHz processor.
Built 1 zonelists.  Total pages: 506273
Kernel command line: ro root=LABEL=/1 crashkernel=128M@16M rhgb quiet
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c0774000 soft=c0754000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 1868804k/2025092k available (2208k kernel code, 155032k reserved, 921k data, 232k init, 1107588k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
hpet0: at MMIO 0xfed00000 (virtual 0xf8800000), IRQs 2, 8, 0, 0, 0, 0, 0, 0
hpet0: 8 64-bit timers, 14318180 Hz
Using HPET for base-timer
Calibrating delay loop (skipped), value calculated using timer frequency.. 5852.23 BogoMIPS (lpj=2926118)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfebfbff 20100000 00000000 00000000 0408e3bd 00000000 00000001
CPU: After vendor identify, caps: bfebfbff 20100000 00000000 00000000 0408e3bd 00000000 00000001
monitor/mwait feature present.
using mwait in idle threads.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 3072K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU: After all inits, caps: bfebf3ff 20100000 00000000 00000940 0408e3bd 00000000 00000001
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
ACPI Warning (utinit-0077): Invalid FADT value PM2_CNT_LEN=0 at offset 5A FADT=f7ff9780 [20060707]
CPU0: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz stepping 0a
SMP alternatives: switching to SMP code
Booting processor 1/1 eip 11000
CPU 1 irqstacks, hard=c0775000 soft=c0755000
Initializing CPU#1
Calibrating delay using timer specific routine.. 5851.96 BogoMIPS (lpj=2925983)
CPU: After generic identify, caps: bfebfbff 20100000 00000000 00000000 0408e3bd 00000000 00000001
CPU: After vendor identify, caps: bfebfbff 20100000 00000000 00000000 0408e3bd 00000000 00000001
monitor/mwait feature present.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 3072K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU: After all inits, caps: bfebf3ff 20100000 00000000 00000940 0408e3bd 00000000 00000001
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz stepping 0a
Total of 2 processors activated (11704.20 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Using local APIC timer interrupts.
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
sizeof(vma)=84 bytes
sizeof(page)=32 bytes
sizeof(inode)=340 bytes
sizeof(dentry)=136 bytes
sizeof(ext3inode)=492 bytes
sizeof(buffer_head)=52 bytes
sizeof(skbuff)=176 bytes
migration_cost=36
checking if image is initramfs... it is
Freeing initrd memory: 2622k freed
HP Compaq Laptop series board detected. Selecting BIOS-method for reboots.
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using MMCONFIG
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
AetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
pnp: 00:0c: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:0d: ioport range 0x400-0x41f could not be reserved
pnp: 00:0d: ioport range 0x420-0x43f has been reserved
pnp: 00:0d: ioport range 0x440-0x45f has been reserved
pnp: 00:0d: ioport range 0x460-0x47f has been reserved
pnp: 00:0d: ioport range 0x800-0x87f has been reserved
pnp: 00:0d: ioport range 0x880-0x8ff has been reserved
pnp: 00:0d: ioport range 0xf800-0xf81f could not be reserved
pnp: 00:0d: ioport range 0xf820-0xf83f could not be reserved
pnp: 00:0e: iomem range 0x0-0x9ffff could not be reserved
pnp: 00:0e: iomem range 0x100000-0x7dffffff could not be reserved
pnp: 00:0e: iomem range 0xe4000-0xfffff could not be reserved
pnp: 00:0e: iomem range 0xfec01000-0xfecfffff has been reserved
PCI: Bridge: 0000:00:1c.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1c.1
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1e.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Setting latency timer of device 0000:00:1c.0 to 64
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 21 (level, low) -> IRQ 169
Phci_hcd 0000:00:1a.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1a.7
ehci_hcd 0000:00:1a.7: irq 209, io mem 0xf0526800
ehci_hcd 0000:00:1a.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 6 ports detected
ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 20 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:00:1d.7 to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 2
ehci_hcd 0000:00:1d.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: irq 217, io mem 0xf0526c00
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 6 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 20 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:00:1a.0 to 64
uhci_hcd 0000:00:1a.0: UHCI Host Controller
uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1a.0: irq 217, io base 0x00001120
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 21 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:00:1a.1 to 64
uhci_hcd 0000:00:1a.1: UHCI Host Controller
uhci_hcd 0000:00:1a.1: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:1a.1: irq 169, io base 0x00001140
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1a.2[C] -> GSI 22 (level, low) -> IRQ 209
PCI: Setting latency timer of device 0000:00:1a.2 to 64
uhci_hcd 0000:00:1a.2: UHCI Host Controller
uhci_hcd 0000:00:1a.2: new USB bus registered, assigned bus number 5
uhci_hcd 0000:00:1a.2: irq 209, io base 0x00001160
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
input: ImPS/2 Logitech Wheel Mouse as /class/input/input1
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 20 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 6
Bluetooth: RFCOMM ver 1.8
Bluetooth: HIDP (Human Interface Emulation) ver 1.1
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
Bridge firewalling registered
[drm] Initialized drm 1.0.1 20051102
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 90
[drm] Initialized i915 1.8.0 20060929 on minor 0
set status page addr 0x01fff000
SysRq : Trigger a crashdump
crash> bt
PID: 315    TASK: f5b9caa0  CPU: 0   COMMAND: "bash"
 #0 [eff0cee8] crash_kexec at c0445cbc
 #1 [eff0cf2c] __handle_sysrq at c054ed56
 #2 [eff0cf58] uptime_read_proc at c04ad81e
 #3 [eff0cf64] proc_delete_inode at c04a8882
 #4 [eff0cf84] vfs_write at c047842e
 #5 [eff0cf9c] sys_write at c0478a55
 #6 [eff0cfb8] system_call at c0404f44
    EAX: ffffffda  EBX: 00000001  ECX: b7f72000  EDX: 00000002
    DS:  007b      ESI: 00000002  ES:  007b      EDI: b7f72000
    SS:  007b      ESP: bf88ff9c  EBP: bf88ffb8
    CS:  0073      EIP: 00b79f9e  ERR: 00000004  EFLAGS: 00000246
crash> vm
PID: 315    TASK: f5b9caa0  CPU: 0   COMMAND: "bash"
   MM       PGD      RSS    TOTAL_VM
f7882580  f07f7000  1564k    4744k
  VMA       START      END    FLAGS  FILE
f4597df4    16e000    16f000 8040075
f45934c4    48f000    499000     75  /lib/libnss_files-2.5.so
f4593ca4    499000    49b000 100073  /lib/libnss_files-2.5.so
f45979b0    a97000    ab1000    875  /lib/ld-2.5.so
f4597cf8    ab1000    ab3000 100873  /lib/ld-2.5.so
f4593a58    ab5000    c0c000     75  /lib/libc-2.5.so
f45937b8    c0c000    c0d000     70  /lib/libc-2.5.so
f45932cc    c0d000    c0e000 100071  /lib/libc-2.5.so
f459380c    c0e000    c10000 100073  /lib/libc-2.5.so
f4593374    c10000    c13000 100073
f4593a04    c40000    c43000     75  /lib/libdl-2.5.so
f4593b00    c43000    c45000 100073  /lib/libdl-2.5.so
f4597ba8   7d1f000   7d22000     75  /lib/libtermcap.so.2.0.8
f45938b4   7d22000   7d23000 100073  /lib/libtermcap.so.2.0.8
f4597f44   8047000   80f6000   1875  /bin/bash
f459795c   80f6000   80fb000 101873  /bin/bash
f4597b54   80fb000   8100000 100073
f4597ca4   9aab000   9aea000 100073
f4593f44  b7d53000  b7f53000     71  /usr/lib/locale/locale-archive
f4597710  b7f53000  b7f55000 100073
f79d0b00  b7f71000  b7f73000 100073
f7b83aac  bf87c000  bf892000 100173



No comments: