Monday, September 3, 2012

Linux x86_64: Detecting Hardware Errors

Program such mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system. This is useful for predicting server hardware failure before actual server crash.

Install mcelog

Type the following command under RHEL / CentOS / Fedora Linux, 64 bit kernel:
# yum install mcelog
Type the following command under Debian / Ubuntu Linux, 64 bit kernel:
# apt-get update && apt-get install mcelog

Default Cronjob

mcelog should be run regularly as a cron job on any x86-64 Linux system. By default following cron settings are used on Debian / Ubuntu Linux - /etc/cron.d/mcelog:
# /etc/cron.d/mcelog: crontab entry for the mcelog package
*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
CentOS / RHEL / Fedora Linux runs hourly cron job via /etc/cron.hourly/mcelog.cron:
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

How do I view error logs?

Use tail or grep command:

# tail -f /var/log/mcelog


# grep -i "hardware error" /var/log/mcelog


# grep -c "hardware error" /var/log/mcelog

Alternatively, you can send an email alert when hardware error found on the system (write a shell script and call it via cron job):

# [ $(grep -c "hardware error" /var/log/mcelog) -gt 0 ] && echo "Hardware Error Found $(hostname) @ $(date)" | mail -s 'H/w Error'

With this tool I was able to pick up couple of hardware problem before a kernel panic i.e. server crash.


  • You need to use 64 bit Linux kernel and operating system to run mcelog. Machine checks can indicate failing hardware, system overheats, bad DIMMs or other problems. Some MCEs are fatal and can not generally be survived without reboot and h/w replacement, but I was able to catch lots of bad h/w before crash with this tool.

No comments:

Post a Comment