Troubleshooting PC crashing

troubleshoot
 
setup
 
pcbuild
 

My computer has crashed randomly on a daily basis for a few weeks. Each crash happen after running for some time ranging from a couple hours to a few days. This post is to document the symptoms and methods, as well as list out tools for future reference.

Diagnosis

Failed Boot

Failure to boot the computer is relatively easy to debug, assuming spare parts are available. The minimal set of components required to boot up the system:

  • Motherboard: it goes without saying that motherboards are essential (if not most) in building up a PC. I download digital copies (or scan the manual if digital copies are not available online) for almost all manuals. Motherboard manual is the only exception in my list as I always found it helpful keeping a hard copy of it. The most important part about motherboard is their compatibilities with different components. PC Part Picker is the go-to website to check compatibility (serious, I’ve been proven wrong every single time my assessment is different from its results).
  • PSU: in order to get some beeps and blinks on the motherboard, we will at least need a Power Supply Unit. PSU is really the last PC part you would want to save money - it’s totally not worth the time debugging PSU issues. A couple things to consider for PSU:
    1. First and foremost, wattage. Rule of thumb is to add up the total power consumption of components and pick a PSU whose 60% power could cover the system requirement. This not only provides a good buffer for miscalculation/deviation, but leaves room for future upgrades without replacing PSU (which makes upgrading components a lot more enjoyable). Some online power supply calculator are useful to get the required power for your system.
    2. Power efficiency. Choosing a power efficient power not only saves your monthly electricity bill - it reduces noise and also helps reduce heat, which is the number one killer of PC components.
  • CPU: finally got to CPU *. Purposely put *CPU at third to stress the importance of the above two. CPU wise, it is important to have a powerful CPU fan and proper thermal paste to disseminate heat from CPU.
  • RAM: RAM is the most common cause of failure. A RAM issue usually cause a beeping sound at the boot-up. Oftentimes, it’s just the RAM not being connected properly to the RAM slots on the motherboard. A properly connected RAM, should generate a click sound when it’s inserted in the RAM slot. Another common issue is selecting the correct slots. Some motherboards have specific requirement given the number of RAMs.

Contrary to common belief, GPU is not mandatory component for a system. With the four main components addressed, we can hopefully boot the system and inspect it using peripherals and debug the issues through user interface.

Random crashes

Failure at the startup can usually be observed instantaneously. Random crashes after the system starts up however are sometimes hard to deal with. First step is to check system logs.

  • In Linux, /var/logs/syslog should be the place to look. Keywords (case insensitive search) like “bug”, “error”, “fail” are useful for identifying potential issues. There are a number of other useful logs in /var/logs if there are suspicion that the issue is caused by a specific module (e.g. check kern.log if issue occurred after a kernel upgrade).
  • For Windows, Event Viewer servers similar purpose to syslog.

Example - if you got lucky like I did, system logs should give you hint about the problem. Before diving into diagnosis, here is list of symptoms:

  • My system started crashing (won’t react to any input) after a kernel upgrade.
  • The crashes always happen after I leave my computer.
  • After reboot, syslog shows a gap of logs between some time and the reboot. Last few messages before crash does not indicate a problem.

An experienced/knowledgeable person should have seen a few hints here. Obviously I didn’t, so I dove into the syslog and found two suspicious logs:

kernel: [73211.404903] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [(md-udevd):3226]

and

[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

A quick google search (together with my CPU model), I found many users experiencing the same issue. Root cause is the kernel upgrade triggered a CPU bug when CPU is in C-State (low power idle state). Solution is either

  1. upgrade the kernel;
  2. disable C-State on boot menu.

Now that we solved the puzzle, let’s look back at the hints:

  • gap of logs and last messages not indicating a clear error likely indicates a failure in an essential component. In this case, chances of the issue being on GPU, peripherals, etc are low.
  • crash only happening when left alone (sleep mode) was a subtle hint at the CPU C-State.

Testing

If system diagnosis yields less than desirable results and issues remain unsolved, one idea is to isolate the issue. One way to do it is to replace the components with spare parts. This is a time consuming and sometimes frustrating process. Rather than using this passive approach, or if no sparing parts are available, we can proactively stress test the individual components.

RAM

RAM failure is a common issue and there is a well known tool for testing: memset86+. It stress tests the RAM with different access patterns and checks for RAM errors. It is recommended to finish 8 full passes (took me almost a full day) before eliminating RAM from the list of the culprits.

CPU

The most common cause of failure of CPU is overclocking it over its limits. It might run well for a while before it gives up. It usually accompanies with overheating. TODO tools for monitoring/recording temperature? (stress testers, e.g. apt install stress) try downclocking the CPU first to see if issues go away

Hard drive

GPU