Solving Linux CPU 100% Issue with One Shell Script
As a Hong Kong server user, you may encounter some problems with the CPU. When our data platform server’s CPU utilization reached a staggering 98.94% and stayed above 70% for an extended period, it seemed like a hardware resource bottleneck that required scaling. However, upon careful consideration, our business system is not a high-concurrency or CPU-intensive application. The utilization was too extreme, and the hardware bottleneck shouldn’t have been reached so quickly. There must be an issue with the business code logic somewhere.
Troubleshooting Steps
2.1 Identify High-Load Process PID
First, login to the server and use the top
command to confirm the server’s specific situation, then analyze and judge based on the situation.
By observing the load average and the load evaluation criteria (8 cores), it can be confirmed that the server has a high load situation. Observing the resource usage of each process, it can be seen that the process with ID 682 has a relatively high CPU percentage.
2.2 Identify the Specific Abnormal Business
Here, we can use the pwdx
command to find the business process path based on the PID, and then locate the person in charge and the project:
It can be concluded that the process corresponds to the web service of the data platform.
2.3 Locate Abnormal Threads and Specific Code Lines
The traditional solution generally involves 4 steps:
1. top order by with P: 1040 // First, sort by process load to find maxLoad(pid)
2. top -Hp process PID: 1073 // Find the relevant load thread PID
3. printf "0x%x" thread PID: 0x431 // Convert thread PID to hexadecimal for later jstack log search
4. jstack process PID | vim +/hexadecimal thread PID - // For example: jstack 1040|vim +/0x431 -
However, for online problem locating, every second counts, and the above 4 steps are too cumbersome and time-consuming. Previously, Taobao’s oldratlee encapsulated the above process into a tool: show-busy-java-threads.sh
, which can conveniently locate such problems online:
It can be concluded that the execution of a time utility class method in the system has a high CPU percentage. After locating the specific method, check whether there are performance issues with the code logic.
Root Cause Analysis
After the previous analysis and troubleshooting, a problem with a time utility class was finally identified, causing high server load and CPU usage.
- Abnormal method logic: converting a timestamp to the corresponding specific date and time format.
- Upper-layer call: calculating all seconds from midnight to the current time of the day, converting it to the corresponding format, and returning the result in a set.
- Logic layer: corresponds to the query logic of the real-time report of the data platform. The real-time report will query at fixed time intervals, and there are multiple (n times) method calls in a single query.
It can be concluded that if the current time is 10 am, the number of calculations for a single query is 10*60*60*n = 36,000*n times, and as time increases, the number of calculations per query will increase linearly as it approaches midnight. Because a large number of query requests for real-time queries, real-time alarms, and other modules need to call this method multiple times, it led to a large amount of CPU resource occupation and waste.
Solution
After locating the problem, the first consideration is to reduce the number of calculations and optimize the abnormal method. After troubleshooting, it was found that at the logic layer, the content in the set returned by the method was not used, but simply the size value of the set. After confirming the logic, the calculation was simplified through a new method (current seconds – seconds at midnight), replacing the called method and solving the problem of too many calculations. After going online, observing the server load and CPU usage, it dropped 30 times compared to the abnormal period and returned to a normal state. At this point, the problem was resolved.
Summary
In the coding process, in addition to implementing business logic, you also need to pay attention to code performance optimization. A business requirement that can be implemented, and one that can be implemented more efficiently and elegantly, actually reflects two completely different engineer abilities and realms, and the latter is also the core competitiveness of engineers.
After the code is written, do more reviews and think about whether it can be implemented in a better way. Don’t overlook any small details in online problems! Details are the devil. Technical colleagues need to have the desire to get to the root of the problem and the spirit of pursuing excellence. Only in this way can they continue to grow and improve.
By leveraging powerful tools like show-busy-java-threads.sh
and following a systematic troubleshooting approach, you can quickly identify and resolve CPU usage issues on your Linux servers. Always strive for optimized code performance to ensure the stability and efficiency of your systems.