本文为读书笔记,关于 Java 线程堆栈分析,在阅读并进行实验的基础上进行整理,如果有问题欢迎反馈

Java 虚拟机提供了线程转储(Thread dump)的后门,通过这个后门,可以将线程堆栈打印出来。这个后门就是通过向 Java 进程发送一个 QUIT 信号,Java 虚拟机收到该信号之后,将系统当前的 Java 线程调用堆栈打印出来。

kill -3 <java_pid> /jstack <java_pid>

1. Demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class MyTest {
Object obj1 = new Object();
Object obj2 = new Object();
public void fun1() {
synchronized(obj1) {
fun2();
}
}
public void fun2() {
synchronized(obj2) {
while(true) {
System.out.print("");
}
}
}
public static void main(String[] args) {
MyTest aa = new MyTest();
aa.fun1();
}
}

1.1 使用 kill -3 得到的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
2017-12-26 10:59:58
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.76-b04 mixed mode):
"Attach Listener" daemon prio=10 tid=0x00007fd9d4001000 nid=0x6993 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Service Thread" daemon prio=10 tid=0x00007fda3c0a8800 nid=0x695f runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007fda3c0a6000 nid=0x695e waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007fda3c0a3000 nid=0x695d waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007fda3c0a0800 nid=0x695c waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007fda3c078000 nid=0x695b in Object.wait() [0x00007fda32e06000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000758c04858> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x0000000758c04858> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
"Reference Handler" daemon prio=10 tid=0x00007fda3c076000 nid=0x695a in Object.wait() [0x00007fd9ec28e000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000758c04470> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x0000000758c04470> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007fda3c008800 nid=0x6950 runnable [0x00007fda4588a000]
java.lang.Thread.State: RUNNABLE
at MyTest.fun2(MyTest.java:12)
- locked <0x0000000758c4a7f8> (a java.lang.Object)
at MyTest.fun1(MyTest.java:6)
- locked <0x0000000758c4a7e8> (a java.lang.Object)
at MyTest.main(MyTest.java:18)
"VM Thread" prio=10 tid=0x00007fda3c071800 nid=0x6959 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fda3c01e000 nid=0x6951 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fda3c020000 nid=0x6952 runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fda3c022000 nid=0x6953 runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fda3c024000 nid=0x6954 runnable
"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fda3c026000 nid=0x6955 runnable
"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fda3c027800 nid=0x6956 runnable
"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fda3c029800 nid=0x6957 runnable
"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fda3c02b800 nid=0x6958 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007fda3c0b3000 nid=0x6960 waiting on condition
JNI global references: 123
Heap
PSYoungGen total 150528K, used 7772K [0x0000000758c00000, 0x0000000763380000, 0x0000000800000000)
eden space 129536K, 6% used [0x0000000758c00000,0x0000000759397200,0x0000000760a80000)
from space 20992K, 0% used [0x0000000761f00000,0x0000000761f00000,0x0000000763380000)
to space 20992K, 0% used [0x0000000760a80000,0x0000000760a80000,0x0000000761f00000)
ParOldGen total 342528K, used 0K [0x000000060a400000, 0x000000061f280000, 0x0000000758c00000)
object space 342528K, 0% used [0x000000060a400000,0x000000060a400000,0x000000061f280000)
PSPermGen total 21504K, used 2429K [0x0000000605200000, 0x0000000606700000, 0x000000060a400000)
object space 21504K, 11% used [0x0000000605200000,0x000000060545f738,0x0000000606700000)

1.2 jstack 的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
2017-12-26 11:04:54
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.76-b04 mixed mode):
"Attach Listener" daemon prio=10 tid=0x00007fd9d4001000 nid=0x6993 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Service Thread" daemon prio=10 tid=0x00007fda3c0a8800 nid=0x695f runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007fda3c0a6000 nid=0x695e waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007fda3c0a3000 nid=0x695d waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007fda3c0a0800 nid=0x695c runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007fda3c078000 nid=0x695b in Object.wait() [0x00007fda32e06000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000758c04858> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x0000000758c04858> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
"Reference Handler" daemon prio=10 tid=0x00007fda3c076000 nid=0x695a in Object.wait() [0x00007fd9ec28e000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000758c04470> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x0000000758c04470> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007fda3c008800 nid=0x6950 runnable [0x00007fda4588a000]
java.lang.Thread.State: RUNNABLE
at MyTest.fun2(MyTest.java:12)
- locked <0x0000000758c4a7f8> (a java.lang.Object)
at MyTest.fun1(MyTest.java:6)
- locked <0x0000000758c4a7e8> (a java.lang.Object)
at MyTest.main(MyTest.java:18)
"VM Thread" prio=10 tid=0x00007fda3c071800 nid=0x6959 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fda3c01e000 nid=0x6951 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fda3c020000 nid=0x6952 runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fda3c022000 nid=0x6953 runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fda3c024000 nid=0x6954 runnable
"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fda3c026000 nid=0x6955 runnable
"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fda3c027800 nid=0x6956 runnable
"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fda3c029800 nid=0x6957 runnable
"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fda3c02b800 nid=0x6958 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007fda3c0b3000 nid=0x6960 waiting on condition
JNI global references: 123

1.3 分析

用户线程为

1
2
3
4
5
6
7
"main" prio=10 tid=0x00007fda3c008800 nid=0x6950 runnable [0x00007fda4588a000] //main 表示线程名称,proi=1 表示线程优先级,tid=0x00007fda3c008800 表示线程 id,nid=0x6950 表示本地线程 id(这个查问题的时候会用到),runnable 表示线程状态,0x00007fda4588a000 表示内存地址
java.lang.Thread.State: RUNNABLE
at MyTest.fun2(MyTest.java:12) //第一个MyTest 表示类名,fun2 表示方法名,第二个 MyTest.java 表示正在调用的函数所在的源代码文件,12 表示行号
- locked <0x0000000758c4a7f8> (a java.lang.Object) //locked 表示持有锁,后面的 0x0000000758c4a7f8 表示锁的 ID
at MyTest.fun1(MyTest.java:6)
- locked <0x0000000758c4a7e8> (a java.lang.Object)
at MyTest.main(MyTest.java:18)

如何获得 Java 虚拟机对应的本地线程

1
2
1. jps //获取 Java 进程 ID
2. pstack <java_pid> 获得 Java 虚拟机的本地线程的堆栈

获得的本地线程栈如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
Thread 19 (Thread 0x7fda4588b700 (LWP 26960)): // Thread 19 表示线程名称,0x7fda4588b700 表示本地线程 id,LWP 26960 本地线程的另一种表示,LWP-light weight process(和上面的 nid 会一一对应,这里是十进制,上面的是十六进制)
#0 0x00007fda401b2c47 in ?? ()
#1 0x00007fda401ab8fc in ?? ()
#2 0x0000000758c4a7f8 in ?? ()
#3 0x00007fda4588a8a0 in ?? ()
#4 0x0000000000000001 in ?? ()
#5 0x00007fda4014f058 in ?? ()
#6 0x00007fda4588a8f8 in ?? ()
#7 0x00007fda4014f058 in ?? ()
#8 0x0000000758c4a7d0 in ?? ()
#9 0x0000000000000001 in ?? ()
#10 0x0000000758c4a7e8 in ?? ()
#11 0x00007fda4588a8a8 in ?? ()
#12 0x000000060545d7b0 in ?? ()
#13 0x00007fda4588a918 in ?? ()
#14 0x000000060545dc80 in ?? ()
#15 0x0000000000000000 in ?? ()
Thread 18 (Thread 0x7fda40148700 (LWP 26961)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7fda3a5b3700 (LWP 26962)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7fda3a4b2700 (LWP 26963)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7fda3a3b1700 (LWP 26964)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7fda3a2b0700 (LWP 26965)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7fda3a1af700 (LWP 26966)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7fda3a0ae700 (LWP 26967)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7fda39fad700 (LWP 26968)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4438dad3 in GCTaskManager::get_task(unsigned int) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4438f198 in GCTaskThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7fda32f08700 (LWP 26969)):
#0 0x00007fda4546ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44657c0f in os::PlatformEvent::park(long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44619b2e in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda447e1899 in VMThread::loop() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda447e1ba0 in VMThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7fd9ec28f700 (LWP 26970)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44646ddd in ObjectMonitor::wait(long, bool, Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda444b6298 in JVM_MonitorWait () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4015bd98 in ?? ()
#5 0x00007fda3c0768d8 in ?? ()
#6 0x00007fd900000000 in ?? ()
#7 0x00007fda3c076000 in ?? ()
#8 0x00007fd9ec28e5d8 in ?? ()
#9 0x0000000605202798 in ?? ()
#10 0x00007fd9ec28e648 in ?? ()
#11 0x00000006052a9e68 in ?? ()
#12 0x0000000000000000 in ?? ()
Thread 8 (Thread 0x7fda32e07700 (LWP 26971)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44646ddd in ObjectMonitor::wait(long, bool, Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda444b6298 in JVM_MonitorWait () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4015bd98 in ?? ()
#5 0x00007fda3c078a08 in ?? ()
#6 0x00007fda00000000 in ?? ()
#7 0x00007fda3c078000 in ?? ()
#8 0x00007fda32e065e8 in ?? ()
#9 0x0000000000000000 in ?? ()
Thread 7 (Thread 0x7fda32d06700 (LWP 26972)):
#0 0x00007fda4546da00 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fda446578ea in check_pending_signals(bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44650b35 in signal_thread_entry(JavaThread*, Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda447930ff in JavaThread::thread_main_inner() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda44793205 in JavaThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fda32c05700 (LWP 26973)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a086 in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda44282648 in CompileQueue::get() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4428674a in CompileBroker::compiler_thread_loop() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda447930ff in JavaThread::thread_main_inner() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda44793205 in JavaThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fda32b04700 (LWP 26974)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a086 in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda44282648 in CompileQueue::get() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4428674a in CompileBroker::compiler_thread_loop() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda447930ff in JavaThread::thread_main_inner() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda44793205 in JavaThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fda32a03700 (LWP 26975)):
#0 0x00007fda4546b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44652543 in os::PlatformEvent::park() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda4461987f in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda446e95a8 in ServiceThread::service_thread_entry(JavaThread*, Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda447930ff in JavaThread::thread_main_inner() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44793205 in JavaThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#9 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fda32902700 (LWP 26976)):
#0 0x00007fda4546ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda44657c0f in os::PlatformEvent::park(long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44619b2e in Monitor::IWait(Thread*, long) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda4461a00e in Monitor::wait(bool, long, bool) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda4478f121 in WatcherThread::sleep() const () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda4478f63e in WatcherThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fda32801700 (LWP 27027)):
#0 0x00007fda4546eb2d in accept () from /lib64/libpthread.so.0
#1 0x00007fda44136af1 in LinuxAttachListener::dequeue() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#2 0x00007fda44136c8b in AttachListener::dequeue() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#3 0x00007fda441357da in attach_listener_thread_entry(JavaThread*, Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fda447930ff in JavaThread::thread_main_inner() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fda44793205 in JavaThread::run() () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fda44658ca8 in java_start(Thread*) () from /usr/local/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fda45467aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fda44d99aad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fda4588d700 (LWP 26959)):
#0 0x00007fda454682fd in pthread_join () from /lib64/libpthread.so.0
#1 0x00007fda45256505 in ContinueInNewThread0 () from /usr/local/jdk1.7.0_76/bin/../lib/amd64/jli/libjli.so
#2 0x00007fda4524b58a in ContinueInNewThread () from /usr/local/jdk1.7.0_76/bin/../lib/amd64/jli/libjli.so
#3 0x00007fda4524e0e0 in JLI_Launch () from /usr/local/jdk1.7.0_76/bin/../lib/amd64/jli/libjli.so
#4 0x0000000000400686 in main ()

线程状态中的 runnable 从虚拟机的角度来看,表示这个线程正在运行。但是出于 runnable 的线程不一定真正消耗 CPU。出于 runnable 的线程只能说明该线程没有阻塞在 java 的 wait 或者 sleep 方法上,同时也没有等待在锁上面。但是如果该线程调用了本地方法(比如 java.net.SocketInputStream.socketRead0) ,而本地方法出于等待状态,这个时候虚拟机是不知道本地代码中发生了什么,此时尽管当前线程实际上也是阻塞状态,但实际上显示出来的还是 runnable 状态,这种情况下是不消耗 CPU 的。如下线程堆栈所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"Thread-243" prio=1 tid=0xa58f2048 nid=0x7ac2 runnable [0xaeedb000..0xaeedc480]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.DataPacket.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.jdbc.driver.T4CMAREngine.getNBytes(T4CMAREngine.java:1520)
at oracle.jdbc.driver.T4CMAREngine.unmarshalNBytes()
at oracle.jdbc.driver.T4CLongRawAccessor.readStreamFromWire()
at oracle.jdbc.driver.T4CLongRawAccessor.readStream()
at oracle.jdbc.driver.T4CInputStream.getBytes(T4CInputStream.java:70)
- locked <0x934f4258> (a oracle.jdbc.driver.T4CInputStream)
- locked <0x6b0dd600> (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OracleInputStream.needBytes()
... ...
at org.hibernate.loader.Loader.list(Loader.java:1577)
at org.hibernate.loader.hql.QueryLoader.list()
at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627)
- locked <0x80df8ce8> (a com.wes.timer.TimerTaskImpl)
at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209)
at com.wes.threadpool.PooledExecutorEx$Worker.run()
at java.lang.Thread.run(Thread.java:595)

那么 “.“ 和 “.“ 各表示什么含义呢? “.“ 表示当前正在执行类的初始化,”.“ 正在执行对象的构造函数。

除接口外,初始化一个类之前必须保证其直接父类已被初始化,并且该初始化过程是由 JVM 保证线程安全的。并非所有的类都拥有一个 方法,在以下条件中该类不会拥有 方法

  • 该类既没有声明任何类变量,也没有静态初始化语句
  • 该类声明了类变量,但没有明确使用类变量初始化语句或静态初始化语句初始化
  • 该类仅包含静态 final 变量的类变量初始化语句,并且类变量初始化语句是编译时常量表达式

2 锁的解读

多线程中,wait() 和 sleep() 的区别。wait() 和 sleep() 有一个共同点,就是二者都会把当前的线程阻塞住,我们称之为睡眠或者等待。但二者实际上是完全不同的两个函数,有着本质的区别:

  • 当线程执行到 wait() 方法上,当前线程会释放监视锁,此时其他线程可以占有该锁,一旦 wait() 方法执行完成,当前线程又继续持有该锁,直到执行完该锁的作用域。
    wait() 方法退出的原因:
    • 达到了等待时间,自动退出。如 wati(5000), 5秒后 wait 方法退出
    • 其他的线程调用了该锁的 notify() 方法。当如果多个线程在等待同一个锁,只有一个线程会被通知到
  • sleep() 与锁操作无关,如果该方法恰好在一个锁的保护范围之内,当前线程及时在执行 sleep() 的时候,仍然继续保持监视锁。该方法实际上仅仅完成等待或者睡眠的语义
wait_sleep.png

与锁相关的三个重要信息如下:

  • 当一个线程占有一个锁的时候,线程堆栈中会打印 – locked <0x22bffb60>
  • 当一个线程正在等待其他线程释放该锁,线程堆栈中会打印 – waiting to lock <0x22bffb60>
  • 当一个线程占有一个锁,但有执行到该锁的 wait() 上,线程堆栈中首先打印 locked, 然后又会打印 – waiting on <0x22c03c60>

3 线程状态的解读

3.1 RUNNABLE

RUNNABLE 从虚拟机的角度看,线程处于正在运行的状态,正在运行不一定消耗 CPU,比如下面的线程栈,在等待 IO

1
2
3
4
5
6
7
8
9
10
11
Thread-39" daemon prio=1 tid=0x08646590 nid=0x666d runnable [5beb7000..5beb88b8]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
at java.io.BufferedInputStream.read(BufferedInputStream.java:201)
- locked <0x47bfb940> (a java.io.BufferedInputStream)
at org.postgresql.PG_Stream.ReceiveChar(PG_Stream.java:141)
at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:68)
- locked <0x47bfb758> (a org.postgresql.PG_Stream)
at org.postgresql.Connection.ExecSQL(Connection.java:398)

下面的线程栈则消耗 CPU

1
2
3
4
5
6
7
8
9
10
11
12
13
"Thread-444" prio=1 tid=0xa4853568 nid=0x7ade runnable [0xafcf7000..0xafcf8680]
java.lang.Thread.State: RUNNABLE
//实实在在对应 CPU 指令
at org.apache.commons.collections.ReferenceMap.getEntry(Unknown Source)
at org.apache.commons.collections.ReferenceMap.get(Unknown Source)
at org.hibernate.util.SoftLimitMRUCache.get(SoftLimitMRUCache.java:51)
at org.hibernate.engine.query.QueryPlanCache.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.list()
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:164)
at com.mogoko.struts.logic.user.LeaveMesManager.getCommentByShopId()
at com.mogoko.struts.action.shop.ShopIndexBaseInfoAction.execute()
......

下面的代码正在进行 JNI 本地方法调用,具体是否消耗 CPU,要看 TcpRecvExt 的实现,如果 TcpRecvExt 是纯运算代码,那么实实在在的消耗 CPU,如果 TcpRecvExt() 中存在挂起的代码,那么该线程尽管显示 RUNNABLE,实际也不消耗 CPU 的

1
2
3
4
5
6
"ClientReceiveThread" daemon prio=1 tid=0x99dbacf8 nid=0x7988 runnable [...]
java.lang.Thread.State: RUNNABLE
at com.pangu.network.icdcomm.htcpapijni.TcpRecvExt(Native Method)
at com.pangu.network.icdcomm.IcdComm.receive(IcdComm.java:60)
at com.msp.client.MspFactory$ClientReceiveThread.task(MspFactory.java:333)
at com.msp.system.TaskThread.run(TaskThread.java:94)

3.2 TIMED_WAITING(on object monitor)

TIMED_WAITING(on object monitor) 表示当前线程被挂起一段时间,说明该线程正在执行 object.wait(int time) 方法,不消耗 CPU

1
2
3
4
5
6
7
"JMX server" daemon prio=6 tid=0x0ad2c800 nid=0xdec in Object.wait() [...]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x03129da0> (a [I)
at com.sun.jmx.remote.internal.ServerComm$Timeout.run(ServerComm.java:150)
- locked <0x03129da0> (a [I)
at java.lang.Thread.run(Thread.java:620)

3.3 TIME_WAITING(sleeping)

TIME_WAITING(sleeping) 表示当前线程被挂起一段时间,正在执行 Thread.sleep(int time) 方法,不消耗 CPU

1
2
3
4
5
6
"Comm thread" daemon prio=10 tid=0x00002aaad4107400 nid=0x649f waiting on condition
[0x000000004133b000..0x000000004133ba00]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
at java.lang.Thread.run(Thread.java:619)

3.4 TIME_WAITING(parking)

TIME_WAITING(parking) 表示线程被挂起,正在执行 Thread.sleep(int time) 方法,不消耗 CPU

1
2
3
4
5
6
7
8
9
10
11
"RMI TCP" daemon prio=6 tid=0x0ae3b800 nid=0x958 waiting on condition [0x17eff000..0x17effa94]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x02f49f58> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:179)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:424)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:871)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:495)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:693)
at java.lang.Thread.run(Thread.java:620)

3.5 WAITING(on object monitor)
WAITING(on object monitor) 表示当前线程被挂起,正在执行 obj.wait() 只能通过 notify() 唤醒,不消耗 CPU

1
2
3
4
5
6
7
8
"IPC Client" daemon prio=10 tid=0x00002aaad4129800 nid=0x649d in Object.wait() [0x039000..0x039d00]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab3acad18>; (aorg.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
- locked <0x00002aaab3acad18> (aorg.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

3.6 小结

处于 TIME_WAITING、WAITING 状态的线程不消耗 CPU,处于 RUNNABLE 状态的线程要结合线程代码性质判断,是否消耗 CPU

  • 如果是纯 Java 运算代码,消耗 CPU
  • 如果是网络 IO,很少消耗 CPU
  • 如果是本地代码,结合本地代码性质进行判断(通过 pstack/gstack 获取本地线程栈),如果是纯运算代码,消耗 CPU,如果被挂起,不消耗 CPU,如果是 IO,则不怎么消耗 CPU

4 接触线程堆栈进行问题分析

线程堆栈翻译系统当前时间正在执行什么代码。根据这些信息就可以知道系统当前到底在做什么。看堆栈一般从三个视角来分析:堆栈的局部信息、一次堆栈的统计信息(全局信息)、多个堆栈的对比信息

  1. 视角一 从一次的堆栈信息中,我们能直接获取以下直接的信息:
    a. 当前每一个线程的调用层次关系(调用上下文),即每个线程当前正在调用哪些函数
    b. 当前每个线程当前的状态:持有了哪些锁?在等待哪些锁?
  2. 视角二 从一次的堆栈信息中,我们还可以获取下面的统计方面信息:
    a. 当前锁的争用情况:
    i. 是不是很多线程在等待同一个锁,如果很多线程等待同一个锁,说明这个系统已经出现了性能瓶颈,并导致了锁竞争。还可能是某个线程长时间持有一个锁不释放(比如这个线程正陷入了死循环的代码或者正在请求一个资源,很长时间得不到唤醒)
    ii. 是否有死锁,哪些线程形成了死锁
    
    b. 当前大多数线程在干什么,即正在执行什么代码
    c. 当前线程总数量
  3. 视角三 从多次(即前后打印多次堆栈进行对比)的堆栈信息中,我们还可以获得下面的统计对比方面的信息:
    a. 一个线程是否在长期执行。如果每次打印的堆栈,某一个线程一直处于同样的调用上下文中,那么说明这个线程一直在执行这段代码,此时就要根据代码逻辑检查,这种长期执行是否合理?
    b. 某个线程是否存在长期获取不到锁的情况?线程是不是永远得不到唤醒?如果每次打印的堆栈,某一个线程一直在等待一个锁,那么就需要检查占有这个锁的线程为什么不释放锁?

通过多个视角进行观察,线程堆栈在定位如下类型的问题上非常有帮助:

  • 线程死锁分析(视角一)
  • Java 代码导致的 CPU 过高分析(视角三)
  • 死循环分析(视角三)
  • 资源不足分析(视角二)
  • 性能瓶颈分析(视角二和视角三)

4.1 线程死锁分析

死锁的原因:当两个或多个线程正在等待被对方占有的锁,死锁就会发生。死锁会导致两个线程无法继续运行,被永远挂起。下图中在时间点0,线程 0 占有了 lock0,线程1占有了 lock1,然后时间点 2 线程0 企图获取 lock 1,线程 1 企图获取 lock 0.由于这两个线程互相要等待被对方占有的锁,自己才能继续,因此造成死锁,二者永远没有机会继续运行下去。

dead_lock.png

一个简单的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package MyPackage;
public class Main {
public static void main(String[] args) {
Object lockobj1 = new Object();
Object lockobj2 = new Object();
TestThread1 thread1 = new TestThread1(lockobj1,lockobj2);
thread1.start();
TestThread2 thread2 = new TestThread2(lockobj1,lockobj2);
thread2.start();
}
}
package MyPackage;
public class TestThread1 extends Thread{
Object lock1 = null;
Object lock2 = null;
public TestThread1(Object lock1_,Object lock2_)
{
lock1 = lock1_;
lock2 = lock2_;
this.setName(this.getClass().getName());
}
public void run()
{
fun();
}
public void fun(){
synchronized(lock1){
try{
Thread.sleep(2);
}
catch(Exception e){
e.printStackTrace();
}
synchronized(lock2){
}
}
}
}
package MyPackage;
public class TestThread2 extends Thread{
Object lock1 = null;
Object lock2 = null;
public TestThread2(Object lock1_,Object lock2_)
{
lock1 = lock1_;
lock2 = lock2_;
this.setName(this.getClass().getName());
}
public void run()
{
fun();
}
public void fun(){
synchronized(lock2){
try{
Thread.sleep(2);
}
catch(Exception e){
e.printStackTrace();
}
synchronized(lock1){
}
}
}
}

jstack 得到结果如下(截取后的堆栈信息)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Found one Java-level deadlock:
=============================
"MyPackage.TestThread2":
waiting to lock monitor 0x00007fa1b00062c8 (object 0x0000000758c4aa78, a java.lang.Object),
which is held by "MyPackage.TestThread1"
"MyPackage.TestThread1":
waiting to lock monitor 0x00007fa1b0004ed8 (object 0x0000000758c4aa88, a java.lang.Object),
which is held by "MyPackage.TestThread2"
Java stack information for the threads listed above:
===================================================
"MyPackage.TestThread2":
at MyPackage.TestThread2.fun(TestThread2.java:24)
- waiting to lock <0x0000000758c4aa78> (a java.lang.Object)
- locked <0x0000000758c4aa88> (a java.lang.Object)
at MyPackage.TestThread2.run(TestThread2.java:13)
"MyPackage.TestThread1":
at MyPackage.TestThread1.fun(TestThread1.java:24)
- waiting to lock <0x0000000758c4aa88> (a java.lang.Object)
- locked <0x0000000758c4aa78> (a java.lang.Object)
at MyPackage.TestThread1.run(TestThread1.java:13)
Found 1 deadlock.

要避免死锁的问题,唯一的办法就是修改代码。一个可靠的并发系统可以说是设计出来的,而不是通过改 Bug 改出来的,这一点与其他类型的 Bug 有很大的不同。另外,死锁的两个或多个线程是不消耗 CPU 的,有的人认为 CPU 100% 的使用率是线程死锁造成的,这个说法是不对的。无限循环(死循环),并且循环中代码都是 CPU 密集型,才有可能导致 CPU 的 100% 使用率,像 socket 或者数据库等 IO 操作是不怎么消耗 CPU 的。

4.2 Java 代码死循环等导致的 CPU 过高分析

当系统负载大的时候,CPU 的使用率会较高,但是不正确的代码也会导致 CPU 过高,比如死循环。当发生 CPU 过高的问题,我们需要能够分析 CPU 高的真正原因。通过多次打印堆栈,前后堆栈对比找到一直在运行的线程,这些线程都是可疑的线程,具体的步骤如下:

  • 通过 jstack 等获取第一次的堆栈信息
  • 等待一段时间,再获取第二次堆栈信息
  • 预处理两次堆栈信息,首先去掉处于 sleeping 或者 waiting 状态的线程,因为这种线程是不消耗 CPU 的
  • 比较第一次堆栈和第二次堆栈预处理后的线程,找出这段时间一直活跃的线程,如果两次堆栈中同一个线程处于同样的调用上下文,那么就应该列为重点怀疑对象。结合代码进行检查

如果通过堆栈定位,没有发现热点代码段,那么 CPU 过高可能是不恰当的内存设置导致的频繁 GC,而从导致 CPU 过高

第一次 jstack 的输出如下:

1
2
3
4
5
6
7
8
9
10
11
"Thread-444" prio=1 tid=0xa4853568 nid=0x7ade runnable [0xafcf7000..0xafcf8680]
at org.apache.commons.collections.ReferenceMap.getEntry(Unknown Source)
at org.apache.commons.collections.ReferenceMap.get(Unknown Source)
at org.hibernate.util.SoftLimitMRUCache.get(SoftLimitMRUCache.java:51)
at org.hibernate.engine.query.QueryPlanCache.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.list()
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:164)
at com.mogoko.struts.logic.user.LeaveMesManager.getCommentByShopId()
at com.mogoko.struts.action.shop.ShopIndexBaseInfoAction.execute()
......

第二次 jstack 的输出如下:

1
2
3
4
5
6
7
8
9
10
11
"Thread-444" prio=1 tid=0xa4853568 nid=0x7ade runnable [0xafcf7000..0xafcf8680]
at org.apache.commons.collections.ReferenceMap.getEntry(Unknown Source)
at org.apache.commons.collections.ReferenceMap.get(Unknown Source)
at org.hibernate.util.SoftLimitMRUCache.get(SoftLimitMRUCache.java:51)
at org.hibernate.engine.query.QueryPlanCache.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.list()
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:164)
at com.mogoko.struts.logic.user.LeaveMesManager.getCommentByShopId()
at com.mogoko.struts.action.shop.ShopIndexBaseInfoAction.execute()
... ...

在长达 5 分钟的时间里,一直在执行 org.apache.commons.collections.ReferenceMap.getEntry() 方法,说明这个方法一直没有结束。具体属于正常还是 Bug,需要结合源码进行判断。

导致死循环的代码属于代码的 Bug,这种类型的问题,重现比较难,但一旦重现,就比较容易解决。一般通过分析代码可以发现问题。导致死循环的原因大致有如下几个:

  • HashMap 等线程不安全的容器,用在多线程读/写的场合,导致 HashMap 的方法调用形成死循环(可以参考2015-08-19 Hadoop升级过程出现异常问题casestudy)
  • 多线程场合,对共享变量没有进行保护,导致数据混乱,从而使循环退出条件永远不满足,导致死循环的发生,比如:
    • for, while 循环中的退出条件永远不满足导致的死循环
    • 链表等数据结构首尾相连,导致遍历永远无法停止
  • 其他编码错误

4.3 高消耗 CPU 代码的常用分析方法

借助操作系统提供的性能分析工具进行 CPU 消耗分析。上面介绍了针对死循环导致 CPU 遍告的分析方法,但是有些场景没有死循环也会导致 CPU 变高,我们需要借助系统提供的一些性能分析工具进行分析。在 Linux 中使用 top -p $pid 进行,步骤如下

  1. top -p $pid
  2. 输入 “H” 查看该进程所有线程的统计情况(CPU 等)如下所示(对于某些操作系统版本不支持 top 命令进行线程统计的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
下面 PID 表示线程 ID,即 LWPID
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25059 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25060 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.04 java
25061 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25062 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25063 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25064 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25065 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25066 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25067 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25068 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25069 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:01.19 java
25070 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25071 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25072 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25073 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25074 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25075 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java
25076 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:24.31 java
25077 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:01.00 java
25078 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:01.05 java
25141 sankuai 20 0 9918m 55m 8264 S 0.0 0.2 0:00.00 java

具体导致问题的代码可能是:

  1. 纯 Java 代码导致的 CPU 过高
  2. Java 代码中调用的 JNI 代码导致的 CPU 过高
  3. 虚拟机自身的代码导致的 CPU 过高,比如 GC 的 bug 等

可以如下进行分析:

  1. top -p $pid 获取最消耗 CPU 的本地线程 ID
  2. jstack 打印出 Java 堆栈
  3. pstack 打印本地线程堆栈
  4. 在 Java 线程堆栈中查找 nid=<第一步中最消耗 CPU 时间的线程 id>
    a. 如果找到的线程是纯 Java 代码,则该 Java 代码导致 CPU 过高
    b. 如果找到的是 Native code,说明导致 CPU 过高的问题在 JNI 调用中
    c. 如果找不到对于的线程,则有如下两种可能:
    i. JNI 调用中重新创建的线程来执行,在 Java 线程堆栈中不存在该线程的信息
    ii. 虚拟机自身的代码导致的 CPU 过高,如堆内存不够导致的频繁 fullgc,或虚拟机的 bug 等
    

4.4 资源不足等导致的性能下降分析

大多数时候资源不足和性能瓶颈是统一类问题。

  • 大量的线程停在同样的调用上下文上
  • 资源数量配置太少,系统当前压力比较大,资源不足导致某些线程不能及时获得资源而等待
  • 获得资源的线程把持资源时间太久(比如获取数据库连接),导致资源不足
  • 设计不合理导致资源占用时间过久,如 SQL 语句设计不恰当,或者没有索引导致的数据库访问太慢等
  • 资源用完后,某些异常情况下,没有关闭或回收,导致可用资源泄漏或减少

4.5 线程不退出导致的系统挂死分析

线程挂死,表现为每次打印线程堆栈,该线程都在同一个调用上下文,该类问题可以通过多次打印堆栈进行分析,导致线程挂死的原因有很多,如:

  • 线程正在执行死循环代码
  • 资源不足或资源泄漏,造成当前线程阻塞在锁对象上,长期得不到唤醒
  • 如果当前程序和外部通信,当外部程序挂起无返回时,也会导致当前线程挂起

4.6 其他问题分析

  • 多个锁导致的锁链分析 – 多个线程等待同一个锁
  • 进行性能瓶颈分析

4.7 线程堆栈不能分析什么问题

线程堆栈定位问题,只能定位在当前线程上留下痕迹的问题,如线程死锁,线程挂死等,另外,定位由于锁的设计不恰当导致的性能问题,线程堆栈也是最有效的工具,因为性能问题时时刻刻反映在当前的线程统计状况上。但线程堆栈对于不留痕迹的问题,就无能为力了。例如:

  • 线程为什么跑飞
  • 并发的 Bug 导致的数据混乱
  • 数据库梭镖的问题
  • 在实际的系统中,系统的问题分为几种类型:

堆栈中能够表现出问题的,使用线程堆栈进行定位

  • 无法在线程中留下痕迹的问题定位,需要依赖一个好的日志设计
  • 非常隐蔽的问题,只能依赖于丰富的代码经验,如多线程导致的数据混乱以及幽灵代码等

参考文献

《Java 问题定位技术》

Comments