Linux是有watchdog机制,但是Android在用户空间中也实现了自己的watchdog机制,来监控用户空间中系统服务的状态。
简单的说整个原理就是:需要监控的Thread必须要定时去喂狗,否则狗就会吃掉系统。
watchdog的初始化操作
watchdog是个单例模式,这里就不多说。在watchdog的构造方法里,可以看到
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
源码注释很清楚,首先初始化watchdog这个继承了Thread构造方法,添加进foregroundhandlerchecker,然后添加进main,ui,io,display这些线程,最后在foregroundhandlerchecker中添加了binder的monitor。
然后看一下init方法,该方法是在SystemServer.java中进行调用的
watchdog.init(context, mActivityManagerService);
再看下init方法:
mResolver = context.getContentResolver();
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
注册了一个重启手机的广播监听。
watchdog的运行
watchdog是继承自Thread的,所以直接看run方法:
boolean waitedHalf = false;
while (true) {
....
}
首先初始化一个 等待一半的bool值,然后进行一个无限循环操作,看看无限循环操作中做了什么:
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
blockedCheckers用来保存被阻塞的handlerchecker,timeout是30s,defaulttimeout是60s,意思就是每半分钟循环一次。 然后可以看到,for循环中每个handlerchecker调用schedulechecklocked方法,简单说就是,狗说,我饿了,你们快喂我。看代码实现:
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
首先,第一个判断,monitor如果没有,并且当前线程的queue正在不停循环,等待新的message进入,那么,可以确保该线程没有阻塞,于是直接设置喂狗完成。而当monitor不为0,或者handler正在处理东西的时候,就会进入第二个判断,默认是true,所以第一次这就不会进入,这时候,进入到下面就代表开始喂狗,设置完成false,设置开始喂狗时间,然后把自己丢给这个handler,于是就可以看handlerchecker的run方法:
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
获取绑定在身上所有的monitor,然后执行monitor方法,用一个例子来说明monitor方法是啥吧,activitymanagerservice通过调用addmonitor方法,将自己绑到foregroundhandlerchecker的身上,这个handlercheckerrun的时候会执行activitymanagerservice的monitor方法:
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
看,就是简单的锁一下自己,看看自己是否死锁了。 喂狗过程就是:每隔30秒,watchdog让注册在身上的handlerchecker去执行monitors的monitor方法,就是看看每个线程是否有死锁现象,如果没有死锁现象,当然mcomplete就是true,如果有还在执行的,当然就还是false了,下面接着看:
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
watchdog会自己计时,start的时候开始的,然后等待30秒(wait(timeout)),最后确保确实是等待了30秒就可以进入下面的阶段:
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
前面watchdog已经等待了30秒,于是来到了这里,首先确定喂食是否完成 evaluatecheckercompletionlocked:
state = Math.max(state, hc.getCompletionStateLocked());
getCompletionStateLocked如下:
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
如果完成了,那么就是complete没话讲,看到后面条件,如果是complete的,那么就把waithalf设置false,然后continue,说明30秒,所有线程喂狗成功,那么进入下一次轮回。
如果当前时间减去开始喂狗时间——monitor拿住锁的时间,如果小于设置时间的一半,那么我们还可以继续等,continue,但是如果大于一半,而不小于设置时间,那么说明我们已经走了30s所以左一些操作,dumpstacktraces,和设置等了一半为true。
如果到了最后,喂狗时间大于了设置时间,意思就是我们post的handler,已经60秒没有进行处理了,就判断该线程已经被阻塞,状态是overdue。
如果是overdue,那么就可以判定,有线程死锁超过了60s,是时候吃掉系统了。:
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
获取被锁死的线程,描述这些线程,准备重启:
...
Process.killProcess(Process.myPid());
System.exit(10);
中间省略了dump当前系统状态,将log写入到文件中,通知其他各种,最后,killprocess()此时虽然pid是在系统进程中,但是不会重启,只有调用了system.exit后才会重启手机。
Watchdog的流程差不多就是这样。整理下来用于理清思路。