watchdog 原理|Gaozhipeng's Blog

watchdog 原理

Linux是有watchdog机制，但是Android在用户空间中也实现了自己的watchdog机制，来监控用户空间中系统服务的状态。
简单的说整个原理就是：需要监控的Thread必须要定时去喂狗，否则狗就会吃掉系统。

watchdog的初始化操作

watchdog是个单例模式，这里就不多说。在watchdog的构造方法里，可以看到

    super("watchdog");
    // Initialize handler checkers for each common thread we want to check.  Note
    // that we are not currently checking the background thread, since it can
    // potentially hold longer running operations with no guarantees about the timeliness
    // of operations there.

    // The shared foreground thread is the main checker.  It is where we
    // will also dispatch monitor checks and do other work.
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread.  We only do a quick check since there
    // can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread.
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread.
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread.
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));

    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());

源码注释很清楚，首先初始化watchdog这个继承了Thread构造方法，添加进foregroundhandlerchecker,然后添加进main，ui,io,display这些线程，最后在foregroundhandlerchecker中添加了binder的monitor。
然后看一下init方法，该方法是在SystemServer.java中进行调用的

 watchdog.init(context, mActivityManagerService);

再看下init方法：

    mResolver = context.getContentResolver();
    mActivity = activity;

    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);

注册了一个重启手机的广播监听。

watchdog的运行

watchdog是继承自Thread的，所以直接看run方法：

    boolean waitedHalf = false;
    while (true) {
    ....
    }

首先初始化一个等待一半的bool值，然后进行一个无限循环操作，看看无限循环操作中做了什么：

final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;

    long timeout = CHECK_INTERVAL;
    // Make sure we (re)spin the checkers that have become idle within
    // this wait-and-check interval
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        hc.scheduleCheckLocked();
    }

    if (debuggerWasConnected > 0) {
        debuggerWasConnected--;
    }

    // NOTE: We use uptimeMillis() here because we do not want to increment the time we
    // wait while asleep. If the device is asleep then the thing that we are waiting
    // to timeout on is asleep as well and won't have a chance to run, causing a false
    // positive on when to kill things.
    long start = SystemClock.uptimeMillis();
    while (timeout > 0) {
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        try {
            wait(timeout);
        } catch (InterruptedException e) {
            Log.wtf(TAG, e);
        }
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
    }

blockedCheckers用来保存被阻塞的handlerchecker,timeout是30s,defaulttimeout是60s，意思就是每半分钟循环一次。然后可以看到，for循环中每个handlerchecker调用schedulechecklocked方法，简单说就是，狗说，我饿了，你们快喂我。看代码实现：

if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
    // If the target looper has recently been polling, then
    // there is no reason to enqueue our checker on it since that
    // is as good as it not being deadlocked.  This avoid having
    // to do a context switch to check the thread.  Note that we
    // only do this if mCheckReboot is false and we have no
    // monitors, since those would need to be executed at this point.
    mCompleted = true;
    return;
}

if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);

首先，第一个判断，monitor如果没有，并且当前线程的queue正在不停循环，等待新的message进入，那么，可以确保该线程没有阻塞，于是直接设置喂狗完成。而当monitor不为0，或者handler正在处理东西的时候，就会进入第二个判断，默认是true，所以第一次这就不会进入，这时候，进入到下面就代表开始喂狗，设置完成false，设置开始喂狗时间，然后把自己丢给这个handler，于是就可以看handlerchecker的run方法：

final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
    synchronized (Watchdog.this) {
        mCurrentMonitor = mMonitors.get(i);
    }
    mCurrentMonitor.monitor();
}

synchronized (Watchdog.this) {
    mCompleted = true;
    mCurrentMonitor = null;
}

获取绑定在身上所有的monitor，然后执行monitor方法，用一个例子来说明monitor方法是啥吧，activitymanagerservice通过调用addmonitor方法，将自己绑到foregroundhandlerchecker的身上，这个handlercheckerrun的时候会执行activitymanagerservice的monitor方法：

/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
    synchronized (this) { }
}

看，就是简单的锁一下自己，看看自己是否死锁了。喂狗过程就是：每隔30秒，watchdog让注册在身上的handlerchecker去执行monitors的monitor方法，就是看看每个线程是否有死锁现象，如果没有死锁现象，当然mcomplete就是true，如果有还在执行的，当然就还是false了，下面接着看：

long start = SystemClock.uptimeMillis();
while (timeout > 0) {
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    try {
        wait(timeout);
    } catch (InterruptedException e) {
        Log.wtf(TAG, e);
    }
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}

watchdog会自己计时，start的时候开始的，然后等待30秒（wait(timeout)）,最后确保确实是等待了30秒就可以进入下面的阶段：

final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
    // The monitors have returned; reset
    waitedHalf = false;
    continue;
} else if (waitState == WAITING) {
    // still waiting but within their configured intervals; back off and recheck
    continue;
} else if (waitState == WAITED_HALF) {
    if (!waitedHalf) {
        // We've waited half the deadlock-detection interval.  Pull a stack
        // trace and wait another half.
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                NATIVE_STACKS_OF_INTEREST);
        waitedHalf = true;
    }
    continue;
}

前面watchdog已经等待了30秒，于是来到了这里，首先确定喂食是否完成 evaluatecheckercompletionlocked：

 state = Math.max(state, hc.getCompletionStateLocked());

getCompletionStateLocked如下：

if (mCompleted) {
    return COMPLETED;
} else {
    long latency = SystemClock.uptimeMillis() - mStartTime;
    if (latency < mWaitMax/2) {
        return WAITING;
    } else if (latency < mWaitMax) {
        return WAITED_HALF;
    }
}
return OVERDUE;

如果完成了，那么就是complete没话讲，看到后面条件，如果是complete的，那么就把waithalf设置false，然后continue，说明30秒，所有线程喂狗成功，那么进入下一次轮回。
如果当前时间减去开始喂狗时间——monitor拿住锁的时间,如果小于设置时间的一半，那么我们还可以继续等，continue，但是如果大于一半，而不小于设置时间，那么说明我们已经走了30s所以左一些操作，dumpstacktraces，和设置等了一半为true。
如果到了最后，喂狗时间大于了设置时间，意思就是我们post的handler，已经60秒没有进行处理了，就判断该线程已经被阻塞，状态是overdue。

如果是overdue，那么就可以判定，有线程死锁超过了60s，是时候吃掉系统了。：

blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;

获取被锁死的线程，描述这些线程，准备重启：

...
Process.killProcess(Process.myPid());
System.exit(10);

中间省略了dump当前系统状态，将log写入到文件中，通知其他各种，最后，killprocess()此时虽然pid是在系统进程中，但是不会重启，只有调用了system.exit后才会重启手机。

Watchdog的流程差不多就是这样。整理下来用于理清思路。