watchdog 原理

Linux是有watchdog机制,但是Android在用户空间中也实现了自己的watchdog机制,来监控用户空间中系统服务的状态。
简单的说整个原理就是:需要监控的Thread必须要定时去喂狗,否则狗就会吃掉系统。

watchdog的初始化操作

watchdog是个单例模式,这里就不多说。在watchdog的构造方法里,可以看到

    super("watchdog");
    // Initialize handler checkers for each common thread we want to check.  Note
    // that we are not currently checking the background thread, since it can
    // potentially hold longer running operations with no guarantees about the timeliness
    // of operations there.

    // The shared foreground thread is the main checker.  It is where we
    // will also dispatch monitor checks and do other work.
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread.  We only do a quick check since there
    // can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread.
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread.
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread.
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));

    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());

源码注释很清楚,首先初始化watchdog这个继承了Thread构造方法,添加进foregroundhandlerchecker,然后添加进main,ui,io,display这些线程,最后在foregroundhandlerchecker中添加了binder的monitor。
然后看一下init方法,该方法是在SystemServer.java中进行调用的

 watchdog.init(context, mActivityManagerService);

再看下init方法:

    mResolver = context.getContentResolver();
    mActivity = activity;

    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);

注册了一个重启手机的广播监听。

watchdog的运行

watchdog是继承自Thread的,所以直接看run方法:

    boolean waitedHalf = false;
    while (true) {
    ....
    }

首先初始化一个 等待一半的bool值,然后进行一个无限循环操作,看看无限循环操作中做了什么:

final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;

    long timeout = CHECK_INTERVAL;
    // Make sure we (re)spin the checkers that have become idle within
    // this wait-and-check interval
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        hc.scheduleCheckLocked();
    }

    if (debuggerWasConnected > 0) {
        debuggerWasConnected--;
    }

    // NOTE: We use uptimeMillis() here because we do not want to increment the time we
    // wait while asleep. If the device is asleep then the thing that we are waiting
    // to timeout on is asleep as well and won't have a chance to run, causing a false
    // positive on when to kill things.
    long start = SystemClock.uptimeMillis();
    while (timeout > 0) {
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        try {
            wait(timeout);
        } catch (InterruptedException e) {
            Log.wtf(TAG, e);
        }
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
    }

blockedCheckers用来保存被阻塞的handlerchecker,timeout是30s,defaulttimeout是60s,意思就是每半分钟循环一次。 然后可以看到,for循环中每个handlerchecker调用schedulechecklocked方法,简单说就是,狗说,我饿了,你们快喂我。看代码实现:

if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
    // If the target looper has recently been polling, then
    // there is no reason to enqueue our checker on it since that
    // is as good as it not being deadlocked.  This avoid having
    // to do a context switch to check the thread.  Note that we
    // only do this if mCheckReboot is false and we have no
    // monitors, since those would need to be executed at this point.
    mCompleted = true;
    return;
}

if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);

首先,第一个判断,monitor如果没有,并且当前线程的queue正在不停循环,等待新的message进入,那么,可以确保该线程没有阻塞,于是直接设置喂狗完成。而当monitor不为0,或者handler正在处理东西的时候,就会进入第二个判断,默认是true,所以第一次这就不会进入,这时候,进入到下面就代表开始喂狗,设置完成false,设置开始喂狗时间,然后把自己丢给这个handler,于是就可以看handlerchecker的run方法:

final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
    synchronized (Watchdog.this) {
        mCurrentMonitor = mMonitors.get(i);
    }
    mCurrentMonitor.monitor();
}

synchronized (Watchdog.this) {
    mCompleted = true;
    mCurrentMonitor = null;
}

获取绑定在身上所有的monitor,然后执行monitor方法,用一个例子来说明monitor方法是啥吧,activitymanagerservice通过调用addmonitor方法,将自己绑到foregroundhandlerchecker的身上,这个handlercheckerrun的时候会执行activitymanagerservice的monitor方法:

/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
    synchronized (this) { }
}

看,就是简单的锁一下自己,看看自己是否死锁了。 喂狗过程就是:每隔30秒,watchdog让注册在身上的handlerchecker去执行monitors的monitor方法,就是看看每个线程是否有死锁现象,如果没有死锁现象,当然mcomplete就是true,如果有还在执行的,当然就还是false了,下面接着看:

long start = SystemClock.uptimeMillis();
while (timeout > 0) {
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    try {
        wait(timeout);
    } catch (InterruptedException e) {
        Log.wtf(TAG, e);
    }
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}

watchdog会自己计时,start的时候开始的,然后等待30秒(wait(timeout)),最后确保确实是等待了30秒就可以进入下面的阶段:

final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
    // The monitors have returned; reset
    waitedHalf = false;
    continue;
} else if (waitState == WAITING) {
    // still waiting but within their configured intervals; back off and recheck
    continue;
} else if (waitState == WAITED_HALF) {
    if (!waitedHalf) {
        // We've waited half the deadlock-detection interval.  Pull a stack
        // trace and wait another half.
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                NATIVE_STACKS_OF_INTEREST);
        waitedHalf = true;
    }
    continue;
}

前面watchdog已经等待了30秒,于是来到了这里,首先确定喂食是否完成 evaluatecheckercompletionlocked:

 state = Math.max(state, hc.getCompletionStateLocked());

getCompletionStateLocked如下:

if (mCompleted) {
    return COMPLETED;
} else {
    long latency = SystemClock.uptimeMillis() - mStartTime;
    if (latency < mWaitMax/2) {
        return WAITING;
    } else if (latency < mWaitMax) {
        return WAITED_HALF;
    }
}
return OVERDUE;

如果完成了,那么就是complete没话讲,看到后面条件,如果是complete的,那么就把waithalf设置false,然后continue,说明30秒,所有线程喂狗成功,那么进入下一次轮回。
如果当前时间减去开始喂狗时间——monitor拿住锁的时间,如果小于设置时间的一半,那么我们还可以继续等,continue,但是如果大于一半,而不小于设置时间,那么说明我们已经走了30s所以左一些操作,dumpstacktraces,和设置等了一半为true。
如果到了最后,喂狗时间大于了设置时间,意思就是我们post的handler,已经60秒没有进行处理了,就判断该线程已经被阻塞,状态是overdue。

如果是overdue,那么就可以判定,有线程死锁超过了60s,是时候吃掉系统了。:

blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;

获取被锁死的线程,描述这些线程,准备重启:

...
Process.killProcess(Process.myPid());
System.exit(10);

中间省略了dump当前系统状态,将log写入到文件中,通知其他各种,最后,killprocess()此时虽然pid是在系统进程中,但是不会重启,只有调用了system.exit后才会重启手机。

Watchdog的流程差不多就是这样。整理下来用于理清思路。