ThreadMon: The Tool Itself

Existing Tools

Traditional performance debuggers (e.g. call profilers) offer little help; simply knowing where a thread spent its time does not aid in analysis of the model. While postmortem tracing tools such as tnfview allow some performance analysis of specific programs, they offer little insight into the effectiveness of the model itself. Moreover, the sheer volume of data generated makes it difficult to spot detrimental anomalous performance behavior.

While existing tools make the model difficult to evaluate on a uniprocessor, they make its evaluation on a multiprogrammed multiprocessor virtually impossible. To perform this kind of analysis, runtime correlation of thread, LWP and CPU behavior is required. To this end, we implemented ThreadMon, a tool which graphically displays the runtime interactions in the Solaris implementation of the many-to-many threads model.

ThreadMon Overview

ThreadMon displays runtime information for each user-level thread, LWP and CPU. It provides not only the state information for each of these computational elements, but also the mappings between them: which thread is running on which LWP, and which LWP is running on which CPU. Thus, to a large degree, one can watch the scheduling decisions made by both the user- level threads package and the kernel, and view how those decisions affect thread state, LWP state, and most importantly, CPU usage. We have been able to use this tool effectively to analyze the decisions made by the many-to-many model.

Features

ThreadMon can display a variety of information about a multithreaded program. The display of threads in a program shows what each thread is doing, showing the percentage time each thread has spent in the various user-thread states during the last three seconds. The LWP display shows the percentage of time each LWP spends in the various LWP (kernel) states. The CPU display details how the CPUs are spending their time. It can also display the synchronization primitives discovered in an example program's three modules (atexit, main, and drand48). These modules can be expanded to reveal more detail, showing us synchronization primitives first used in the routines fopen and drand48. The programmer has source code for neither of these routines, so no names are available for their synchronization primitives. Finally, we can view the mapping of threads to LWPs at one moment of a program's execution.

Implementation Details

To minimize probe effects, we did not want to display runtime data on the same machine as the monitored program. Thus, ThreadMon consists of two discrete parts: a library side that gathers data in the monitored program, and a remote display side that presents the data graphically.

In order to allow monitoring of arbitrary binaries, the library side is implemented as a shared library. Thus, to monitor a program, the user sets the LD_PRELOAD environment variable to point to the ThreadMon library. This forces ThreadMon to be loaded before other shared libraries. Once loaded, ThreadMon connects to the remote display side and goes on with the program. As the program continues, ThreadMon wakes up every 10 milliseconds, gathers data, and for wards that data to the display side. The monitor runs as a separate thread bound to its own LWP. The gathering of data at the 10-millisecond rate requires approximately ten percent of one CPU on a four-processor 40-MHz SparcStation 10. In practice, we have found that this probe effect is not significant enough to drastically change a program's performance characteristics. However, for the skeptical, a nice fringe benefit of ThreadMon is its ability to monitor itself: by examining the thread and LWP which ThreadMon uses, the probe effect can be measured.

ThreadMon uses several OS services to perform data gathering:

Interpositioning. The most important data is gathered by the library by interpositioning between the user-level threads library and itself. That is, ThreadMon redefines many of the functions that the user-level threads library uses internally to change the state of threads and LWPs.
Process file system [Faulkner and Gomes 1991]. The /proc file system offers a wealth of performance information. Specifically, PIOCLUSAGE is used to determine LWP states.
Kernel statistics interface. The kstat interface is used to obtain CPU usage statistics.
Trace Normal Form. Unfortunately, there is no existing operating-system service to determine the mappings between LWPs and CPUs. To get this information, we used the TNF kernel probes present in Solaris 2.5 and extrapolated the mapping information. For a variety of reasons, this extrapolation is extremely expensive. The TNF monitoring is off by default; when it is turned on, ThreadMon typically consumes fifty percent of one CPU on a four-processor SparcStation 10.
Mmaping of /dev/kmem. For some statistics, we have found it significantly faster to delve straight into kernel memory.

[ Top | Introduction | Solaris | Threadmon | References ]

The text of this web document was taken from a paper by Bryan M. Cantrill and Thomas W. Doeppner Jr., Department of Computer Science, Brown University, Providence, RI 02912-1910

Greg Foxman (gmf@cs.brown.edu)