
ThreadMon: The
Tool Itself
Existing Tools
Traditional performance debuggers (e.g. call profilers) offer little
help; simply knowing where a thread spent its time does not
aid in analysis of the model. While postmortem tracing tools such as
tnfview allow
some performance analysis of specific programs, they offer little
insight into the effectiveness of the model itself. Moreover, the
sheer volume of data generated makes it difficult to spot detrimental
anomalous performance behavior.
While existing tools make the model difficult to evaluate on a
uniprocessor, they make its evaluation on a multiprogrammed
multiprocessor virtually impossible. To perform this kind of analysis,
runtime correlation of thread, LWP and CPU behavior is
required. To this end, we implemented ThreadMon, a tool which
graphically displays the runtime interactions in the Solaris
implementation of the many-to-many threads model.
ThreadMon Overview
ThreadMon displays runtime information for each user-level thread, LWP
and CPU. It provides not only the state information for each
of these computational elements, but also the mappings
between them: which thread is running on which LWP, and which LWP is
running on which CPU. Thus, to a large degree, one can watch the
scheduling decisions made by both the user- level threads package and
the kernel, and view how those decisions affect thread state, LWP
state, and most importantly, CPU usage. We have been able to use this
tool effectively to analyze the decisions made by the many-to-many
model.
Features
ThreadMon can display a variety of information about a multithreaded
program. The display of threads in a program
shows what each thread is doing, showing the percentage time each
thread has spent in the various user-thread states during the last
three seconds. The LWP display shows the
percentage of time each LWP spends in the various LWP (kernel) states.
The CPU display details how the CPUs are
spending their time. It can also display the synchronization primitives discovered in an
example program's three modules (atexit, main, and
drand48). These modules can be expanded to reveal more
detail, showing us synchronization primitives
first used in the routines fopen and drand48. The programmer has
source code for neither of these routines, so no names are available
for their synchronization primitives. Finally, we can view the mapping of threads to LWPs at one moment of a
program's execution.
Implementation Details
To minimize probe effects, we did not want to display runtime data on
the same machine as the monitored program. Thus, ThreadMon consists of
two discrete parts: a library side that gathers data in the
monitored program, and a remote display side that presents
the data graphically.
In order to allow monitoring of arbitrary binaries, the library side
is implemented as a shared library. Thus, to monitor a program, the
user sets the LD_PRELOAD environment variable to point to the
ThreadMon library. This forces ThreadMon to be loaded before other
shared libraries. Once loaded, ThreadMon connects to the remote
display side and goes on with the program. As the program continues,
ThreadMon wakes up every 10 milliseconds, gathers data, and for wards
that data to the display side. The monitor runs as a separate thread
bound to its own LWP. The gathering of data at the 10-millisecond
rate requires approximately ten percent of one CPU on a four-processor
40-MHz SparcStation 10. In practice, we have found that this probe
effect is not significant enough to drastically change a program's
performance characteristics. However, for the skeptical, a nice fringe
benefit of ThreadMon is its ability to monitor itself: by examining
the thread and LWP which ThreadMon uses, the probe effect can be
measured.
ThreadMon uses several OS services to perform data gathering:
- Interpositioning. The most important data is gathered by
the library by interpositioning between the user-level threads library
and itself. That is, ThreadMon redefines many of the functions that
the user-level threads library uses internally to change the state of
threads and LWPs.
- Process file system [Faulkner and Gomes 1991]. The
/proc file system offers a wealth of performance
information. Specifically, PIOCLUSAGE is used to determine
LWP states.
- Kernel statistics interface. The kstat interface is used
to obtain CPU usage statistics.
- Trace Normal Form. Unfortunately, there is no existing
operating-system service to determine the mappings between LWPs and
CPUs. To get this information, we used the TNF kernel probes present
in Solaris 2.5 and extrapolated the mapping information. For a variety
of reasons, this extrapolation is extremely expensive. The TNF
monitoring is off by default; when it is turned on, ThreadMon
typically consumes fifty percent of one CPU on a four-processor
SparcStation 10.
- Mmaping of /dev/kmem. For some statistics, we have found
it significantly faster to delve straight into kernel memory.
[
Top |
Introduction |
Solaris |
Threadmon |
References
]
The text of this web document was taken from a paper by Bryan M. Cantrill and Thomas W. Doeppner Jr., Department
of Computer Science, Brown University, Providence, RI 02912-1910
Greg Foxman (gmf@cs.brown.edu)