So I promised a while ago to describe my systemtap experience.
My initial intent was to use dtrace, and as it isn’t available on Linux, I spent some time playing with OpenSolaris. In the course of my web searches I stumbled across Systemtap that is supposed to provide at least equivalent or sometimes better functionality. I won’t comment on that, because as soon as I found it, I abandoned dtrace. This is nothing against dtrace, but my primary development platform is Linux.
The task I was tackling was investigating performance for my ARM4 agent. In particular, I wanted to improve performance when run on machines with multiple cores to improve scalability. Conventional debuggers don’t really do this, and profilers only give you part of the picture… especially when multiple processes are involved. One question I had in particular was how much time I was spending waiting on mutual exclusions.
So I start playing with the example scripts. Overall it’s pretty easy. In fact, I probably spent as much time setting up a good test scenario as I did developing my systemtap scripts. Their examples for monitoring IPCs are pretty good. Not stellar, but good.
I got a few surprises. Primarily, I was astonished to see how much time I was spending copying memory using message queues. This is by far the bulk of the time expended on a 2 core Opteron system. Secondly, I was astonished at just how difficult it is to monitor mutual exclusions!
Linux uses an exclusion mechanism called futexes. User space memory is used for the exclusion and the operations are fast. Unfortunately, at least with the version I was running on RHEL 5.2, you get no visibility into the user symbol tables. So your futex is displayed as a memory location. That’s pretty useless. I had to go back and modify my code to print all the relevant memory addresses just so I could see which one is which! This was a lot of work that definitely flies in the face of the non-intrusive measurement philosophy.
Overall, I’d rate the experience satisfactory, but systemtap still needs improvement.
I found a few areas where I could tweak my code and get significant performance improvements. I saw some areas where the architecture could be improved, but that’s another problem for another day.
I see rumours on the web of user space taps. It may exist already, but if it’s in the version I used, it’s unclear how to use it. A far more important task for systemtap developers in my mind would be reading program symbol tables. Knowing a futex is generating a hot spot is useless to me if I can’t find where the futex is!
Throughout this process, I’ve been amazed at how “almost” the development tools on Linux are. CodeAnalyst (oprofile) and Systemtap are both tools I need, and both come up short. As good as Linux is for writing code, I can’t fully do the work that needs to be done using the tools available. I still wind up testing on other machines.
I’d give systemtap a B, and Linux in general a C.
Tags: ARM4, C++, CodeAnalyst, Development, DTrace, Linux, oprofile, Red Hat, Solaris, Systemtap
May 27, 2009 at 7:55 pm |
Thanks for trying systemtap!
> Unfortunately, at least with the version I was running on RHEL 5.2, you get no
> visibility into the user symbol tables. So your futex is displayed as a memory location
We are tracking a bug with the symname() function that should map pointers to
their nearest symbol (current user-thread or kernel-space). Please consider giving
more details about what you were trying to do on .
> Overall, I’d rate the experience satisfactory, but systemtap still needs improvement.
We agree.
May 27, 2009 at 8:16 pm |
Hmm… reading your comment, I must concede the fault may be mine in that I don’t think I put anything in my script that tried to map the symbol name. I don’t think I realized this was a requirement/possibility.
Here’s the script I used:
#! /usr/bin/env stap
# This script tries to identify contended user-space locks by hooking
# into the futex system call.
global thread_thislock # short
global thread_blocktime #
global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
global lock_waits # long-lived stats on (tid,lock) blockage elapsed time
global process_names # long-lived pid-to-execname mapping
probe syscall.futex {
if (op != FUTEX_WAIT) next # don't care about WAKE event originator
t = tid ()
process_names[pid()] = execname()
thread_thislock[t] = $uaddr
thread_blocktime[t] = gettimeofday_us()
}
probe syscall.futex.return {
t = tid()
ts = thread_blocktime[t]
if (ts) {
elapsed = gettimeofday_us() - ts
lock_waits[pid(), thread_thislock[t]] <<< elapsed
delete thread_blocktime[t]
delete thread_thislock[t]
}
}
probe end {
foreach ([pid+, lock] in lock_waits)
printf ("%s[%d] lock %p contended %d times, %d max us, %d min us, %d avg us\
n",
process_names[pid], pid, lock, @count(lock_waits[pid,lock]),
@max(lock_waits[pid,lock]),
@min(lock_waits[pid,lock]),
@avg(lock_waits[pid,lock]))
}
All suggestions welcome. BTW, this is my second instance of tech support via blog… I like it! 😀
May 27, 2009 at 9:33 pm
To the extent that the futex addresses may relate to data symbols in your program
(embedded into pthread_mutex objects perhaps), a change along these lines should
before too long give you the right symbolic info.
from: thread_thislock[t] = $uaddr
to: thread_thislock[t] = symname($uaddr)
and in the printf() at the bottom,
from: lock %p
to: lock %s
May 27, 2009 at 11:47 pm |
Well, RHEL’s version is way too old for this (0.7.2), so I tried it on FC11 (0.9.2). Still no love – all I get is an address.
The issue may be that I’m using pthread mutexes, which in turn use futexes, rather than futexes directly, but I’d still hope to see pthread_symbol_name+offset. Since these are defined within my program, I wouldn’t think I’d need debug versions of pthreads or anything similar.
Any debug settings I can use to see what’s going on?
Also, I went back to the documentation and noticed that symname isn’t mentioned anywhere, so I feel less guilty about missing it now.
May 28, 2009 at 8:49 am |
You will need a more recent systemtap version, like 0.9.7, and
then change “symname” to “usymname” above (as my colleage
Mark Wielaard reminded me), and then it should start to work.
May 28, 2009 at 9:14 am |
And I forgot to put the new user symbol/backtrace tapset functions in the manual, so even in 0.9.7 you wouldn’t have found them… oops. Fixed 3 minutes ago in git, so the next time we regenerate the manual they will be included. For now please look at the documentation in the tapset sources themselves (/usr/share/systemtap/tapset/*context*stp). Sorry for the inconvenience.
May 28, 2009 at 10:54 am |
I mistyped in the FC11 version number. It is 0.9.7. I tried the usymname and just for giggles usymdata, but I’m still getting hex addresses.
May 28, 2009 at 11:58 am |
David, would you mind posting your pthready C program and the exact script you’re
trying to run to the mailing list?
May 28, 2009 at 12:17 pm |
Yes, I think we’ve over run this thread. Give me a few minutes and I’ll get something posted.