Systemtap

So I promised a while ago to describe my systemtap experience.

My initial intent was to use dtrace, and as it isn’t available on Linux, I spent some time playing with OpenSolaris. In the course of my web searches I stumbled across Systemtap that is supposed to provide at least equivalent or sometimes better functionality. I won’t comment on that, because as soon as I found it, I abandoned dtrace. This is nothing against dtrace, but my primary development platform is Linux.

The task I was tackling was investigating performance for my ARM4 agent. In particular, I wanted to improve performance when run on machines with multiple cores to improve scalability. Conventional debuggers don’t really do this, and profilers only give you part of the picture… especially when multiple processes are involved. One question I had in particular was how much time I was spending waiting on mutual exclusions.

So I start playing with the example scripts. Overall it’s pretty easy. In fact, I probably spent as much time setting up a good test scenario as I did developing my systemtap scripts. Their examples for monitoring IPCs are pretty good. Not stellar, but good.

I got a few surprises. Primarily, I was astonished to see how much time I was spending copying memory using message queues. This is by far the bulk of the time expended on a 2 core Opteron system. Secondly, I was astonished at just how difficult it is to monitor mutual exclusions!

Linux uses an exclusion mechanism called futexes. User space memory is used for the exclusion and the operations are fast. Unfortunately, at least with the version I was running on RHEL 5.2, you get no visibility into the user symbol tables. So your futex is displayed as a memory location. That’s pretty useless. I had to go back and modify my code to print all the relevant memory addresses just so I could see which one is which! This was a lot of work that definitely flies in the face of the non-intrusive measurement philosophy.

Overall, I’d rate the experience satisfactory, but systemtap still needs improvement.

I found a few areas where I could tweak my code and get significant performance improvements. I saw some areas where the architecture could be improved, but that’s another problem for another day.

I see rumours on the web of user space taps. It may exist already, but if it’s in the version I used, it’s unclear how to use it. A far more important task for systemtap developers in my mind would be reading program symbol tables. Knowing a futex is generating a hot spot is useless to me if I can’t find where the futex is!

Throughout this process, I’ve been amazed at how “almost” the development tools on Linux are. CodeAnalyst (oprofile) and Systemtap are both tools I need, and both come up short. As good as Linux is for writing code, I can’t fully do the work that needs to be done using the tools available. I still wind up testing on other machines.

I’d give systemtap a B, and Linux in general a C.

Tags: , , , , , , , , ,

9 Responses to “Systemtap”

  1. Frank Ch. Eigler Says:

    Thanks for trying systemtap!

    > Unfortunately, at least with the version I was running on RHEL 5.2, you get no
    > visibility into the user symbol tables. So your futex is displayed as a memory location

    We are tracking a bug with the symname() function that should map pointers to
    their nearest symbol (current user-thread or kernel-space). Please consider giving
    more details about what you were trying to do on .

    > Overall, I’d rate the experience satisfactory, but systemtap still needs improvement.

    We agree.

    • davidcarterca Says:

      Hmm… reading your comment, I must concede the fault may be mine in that I don’t think I put anything in my script that tried to map the symbol name. I don’t think I realized this was a requirement/possibility.

      Here’s the script I used:
      #! /usr/bin/env stap

      # This script tries to identify contended user-space locks by hooking
      # into the futex system call.

      global thread_thislock # short
      global thread_blocktime #
      global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */

      global lock_waits # long-lived stats on (tid,lock) blockage elapsed time
      global process_names # long-lived pid-to-execname mapping

      probe syscall.futex {
      if (op != FUTEX_WAIT) next # don't care about WAKE event originator
      t = tid ()
      process_names[pid()] = execname()
      thread_thislock[t] = $uaddr
      thread_blocktime[t] = gettimeofday_us()
      }

      probe syscall.futex.return {
      t = tid()
      ts = thread_blocktime[t]
      if (ts) {
      elapsed = gettimeofday_us() - ts
      lock_waits[pid(), thread_thislock[t]] <<< elapsed
      delete thread_blocktime[t]
      delete thread_thislock[t]
      }
      }

      probe end {
      foreach ([pid+, lock] in lock_waits)
      printf ("%s[%d] lock %p contended %d times, %d max us, %d min us, %d avg us\
      n",
      process_names[pid], pid, lock, @count(lock_waits[pid,lock]),
      @max(lock_waits[pid,lock]),
      @min(lock_waits[pid,lock]),
      @avg(lock_waits[pid,lock]))
      }

      All suggestions welcome. BTW, this is my second instance of tech support via blog… I like it! 😀

      • Frank Ch. Eigler Says:

        To the extent that the futex addresses may relate to data symbols in your program
        (embedded into pthread_mutex objects perhaps), a change along these lines should
        before too long give you the right symbolic info.

        from: thread_thislock[t] = $uaddr
        to: thread_thislock[t] = symname($uaddr)

        and in the printf() at the bottom,

        from: lock %p
        to: lock %s

  2. davidcarterca Says:

    Well, RHEL’s version is way too old for this (0.7.2), so I tried it on FC11 (0.9.2). Still no love – all I get is an address.

    The issue may be that I’m using pthread mutexes, which in turn use futexes, rather than futexes directly, but I’d still hope to see pthread_symbol_name+offset. Since these are defined within my program, I wouldn’t think I’d need debug versions of pthreads or anything similar.

    Any debug settings I can use to see what’s going on?

    Also, I went back to the documentation and noticed that symname isn’t mentioned anywhere, so I feel less guilty about missing it now.

  3. Frank Ch. Eigler Says:

    You will need a more recent systemtap version, like 0.9.7, and
    then change “symname” to “usymname” above (as my colleage
    Mark Wielaard reminded me), and then it should start to work.

    • Mark Wielaard Says:

      And I forgot to put the new user symbol/backtrace tapset functions in the manual, so even in 0.9.7 you wouldn’t have found them… oops. Fixed 3 minutes ago in git, so the next time we regenerate the manual they will be included. For now please look at the documentation in the tapset sources themselves (/usr/share/systemtap/tapset/*context*stp). Sorry for the inconvenience.

  4. davidcarterca Says:

    I mistyped in the FC11 version number. It is 0.9.7. I tried the usymname and just for giggles usymdata, but I’m still getting hex addresses.

  5. fche Says:

    David, would you mind posting your pthready C program and the exact script you’re
    trying to run to the mailing list?

  6. davidcarterca Says:

    Yes, I think we’ve over run this thread. Give me a few minutes and I’ll get something posted.

Leave a comment