Posts Tagged ‘Systemtap’

Systemtap

May 27, 2009

So I promised a while ago to describe my systemtap experience.

My initial intent was to use dtrace, and as it isn’t available on Linux, I spent some time playing with OpenSolaris. In the course of my web searches I stumbled across Systemtap that is supposed to provide at least equivalent or sometimes better functionality. I won’t comment on that, because as soon as I found it, I abandoned dtrace. This is nothing against dtrace, but my primary development platform is Linux.

The task I was tackling was investigating performance for my ARM4 agent. In particular, I wanted to improve performance when run on machines with multiple cores to improve scalability. Conventional debuggers don’t really do this, and profilers only give you part of the picture… especially when multiple processes are involved. One question I had in particular was how much time I was spending waiting on mutual exclusions.

So I start playing with the example scripts. Overall it’s pretty easy. In fact, I probably spent as much time setting up a good test scenario as I did developing my systemtap scripts. Their examples for monitoring IPCs are pretty good. Not stellar, but good.

I got a few surprises. Primarily, I was astonished to see how much time I was spending copying memory using message queues. This is by far the bulk of the time expended on a 2 core Opteron system. Secondly, I was astonished at just how difficult it is to monitor mutual exclusions!

Linux uses an exclusion mechanism called futexes. User space memory is used for the exclusion and the operations are fast. Unfortunately, at least with the version I was running on RHEL 5.2, you get no visibility into the user symbol tables. So your futex is displayed as a memory location. That’s pretty useless. I had to go back and modify my code to print all the relevant memory addresses just so I could see which one is which! This was a lot of work that definitely flies in the face of the non-intrusive measurement philosophy.

Overall, I’d rate the experience satisfactory, but systemtap still needs improvement.

I found a few areas where I could tweak my code and get significant performance improvements. I saw some areas where the architecture could be improved, but that’s another problem for another day.

I see rumours on the web of user space taps. It may exist already, but if it’s in the version I used, it’s unclear how to use it. A far more important task for systemtap developers in my mind would be reading program symbol tables. Knowing a futex is generating a hot spot is useless to me if I can’t find where the futex is!

Throughout this process, I’ve been amazed at how “almost” the development tools on Linux are. CodeAnalyst (oprofile) and Systemtap are both tools I need, and both come up short. As good as Linux is for writing code, I can’t fully do the work that needs to be done using the tools available. I still wind up testing on other machines.

I’d give systemtap a B, and Linux in general a C.

Attack of the stupids

May 19, 2009

That was me attacking myself by the way.

I’ve had this long standing stability issue in my ARM4 agent that had been driving me insane for a couple of years. I use the Berkeley database for my datastore. When backing up files, I saved all files, removed the instance databases while retaining the definitions, and re-created new instance databases. The problem was that when the databases were re-created, I was getting errors saying the files didn’t exist, and the program would hang.

First thing I did was look at the threading and error propagation to reduce the hangs, but I could never quite eliminate them. Next I went after what looked like a corrupted environment. I tried threading tools. I tried systemtap. Nothing was getting me anywhere.

Then, when looking for an unrelated API, I came across the db->remove() function call. You know, the one I should have been using rather than the file system’s unlink() call. Simple when you know.

In my defense, the code is several years old, and though I was the original author, it was the first Berkeley DB code I’d written. Still stupid, but the foolishness of youth. Except I’m not young. I’ll shut up now.

P.S. New release coming shortly 🙂

OpenSolaris – a first look

May 10, 2009

Well, first for me anyways.

I decided to have a look at OpenSolaris for one reason – dtrace. I’m trying to optimize some multithreaded software, and I need to have a peek at the semaphore usage. This seems like the ideal time to use dtrace.

I need a multi-core machine to do this. That eliminates my Sun Blade. Mac OS X has dtrace, but they’ve changed enough of the underlying OS to make it impractical for testing system software, such as the removal of groups. I may still use it, but I wanted something better. Enter OpenSolaris.

There are other reasons to look at OpenSolaris as well, such as ZFS, but my needs are pretty focused.

So, lacking an unused machine, I proceed to install in VMWare workstation. My first impression is that it was slow. It was using both CPUs near the limit, but it wasn’t doing much. But it installed, I logged in, and all was good. The next thing I did was assign it a fixed IP using bridged networking so I can SSH in. The network didn’t reinitialize right away, and although I know I could have just restarted the appropriate service, I elected for a reboot. That’s when things went horribly wrong. As the machine VERY SLOWLY came back up, all the while maxing out the CPU. When it came up, logging in was impossible as each key press would repeat. After using SSH to gain access, I did a check using top to discover XWindows and the kernel were using all the CPU. Except the small amount used by the screensaver I’d already disabled. This certainly indicated some conflicts in resource management with VMWare.

This time I did an orderly shutdown of the VM. Slow. Eventually it stopped and I was able to restart in text mode. It started fine. With great trepidation, I re-enabled X Windows. It started fine. Definitely some issues with running in a VM. All is good now.

Boy, this looks a lot like Linux. Gnome will do that. But there are other things as well. Bash is the shell of choice. I have that on my Sparc box as well, but here it’s also the default for root. Root without command line completion is just plain a pain in the a**. Paths are defaulted better too. That’s always been a pain in Solaris.

Which leads to packages. I like what they’ve done. One of the issues I’ve always had with Solaris is the patch problem. Difficult. Klunky. Hard to find. Given that Linux and others had solved this so well, they were definitely behind the times. Of course, as hobby systems that didn’t generate revenue, my Solaris boxes have never had any maintenance subscriptions.

OK, so I started writing this several days ago, and it has been languishing in my Drafts folder. There a few reasons for this. One is I discovered Systemtap. More on this in another post. A second is I’ve got some of the portability issues for OS X solved. Shark was available to me again! Third, Solaris is still Solaris. It’s still got some annoyances in terms of path setup, and although it’s very much a usable operating system, I’ve no compelling reason to use it over Linux on my x86 boxes. My Sparc boxes are another story. Perhaps in a month or so.

Semi-final verdict: Much improved but no compelling reason to switch from Linux. Maybe Oracle will change that?