A while ago, we discussed some performance analysis basics:
-
Define what your problem is.
-
Figure out your goal: What metric needs to be in what ballpark for you to declare victory?
-
Analyze your system from the inside out: CPU, RAM, Disk, Network. Your Bottleneck is always in one of these 4 regions.
So what are the best commands for finding bottlenecks in each of the four categories above? Here’s part two of my Oracle Solaris Performance cheat sheet with some favorite tricks.
Does Your System Have Enough CPU Power?
This is usually the first suspicion when the performance isn’t where it should be:
“The CPU is too slow!”
And it’s often just plain wrong.
Let’s see how we can quickly answer the question: Do I have enough CPU power?
In the old days of single-core, single-CPU systems, we fired up top
and watched the system load value, or the top processes’ CPU percentage. But in today’s multi-CPU, multi-core world, this doesn’t work anymore. The old concept of “load” is now misleading and quite useless if your want to assess whether your system has enough CPU power or not.
Here’s a more modern way:
constant@fridolin:~$ vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
0 0 0 446144 130076 23 100 0 1 3 0 12 7 -0 13 0 465 1352 1137 6 12 82
0 0 0 405376 90808 33 41 0 0 0 0 0 39 0 3 0 514 500 571 4 11 85
0 0 0 405296 90536 0 0 0 0 0 0 0 29 0 1 0 502 778 551 4 10 86
...
(Remember to ignore the first line of the output as it may contain accumulated data from an unknown sample size.)
Now watch the rightmost column, which is the system idle time in percent. Is it bigger than 0 most of the time? Then you have enough CPU power. It’s that simple. If idle time is 0 most of the time, buy a bigger CPU, if not, look elsewhere.
The above system has enough CPU: It’s idle more than 80% of its time so even if something runs slow, it can’t be the CPU in this case.
(Yes, life can be more complex than that, but remember, we’re talking about a cheat sheet here. This is the most useful approach for a majority of cases.)
How’s My Memory Doing?
Now that we’ve ruled out “not enough CPU horsepower” as the bottleneck, let’s look at the next layer: RAM. Do we have enough RAM? Or is the system starving for more memory, possibly resorting to using slow disks as a poor substitute for RAM? Again,
constant@fridolin:~$ vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
2 0 7 6472 30620 6 85 108 392 546 3060 2617 143 0 111 0 839 408 14606 3 40 57
0 0 7 8360 33960 10 51 89 155 1910 1816 19090 187 0 52 0 883 529 9512 5 36 59
0 0 7 12548 42948 19 48 66 215 215 1080 0 121 0 70 0 737 340 10273 3 31 66
1 0 7 13612 39916 38 90 106 0 0 632 0 171 0 56 0 900 616 10160 5 29 66
4 0 7 8060 29528 10 47 55 0 383 232 5514 112 0 77 0 854 739 6665 4 26 70
0 0 7 7312 38468 3 9 15 234 1500 0 17073 33 0 47 0 580 349 3993 2 25 73
0 0 7 8960 39460 17 46 55 0 0 0 0 101 0 37 0 744 529 7870 3 27 70
2 0 7 8836 37020 6 31 46 0 0 0 0 87 0 87 0 749 418 6033 3 20 77
is our friend. This time, let’s look at three values: swap
, free
and sr
(or: scan rate):
-
swap: This is the amount of free virtual memory.
-
free: This is the amount of free physical memory.
-
scan rate (sr): This is the number of times that the memory page scanner is cleaning up memory pages, freeing the lesser used memory pages to make room for data that needs to be allocated from physical memory.
Again, the old adage was: If memory is full, you need more of it. But today it’s misleading: Modern operating systems tend to use up as much memory as they can, to maximize your hard spent RAM bucks’ utilization. For example, ZFS uses as much free memory as possible as a read cache to save you from spending precious IOPS on disks. So if the “free mem” column in top
is small, this is actually a good sign: It means that your RAM is doing useful stuff.
A better question to ask here: Is my memory system in trouble? That’s what the scan rate value is telling us: The bigger this value, the more stressed our memory subsystem is, because the OS is more and more busy scanning memory pages for expendable chunks so it can fulfill a high demand in fresh memory. If the scan rate is a single digit value most of the time, you’re ok. If it shows large values over extended periods of time, you’ll likely benefit from some extra RAM in your system.
In the second vmstat
example above, I created extra stress for the memory system by starting a ZFS scrub (filling up RAM), starting OpenOffice with a large presentation and asking GIMP to set up a new 8k x 8k picture for me. That resulted in some samples showing more than a thousand page scans. That’s certainly a situation where more RAM would have come in handy. The system was unusable, although the CPU showed more than 70% idle.
(Again, there’s a lot more detail that we don’t cover here, but we don’t want to make this post bigger than a good bedtime reading, do we?)
The nice thing about vmstat
is that with just one command, you can easily assess if the CPU and RAM situation is ok or not, then move on to the next layer.
Or Is There a Disk Problem?
Now it gets interesting. Most if not all of the performance problems I see are disk I/O related, and there’s no indication that this is about to change.
You can get a quick overview about your IO situation by using:
constant@fridolin:~$ iostat -xzn 5
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
8.2 6.2 163.8 90.0 0.5 0.2 35.4 13.1 8 10 c3d0
1.4 12.2 30.0 81.4 0.1 0.2 8.9 13.0 3 7 c6t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
126.6 33.1 1613.0 400.3 3.5 1.6 21.9 9.8 75 81 c3d0
0.0 19.7 0.0 40.7 0.6 0.1 28.6 7.5 14 15 c6t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
33.4 2.0 242.5 14.4 7.1 2.0 200.0 56.4 100 100 c3d0
0.0 15.8 0.0 39.4 2.3 0.5 148.2 31.3 49 49 c6t0d0
Again, looking at simple performance numbers like reads/writes per second or even kilobytes read/written per second doesn’t tell you much. Are 126 reads fast? Or too slow? Wow, 1613k read per second. That’s a lot! Is it? Wait, what disks am I using again? (Answer: The above is a Solaris 11 Express system running on VirtualBox on my 3-year-old Mac.)
A more interesting figure to look at is wait
: This is the number of IO operations that are waiting to be serviced. In other words: “wait” tells you the waiting queue length. If your queue length looks like the one in front of an Apple store at the day of the introduction of the new iPhone, you need to work on your disks (Here are a few suggestions if you use ZFS). If the wait time is in the single digit range, then your problem may be elsewhere.
Sometimes you want a more application level view into your IO situation and that is what the following command is about:
admin@krengi:~$ fsstat -F 5
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 0 0 0 0 0 0 0 0 ufs
0 0 0 0 0 0 0 0 0 0 0 proc
0 0 0 0 0 0 0 0 0 0 0 nfs
0 0 0 68 0 43 0 0 0 9 1.06K zfs
0 0 0 0 0 0 0 0 0 0 0 lofs
0 0 0 0 0 0 0 0 0 0 0 tmpfs
0 0 0 0 0 0 0 0 0 0 0 mntfs
0 0 0 0 0 0 0 0 0 0 0 nfs3
0 0 0 0 0 0 0 0 0 0 0 nfs4
0 0 0 0 0 0 0 0 0 0 0 autofs
(I threw away the first batch of data, which is always useless.)
Or, if the number of filesystems you’re interested in is limited:
admin@krengi:~$ fsstat zfs 5
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
2.08M 613K 171K 7.68G 2.25M 10.0G 43.3M 1.09G 1.97T 189M 638G zfs
0 0 0 74 0 79 0 35 608 18 860 zfs
0 0 0 67 0 39 0 0 0 1 112 zfs
0 0 0 71 0 73 0 1 4 1 112 zfs
This is another great way of quickly having a look at what’s up with your disk IO.
Are your users creating lots of files? Or are they modifying/removing/changing attributes a lot? What filesystems are causing the most IO load? How much IO goes through NFS and how much is local? All these questions can be easily answered with fsstat
and a few flags.
Checking Out the Network
Finally, if your problem is neither on the CPU nor on the memory nor on the disk IO side, it may lie outside of your system, perhaps at the networking level. Again, there’s a favorite command that gets me a useful picture most of the time. For example, while streaming some video on my home server, I checked the effect on the network with this:
admin@krengi:~$ netstat -I e1000g0 5
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
417683472 4 384816503 0 0 420603019 4 387736050 0 0
5779 0 3282 0 0 5779 0 3282 0 0
6487 0 3556 0 0 6487 0 3556 0 0
3672 0 2351 0 0 3673 0 2352 0 0
Notice that netstat counts packets here, not MB/s. Network performance analysis and tuning is a science of its own, but with this command you can quickly assess what each networking interface is doing, and whether the packets they transmit are in the right ballpark. Maybe you have multiple network interfaces configured, but still all your data is sent through the same pipe?
Digging Deeper
So that’s it for my performance cheat sheet: vmstat
for CPU and memory, iostat
with the -xzn
flags and fsstat
for disk IO, and good old netstat -I
for the network. This is the 20% effort solution, the minimum effective set of commands that will get you a quick overview of a system in 80% of the cases.
Now for that other 20% of more complicated cases, you will need some extra digging. If you want to learn more, here are a few useful pointers:
-
The Solaris Internals Wiki has a great page about CPU/Processor Analysis (no link, solarisinternals.com no longer exists).
-
dim_STAT (no link, dimitrik.free.fr no longer exists) is a complete toolset for collecting and analyzing system performance. It can both generate a high level overview or a deep down analysis of a system.
-
Jörg wrote a nice article about fsstat (no link, page no longer exists), and he promised a little series about
*stat
articles. Jörg, why don’t you continue your series with some of your favorite tools? That would be cool!
Your Own Favorite Performance Tools
As we have seen, most of the time we can get away with some simple use of vmstat
, iostat
, fsstat
and netstat
. What are the tools that you like to use most of the time? What’s your own little set of cheat sheet performance tools? Share your own set of tools in the comments, and if Jörg is reading this: Please continue your Meet the stats series!
Commenting is currently not available, because I’d like to avoid cookies on this site. I may or may not endeavor into building my own commenting system at some time, who knows?
Meanwhile, please use the Contact form to send me your comments.
Thank you!