Pretty much everyone has heard of or tried pair programming. I personally have found it to be quite valuable. One very interesting thing I was reminded of when watching some people on my team do some pairing is how dumb pair programming makes the typer. Conversely, it also makes the watcher (don't know what it's really called) considerably smarter.
I can vividly remember pairing with someone much smarter than me on a transaction logging system a few years back and noticing this effect. When he was typing, I was always coming up with the cool ideas. Of course it's not always an intelligence thing. Around the office I have noticed multiple pairs demonstrate this. The typer becomes dump. The non-typers make excellent judgments and decisions.
I think the obvious explanation for this is just how much energy we put into things like typing, picking variable names, formatting, etc... I've often joked that we should hire secretaries with a little programming knowledge for each programmer.
Has anyone else out there noticed this? Is their anything to be gained from this observation? Or am I completely off base.
Scaling software should be an activity done with ease by Hardware, Cloud and System Administrators without knowledge of the applications being scaled.
Tuesday, September 26, 2006
Performance Tuning with Thread Dumps
As a follow on to my earlier tuning blog I whipped up a quick sample to make the Performance tuning with Thread dumps point clearer (see my previous blog). Take a look at the following code and make a guess at what you think the slowest part of the code is and why.
On my slow computer it took around 43 seconds per 1000.
Now use jps to find out what process id the little app has and do a "kill -3" 10 to 20 times fast in succession.
Note: jps requires jdk1.5 to be installed. Figure out how to use task manager in windows or ps on unix based platforms if you are not on jdk1.5
Note: See the following link for more on how to do thread dumps and what they are A little on thread dumps
Should see something like the following:
Note: ignore any thread that is in Object.wait() for now. In fact, searching on runnable is usually your best bet for skipping to the things that are interesting from a performance perspective.
My guess is what you will see is that a decent percentage of the time is spent in the remove call. hmmm, wonder why that is. Now go look at the implementation of remove on an ArrayList and it will all be clear. Now try using a linked list instead. For me that dropped it down to 6 seconds per 1000.
busy today but more to come...
package softwareandstuff.performance;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class SomethingToRun {
public void run() {
Liststuff = new ArrayList ();
Random r = new Random();
long total = 0;
long count = 0;
while (true) {
long start = System.currentTimeMillis();
String s = " " + r.nextInt() + " " + r.nextInt();
for (int i = 0; i < i =" 0;" 1000 ="="">
On my slow computer it took around 43 seconds per 1000.
Now use jps to find out what process id the little app has and do a "kill -3
Note: jps requires jdk1.5 to be installed. Figure out how to use task manager in windows or ps on unix based platforms if you are not on jdk1.5
steve:~ steve$ jps
7595 Jps
7594 SomethingToRun
7440 startup.jar
steve:~ steve$ kill -3 7594
Note: See the following link for more on how to do thread dumps and what they are A little on thread dumps
Should see something like the following:
Signal Dispatcher" daemon prio=9 tid=0x005098d0 nid=0x1820a00 waiting on condition [0x00000000..0x00000000]
"Finalizer" daemon prio=8 tid=0x00509030 nid=0x180fc00 in Object.wait() [0xf0a04000..0xf0a04ab0]
at java.lang.Object.wait(Native Method)
- waiting on <0x266101e0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x266101e0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x00508c40 nid=0x180ec00 in Object.wait() [0xf0983000..0xf0983ab0]
at java.lang.Object.wait(Native Method)
- waiting on <0x26610260> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:474)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x26610260> (a java.lang.ref.Reference$Lock)
"main" prio=5 tid=0x005013b0 nid=0x1804600 runnable [0xf07ff000..0xf08002c0]
at softwareandstuff.performance.SomethingToRun.run(SomethingToRun.java:25)
at softwareandstuff.performance.SomethingToRun.main(SomethingToRun.java:36)
"VM Thread" prio=9 tid=0x00508460 nid=0x1803c00 runnable
Note: ignore any thread that is in Object.wait() for now. In fact, searching on runnable is usually your best bet for skipping to the things that are interesting from a performance perspective.
My guess is what you will see is that a decent percentage of the time is spent in the remove call. hmmm, wonder why that is. Now go look at the implementation of remove on an ArrayList and it will all be clear. Now try using a linked list instead. For me that dropped it down to 6 seconds per 1000.
busy today but more to come...
Saturday, September 23, 2006
Performance Tuning Tips for Java - No tools required
Over the years I've developed a simple methodology for doing performance tuning. To tell the truth, I didn't even know it was a methodology until I realized that whenever people said "The software is slow!" I always responded with the same actions. In this article I'm going to talk about what I now fondly call, "Performance tuning by Steve" (Can you tell I've seen one to many infomercials in my day?)
First things first:
Someone comes to you as a developer and says, "The software is too slow." They usually have between 2 and 5 guesses as to why it's slow and maybe so do you. What do you do? I'll tell you what you NEVER do. You NEVER take action based on a guess of what the problem is. The only thing worse than an outsiders guesses as to why your software is slow are your own guesses. Do the investigation. I can't tell you how many times I hear people insist that the reason some piece of software is slow is because of networking, disk or database even though the machine is cpu bound. Before I touch a line of code I always ask these two basic questions.
1) Are you CPU bound
(if it is a multi process test I ask if they are cpu bound on any of the boxes involved in the test). If it is a multi cpu machine I ask if any one processor is pegged.
Answer I usually get is "I don't know." Ahh, I say. Can you run your test again with vmstat (translates to iostat on the mac and that Perfomance thing in the task bar on Windows). Don't forget to tee the output to a file.
2) Are you GC bound?
Answer I usually get, "I don't know." I usually follow with. " Can you run your test again with all processes having some variant on the -verbose:gc flag turned on. And don't forget to tee it to a file. If they are GC bound then one needs to do GC/Garbage creation tuning which I will save for another day. It is a big topic on its own.
If they come back and say, "Yes I'm cpu bound! and no I'm not spending tons of time in GC." My mouth waters a bit. I scratch my chin gently and say, "Great! can you take a series of thread dumps while your app is running." At this point I get a funny look. Why? "I don't think it's deadlocked" They usually say. Well this is an extremely effective way to see where an application is spending it's time. In my experience this is more effective than even a good profiler. If you run a load test that is cpu bound and take a series of thread dumps you will find out where your code is spending time. For a little more on thread dumps check this out. A little on thread dumps
I usually perform analysis on the thread dumps, fix the thing that is the worst offender than try the test again. Rinse dry repeat...
Now you are not CPU bound anymore. What next? I'll tell you what still isn't next. DON'T GUESS! you will be wrong. I promise!
The trick here is to find the bottleneck. It is best to think about a non cpu bound performance problem as thread starvation. Not a network problem and not a disk problem (at least at first and for this their are exceptions).
If you are using SEDA style development (think work queues) then your lucky. The next thing you have to do is just find out what queue is backing up. Really, any statemachine style programming can use this technique.
- Which ever queue is backing up is where your performance problem is.
- You may need to add more worker threads to the end of the Queue. Especially if the threads are doing blocking tasks of any kind.
- You may need to do a better job of separating out CPU tasks from blocking tasks.
* If you have a lot of disk writing to do than the thread that writes to disk should do almost 0 cpu work otherwise you will get pipeline stalls going to disk. Big mistake.
- You may need to more effectively synchronize something if your threads are all blocked waiting for the same resource.
If you are not doing SEDA style development or if you are but you have found the queue that is backing up but don't know why yet then here is the next fun step.
Warning, it is time consuming but very effective.
Wrap large swaths of your slow code like so:
Do this in multiple parts of your code and then basically keep moving the timing in tighter and tighter until you have found the part or parts of your code that are taking the longest. This takes some practice. You'll probably get fooled by thread starvation a few times. Maybe get fooled by some blocking calls, ie. network or disk. Keep your mind clear. Try and draw on a white board the hand offs between threads.
One last thing. TAKE NOTES, lots of NOTES. Save the output from every run and make a little readme that reminds you what each experiment was and even any thoughts you had during said experiment.
Summary:
1) Figure out if you are CPU bound
2) Eliminate GC as a concern (tuning GC is a whole other article)
3) If you are cpu bound use thread dumps to figure out where time is being spent (ignore things in wait, only runnables)
4) If you are not then do the following:
- Still take thread dumps to understand where threads are blocked and where they are runnable
- Bottleneck analysis, find out where things are backing up moving through your state machine
- use timings over many calls of each code path and then move those timings in on the areas that are slow until you figure out where the problem is.
5) Take lots of detailed notes and save all artifacts of each run.
Good luck
First things first:
Someone comes to you as a developer and says, "The software is too slow." They usually have between 2 and 5 guesses as to why it's slow and maybe so do you. What do you do? I'll tell you what you NEVER do. You NEVER take action based on a guess of what the problem is. The only thing worse than an outsiders guesses as to why your software is slow are your own guesses. Do the investigation. I can't tell you how many times I hear people insist that the reason some piece of software is slow is because of networking, disk or database even though the machine is cpu bound. Before I touch a line of code I always ask these two basic questions.
1) Are you CPU bound
(if it is a multi process test I ask if they are cpu bound on any of the boxes involved in the test). If it is a multi cpu machine I ask if any one processor is pegged.
Answer I usually get is "I don't know." Ahh, I say. Can you run your test again with vmstat (translates to iostat on the mac and that Perfomance thing in the task bar on Windows). Don't forget to tee the output to a file.
2) Are you GC bound?
Answer I usually get, "I don't know." I usually follow with. " Can you run your test again with all processes having some variant on the -verbose:gc flag turned on. And don't forget to tee it to a file. If they are GC bound then one needs to do GC/Garbage creation tuning which I will save for another day. It is a big topic on its own.
If they come back and say, "Yes I'm cpu bound! and no I'm not spending tons of time in GC." My mouth waters a bit. I scratch my chin gently and say, "Great! can you take a series of thread dumps while your app is running." At this point I get a funny look. Why? "I don't think it's deadlocked" They usually say. Well this is an extremely effective way to see where an application is spending it's time. In my experience this is more effective than even a good profiler. If you run a load test that is cpu bound and take a series of thread dumps you will find out where your code is spending time. For a little more on thread dumps check this out. A little on thread dumps
I usually perform analysis on the thread dumps, fix the thing that is the worst offender than try the test again. Rinse dry repeat...
Now you are not CPU bound anymore. What next? I'll tell you what still isn't next. DON'T GUESS! you will be wrong. I promise!
The trick here is to find the bottleneck. It is best to think about a non cpu bound performance problem as thread starvation. Not a network problem and not a disk problem (at least at first and for this their are exceptions).
If you are using SEDA style development (think work queues) then your lucky. The next thing you have to do is just find out what queue is backing up. Really, any statemachine style programming can use this technique.
- Which ever queue is backing up is where your performance problem is.
- You may need to add more worker threads to the end of the Queue. Especially if the threads are doing blocking tasks of any kind.
- You may need to do a better job of separating out CPU tasks from blocking tasks.
* If you have a lot of disk writing to do than the thread that writes to disk should do almost 0 cpu work otherwise you will get pipeline stalls going to disk. Big mistake.
- You may need to more effectively synchronize something if your threads are all blocked waiting for the same resource.
If you are not doing SEDA style development or if you are but you have found the queue that is backing up but don't know why yet then here is the next fun step.
Warning, it is time consuming but very effective.
Wrap large swaths of your slow code like so:
long start = System.currentTimeMillis();
.
.
-your code-
.
.
long t = System.currentTimeMillis() - start;
count++;
total += t;
if(count % 1000 == 0){ //<-- obviously the number you divide by here is dependent on how often the code is called
System.out.println("T1 Average: " + (t/count)+" count:"+count);
}
Do this in multiple parts of your code and then basically keep moving the timing in tighter and tighter until you have found the part or parts of your code that are taking the longest. This takes some practice. You'll probably get fooled by thread starvation a few times. Maybe get fooled by some blocking calls, ie. network or disk. Keep your mind clear. Try and draw on a white board the hand offs between threads.
One last thing. TAKE NOTES, lots of NOTES. Save the output from every run and make a little readme that reminds you what each experiment was and even any thoughts you had during said experiment.
Summary:
1) Figure out if you are CPU bound
2) Eliminate GC as a concern (tuning GC is a whole other article)
3) If you are cpu bound use thread dumps to figure out where time is being spent (ignore things in wait, only runnables)
4) If you are not then do the following:
- Still take thread dumps to understand where threads are blocked and where they are runnable
- Bottleneck analysis, find out where things are backing up moving through your state machine
- use timings over many calls of each code path and then move those timings in on the areas that are slow until you figure out where the problem is.
5) Take lots of detailed notes and save all artifacts of each run.
Good luck
Subscribe to:
Posts (Atom)