Wednesday, June 27, 2007

Why Your Distributed Performance Tests Are Lying to You: Anti-Patterns of Distributed Application Testing and Tuning - Part 1

Clustering and distributing Java applications has never been easier than it is today (see Terracotta). As a result, writing good distributed performance tests and tuning those applications is increasingly important. Performance tuning and testing of distributed and/or clustered applications is an important skill and many who do it can use a little help. Over my next few blogs I'm going to cover a series of anti-patterns in this area. I'll be following it up with a simple open distributed testing framework that I hope can help people out (hint, hint, the testing framework itself is distributed to best test distributed apps).

Here are the first 3 anti-patterns...

Anti-pattern 1: Single-Node “Distributed” Testing

Description

Running your “distributed” performance test inside a single JVM.

Problem

Depending on the framework, this can tell you either: 1) nothing, because the clustering framework recognizes it has no partners so optimizes itself out or 2) very little—it might give one an idea of maximum theoretical read/write speed for that framework.

Solution

When trying to evaluate the performance of any kind of clustering or distributed computing software, always use an absolute minimum of 2 nodes (Preferably more).


Anti-pattern 2: Single-Computer “Distributed” Testing

Description

Putting all (or just too many) of the resources for a performance test on one machine.

Problem

This has two problems. First, distributed applications running on the same machine have different latency and networking characteristics than distributed applications on different machines. This can hide various classes of problems around pipeline stalls, batching, and windowing issues.

The second problem is a variation on another anti-pattern I will discuss later around resource contention. By running multiple JVMs on one machine you are now contending for CPU, disk, network, and potentially affecting context switch rate, etc.

Solution

The only real way to test a distributed application is to run it in a truly distributed way: on multiple machines. If you must have multiple nodes/JVMs on one machine, make sure you are running one of the many resource-monitoring tools and make sure you aren't resource constrained (I use iostat/vmstat for simple tests).


Anti-pattern 3: Multi-Node, Load Only One

Description

Testing with multiple nodes but only sending load/work to one of those nodes while leaving the others just hanging out doing little or nothing

Problem

Depending on the distributed computing architecture chosen, the nodes that are not receiving load may be actually doing a lot of work. If that's the case, only loading one of the nodes is giving a false sense of performance. Also, in some cases, data is lazily loaded into nodes so only putting load on one node could be putting you in the same boat as the single-node tester where no actual clustering is happening.

Solution

When testing clustering software, make sure you are throwing load at all nodes.


Be sure to check back soon as the next few anti-patterns will cover the data aspects of distributed performance testing...

2 comments:

  1. Steve,

    I enjoyed your three articles and wrote a post highlighting them. Are you planning something for the promised fourth part? (Or is this your equivalent of the Traveling Wilburys catalog? ;-)

    --Chris

    ReplyDelete
  2. Yep, Part 4 is coming. I have whipped up some code but I have to write it up. Takes more time than the average blog :-). Sorry for the delay.

    ReplyDelete