Friday, July 13, 2007

More Lies - Distributed Performance Testing Anti-patterns Part 3 of 4

In part 3 out of 4 of this blog, much like parts 1 and 2 I will hit on anti-patterns that allow your performance testing of clustered and or distributed software to lie to you. I'll be following up part 3 of this blog, the last 4 anti-patterns, with a blog about a simple distributed testing framework I have begun. Hopefully enough of you will be interested, try it and maybe even contribute to it.

Anti-pattern 7:
In-memory vs. Distributed Performance Comparison

Description

Writing a test that compares the speed of adding objects to a local, in-memory data structure vs. adding objects to a clustered data structure.

Problem

To avoid suspense, I'll tell you the results of that test without running it. Adding things to a local, in-memory data structure stakes virtually no time at all. In-memory object changes happen so fast, they are hard to even measure. However, when you are making changes to a distributed data structure, no matter what, those state changes have to be shipped off to another location. This takes instructions to be executed to make this happen on top of the ones used for the original task. This isn't just slower, it is way slower. The comparison between in-memory object changes and distributed object changes is useless.

Solution

Figure out how much data you are going to be clustering and what the usage patterns of that data becoming clustered will be. Then simulate and time that. Once again, focus on total throughput with acceptable latency.


Anti-pattern 8: Ignore Real-world Cross-node Patterns

Description

Reading and writing the same data in every node.

Problem

Generally speaking, whether reading or writing, it is more expensive to access the same data concurrently across all nodes. Depending on the underlying clustering infrastructure, this can be more or less of a problem. If you are using an “everything everywhere” strategy, the performance hit of random access across all the data on all the nodes is less, but the “everything everywhere” sharing strategy generally does not scale well. Most other strategies perform better when data access is consistently read and or written from the same node

Solution

Write your performance tests in a way that allows you to set a percentage for locality of reference. Is an object accessed on the same node 80%, 90%, or 99% of the time? You should usually have some cross-node chatter, but usually not too much—although you should be as realistic to the problem you are trying to solve as possible.


Anti-pattern 9: Ignore Usage Patterns

Description

The performance test either just creates objects or just reads objects

Problem

In the real world, an application does a certain amount of reading, writing, and updating of shared objects. And those reads, writes, and updates are of certain sizes.

Solution

If your app likely changes only a few fields in a large object graph, then that is what your performance test should do. If your app is 90% read from multiple threads and 10% write from multiple threads than that is what your test should do. Make your test be true to what you need when it comes to data and usage.


Anti-pattern 10: Log Yourself to Death

Description

Last, but far from least, doing extra stuff like writing data out to a log chews up CPU. Logging too much in any performance test can render the test results meaningless.
This anti-pattern generally covers any extra CPU usage on a load-generating client that affects the performance test. In general, if one or more of your nodes is CPU bound in a cluster performance test, you likely have not maxed-out the performance of your cluster. Let me say that again, if you are resource constrained on any node, including your load generating nodes (but not including your server if one exists) then you are probably not maxing out what your cluster as a whole can handle. Investigate further.

Problem

If the individual load-generating nodes—or even the clustered nodes—are resource constrained, it is likely to create a false bottleneck in your test. You are trying to figure out the throughput of the cluster and your cluster nodes are likely busy doing other things like logging.

Solution

First, always have machine monitoring on all nodes in a performance test. Any time one of the nodes or load generators becomes resource constrained make sure you test with an additional node and see if it adds to the scale. If a node is unexpectedly resource constrained, then take a series of thread dumps (java only) and figure out where all the time is going.


Alright, that is the end of my anti-pattern list for now. I could probably come up with a few more but I'll save them for another day. The moral of this section of the blog is to be curious and skeptical with your testing results. Don't just ask what the numbers are. Find out why and you will end up a much happier person.

No comments:

Post a Comment