This article summarizes common performance tuning steps and techniques to help debug VSE performance issues.
All supported DevTest releases.
This document summarizes common performance tuning steps and techniques to help debug VSE performance issues.
The number one performance constraint is the VSE and client side logging setup. Under high load, logging can significantly slow performance. The logging setup is controlled in LISA_HOME/logging.properties and you should ensure the log level is set to INFO; this will alert you to heap sizing problems on the servers.
The two most important lines are
rootLogger.level=WARN
and
logger.VSE.level=WARN
VSE is configured to clear its database cache every 10 seconds. This is a reasonable trade-off in a development environment where service images are being constantly updated but when you are benchmarking or going into production, you should tune this value to reflect the static nature of the service image database.
LISA_HOME/lisa.properties describes the setting and its best to override it in LISA_HOME/local.properties as follows:
lisa.eclipselink.cache.timeout.ms=3600000 (5.x only, 6.x does not use a database)
By default we cache Eclipselink object for no longer than 10 seconds. This means that changes made to the service image database via the service image editor will propagate to a running virtual service model in no longer than 10 seconds and on average 5 seconds, assuming the object is already cached by the VSE. If the object is not already cached the propagation will be immediate. 10 seconds is a reasonable trade-off in a development environment where you are making changes to service images. In a production environment with a heavily loaded VSE server you might consider revising this number upwards to at least a minute (60000 ms) or even an hour (3600000 ms) to reduce the number of SQLs issued and thus increase performance.
lisa.eventPool.maxQueueSize=65535
It is sometimes a good idea to crank up the maximum size of the internal event queue in local.properties. This event queue defaults to 8192 and under extreme load this becomes a bottleneck. You would normally leave this alone. The only way right now to see if this is the bottleneck is to run jstack against the VSE and see how many threads are blocked inside the EventHandler class waiting for the queue to be drained...
lisa.pathfinder.on=false
Disable Pathfinder. It adds a reasonable overhead, especially when you are going for maximum throughput.
lisa.CycleExecHistory.buffer.size=2 (6.x does this for you so this is a 5.x thing only)
Reduce the amount of history that each VSE worker thread keeps. This reduces heap requirements and saves CPU.
Virtual Service Models should, in general, be deployed to production with 0% think time and a Capacity of more than 1 (the default) but probably no more than 10 to 20, depending on how many other service models the VSE has deployed.
The Execution Mode should be set to "Most Efficient;" see also the discussion of recording the original service model (more flexible versus more efficient). A "more efficient" model is almost twice as fast as a "more flexible" model.
Capacity here really means how many threads there are to service requests for this service model. The screen capture above shows 20 but the service is running on the 32-way trex machine. A good rule of thumb for x86 machines is capacity= two times the number of cores. A good rule of thumb for x86 machines is capacity= two times the number of cores. It is highly unlikely, unless the virtual service is I/O bound, that adding capacity beyond this will result in increased throughput: the service will normally (should normally) be CPU-bound.
VSE works much the same way as most modern, high performance applications servers do. Each service has an endpoint, typically a network socket. VSE has one Selector Thread per service socket. The Selector Thread uses a Java "SelectableChannel" to very efficiently process inbound and outbound network IO. Inbound IO is handed to a thread pool bound to the service. The size of the thread pool is fixed by the service Capacity.
In practice, it is very difficult to saturate the thread pool with standard models and a capacity of 10, so you might want to start with a Capacity of only 5. Much work has gone into making the VSE execution path extremely "tight" and it is generally only when logging, XML parsing or call-outs to other systems get involved that the VSE request/response cycle is not CPU bound.
Of course, if you really do want to model the original service think time, then the Think Scale setting will be 100% and the worker threads will put the response on the DelayedResponseQueue and then returned to the pool. But this document is about maximizing throughput... you want a Think Scale of 0.
If you are using LISA to actually drive the load on the VSE (and there are many good reasons to do this) then you must ensure that your client side does not become the bottleneck too quickly. Unless you have hand-crafted Java load clients, no matter which framework you use, it gets in the way eventually unless you spread them over several physical machines.
There are a few simple things you can do in your LISA staging doc and test case to ensure you won't totally slow down after a few vusers.
vse_log.log:2009-08-31 19:50:11,641 [main] INFO com.itko.lisa.test.Environment - JVM info: java.vm.info=mixed mode, java.vm.name=Java HotSpot(TM) Server VM, java.vm.specification.name=Java Virtual Machine Specification, java.vm.specification.vendor=Sun Microsystems Inc., java.vm.specification.version=1.0, java.vm.vendor=Sun Microsystems Inc., java.vm.version=10.0-b22}{java.version=1.6.0_06}
JVM garbage collection has been observed to cause VSE performance issues, with responses sometimes taking several seconds. To troubleshoot and resolve this, add the following to VSE's vmoptions file. These options apply only to Sun's / Oracle's JVM.
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Bounce VSE, and watch the output (stdout). You will see lines like this:
3603.403: [GC [PSYoungGen: 121111K->26065K(132608K)] 533555K->441076K(553280K),0.0761500 secs] [Times: user=0.19 sys=0.00, real=0.11 secs]
3603.481: [Full GC [PSYoungGen: 26065K->0K(132608K)] [PSOldGen: 415010K->388919K(425792K)] 441076K->388919K(558400K) [PSPermGen: 55456K->55456K(59456K)], 3.6979719 secs] [Times: user=4.51 sys=0.09, real=4.70 secs]
In this case, the OldGen garbage collection was causing VSE to slow down significantly, which you can see by comparing it to the output when only YoungGen collection happened to the output when OldGen collection happened. Switching garbage collection strategies may help. Add these lines to your vmoptions file:
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:MaxGCPauseMillis=500
Other options can be found at http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html
Surprisingly, it can also help to decrease the memory you have allocated to VSE.
Make sure the VS is running in Most Efficient mode as other modes might have impact on the performance.