Load tests observability


Last few sprints I’ve been involved into load test of the high load distributed system and faced with some problems that are related mostly to the test infrastructure rather than system under test. Let me explain what I mean and how I am going to fix it.

System under tests could be represented by following scheme.

System under test could be described with following scenario: HTTP request -> JVM app -> load data from cache or load it from third party system -> return response to the user.

Wiremock problem

During the load test we usually mock third party system via wiremock and run tests against java applications directly avoiding Public API gateway. During the last load session I found that Java application <=> wiremock response time was 3 seconds max and 95th percentile was 2 seconds! After hours of debugging wiremock, running tests again and again and digging into thread dumps and heap dumps I found out that wiremock’s linux machine max open files limit was low (You can check it by cat /proc/sys/fs/file-max) . You might ask: Why should I care about file limit? Basically everything in linux is a file. TCP connections are files too. So wiremock was struggling to open new connections because of the file descriptors limit. Another thing was just basic misconfiguration, request journal was enabled and big part of wiremock’s heap was requests information. After tuning up file descriptors limit, disabling request journal and increasing wiremock heap I ended up with <300 ms 95th percentile of request time, which is acceptable in my case.

I spent so much time on this and didn’t wanted to dig into details again. I decided that I need to monitor wiremock the same way application under test is monitored. So I created wiremock-metrics extension, that exposes JVM and wiremock metrics in prometheus format. With combination of node_exporter and wiremock-metrics now I have all information about CPU/Memory/JVM/File limits etc. If you are using mock server in your load tests and you don’t collect any metrics I highly encourage you to do so. Because problems could hide in mock server infrastructure, not in application under test itself.

Load agent problem

After tuning wiremock I found in application grafana dashboard that requests quantity is slowly decreasing during 2 hour load test session. I haven’t found anything in application logs and metrics and I thought that problem could be related to the test infrastructure again rather than application under test.
We use gatling in our load tests with constantUsersPerSec injection profile. It turned out that gatling was struggling to inject new users after some time because of the same max file descriptors limit and low heap configured for gatling JVM. Looks like load agent should be monitored as well. I haven’t implemented it yet, but approach will be similar to the wiremock solution: node_exporter for linux machine and prometheus java agent for gatling’s JVM.