Over the last few months JIRA performance has tanked, but gradually. Difficult to think of any particular inflexion point, just more users, projects and issues.
I had a number of changes I had in mind to help:
- Application server
- DBMS type and location
- Caching for static content
- Reducing numbers of projects, groups etc.
- Optimising indexes.
- Turn gzip compression on.
- Memory and garbage collector settings.
- Bypassing the apache reverse proxy and going direct to Tomcat
- Use AJP13 connector instead of (as well as?) ProxyPass.
BTW, if you're looking for the instruction on where the "go-faster switch" is and can't be bothered to read the following, I'll tell you simply that you should replace the group jira-users in your roles or permission schemes with Anyone. If you're not using that then hopefully there will be something else of interest here, if only how to set up a simple load test to get some figures.
For the record our numbers at the time of testing are:
Issues 112897 Projects 475 Custom Fields 505 Workflows 239 Users 7613 Groups 698
Before changing anything it's important you have a benchmark. It's no good relying on user feedback. Sure, they will whinge when performance is bad, but when it improves you ain't going to hear anything from them.
The weapons in my load testing armoury were LoadRunner, JMeter, and Selenium RC. LoadRunner pissed me off right from the get-go with a 1.2 Gb installation and general all-round suckiness, Selenium is very easy to use and can test AJAXy things by simulating actual browser usage, but you're limited by how many browsers you can actually run on your machine. JMeter doesn't have this issue
because it just chucks http packets at JIRA, and also has nice facilities for rolling up and displaying your test results.
For complex load tests I'd use LoadRunner, for regression testing Selenium, but for a non-QA professional with a simple load test, JMeter does just fine.
In testing, your job is to try to simulate activity which relates to real-world conditions. So just hammering a single page is unlikely to yield useful results… filesystem caching, hotspot compilation and other things are going to give you distorted results, and suggest changes which may not benefit your real-life usage.
We use Apache as a reverse proxy as there are five different jira instances on the box. So I took the access logs for a recent three-day period, and sorted by request type and frequency. From this I collected a list of URLs for the most frequent queries, eg Issue Navigator requests, browse project, view issues etc.
The JMeter test looks like so. There are two thread groups which run concurrently. Both groups are set to loop forever, i.e. until I stop the test.
The first one simulates a user going to the dashboard, then creating an issue.
The second thread group (ignore the disabled Static Content one) simulates users logging in, then hitting a particular URL, be it an issue, a filter, browsing a project, and so on. The CSV Data Set Config has a list of 60 or 70 frequently used URLs that I extracted from the Apache access log. The first thread selects the first URL in the CSV file, the second selects the second and so on, and then rotates.
The first group has three threads (i.e. users) that ramp up with a 3-second delay. The second group has 10 users with a 5-second delay.
When run, this very very approximately, yields a similar proportion of readers versus writers as is seen in our production system. Although it's a fairly crude approximation, I think this is good enough to give a starting point for tweaking things.
The full JMX file can be found here.
Hint: If you are having problems or are not sure your test is running properly, then set a thread group to run with a single thread, and have it proxy through Fiddler, by starting it thus:
javaw -jar ApacheJMeter.jar -H localhost -P 8888
When you start running the tests you might notice a high degree of entrainment in the first thread group – even if you stagger the threads heavily, pretty soon they all get the same pages at the same time.
If your JIRA uses kerberos or NTLM authentication it may be as well to disable this and fall back to forms-based authentication, to keep things simple.
For your benchmark, you should use a copy of the production database. If I had used a blank database I wouldn't have come to my key finding. Also, it goes without saying that you should test on similar hardware for web and database server.
For my benchmark I'm using the same web server as production, SQL Server 2005, and a copy of the production database. Don't forget to turn email off (-Datlassian.mail.popdisabled=true -Datlassian.mail.senddisabled=true).
After running the test for 5 minutes or so, the Summary Report listener provides the following information:
|Create Choose type||12||15452||8320||23219|
The bold figure is the average request time across all sample types, and is the main figure I use for comparison going forward. Yes that's simplistic, but probably "good enough" for my purposes. So, we're looking at a 10 second average request time, when the system is under load. That is fucked up.
The "Get Issue" label is actually getting one of the 60 or 70 URLs as mentioned above, which takes 9.3 seconds on average. Getting a dashboard and hitting the Create Issue link is truly bad, at over 30 seconds.
JMeter will also graph the results for you in various ways, but I didn't find this as informative as looking at the simple summary.
Now, I expected the bottleneck to be either the database, or filesystem access to the Lucene indexes. In fact it turns out that JIRA was CPU-bound during this test:
This is an HP DL385 with two dual-core AMD 2.2 GHz processors, so not too short of horsepower.
Slightly tangentially, I've also noticed aware that the Filter List Plugin for JIRA is extremely slow, at around 12 seconds per portlet. As it's not lazy-loading it presents a very bad end-user experience.
So the next weapon in your armoury should be a profiler… I'm using JProfiler. Profiling shows that much time is spent creating a list of projects that the current user has the BROWSE permission for. In fact this is required for almost every transaction. The list is cached, but only for the duration of the transaction, i.e. the page request.
The hotspots show:
Now, a bit more background. By default we set projects up with our standard role-based permissions scheme, so project admins can change roles themselves. Anything else is unmanageable from an admin-maintenance perspective with our volume of users.
By default we set projects up so they are world-browseable, in the spirit of information sharing. To do this we add the group jira-users to the Users role. If the project admins want a private a project they can remove that and replace it with users or other groups.
So my guess here is that the overhead is in checking if user X is in the group jira-users, for every project using this role-based scheme.
The nice thing about having a load test is you can make a guess, change something, then rerun the load test, as opposed to stepping through the code and proving it to yourself. I created a permission scheme that uses the group called Anyone for the BROWSE permission, and changed all 450-odd projects to this scheme and rerun the test.
Results using the "Anyone" Scheme
|Create Choose type||6||3781||3054||4420|
The average request time is 1.6 seconds. Now we're cooking. I think this is where I came up with the 6.5 times figure for improvement in the title.
Of course, there were many private projects, so I couldn't do this in production. However, I could change the scheme of all projects that were using the role-based scheme which also had jira-users in the Users role, which was around 260. I did this using the groovy script in the appendix. Doing this gave a 4.5 times speed boost, plus there is more tuning to do, still some more juice to be squeezed from this fuquer.
There is a disadvantage though – to make a public project private, it's no longer sufficient to remove the jira-users group – it now needs a permission scheme change (another admin request).
As a side-note, and notwithstanding I haven't been through the code with a fine-tooth comb, it doesn't appear to be terribly efficient. The permissions list is regenerated for every transaction. Why not have a global cache of permissions by project by user, which could be populated on jira startup, and modified when groups, roles or schemes are changed. That would be a bitmap with size: active users * projects * permissions = approx 43 million in our case.
If you don't have access to a profiler you can get some way there using the in-built profiling option. For instance, the before timing for the permissions query is:
PERMISSIONS: 507 keys (4 unique) took 1004ms/1142ms : 87.91594% 251ms/query avg.
After changing schemes:
PERMISSIONS: 478 keys (1 unique) took 0ms/12359ms : 0.0% 0ms/query avg.
It turns out com.atlassian.jira.security.type.GroupDropdown#hasPermission has a little quick get-out for the Anyone group.
Disabling Static Content
Each request, even if it results in a 304 (not modified) response still causes work for Tomcat.
|Create Choose type||12||353||96||987|
Now, this is 21% faster than the previous run, but this could be due to way JMeter works. My feeling behind including this variable, is that if what is effectively static content could be served up by something that handles this well, like Apache, or maybe Glassfish – the Grizzly engine is supposed to be an Apache-killer with the new java I/O system), then you might get something like these numbers even when serving the static content.
Something to look at another day anyway.
Using Remote Postgres
This is using the same copy of the production database, includes the permissions changes made earlier, but getting static content again.
|Create Choose type||12||744||454||1117|
Disappointingly, this was slower than with SQL Server. No doubt it would be faster using a co-hosted database, but perhaps not worth the headache of departing from a corporate standard. Also many users write reports directly against a database view, to overcome shortcomings in jira filtering and reporting. The SQL dialect used by SQL Server is more commonly known than PSQL.
Currently we have "-XX:MaxPermSize=128m -Xms512m -Xmx4096m".
I tried a small, medium and large heap. Gratifyingly, our current settings came out best. I have heard setting a large heap can cause poor performance but cannot find any documentary evidence for this. This is true for setting a large thread stack size but we don't have that set.
Nor did I see evidence that starting with a smaller heap size then allowing to grow caused problems, but perhaps the differences were too small to become evident.
I didn't do much with the GC. I'm of the opinion that at least in Java 5 and above it's going to make better choices than you will, and dicking with it is likely to cause more harm than good.
Enabling Gzip Compression
No statistically significant difference, although may have been if I was running the test from a remote site. Can't decide whether there is a net benefit to enabling this.
Avoiding Apache Reverse Proxy
Here I want to test going direct to the tomcat port and not using Apache. We use Apache as a reverse proxy because we have 5 different jira instances on the same box, with virtual host directives that proxy requests through to the correct port. This lets us avoid using port numbers in URLs which looks untidy, and ties you to those ports forever.
|Create Choose type||9||606||411||882|
11% quicker is probably worth a closer look. If you only have a single instance and are using Apache just because you don't want to run the app server as root, there is probably a better way.
90% of the benefit came from permission scheme changes, and fortunately doesn't seem to require any code changes. Atlassian really should look at this area, or perhaps document how to set it up "properly" if we are not doing the right thing.
And creating a load test, even a simplistic one, is essential before trying stuff out.
Next steps are to do load testing with Scarlet/Terracotta, and the commercial clustering solution.
Change Permission Schemes