How to make JIRA 6 1/2 times faster

Over the last few months JIRA performance has tanked, but gradually. Difficult to think of any particular inflexion point, just more users, projects and issues.

I had a number of changes I had in mind to help:

  • Application server
  • DBMS type and location
  • Caching for static content
  • Reducing numbers of projects, groups etc.
  • Optimising indexes.
  • Turn gzip compression on.
  • Memory and garbage collector settings.
  • Bypassing the apache reverse proxy and going direct to Tomcat
  • Use AJP13 connector instead of (as well as?) ProxyPass.

BTW, if you're looking for the instruction on where the "go-faster switch" is and can't be bothered to read the following, I'll tell you simply that you should replace the group jira-users in your roles or permission schemes with Anyone. If you're not using that then hopefully there will be something else of interest here, if only how to set up a simple load test to get some figures.

For the record our numbers at the time of testing are:

Issues 112897   Projects 475   Custom Fields 505   Workflows 239   Users 7613   Groups 698

Load Testing

Before changing anything it's important you have a benchmark. It's no good relying on user feedback. Sure, they will whinge when performance is bad, but when it improves you ain't going to hear anything from them.

The weapons in my load testing armoury were LoadRunner, JMeter, and Selenium RC. LoadRunner pissed me off right from the get-go with a 1.2 Gb installation and general all-round suckiness, Selenium is very easy to use and can test AJAXy things by simulating actual browser usage, but you're limited by how many browsers you can actually run on your machine. JMeter doesn't have this issue
because it just chucks http packets at JIRA, and also has nice facilities for rolling up and displaying your test results.

For complex load tests I'd use LoadRunner, for regression testing Selenium, but for a non-QA professional with a simple load test, JMeter does just fine.

In testing, your job is to try to simulate activity which relates to real-world conditions. So just hammering a single page is unlikely to yield useful results… filesystem caching, hotspot compilation and other things are going to give you distorted results, and suggest changes which may not benefit your real-life usage.

We use Apache as a reverse proxy as there are five different jira instances on the box. So I took the access logs for a recent three-day period, and sorted by request type and frequency. From this I collected a list of URLs for the most frequent queries, eg Issue Navigator requests, browse project, view issues etc.

The JMeter test looks like so. There are two thread groups which run concurrently. Both groups are set to loop forever, i.e. until I stop the test.

The first one simulates a user going to the dashboard, then creating an issue.

The second thread group (ignore the disabled Static Content one) simulates users logging in, then hitting a particular URL, be it an issue, a filter, browsing a project, and so on. The CSV Data Set Config has a list of 60 or 70 frequently used URLs that I extracted from the Apache access log. The first thread selects the first URL in the CSV file, the second selects the second and so on, and then rotates.

The first group has three threads (i.e. users) that ramp up with a 3-second delay. The second group has 10 users with a 5-second delay.

When run, this very very approximately, yields a similar proportion of readers versus writers as is seen in our production system. Although it's a fairly crude approximation, I think this is good enough to give a starting point for tweaking things.

The full JMX file can be found here.

Hint: If you are having problems or are not sure your test is running properly, then set a thread group to run with a single thread, and have it proxy through Fiddler, by starting it thus:

javaw -jar ApacheJMeter.jar -H localhost -P 8888

When you start running the tests you might notice a high degree of entrainment in the first thread group – even if you stagger the threads heavily, pretty soon they all get the same pages at the same time.

If your JIRA uses kerberos or NTLM authentication it may be as well to disable this and fall back to forms-based authentication, to keep things simple.

     

Benchmark Results

For your benchmark, you should use a copy of the production database. If I had used a blank database I wouldn't have come to my key finding. Also, it goes without saying that you should test on similar hardware for web and database server.

For my benchmark I'm using the same web server as production, SQL Server 2005, and a copy of the production database. Don't forget to turn email off (-Datlassian.mail.popdisabled=true -Datlassian.mail.senddisabled=true).

After running the test for 5 minutes or so, the Summary Report listener provides the following information:

Label # Samples Average Min Max
Login 3 47 14 96
LoginFBrowse 10 64 10 247
Get issue 531 9308 2514 37488
Dashboard 15 32710 21064 49975
LazyLoaderPortlet 15 10362 6057 15021
CreateIssue Link 13 37400 28580 49606
Create Choose type 12 15452 8320 23219
Issue Details 12 14400 8600 18720
TOTAL 611 10530 10 49975

The bold figure is the average request time across all sample types, and is the main figure I use for comparison going forward. Yes that's simplistic, but probably "good enough" for my purposes. So, we're looking at a 10 second average request time, when the system is under load. That is fucked up.

The "Get Issue" label is actually getting one of the 60 or 70 URLs as mentioned above, which takes 9.3 seconds on average. Getting a dashboard and hitting the Create Issue link is truly bad, at over 30 seconds.

JMeter will also graph the results for you in various ways, but I didn't find this as informative as looking at the simple summary.

Now, I expected the bottleneck to be either the database, or filesystem access to the Lucene indexes. In fact it turns out that JIRA was CPU-bound during this test:

This is an HP DL385 with two dual-core AMD 2.2 GHz processors, so not too short of horsepower.

Slightly tangentially, I've also noticed aware that the Filter List Plugin for JIRA is extremely slow, at around 12 seconds per portlet. As it's not lazy-loading it presents a very bad end-user experience.

So the next weapon in your armoury should be a profiler… I'm using JProfiler. Profiling shows that much time is spent creating a list of projects that the current user has the BROWSE permission for. In fact this is required for almost every transaction. The list is cached, but only for the duration of the transaction, i.e. the page request.

The hotspots show:

Now, a bit more background. By default we set projects up with our standard role-based permissions scheme, so project admins can change roles themselves. Anything else is unmanageable from an admin-maintenance perspective with our volume of users.

By default we set projects up so they are world-browseable, in the spirit of information sharing. To do this we add the group jira-users to the Users role. If the project admins want a private a project they can remove that and replace it with users or other groups.

So my guess here is that the overhead is in checking if user X is in the group jira-users, for every project using this role-based scheme.

The nice thing about having a load test is you can make a guess, change something, then rerun the load test, as opposed to stepping through the code and proving it to yourself. I created a permission scheme that uses the group called Anyone for the BROWSE permission, and changed all 450-odd projects to this scheme and rerun the test.

Results using the "Anyone" Scheme

Label # Samples Average Min Max
Login 3 35 16 45
LoginFBrowse 10 14 8 39
Get issue 709 1322 20 26670
Dashboard 9 14120 9735 16551
LazyLoaderPortlet 9 367 221 680
CreateIssue Link 9 13444 10532 15855
Create Choose type 6 3781 3054 4420
Issue Details 6 3654 2929 5013
TOTAL 761 1621 8 26670

The average request time is 1.6 seconds. Now we're cooking. I think this is where I came up with the 6.5 times figure for improvement in the title.

Of course, there were many private projects, so I couldn't do this in production. However, I could change the scheme of all projects that were using the role-based scheme which also had jira-users in the Users role, which was around 260. I did this using the groovy script in the appendix. Doing this gave a 4.5 times speed boost, plus there is more tuning to do, still some more juice to be squeezed from this fuquer.

There is a disadvantage though – to make a public project private, it's no longer sufficient to remove the jira-users group – it now needs a permission scheme change (another admin request).

As a side-note, and notwithstanding I haven't been through the code with a fine-tooth comb, it doesn't appear to be terribly efficient. The permissions list is regenerated for every transaction. Why not have a global cache of permissions by project by user, which could be populated on jira startup, and modified when groups, roles or schemes are changed. That would be a bitmap with size: active users * projects * permissions = approx 43 million in our case.

If you don't have access to a profiler you can get some way there using the in-built profiling option. For instance, the before timing for the permissions query is:

PERMISSIONS: 507 keys (4 unique) took 1004ms/1142ms : 87.91594% 251ms/query avg. 

After changing schemes:

PERMISSIONS: 478 keys (1 unique) took 0ms/12359ms : 0.0% 0ms/query avg.  

It turns out com.atlassian.jira.security.type.GroupDropdown#hasPermission has a little quick get-out for the Anyone group.

Disabling Static Content

For each http sample I unchecked the box that says "retrieve all embedded resources from http files", so it's not getting associated stylesheets, javascript files, and images etc.

Each request, even if it results in a 304 (not modified) response still causes work for Tomcat.

Label # Samples Average Min Max
Login 3 52 16 98
LoginFBrowse 10 33 7 118
Get issue 1089 1048 22 10981
Dashboard 15 15706 10442 19187
LazyLoaderPortlet 12 529 283 1590
CreateIssue Link 12 6636 4629 9643
Create Choose type 12 353 96 987
Issue Details 12 2694 1709 5338
TOTAL 1165 1288 7 19187

Now, this is 21% faster than the previous run, but this could be due to way JMeter works. My feeling behind including this variable, is that if what is effectively static content could be served up by something that handles this well, like Apache, or maybe Glassfish – the Grizzly engine is supposed to be an Apache-killer with the new java I/O system), then you might get something like these numbers even when serving the static content.

Something to look at another day anyway.

Using Remote Postgres

This is using the same copy of the production database, includes the permissions changes made earlier, but getting static content again.

Label # Samples Average Min Max
Login 3 15 14 17
LoginFBrowse 10 26 9 85
Get issue 595 1516 21 26446
Dashboard 15 10982 7354 15680
LazyLoaderPortlet 12 430 221 744
CreateIssue Link 12 6978 5296 8789
Create Choose type 12 744 454 1117
Issue Details 12 2440 2005 2984
TOTAL 671 1780 9 26446

Disappointingly, this was slower than with SQL Server. No doubt it would be faster using a co-hosted database, but perhaps not worth the headache of departing from a corporate standard. Also many users write reports directly against a database view, to overcome shortcomings in jira filtering and reporting. The SQL dialect used by SQL Server is more commonly known than PSQL.

Memory Tweaks

Currently we have "-XX:MaxPermSize=128m -Xms512m -Xmx4096m".

I tried a small, medium and large heap. Gratifyingly, our current settings came out best. I have heard setting a large heap can cause poor performance but cannot find any documentary evidence for this. This is true for setting a large thread stack size but we don't have that set.

Nor did I see evidence that starting with a smaller heap size then allowing to grow caused problems, but perhaps the differences were too small to become evident.

I didn't do much with the GC. I'm of the opinion that at least in Java 5 and above it's going to make better choices than you will, and dicking with it is likely to cause more harm than good.

Enabling Gzip Compression

No statistically significant difference, although may have been if I was running the test from a remote site. Can't decide whether there is a net benefit to enabling this.

Avoiding Apache Reverse Proxy

Here I want to test going direct to the tomcat port and not using Apache. We use Apache as a reverse proxy because we have 5 different jira instances on the same box, with virtual host directives that proxy requests through to the correct port. This lets us avoid using port numbers in URLs which looks untidy, and ties you to those ports forever.

Label # Samples Average Min Max
Login 3 11 11 12
LoginFBrowse 10 27 7 181
Get issue 572 1206 21 22610
Dashboard 11 13310 9551 15942
LazyLoaderPortlet 9 481 407 617
CreateIssue Link 9 6685 5401 7713
Create Choose type 9 606 411 882
Issue Details 9 1616 1313 1911
TOTAL 632 1457 7 22610

11% quicker is probably worth a closer look. If you only have a single instance and are using Apache just because you don't want to run the app server as root, there is probably a better way.

Conclusion

90% of the benefit came from permission scheme changes, and fortunately doesn't seem to require any code changes. Atlassian really should look at this area, or perhaps document how to set it up "properly" if we are not doing the right thing.

And creating a load test, even a simplistic one, is essential before trying stuff out.

Next steps are to do load testing with Scarlet/Terracotta, and the commercial clustering solution. 

Appendix

Change Permission Schemes 

16 comments to How to make JIRA 6 1/2 times faster

  • That was a great article to find waiting in my feeds today, thanks! Thorough, well-explained and with a very useful result.

    When I saw the 5 digit time values, I thought “that can’t be milliseconds, his users would be revolting!”. Then I remembered what the article was about. For what it’s worth, I’ve never seen a JIRA as slow as that.

    Just to spell it out for others reading this article, the Anyone group in JIRA is a standard group, so you don’t have to define it. The explicit steps to do what Jamie did are:

    1. Create a new permission scheme or copy an existing one
    2. In the Browse Projects row, if there is an entry “jira-users” then delete it
    3. Click Add and select Group, Anyone

    I see that Create Issue is still at 13s which is ridiculous. I’m really hoping someone from Atlassian will comment here or in their Developer’s blog at some point.

    As an admin aside, please could you post or refer to how you run that Groovy script you appended?

    ~Matt

  • anton@atlassian.com

    Hi Jamie,

    Thanks for sharing your findings! This was a great read.

    We have also found that checking group membership for a user can be quite a performance bottleneck if a group has a lot of users. (It’s O(n) at the moment). We are hoping to fix this.

    I would also be very interested to know what static content is being re-requested and what version of JIRA you are using?

    We did a bunch of work to ensure JS and CSS files are cached forever by the browsers. I would love to know what we missed.

    Cheers,
    Anton
    JIRA Dev Team Lead

  • Great post. We (JIRA Engine Room team) are aware of some of these problems and are hoping to address them at some point, but they aren’t scheduled just at the moment.

    As for GC and memory tuning, there are some excellent references around about how GC algorithms work, but – briefly – the larger the heap you allocate the longer between full GCs and the then the longer it takes to actually perform a full GC. GC tuning itself is _hard_ and requires a lot of suck it and see.

    Here’s a couple of good resources for tuning and troubleshooting Tomcat (a lot of it is actually generically applicable):
    http://www.springsource.com/files/uploads/tomcat/tomcatx-performance-tuning.pdf
    http://www.springsource.com/files/uploads/tomcat/tomcatx-troubleshooting-production.pdf

  • Hi all,

    @Anton:
    > I would also be very interested to know what static content is being re-requested and what version of JIRA you are using?

    We’re using 3.13 (.0). About the caching, sorry, this is my mistake… I see all the static stuff has a ten-year expiration and the browser doesn’t re-request it. I guess that’s why the build number is in every URL. It’s too embarrassing to explain why I came to this wrong conclusion.

    I don’t understand why my load test without the static content was significantly faster… I’m wondering if the HTTP Cache Manager in JMeter doesn’t work quite right, I will investigate this more.

    > We did a bunch of work to ensure JS and CSS files are cached forever by the browsers

    The JS produced by DWR (ie the /interface/ stuff) is never cached… I wonder if it could be? I have written a plugin which makes use of DWR on almost every page (although it was disabled for my tests above). Perhaps it could at least send a must-revalidate header to avoid resending the content, which should never change within the same build (unless people are hacking). Or have some version ID in the DWR interface classes that you bump on each change, that forms part of the URL.

    @Matt:

    “that can’t be milliseconds, his users would be revolting!”. – well, they are revolting actually ;-) But yes the first step of create issue is still very slow, I guess it’s computing the list of projects the user has Create perms in. I need to do more work on the permission schemes, and profile that operation too.

    > Just to spell it out for others reading this article

    Yep, thanks for that, I should have made that clear.

    > As an admin aside, please could you post or refer to how you run that Groovy script you appende

    You can just paste the whole thing in to the groovy runner: http://blogs.onresolve.com/?p=55, or the older one on Atlassian’s confluence. Either should work.

    @Jed:

    > We (JIRA Engine Room team) are aware of some of these problems and are hoping to address them at some point, but they aren’t scheduled just at the moment.

    Firstly, thanks for commenting publicly (not that this is a mass-traffic site, but still). Many larger companies I’ve been unfortunate to work with certainly would not.

    I think it’s a shame that you are not making performance a top priority – adding features is great, but adequate performance should come first. We’ve got 9 licensed jira instances with around 700k issues on aggregate, if we don’t get to the bottom of the performance problems we can’t use jira, no matter how many killer features there are.

    There seems to be some sort of concensus in the community that JIRA doesn’t scale much beyond 100k issues, however there is nothing in your underlying technologies that should create this limitation IMHO. From an acquired company, I’ve just inheritied 5 additional licensed jira installations (to add to my existing 5). We used 5 separate ones for reasons of segregation of duties, however they used 5 because of perceived performance problems. My guess is they just came up against the issue outlined above, ie performance problems were due to number of projects/groups/schemes, rather than number of issues. It would be great for admins like myself, and ultimately beneficial to you, to put to bed this common perception that jira can’t scale. By fixing it ;-)

    Thanks for the tuning links, they look helpful. I will look to do some more GC tuning, but only once the easier wins, like sorting out the schemes, have all been done.

    Unrelated to the above:

    “I have heard setting a large heap can cause poor performance but cannot find any documentary evidence for this”

    I should have mentioned that obviously the object heap size should be well within the constraints of the physical memory.

    cheers, jamie

  • Regarding the question about why the load tests are faster without static content… it seems that the HTTP Cache Manager in JMeter does a conditional GET for stuff already in its cache rather than skip the request altogether. So, it doesn’t really behave like a typical browser in that respect.

    So for more accurate load testing it’s probably best to just assume users have warm caches and skip the static content altogether.

    http://www.nabble.com/Far-future-Expires-header-and-HTTP-Cache-Manager-td21984255.html

    cheers, jamie

  • anton@atlassian.com

    Hi Jamie,

    Thanks for the update. That explains a lot.

    We have been trying to solve the DWR problem for a while. The JS file is user specific (from memory DWR guys say it for security reasons) which makes it uncachable :( I guess DWR is written from a full AJAX app perspective where the file is requested once and then the page is never reloaded. Not what JIRA does. I believe DWR were planning to fix it at some stage, and I think it has not happened yet. Our plan is to (hopefully) stop using DWR.

    I certainly share your thoughts on performance. The problem with “top priority” is that there can only be one of them, and there are a lot (thousands) of things competing for that spot. If performance is what’s hurting, then certainly it is the performance that one would like attention to be paid to. If it’s lack of some feature, then it’s that feature, and so on. Our most challenging task is choosing a few tasks that we can do next out of (literally) thousands things that could be improved.

    Having said all of this, we are hoping to tackle performance. This is why we formed an Engine Room sub-team for JIRA Development. The other trouble is that “performance” is a massive subject. Just as you cover in this blog, there is browser caching, server side processing, proxy to server communication, etc. The plan is to try and chip away at it with every release.

    Just a note on current scalability point of JIRA. It does hugely depend on usage pattern (numbver of concurrent users, how things are setup – as your investigation have shown that the way permissions are organised can influence performance, etc). However, as a ballpark guideline, we currently advise to split JIRA instances once the number of issues is between 200,000 and 300,000.

    Again, thank you for taking the time to share your findings!

    Cheers,
    Anton

  • anton@atlassian.com

    The issue to watch for the group membership performance improvement is:
    http://jira.atlassian.com/browse/JRA-16744

    Cheers,
    Anton

  • Oh, pfft, the slowness in hitting the Create Issue link was down to the “My Projects” stuff – http://blogs.onresolve.com/?p=70.

    This is why you should never modify jira or install any plugins ;-)

    In my defence there is something wrong with the impl of com.atlassian.jira.security.roles.actor.GroupRoleActorFactory.GroupRoleActor#getGroup, as noted in the comments.

    The “My Projects” thing is too useful when you have 500 odd projects to be got rid of. Fortunately replacing “actor.getGroup().getName()” with “actor.getParameter()” in com.atlassian.jira.security.util.BuiltinGroupsAwareProjectSelector#getMyProjects brings that operation down to 1 second from 6 seconds, only marginally slower than without “my projects”.

  • nbelser

    We are still in the set up and evaluation phase for Jira and are using roles as a primary way of delineating permissions. I have a role called mycompany-users that I am using for the generic user permissions. In order to attain the performance improvement mentioned, should I get rid of that role and just assign those permissions to the ANYONE group or is it sufficient to assign the group ANYONE to the mycompany-users role?

  • First of all, don’t let the problems mentioned above put you off using jira. These only become problems when you get *large*.

    > I have a role called mycompany-users that I am using for the generic user permissions

    Do you mean a group? If so you should not put that as a role membership then use Roles in perm schemes, you should use the group Anyone (which is not a real group).

    > or is it sufficient to assign the group ANYONE to the mycompany-users role?

    You can’t actually do that…

    cheers, jamie

  • nbelser

    We do have a fairly large number of issues, compared to the average Jira customer: approx 70k at import time.

    What I meant to say is I have set up a permission scheme and have created a role called mycompany-users that I use for generic permissions such as ‘browse projects’, ‘edit own comments’. You are saying that I should just set those particular permissions to ‘anyone’ instead of using the role I created, which basically means that every user has those privileges for every project which uses that permission scheme (which means every project for us).

  • OK. Well, it depends on who is in the role mycompany-users for each project. If it’s just a few people or a small group it’s no problem. If however it’s jira-users or the equivalent then you should definitely replace it with anyone.

    But it’s a trade-off, by adding Anyone, you lose the ability for project admins to control access to their own projects. On the other hand in my case I got a 10x performance improvement.

    Also, the 70k issues is not itself significant as far as the above is concerned. Question is how many users do you have, and how many projects and groups?

    I would invest a couple of days in writing a load test and testing different scenarios.

    cheers, jamie

  • nbelser

    We have approximately 500 users. We have about 30 projects. Not sure about groups yet as we have not defined them. I am guessing it won’t be more than 50. I will take your advice and use jmeter. Thanks.

  • FWIW, if the biggest group will be around 500 users I don’t think you have anything to worry about, and the benefit of using roles will outweigh the slight gain you will get from the above. We only noticed problems starting around the 4 or 5 thousand user mark.
    jamie

  • nbelser

    ok, good to know. Thanks!

  • The whole users in JIRA thing just bit me again. My custom importer was doing very nicely until we ran it with a JIRA instance of 8000 users. Long pauses and head scratching followed until I noticed the network traffic peaks and that they corresponded with retrieving the list of users in a group via SOAP.

Leave a Reply