Dan Mindler
Enterprise Systems & Services dmindler@rutgers.edu
UPortal 2:
Releases
uPortal 2.5.x
2.5.0 GA 5/26/05
2.5.1 RC1 available 7/05
2.5.1 GA ?
uPortal 2.4.x
2.4.3 GA 8/12/05
Quick Starts for 2.5.0 and 2.4.3 made available 9/4/05 (contributed by Vincent
Mathieu Vincent.Mathieu@univ-nancy2.fr
)
uPortal 2.x Memory/Performance:
Status
uPortal 2.x Memory/Performance:
Background
Scott Battaglia presented overview on 3/21/05
( http://jasigch.princeton.edu:9000/display/UPC/March+2005+uPortal+Developer s+Meeting+Minutes#March2005uPortalDevelopersMeetingMinutes-24PerfMem )
myRutgers has 10-15k unique users per day, with 30-40k total logins. Load balance per session to 4 machines
Symptoms indicated memory leak
Saw in production since 11/04
Utilized scripting to monitor JVM heap and mod_jk connections; restart before the out-ofmemory occurred
Used tools (e.g., YourKit) to inspect the heap
uPortal 2.x Memory/Performance:
Initial Changes
Fixes before March, 2005:
Removed caching of IPersons from
PersonDirectory
CError and CSecureInfo now pass events to wrapped channels
Restrict access to ChannelFactory’s channel cache, synchronized instantiateChannel method
Guest sessions created on time out
AbstractMultithreadedChannels were not cleaning out their channel state maps (2 of them)
uPortal 2.x Memory/Performance:
More Changes
Switch homegrown ThreadPool to
Backport Concurrent
Remove finalizer in UBC_Webmail
Update to AuthorizationImpl
uPortal 2.x Memory/Performance:
More leaks?
uPortal 2.x Memory/Performance:
Finalization Problems
If an Object implements public void finalize(), no longer referenced gets placed on finalizer queue after
A finalizer thread will sweep this list at its’ leisure (i.e., no guarantee of when)
Perhaps the GC thread is too busy, many temporary Objects created
Target: finalizers, leaks and temporary objects
uPortal 2.x Memory/Performance:
Finalizers, Leaks and Temporary Objects
Target finalizer() methods:
org.jasig.portal.MultipartDataSource
Xalan XRTreeFrag (now in Xalan 2.7)
Upgrade to JavaMail 1.3.2 and modify local copy to remove finalizers
Target Leaks/Performance:
org.jasig.portal.ChannelManager.java channelTable (Hashtable) not cleaned properly
org.jasig.portal.car.CarResources.java not closing stream
Turn off dynamic class reloading in web.xml:
<Context reloadable="false"/>
Target excessive temporary Objects:
( http://jasigch.princeton.edu:9000/pages/viewpage.action?pageId=9271
):
org.jasig.portal.serialize.HTMLdtd.fromChar(int) – based on Xerces, but now deprecated – removed temp object creation
org.jasig.portal.utils.SubstitutionIntegerFilter – reduced temp object creation
org.jasig.portal.MediaManager – reduced number of MediaManager objects created
uPortal 2.x Memory/Performance:
Still Memory Problems
Server behaving better; some servers requiring restart in 3 hours, others in 3 days
Heap dumps still point to Objects on Finalizer queue – primarily JDK sockets (plain and SSL)
Though still running out of a memory:
Unable to replicate in QA – indicates testing does not reflect usage
Unpredictable nature of heap growth - servers running similar traffic/load should show similar pattern in heap usage (not one failing in 3 hours and another in 3 days)
Tools no longer point to leaked Objects (other than on Finalizer queue)
So, why can’t the finalizer thread keep up!?!?
uPortal 2.x Memory/Performance:
Out Of Memory Graph
uPortal 2.x Memory/Performance:
Try, Try Again
Too many open files (configured for 8k)
Thought it was network sockets
During production run, got a snapshot of lsof (list of open files), indicating hundreds of: java 6843 tomcat 30r VREG 85,7 409 4271365 /u01/app/jakarta-tomcat-
5.0.25_load_test/work/Catalina/localhost/portallt/loader/com/swabunga/spell/engine/configu ration.properties
Email uses Jazzy ( http://sourceforge.net/projects/jazzy ) was not explicitly closing files
( com.swabunga.spell.engine.PropertyConfiguration.java
)
Wrote/Deployed a web-app that given parameters:
spawns a thread to run System.gc() every configured milliseconds
spawns a thread to run System.runFinalization() every configured milliseconds – not guaranteed to run finalization, but system will make an effort
uPortal 2.x Memory/Performance:
A Watched Pot…Boils
Patch solved problem of too many open files
Periodic run of finalization appeared ok:
2005-08-16 07:26:03.591: Thread-1: DEBUG: Requesting finalization
2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization request complete
2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization thread sleeping 900 seconds
Monitored tenured area of heap:
Heap def new generation total 873856K, used 62635K [0x75000000, 0xb5000000, 0xb5000000) eden space 699136K, 0% used [0x75000000, 0x75000000, 0x9fac0000) from space 174720K, 35% used [0x9fac0000, 0xa37eadb0, 0xaa560000) to space 174720K, 0% used [0xaa560000, 0xaa560000, 0xb5000000) tenured generation total 1048576K, used 287178K [0xb5000000, 0xf5000000, 0xf5000000) the space 1048576K, 27% used [0xb5000000, 0xc6872810, 0xc6872a00, 0xf5000000) compacting perm gen total 32768K, used 24113K [0xf5000000, 0xf7000000, 0xf9000000) the space 32768K, 73% used [0xf5000000, 0xf678c538, 0xf678c600, 0xf7000000)
uPortal 2.x Memory/Performance:
Bingo!
Noticed the tenured space growing after successive GC’s: heap leak occurring
Scanned the log for finalization messages:
2005-08-16 12:07:04.466: Thread-2: DEBUG: Requesting finalization
...
No debug print of finalization complete!!!!!
AHAA!!!
Finalization request not returning at the same time the tenured area of heap starts growing… Finalization thread is blocked
Issue a “kill -3” to obtain a Java Thread
Dump
uPortal 2.x Memory/Performance:
Blocked Finalizer Threads
"Secondary finalizer" prio=5 tid=0x011ef348 nid=0x1477 runnable [d8000000..d80019c0] at java.lang.Object.wait(Native Method)
- waiting on <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync) at com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync.lock(Unknown Source)
- locked <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync) at com.sshtools.j2ssh.transport.TransportProtocolOutputStream.sendMessage(Unknown Source)
- locked <0xeaf55810> (a com.sshtools.j2ssh.transport.TransportProtocolOutputStream) at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendMessage(Unknown Source)
- locked <0xeaf49ef0> (a com.sshtools.j2ssh.transport.TransportProtocolClient) at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendDisconnect(Unknown Source) at com.sshtools.j2ssh.transport.TransportProtocolCommon.disconnect(Unknown Source) at com.sshtools.j2ssh.SshClient.disconnect(Unknown Source) at edu.columbia.filesystem.impl.SFTPFileSystemImpl.disconnect(SFTPFileSystemImpl.java:89) at edu.columbia.filesystem.impl.RemoteFileSystemImpl.finalize(RemoteFileSystemImpl.java:715) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83) at java.lang.ref.Finalizer.access$100(Finalizer.java:14) at java.lang.ref.Finalizer$2.run(Finalizer.java:131) at java.lang.Thread.run(Thread.java:534)
uPortal 2.x Memory/Performance:
Investigation
Thread dump indicated two finalization threads, both blocked at the same point in code
Points to Briefcase channel, in the low level code to handle secure connections to ftp server (J2SSH http://sourceforge.net/projects/sshtools )
A lock release is NOT in a finally clause
A user who opens their briefcase, then proactively closes it causes the problem when the user logs/times out:
Proactive closing sends a close command over the socket
A finalizer in the HyperContent (uses J2SSH) issues close (throws Exception, not releasing lock)
A finalizer in briefcase issues close through
HyperContent, which waits on the lock
uPortal 2.x Memory/Performance:
Wash, Rinse, Repeat
Need to re-create problem in QA
Why not caught in QA?
The briefcase channel utilizes sftp to an individual account, but a test sftp server does not exist, so a “real” user must be used – not part of the Test Scenario
A new test script is written to simulate one user logging into the portal and using the Briefcase
Problem is replicated consistently within 20 minutes using a 256MB heap
uPortal 2.x Memory/Performance:
Final Resolution
Though the J2SSH code is no longer actively maintained, a contribution is found to address this issue – made available 6/04
A HyperContent class is extended and a finalizer method is overridden to only close if the connection is open
The finalizer in the briefcase is removed
QA test now passes
uPortal 2.x Memory/Performance:
Install Fix in Production
uPortal 2.x Memory/Performance:
Increase in Threads
Monitoring of servers show a gradual increase in the number of threads (from initial 180 to over 1k)
kill -3 reveals hundreds of threads in
JavaMail:
"JavaMail-EventQueue" daemon prio=5 tid=0x002da400 nid=0x534 in Object.wait() [aa881000..aa8819c0] at java.lang.Object.wait(Native Method)
- waiting on <0xba6476f0> (a javax.mail.EventQueue) at java.lang.Object.wait(Object.java:429) at javax.mail.EventQueue.dequeue(Unknown Source)
- locked <0xba6476f0> (a javax.mail.EventQueue) at javax.mail.EventQueue.run(Unknown Source) at java.lang.Thread.run(Thread.java:534)
Modified JavaMail to restore needed finalization() methods
Patch fixes thread problem, but…
uPortal 2.x Memory/Performance:
Heavy Traffic
New semester brings much more traffic
(and more problems):
Increase in portal logins (40k in one day,
3600/hour) overwhelms authentication infrastructure, causing mod_jk threads blocking resulting in restarts (and more logins, etc…)
Channel time out – traced to heavy load and use of SoftReference caches; worst case scenario caching
Heavy memory usage – many temporary objects; over 600 concurrent sessions results in over 700MB temporary objects every second
uPortal 2.x Memory/Performance:
Triage
Introduce LDAP connection pooling
Modify caching of layouts to share across users
Tune caching of heavily used channels
* Modifications not yet submitted to uPortal codebase
uPortal 2.x Memory/Performance:
Going Forward…
Focus on heavily used channels (Email)
Revisit statistics to capture better data for monitoring/reporting/forecasting production servers
Implement better caching algorithms
Monitor/tune/implement pooling:
LDAP
DB
HTTP client connections
Migrate to JDK 1.5
Continue to monitor/tune heap
Large new area for temporary objects
Go on a Temporary Object Diet!
uPortal 2.x Memory/Performance:
Lessons Learned
QA should reflect production usage
Java “Memory Leaks” becomes a catchphrase for many different types of problems
Useful tools:
Java Heap/CPU analyzer (e.g., YourKit)
lsof ( ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/ )
kill -3 thread dumps
Experience was invaluable:
Solved many memory leaks/performance issues in codebase contributing to a more stable portal
Skills can be applied to other Java-based apps