myRutgers Alert Sub-system

advertisement

uPortal 2 Status – 9/19/05

Dan Mindler

Enterprise Systems & Services dmindler@rutgers.edu

UPortal 2:

Releases

 uPortal 2.5.x

 2.5.0 GA 5/26/05

 2.5.1 RC1 available 7/05

 2.5.1 GA ?

 uPortal 2.4.x

 2.4.3 GA 8/12/05

 Quick Starts for 2.5.0 and 2.4.3 made available 9/4/05 (contributed by Vincent

Mathieu Vincent.Mathieu@univ-nancy2.fr

)

uPortal 2.x Memory/Performance:

Status

As of 8/31/05, all known memory leaks in myRutgers have been resolved

Finalizers did play a part

Many leaks have been plugged

Performance improvements have been made

But, there is room for improvement…

uPortal 2.x Memory/Performance:

Background

 Scott Battaglia presented overview on 3/21/05

( http://jasigch.princeton.edu:9000/display/UPC/March+2005+uPortal+Developer s+Meeting+Minutes#March2005uPortalDevelopersMeetingMinutes-24PerfMem )

 myRutgers has 10-15k unique users per day, with 30-40k total logins. Load balance per session to 4 machines

 Symptoms indicated memory leak

 Saw in production since 11/04

 Utilized scripting to monitor JVM heap and mod_jk connections; restart before the out-ofmemory occurred

 Used tools (e.g., YourKit) to inspect the heap

uPortal 2.x Memory/Performance:

Initial Changes

 Fixes before March, 2005:

 Removed caching of IPersons from

PersonDirectory

 CError and CSecureInfo now pass events to wrapped channels

 Restrict access to ChannelFactory’s channel cache, synchronized instantiateChannel method

 Guest sessions created on time out

 AbstractMultithreadedChannels were not cleaning out their channel state maps (2 of them)

uPortal 2.x Memory/Performance:

More Changes

Fixes discussed March, 2005:

 Switch homegrown ThreadPool to

Backport Concurrent

 Remove finalizer in UBC_Webmail

 Update to AuthorizationImpl

uPortal 2.x Memory/Performance:

More leaks?

Since March, 2005 changes, behavior better, but myRutgers JVMs still get bounced in anywhere from 1 to 3 hours

Finalizer queue still indicates many unfinalized Objects

Question: Why can’t the finalizer thread keep up?

uPortal 2.x Memory/Performance:

Finalization Problems

 If an Object implements public void finalize(), no longer referenced gets placed on finalizer queue after

 A finalizer thread will sweep this list at its’ leisure (i.e., no guarantee of when)

 Perhaps the GC thread is too busy, many temporary Objects created

 Target: finalizers, leaks and temporary objects

uPortal 2.x Memory/Performance:

Finalizers, Leaks and Temporary Objects

 Target finalizer() methods:

 org.jasig.portal.MultipartDataSource

 Xalan XRTreeFrag (now in Xalan 2.7)

 Upgrade to JavaMail 1.3.2 and modify local copy to remove finalizers

 Target Leaks/Performance:

 org.jasig.portal.ChannelManager.java channelTable (Hashtable) not cleaned properly

 org.jasig.portal.car.CarResources.java not closing stream

 Turn off dynamic class reloading in web.xml:

<Context reloadable="false"/>

 Target excessive temporary Objects:

( http://jasigch.princeton.edu:9000/pages/viewpage.action?pageId=9271

):

 org.jasig.portal.serialize.HTMLdtd.fromChar(int) – based on Xerces, but now deprecated – removed temp object creation

 org.jasig.portal.utils.SubstitutionIntegerFilter – reduced temp object creation

 org.jasig.portal.MediaManager – reduced number of MediaManager objects created

uPortal 2.x Memory/Performance:

Still Memory Problems

 Server behaving better; some servers requiring restart in 3 hours, others in 3 days

 Heap dumps still point to Objects on Finalizer queue – primarily JDK sockets (plain and SSL)

 Though still running out of a memory:

 Unable to replicate in QA – indicates testing does not reflect usage

 Unpredictable nature of heap growth - servers running similar traffic/load should show similar pattern in heap usage (not one failing in 3 hours and another in 3 days)

 Tools no longer point to leaked Objects (other than on Finalizer queue)

 So, why can’t the finalizer thread keep up!?!?

uPortal 2.x Memory/Performance:

Out Of Memory Graph

uPortal 2.x Memory/Performance:

Try, Try Again

 Too many open files (configured for 8k)

 Thought it was network sockets

 During production run, got a snapshot of lsof (list of open files), indicating hundreds of: java 6843 tomcat 30r VREG 85,7 409 4271365 /u01/app/jakarta-tomcat-

5.0.25_load_test/work/Catalina/localhost/portallt/loader/com/swabunga/spell/engine/configu ration.properties

 Email uses Jazzy ( http://sourceforge.net/projects/jazzy ) was not explicitly closing files

( com.swabunga.spell.engine.PropertyConfiguration.java

)

 Wrote/Deployed a web-app that given parameters:

 spawns a thread to run System.gc() every configured milliseconds

 spawns a thread to run System.runFinalization() every configured milliseconds – not guaranteed to run finalization, but system will make an effort

uPortal 2.x Memory/Performance:

A Watched Pot…Boils

 Patch solved problem of too many open files

 Periodic run of finalization appeared ok:

2005-08-16 07:26:03.591: Thread-1: DEBUG: Requesting finalization

2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization request complete

2005-08-16 07:26:03.593: Thread-1: DEBUG: Finalization thread sleeping 900 seconds

 Monitored tenured area of heap:

Heap def new generation total 873856K, used 62635K [0x75000000, 0xb5000000, 0xb5000000) eden space 699136K, 0% used [0x75000000, 0x75000000, 0x9fac0000) from space 174720K, 35% used [0x9fac0000, 0xa37eadb0, 0xaa560000) to space 174720K, 0% used [0xaa560000, 0xaa560000, 0xb5000000) tenured generation total 1048576K, used 287178K [0xb5000000, 0xf5000000, 0xf5000000) the space 1048576K, 27% used [0xb5000000, 0xc6872810, 0xc6872a00, 0xf5000000) compacting perm gen total 32768K, used 24113K [0xf5000000, 0xf7000000, 0xf9000000) the space 32768K, 73% used [0xf5000000, 0xf678c538, 0xf678c600, 0xf7000000)

uPortal 2.x Memory/Performance:

Bingo!

 Noticed the tenured space growing after successive GC’s: heap leak occurring

 Scanned the log for finalization messages:

2005-08-16 12:07:04.466: Thread-2: DEBUG: Requesting finalization

...

 No debug print of finalization complete!!!!!

AHAA!!!

 Finalization request not returning at the same time the tenured area of heap starts growing… Finalization thread is blocked

 Issue a “kill -3” to obtain a Java Thread

Dump

uPortal 2.x Memory/Performance:

Blocked Finalizer Threads

Finalizer threads stack traces:

"Secondary finalizer" prio=5 tid=0x011ef348 nid=0x1477 runnable [d8000000..d80019c0] at java.lang.Object.wait(Native Method)

- waiting on <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync) at com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync.lock(Unknown Source)

- locked <0xeaf54f08> (a com.sshtools.j2ssh.transport.TransportProtocolAlgorithmSync) at com.sshtools.j2ssh.transport.TransportProtocolOutputStream.sendMessage(Unknown Source)

- locked <0xeaf55810> (a com.sshtools.j2ssh.transport.TransportProtocolOutputStream) at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendMessage(Unknown Source)

- locked <0xeaf49ef0> (a com.sshtools.j2ssh.transport.TransportProtocolClient) at com.sshtools.j2ssh.transport.TransportProtocolCommon.sendDisconnect(Unknown Source) at com.sshtools.j2ssh.transport.TransportProtocolCommon.disconnect(Unknown Source) at com.sshtools.j2ssh.SshClient.disconnect(Unknown Source) at edu.columbia.filesystem.impl.SFTPFileSystemImpl.disconnect(SFTPFileSystemImpl.java:89) at edu.columbia.filesystem.impl.RemoteFileSystemImpl.finalize(RemoteFileSystemImpl.java:715) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83) at java.lang.ref.Finalizer.access$100(Finalizer.java:14) at java.lang.ref.Finalizer$2.run(Finalizer.java:131) at java.lang.Thread.run(Thread.java:534)

uPortal 2.x Memory/Performance:

Investigation

 Thread dump indicated two finalization threads, both blocked at the same point in code

 Points to Briefcase channel, in the low level code to handle secure connections to ftp server (J2SSH http://sourceforge.net/projects/sshtools )

 A lock release is NOT in a finally clause

 A user who opens their briefcase, then proactively closes it causes the problem when the user logs/times out:

 Proactive closing sends a close command over the socket

 A finalizer in the HyperContent (uses J2SSH) issues close (throws Exception, not releasing lock)

 A finalizer in briefcase issues close through

HyperContent, which waits on the lock

uPortal 2.x Memory/Performance:

Wash, Rinse, Repeat

 Need to re-create problem in QA

 Why not caught in QA?

 The briefcase channel utilizes sftp to an individual account, but a test sftp server does not exist, so a “real” user must be used – not part of the Test Scenario

 A new test script is written to simulate one user logging into the portal and using the Briefcase

 Problem is replicated consistently within 20 minutes using a 256MB heap

uPortal 2.x Memory/Performance:

Final Resolution

 Though the J2SSH code is no longer actively maintained, a contribution is found to address this issue – made available 6/04

 A HyperContent class is extended and a finalizer method is overridden to only close if the connection is open

 The finalizer in the briefcase is removed

 QA test now passes

uPortal 2.x Memory/Performance:

Install Fix in Production

Final set of fixes rolled into production 8/31/05

Servers have not run out of memory since release has been introduced

But…

uPortal 2.x Memory/Performance:

Increase in Threads

 Monitoring of servers show a gradual increase in the number of threads (from initial 180 to over 1k)

 kill -3 reveals hundreds of threads in

JavaMail:

"JavaMail-EventQueue" daemon prio=5 tid=0x002da400 nid=0x534 in Object.wait() [aa881000..aa8819c0] at java.lang.Object.wait(Native Method)

- waiting on <0xba6476f0> (a javax.mail.EventQueue) at java.lang.Object.wait(Object.java:429) at javax.mail.EventQueue.dequeue(Unknown Source)

- locked <0xba6476f0> (a javax.mail.EventQueue) at javax.mail.EventQueue.run(Unknown Source) at java.lang.Thread.run(Thread.java:534)

 Modified JavaMail to restore needed finalization() methods

 Patch fixes thread problem, but…

uPortal 2.x Memory/Performance:

Heavy Traffic

 New semester brings much more traffic

(and more problems):

 Increase in portal logins (40k in one day,

3600/hour) overwhelms authentication infrastructure, causing mod_jk threads blocking resulting in restarts (and more logins, etc…)

 Channel time out – traced to heavy load and use of SoftReference caches; worst case scenario caching

 Heavy memory usage – many temporary objects; over 600 concurrent sessions results in over 700MB temporary objects every second

uPortal 2.x Memory/Performance:

Triage

Address immediate concerns for heavy load*:

 Introduce LDAP connection pooling

 Modify caching of layouts to share across users

 Tune caching of heavily used channels

* Modifications not yet submitted to uPortal codebase

uPortal 2.x Memory/Performance:

Going Forward…

 Focus on heavily used channels (Email)

 Revisit statistics to capture better data for monitoring/reporting/forecasting production servers

 Implement better caching algorithms

 Monitor/tune/implement pooling:

 LDAP

 DB

 HTTP client connections

 Migrate to JDK 1.5

 Continue to monitor/tune heap

 Large new area for temporary objects

 Go on a Temporary Object Diet!

uPortal 2.x Memory/Performance:

Lessons Learned

 QA should reflect production usage

 Java “Memory Leaks” becomes a catchphrase for many different types of problems

 Useful tools:

 Java Heap/CPU analyzer (e.g., YourKit)

 lsof ( ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/ )

 kill -3 thread dumps

 Experience was invaluable:

 Solved many memory leaks/performance issues in codebase contributing to a more stable portal

 Skills can be applied to other Java-based apps

Download