enlivend ([info]enlivend) wrote,
@ 2008-05-08 15:53:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
JFLI and the memory leaks of doom

Finally, I've got the OK to open source some of the work I was doing last year. First step is the fixes I made to jfli (Java Foreign Language Interface), now available for CVS download or as a tarball. It's the first time anyone has posted changes to this library in some years, so I took the bull by the horns and gave it version number 0.2.

For the record, this is what I have changed. It was an interesting object lesson in trying to get two GCs to be nice to each other.

Memory problems:

  1. add-special-free-action takes a symbol, not a function. If you give it a function it doesn't do anything. In this case, that meant that all global-refs were leaking on the Java side. That's a hell of a pile-up - run anything for long enough and Java will run out of memory.

  2. Lisp processes just accumulated in *process-envs*, which meant that the associated stacks etc end up leaking on _both_ sides of the fence, i.e. into both Java and lisp. My first attempt at solving this involved a call to mp:ensure-process-cleanup but...

  3. Suppose a new thread allocates before mp:*current-process* has been set. Then delete-global-ref might be called, which invokes current-env for the first time on this thread, which goes ensure-process-cleanup with null mp:process. SEGV. Farfetched? Well, it happened.

  4. The "access functions" (calls to anything seen in defvtable, i.e. all the JNI's calls into the JVM) failed to memoize a dereferenced foreign-slot-value which was the same every time, and so burned 56 unnecessary bytes per shot (in lisp). This was reclaimed by the GC but it messed up the allocation figures when I was out hunting for real leaks.

  5. Untimely Finalization

    Consider this little problem which exhibited pathological behaviour from time to time:

    • The special free actions are not run when a GC occurs inside mp:without-interrupts, because that could cause a deadlock if the action function claims any locks (e.g. uses hash-tables). Instead of freeing them, the GC just keeps them alive with their special free actions intact.
    • The system maintains a table of all of the objects marked for special free actions (so the GC can find them all easily). Unfortunately, flag-special-free-action takes O(n2) time for n objects in the table. ["Ouch", says Nick.]
    • This table is enlarged by flag-special-free-action, inside without-interrupts to make it atomic with respect to other calls to flag-special-free-action and finalization.
    • If you're unlucky, all of the GC operations are triggered by the enlargement of this table.

    This final aspect completed a vicious cycle: none of the special objects were ever freed, because all of the GC operations occurred inside without-interrupts and hence their special free actions could not be run at that time. Excessive allocation occured, caused by the enlargement of the table which was always filled again quickly for the same reason. (By "excessive", I mean images bloating to over 1GB in very short order.)

    The recommended solution from lispworks-support was to manually mark-and-sweep generation 0 every 1000 or so allocations. Without it, you'd occasionally get generation 0 trying to climb over 1GB while you're sat there wondering why your emacs was running so slow.

Non-memory problems:

  1. Exports from JFLI package of box-integer and unbox-integer instead of the documented box-int and unbox-int - I restored the documented behaviour.

  2. I needed more configurable exception handling.

  3. No support for system building - you needed a live JVM connection in order to macroexpand source and so couldn't save the image (well, you could, but when you restarted it you wouldn't be able to connect).

Fixed by upgrading to LispWorks 5.1:

  1. Occurances of java.lang.NullPointerException, java.lang.ArrayIndexOutOfBoundsException, etc which had no explanation even after reading Sun Java sources.

    This turned out to be caused by a bug in the FLI, which could leave the CPU's direction flag set incorrectly in some cases. When the direction flag was set incorrectly, some optimized memory copying routines would corrupt adjacent objects. The bug, fixed in LispWorks 5.1, affected 32-bit x86 platforms running Linux, FreeBSD or Mac OS X (not Windows).



(Post a new comment)


[info]drj11.wordpress.com
2008-05-10 04:05 pm UTC (link)
The Direction Flag thing is very similar to the recent GCC 4.x upgrade exposing a bug in many kernels: http://lwn.net/Articles/272048/

(Reply to this)(Thread)

"The window of vulnerability is small, but was observed in SBCL"
[info]enlivend
2008-05-11 07:17 pm UTC (link)
More lisp?

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…