Memory Leak when restoring a Backup #4890

adamretter · 2023-04-27T18:02:10Z

When restoring a clean instance of eXist-db 6.2.0 from a backup we are able to observe what we believe to be a memory leak in the Full-Text and Range indexes of eXist-db; to be clear I am referring to the implementations of both of those indexes that are backed by Lucene.

The database backup is approximately 40,000 XML documents (35GB).

Having configured the JVM for eXist-db to -XX:+HeapDumpOnOutOfMemoryError and -XX:+ExitOnOutOfMemoryError, we have captured the heap dumps from the JVM and analyzed them with YourKit Java Profiler.

Heap Dump 1

@dariok provided me with the file: java_pid236689.hprof (md5sum: 54c669ddba7532bee2449922f2fba1d5)

Shows that of the 2.9GB available to eXist-db, 1.1GB is retained (at the time that eXist-db ran out of memory) by the RangeIndex.

Heap Dump 2

@dariok provided me with the file: java_pid318495.hprof (md5sum: 9a99f30986fe2579b8c54efe6547e05b)

Re-running with a clean eXist-db instance and restoring the same backup again but with 6GB of memory available to eXist-db, we see that 2.5GB and 1.1GB are retained (at the time that eXist-db ran out of memory) by the LuceneIndex and RangeIndex respectively.

Observations

The problem is not reproducible with eXist-db 6.0.1.
The git history shows that the only changes to those indexes were the PR Field perf revised #4541 which first shipped in eXist-db 6.1.0.

For reference, the collection.xconf in use looks like:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:mets="http://www.loc.gov/METS/" xmlns:meta="https://github.com/dariok/wdbplus/wdbmeta" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <lucene>
      <analyzer id="std" class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

      <text qname="tei:persName"/>
      <text qname="tei:placeName"/>
      <text qname="tei:orgName"/>
      <text qname="tei:title"/>
      <text qname="tei:text"/>
      <text qname="tei:note"/>
      <text qname="tei:p"/>
      <text qname="tei:ab"/>
      <text qname="tei:opener"/>
      <text qname="tei:closer"/>
      <text qname="tei:postscript"/>
      <text qname="tei:cell"/>
      <text qname="tei:label"/>
      <text qname="tei:item"/>
      <text qname="tei:quote"/>
      <text qname="tei:head"/>
      <text qname="tei:l"/>
      <text qname="tei:w"/>

      <inline qname="tei:unclear"/>
      <inline qname="tei:supplied"/>
      <inline qname="tei:gap"/>
      <inline qname="tei:ex"/>
      <inline qname="tei:hi"/>
      
      <ignore qname="tei:listPerson"/>
      <ignore qname="tei:listPlace"/>
      <ignore qname="tei:listOrg"/>
      <ignore qname="tei:listBibl"/>
      <ignore qname="tei:note"/>

      <ignore qname="tei:pb"/>
      <ignore qname="tei:fw"/>
      <ignore qname="tei:lb"/>
      <ignore qname="tei:pb" /> 
      <ignore qname="tei:pc"/>
    </lucene>
    <range>
      <create qname="mets:file">
        <field name="mets-file-id" match="@ID" type="xs:string"/>
      </create>
      <create qname="meta:file">
        <field name="file-path" match="@path" type="xs:string"/>
      </create>
      <create qname="@ref" type="xs:string"/>
      <create qname="@cRef" type="xs:string"/>
      <create qname="@type" type="xs:string"/>
      <create qname="@place" type="xs:string"/>
      <create qname="tei:rs">
        <field name="rs-type" match="@type" type="xs:string"/>
        <field name="rs-ref" match="@ref" type="xs:string"/>
      </create>
      <create qname="@when" type="xs:date"/>
      <create qname="@notBefore" type="xs:date"/>
      <create qname="@notAfter" type="xs:date"/>
      <create qname="tei:w" type="xs:string"/>
    </range>
  </index>
</collection>

The text was updated successfully, but these errors were encountered:

joewiz · 2023-04-27T18:22:47Z

Are the .xconf files available for reference? There are various options in eXist's Lucene-based full text and range index configuration files. It might be helpful to know which aspects were in play in this case.

dariok · 2023-04-28T12:15:11Z

@adamretter added the collection.xconf above. For one of my test runs, I reduced that one by removing some (more or less random) parts from both the full text and the range index config but that did not seem to change anything.

If you need any additional info, just let me know.

joewiz · 2023-05-13T16:54:50Z

Concerned that this issue might affect the release of a forthcoming update to my application, which depends upon the facets and fields feature and fixes included in eXist 6.2.0, I performed some tests to try to assess if my application is affected.

Starting from a fresh, stock instance of eXist 6.2.0 (with all defaults, including 2GB Xmx), I carried out these steps:

Stored all history.state.gov collection configuration files (captured from /db/system/config on a complete hsg-project system, as shown in the attached zip file to ensure that all indexes are configured before any data is stored.
Uploaded all hsg-project repositories via WebDAV to store and index the data. (I then confirmed via monex's indexing pane that a sample facet index was successfully created.)
Triggered a java heap dump via jmap -dump:format=b,file=/tmp/heapdump.bin <pid>.

In analyzing the heap dump, we determined that after indexing is completed, the instance retained about 40MB of memory for Lucene indexes, out of the total 392 MB of retained memory used - about 10%.

Based on these results, it appears that Lucene is holding onto memory that it should release after indexing is complete. While the memory withheld did not cause eXist to crash, it appears the findings are consistent with Dario’s findings, and may demonstrate that a leak is present.

I would note that @dariok's dataset is about 8x larger than mine. (Mine weighs in at 4.37 GB in 37k XML documents, whereas Dario's is 35 GB with 40k documents). His also has a more extensive and granular set of index definitions, including <tei:w> - i.e., every individual word - whereas mine is at the <tei:div> (article) level.

reinhapa · 2023-06-25T19:53:30Z

I did some more HEAP dump digging and I saw a lot of objects held in the BTreeCache and references of that in that ThreadLocal

Here are some more details:

@adamretter I there a way to access this cache using JMX and try to flush this one?

adamretter · 2023-06-28T13:34:40Z

@reinhapa As per my screenshot at the top I also saw objects in the BTreeCache of about ~1GB retained size, however I believe that to be correct and inline with the sizing of the caches in the conf.xml files.

I don't know of a way to clear the BTreeCache via JMX (see the: org.exist.management.impl package), however it should not be too hard to expose something if you need it.

line-o · 2023-07-05T10:10:08Z

I was able to reproduce the memory leak using the original dataset provided by @dariok. With enough resources, about 10 GB of RAM, the process did finish successfully. After restoring the entire dataset of 35 GB, containing 26874 XML files (including .html, .xsl, .svg), almost 5 GB of RAM retained.

There is ~59000 instances of TriggerStatePerThread$TriggerState.
Since I can see that there is two instances for each restored item that is ~29500 CREATE_DOCUMENT triggers - one for BEFORE and for the AFTER phase.

I was then wondering whether I was missing +3000 documents until I discovered that CREATE_COLLECTION was also invoked:

The original collection.xconf defines an update trigger as such:

<triggers>
    <trigger event="update" class="org.exist.collections.triggers.XQueryTrigger">
        <parameter name="url" value="xmldb:exist:///db/apps/path/to/module.xql"/>
    </trigger>
</triggers>

With the collection.xconf as it is in the issue description (without any triggers configured) the outcome was very different.
RAM usage stayed within reasonable bounds (<2 GB) and all memory was freed afterwards (see screenshot).

My analysis indicates that TriggerStates are held in memory indefinitely and are the cause of the leak. This is especially interesting as the events that were triggered are not the ones that are defined and thus one would not expect them to be fired at all.

line-o · 2023-07-05T12:17:02Z

Just to be clear: Based on my findings, it is unlikely that the changes to fields is related to this issue.

line-o · 2023-07-05T12:19:34Z

@joewiz does your dataset define triggers?

joewiz · 2023-07-05T12:24:38Z

@line-o Not in production, but yes in the version that is being prepared to deploy to production: https://github.com/HistoryAtState/frus/blame/master/volumes.xconf#L2-L9. This is the version that I used in the results I shared above.

reinhapa · 2023-07-05T16:41:33Z

@line-o @adamretter when would/should such a trigger event be processed/completed?

line-o · 2023-07-06T08:45:27Z

@reinhapa I don't know / have yet to familiarise myself with the Trigger implementation.

adamretter · 2023-07-06T21:44:02Z

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

adamretter · 2023-07-06T21:50:39Z

TriggerStates are held in memory indefinitely

These are Thread Local instances and the intention is that they should be cleared when each Trigger completes - see TriggerStatePerThread#clear().

reinhapa · 2023-07-07T05:28:42Z

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

Hmm, when I understand @line-o corectly the import succeeded when using enough memory so far and the Range and Lucene Index free up there memory after the restore right?

If the memory at the end is on the "normal" levels again, I would not directly call that part a memory leak in the classic sense but an uncontrolled temporary overshoot (that we definitly need to look into).

@adamretter those triggers should be processed and removed afterwards right? As for the Fulltext and Range index: Can it be, that those are note "fast" enough in processing and we experience some sort of a memory back pressure there?

line-o · 2023-07-07T08:27:08Z

@reinhapa correct, with enough resources restoring the backup succeeds with or without triggers configured and only TriggerStates are not removed from memory afterwards.

wolfgangmm · 2023-07-07T08:28:49Z

Please note that after removing the trigger – while leaving the rest of the configuration untouched – the restore process went smooth and memory consumption stayed within lower bounds as far as I could see in @line-o's reports during testing. The Lucene-based indexes thus seem to behave as expected under normal conditions, i.e. if memory is not under pressure due to other memory leaks.

adamretter · 2023-07-07T08:39:35Z

It is very strange that we are seeing different things from the same process. I will attempt to reproduce my findings again when I have a moment. If I can reproduce them, perhaps I should post a video, as presumably there must be something different between our approaches to reproducing this.

wolfgangmm · 2023-07-07T08:52:55Z

Maybe you get different results because you used -XX:+HeapDumpOnOutOfMemoryError, which means that the system was already in a state in which it was fighting for life, thus showing misleading results? @line-o on the other hand made sure the process would end by giving it enough memory, so this could make a difference.

adamretter · 2023-07-07T09:03:44Z

@wolfgangmm That's an interesting idea! I didn't understand from @line-o's report that that was what he had done. I will try with and without -XX:+HeapDumpOnOutOfMemoryError and see if that makes a difference and report back.

line-o · 2023-11-17T15:08:27Z

It has been 4 months since I reported my findings. Was anybody able to falsify or reproduce them in the meantime?

dizzzz · 2024-04-24T19:01:03Z

ping...

adamretter · 2024-06-06T08:24:42Z

I think there are clearly two different things being reported in this issue:

The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

I will try and find the Heap Dump that @dariok sent me previously and take a look into (1) as soon as I have some free time. The first thing I will do is provide a checksum of the heap dump file, and clearer steps to reproduce the issue.

For (1) if we assume that @line-o used the same Heap Dump from @dariok as I did, then I think he must have taken different steps to me as we got quite different results. If @line-o could provide a checksum of his heap dump file (so we can ensure we are using the same heap dumps), and exact steps to reproduce what he is seeing, I could then look into that also...

dizzzz · 2024-06-17T18:20:27Z

Hi Adam, would you be available to discuss this subject in an upcoming community call?

line-o · 2024-06-17T18:26:43Z

I am not using a heap dump, but reproduced the issue by restoring the backup into a instance with 16 GB of RAM

first unchanged
second attempt was without triggers defined in the collection.xconf

reinhapa · 2024-06-17T18:30:16Z

I think there are clearly two different things being reported in this issue:

The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)

The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

@adamretter im my tests using the backup file given to my by @line-o and @dariok I could observe only memory issues, when there where triggers activated with the backup file. I would very much like to discuss the potential further tests and possible solutions in a upcoming community call as suggested by @dizzzz. I find it hard to discuss it here.

adamretter · 2024-06-18T15:19:34Z

The first thing I will do is provide a checksum of the heap dump file

I have added the exact details now at the top of this thread in the description of the issue.

Hi Adam, would you be available to discuss this subject in an upcoming community call?

@dizzzz No, I will not attend the community calls. I stopped attending them as it became a toxic environment where I no longer felt welcome; I raised that issue with several people, and nothing has been done about it. If you want to discuss this issue with me, you are welcome to contact me directly.

@reinhapa You are also welcome to contact me directly.

I think as I stated above, there are clearly two different technical issues. The issue that I opened is about the memory leak with Lucene in eXist-db versions after 6.0.1. The two heap dump files that I was provided that clearly show this are detailed at the top of this issue.

I think the potential other issue that @line-o identified should be moved into its own new issue as it is different to this one, and conflating them in this single issue thread just leads to confusion.

adamretter · 2024-07-11T07:21:39Z

@line-o @reinhapa Would you be able to open a new issue for the other memory leak you believe you are seeing please, and copy all details there and add anything else that is needed to reproduce it step-by-step. I would like to see if I can also reproduce that one...

line-o · 2024-07-11T09:13:42Z

I am all for it @adamretter, will do. For the record: I can not reproduce the issue you describe.

adamretter · 2024-07-11T10:43:15Z

I can not reproduce the issue you describe.

Yes, I was already clear on that. I am not sure why, as I and others can reproduce it. Anyway, I will work on some way of presenting further evidence... perhaps a screen recording showing the issue

line-o · 2024-07-12T08:19:24Z

I haven't seen anyone else commenting here that they were able to reproduce the issue.

adamretter · 2024-07-12T11:42:50Z

I haven't seen anyone else commenting here that they were able to reproduce the issue.

Well there is clearly two people reporting this here - me and Dario. In addition there are some customers of ours (Evolved Binary) who won't comment on GitHub issues...

line-o · 2024-07-15T10:24:43Z

The issue as described here did not lead to a memory leak in my testing. The trigger definition needs to be added, which is the case in @dariok's original backup.

adamretter · 2024-07-15T11:53:27Z

It does lead to a memory leak in my and others testing even without the triggers being defined.

I don't see how repeating yourself that you can't see the issue helps anyone at all. Let's move forward please - as requested please feel free to open a separate issue for the separate thing you are seeing @line-o

line-o · 2024-07-15T14:40:46Z

@dariok since you were the first person to notice a problem when you restored your backup: did you retest without triggers?

adamretter · 2024-08-28T08:30:07Z

@line-o Separately to this issue, we have diagnosed your Trigger issue and have a bugfix ready to send in. As I previously requested ([1], [2], and [3]), please could you open a new issue for your Trigger issue, we can then send a PR for you with the fix marked as closing that issue.

adamretter · 2024-09-02T11:28:54Z

@reinhapa @dizzzz I haven't heard back from @line-o about this, do you know if he is on holiday? We would like to get this fix sent in ASAP!

line-o · 2024-09-02T18:21:20Z

@adamretter It is correct that I am on holidays and I won't be able to open a separate issue this week. I believe I will find time to do this next week, however. If you want to publish your PR in the meantime @dizzzz and @reinhapa may already be able to review the fix and the issue number can then be added afterwards.

adamretter · 2024-09-03T10:13:42Z

@line-o we will await your issue next week

adamretter · 2024-09-16T11:14:22Z

@line-o If you are back from your holidays now, would you be able to open an issue please?

adamretter · 2024-09-30T08:44:01Z

@line-o Hope you are ok... can we get an update please?

adamretter · 2024-10-08T19:42:56Z

The concerns of @line-o about an issue in XQuery Triggers have now been moved into a separate issue here: #5459

adamretter changed the title ~~Memory Leak in eXist-dbLucene) Full-Text and Range Lucene Index~~ Memory Leak in eXist-db's (Lucene) Full-Text and Range Indexes Apr 27, 2023

reinhapa self-assigned this Jun 23, 2023

line-o changed the title ~~Memory Leak in eXist-db's (Lucene) Full-Text and Range Indexes~~ Memory Leak when restoring a Backup Jul 6, 2023

reinhapa mentioned this issue May 2, 2024

Issue #4890 trigger leak analysis #5295

Draft

line-o mentioned this issue Sep 30, 2024

Memory leak on backup restore with xquery triggers #5459

Closed

Memory Leak when restoring a Backup #4890

Memory Leak when restoring a Backup #4890

Comments

adamretter commented Apr 27, 2023 • edited Loading

Heap Dump 1

Heap Dump 2

Observations

joewiz commented Apr 27, 2023 • edited Loading

dariok commented Apr 28, 2023

joewiz commented May 13, 2023

reinhapa commented Jun 25, 2023

adamretter commented Jun 28, 2023 • edited Loading

line-o commented Jul 5, 2023 • edited Loading

line-o commented Jul 5, 2023 • edited Loading

line-o commented Jul 5, 2023

joewiz commented Jul 5, 2023 • edited Loading

reinhapa commented Jul 5, 2023

line-o commented Jul 6, 2023

adamretter commented Jul 6, 2023

adamretter commented Jul 6, 2023

reinhapa commented Jul 7, 2023

line-o commented Jul 7, 2023 • edited Loading

wolfgangmm commented Jul 7, 2023

adamretter commented Jul 7, 2023

wolfgangmm commented Jul 7, 2023

adamretter commented Jul 7, 2023

line-o commented Nov 17, 2023

dizzzz commented Apr 24, 2024

adamretter commented Jun 6, 2024 • edited Loading

dizzzz commented Jun 17, 2024

line-o commented Jun 17, 2024

reinhapa commented Jun 17, 2024 • edited Loading

adamretter commented Jun 18, 2024

adamretter commented Jul 11, 2024 • edited Loading

line-o commented Jul 11, 2024

adamretter commented Jul 11, 2024

line-o commented Jul 12, 2024

adamretter commented Jul 12, 2024

line-o commented Jul 15, 2024

adamretter commented Jul 15, 2024 • edited Loading

line-o commented Jul 15, 2024

adamretter commented Aug 28, 2024 • edited Loading

adamretter commented Sep 2, 2024

line-o commented Sep 2, 2024

adamretter commented Sep 3, 2024

adamretter commented Sep 16, 2024

adamretter commented Sep 30, 2024

adamretter commented Oct 8, 2024

adamretter commented Apr 27, 2023 •

edited

Loading

joewiz commented Apr 27, 2023 •

edited

Loading

adamretter commented Jun 28, 2023 •

edited

Loading

line-o commented Jul 5, 2023 •

edited

Loading

line-o commented Jul 5, 2023 •

edited

Loading

joewiz commented Jul 5, 2023 •

edited

Loading

line-o commented Jul 7, 2023 •

edited

Loading

adamretter commented Jun 6, 2024 •

edited

Loading

reinhapa commented Jun 17, 2024 •

edited

Loading

adamretter commented Jul 11, 2024 •

edited

Loading

adamretter commented Jul 15, 2024 •

edited

Loading

adamretter commented Aug 28, 2024 •

edited

Loading