Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Entities automatically generated from mentions in an investigation during cross reference #2994

Open
tom-claessens opened this issue Apr 12, 2023 · 7 comments
Assignees
Labels
bug Things that should work, but don’t Critical Issue that requires prompt attention

Comments

@tom-claessens
Copy link

Describe the bug
When cross-referencing an investigation with a multitude of uploaded documents in Aleph, the mentions within the documents are automatically conversed into entities.

To Reproduce
Steps to reproduce the behavior:

  1. Create an investigation
  2. Upload a set of documents within the investigation, wait for them to be indexed and ingested
  3. Start cross-reference
  4. Mentions from will now be automatically be added to the entities within the investigation

Expected behavior
Ideally, cross-referencing should happen without generating entities within the investigation. So that the cross-referencing exists of:

  • Already generated entities within investigation are cross-referenced with other investigations, datasets and mentions in documents (within this investigation and other investigations/datasets)
  • Mentions within documents are cross-referenced with other investigations and datasets and mentions in documents (within other investigations/datasets), without mentions being added to this investigation

Aleph version
Latest version. Problem is encountered within the Aleph instance of Follow the Money (NL).

Screenshots
Example of unwanted generated entities
Screenshot from 2023-04-12 13-08-25
lem.

@tom-claessens tom-claessens added bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team labels Apr 12, 2023
@tillprochaska
Copy link
Contributor

Hi @tom-claessens, thanks for opening this issue!

  1. Start cross-reference
  2. Mentions from will now be automatically be added to the entities within the investigation

Just to clarify, when you say "cross-reference" did you only trigger the automatic cross-reference process by clicking the "compute" button in the cross-referencing section? Or did you also manually rate the corss-referencing results ("Same"/"Unsure"/"Different")?

@tom-claessens-ftm
Copy link

Hi @tillprochaska ,

I think it happened in both situations. I'm not entirely sure, as it is both something I've encountered, but also my colleagues. I think most of us are not very tempted to rate all cross-reference results, as sometimes there are thousands of results to rate. Does this mean that Aleph is supposed to add new entities from the manually chosen "sames" from the cross-reference results?

@tillprochaska
Copy link
Contributor

I will need to reproduce the issue and get some more information from others as I'm not super familiar with the feature. If this is only happening for xref matches that are rated manually I could imagine that this is intended behavior. I'll geht back to you when I have more information.

@tillprochaska
Copy link
Contributor

tillprochaska commented Apr 18, 2023

I have been able to reproduce this issue:

  1. Uploaded a PDF document that contains names of companies
  2. Waited for Aleph to finish processing the document.
  3. Viewed the document and ensured Aleph had extracted the names of the companies as mentions.
  4. Navigated to the XREF section and manually triggered XREF.
  5. Waited for the XREF to complete.
  6. Searched for schema:Company within the investigation.
  7. The search results include the mentions extracted from the document.

When viewing these entities, you can actually see that they are still linked to the source document using the companiesMentioned/mentionedBy properties:

Screen Shot 2023-04-18 at 17 10 57

For further debugging, these logs may help finding the relevant parts of the source code that trigger this behavior. Note that "[Test] Entities generated from mentions" is the title of the investigation I created for testing.

Screen Shot 2023-04-18 at 17 15 32

@tillprochaska
Copy link
Contributor

tillprochaska commented Apr 19, 2023

Additional context from @brrttwrks:

I too was able to recreate it, but only for investigations. I did not see the same behavior for datasets.

For the dataset, I did not get any results from the xref, nor were any extracted entities 'reified'.

Firstly, I think the behavior should be the same. At least that is my expectation. That it isn't happening for datasets means that xref for leaks and many of our bigger investigations are missing possible matches.

Also, if, in datasets, Aleph is automatically creating actual entities, then xref won't work from just one side, but in both directions. However, this means that datasets mentions aren't matched currently in either direction

@tillprochaska tillprochaska removed the triage These issues need to be reviewed by the Aleph team label Apr 19, 2023
@Rosencrantz Rosencrantz added triage These issues need to be reviewed by the Aleph team Critical Issue that requires prompt attention labels May 2, 2023
@Rosencrantz Rosencrantz removed the triage These issues need to be reviewed by the Aleph team label May 2, 2023
@tillprochaska
Copy link
Contributor

tillprochaska commented May 24, 2023

I was able to confirm that the current behavior is indeed intended. It was implemented some time ago as an "experiment" with the expectation that there would be more iterations to refine the feature in the future, but that never happened.

The idea behind it was the following: When Aleph extracts mentions of names from a document and is then able to find similar Person/Company entities in other datasets (e.g. in a companies registry or census database), it is likely that that name is the name of a person or company, respectively.

We do however understand that the current behavior is confusing and inconsistent and can lead to cluttered investigations and will consider adjusting or removing the behavior.

@tillprochaska
Copy link
Contributor

One additional small detail I just observed:

When cross-referencing a collection with mentions, entities are created as outlined in this thread. When I then delete the entity that was automatically created, the respective cross-referencing match is deleted as well (makes sense). When I re-run the cross-referencing, the mention is ignored, i.e., two cross-referencing runs with the same data lead to different results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t Critical Issue that requires prompt attention
Projects
None yet
Development

No branches or pull requests

5 participants