Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misaligned reference links in full text #35

Open
dennlinger opened this issue Sep 30, 2019 · 11 comments
Open

Misaligned reference links in full text #35

dennlinger opened this issue Sep 30, 2019 · 11 comments

Comments

@dennlinger
Copy link

dennlinger commented Sep 30, 2019

For some of the decisions (e.g., this one), the references are not aligned at all with the corresponding occurrences in the text.

Is there any way to work with the data prior to the annotation (as it is available through the JSON), to potentially help with investigating this?

@malteos
Copy link
Contributor

malteos commented Oct 1, 2019

Hi @dennlinger,

thanks for your bug report. We are already aware of this bug but couldn't fix it until now (see openlegaldata/legal-reference-extraction#1 ).

If the original text without any annotation would help you, we could provide it as an additional field in the API response.

Best,
Malte

@dennlinger
Copy link
Author

Hi Malte,
unfortunately didn't see the bug report before. I was more wondering whether you could provide some of the actual samples (raw HTML before processing, maybe from the case referenced in the bug report) used in the dataset for the live webpage to help with the debugging.

The test cases provided in legal-reference-extraction seem simple enough at first glance, and I assume you are checking for correctness on those anyways. I'm aware of legal-datasets, but that one is unfortunately empty as well.

I think the feature is extremely helpful if working properly, and could potentially be extended, if you are willing to accept contributions on this issue.

Best,
Dennis

@malteos
Copy link
Contributor

malteos commented Oct 2, 2019

Contributions are always welcome!

I'll try to update the API accordingly within the next week.

@malteos
Copy link
Contributor

malteos commented Oct 4, 2019

The decision content which is currently available via the API does not contain any annotations. Thus, it should not be affected by the reference extraction bug. The API serializer returns the content field that holds the HTML as we obtained it from the source.

For the UI, all annotations are added later (See https://github.com/openlegaldata/oldp/blob/master/oldp/apps/cases/models.py#L186-L209 )

@fchrubasik
Copy link

After running some tests (for example on this document) it seems like the references are misaligned because of the HTML-Offset, i.e. replacing special characters like "ö" with "ö". The references are placed as if they were applied to plain text without taking these special characters into account resulting in the misalignment.
I am currently working on a bugfix for this issue together with @dennlinger.

@malteos
Copy link
Contributor

malteos commented May 4, 2020

Hi @fchrubasik & @dennlinger

thanks again for your contribution! The last months have been really busy over here so I only today managed to finally deploy your changes to production. I'm really sorry for that!

I'm currently reprocessing all our documents with the changes (that might take 10hrs or so).

Did you end up doing anything with the citation data?

Best,
Malte

@malteos malteos closed this as completed May 4, 2020
@dennlinger
Copy link
Author

Hi, thanks for incorporating the changes!
So far we haven't directly used the citations from openlegaldata, but had a Thesis project by another student working on Bafin data and European Directives.
As for this patch, let me know if there are any problems coming up. I think there is a chance that depending on your input format, some files are still processed incorrectly, but I'll happily check a bunch of documents once the changes are live. ;-)

Cheers,
Dennis

@dennlinger
Copy link
Author

Not sure where to follow up with this, but it seems the references are still misaligned on the live server, as it seems. Did we miss anything with the original bugfix that might cause this to be still misaligned?

@malteos
Copy link
Contributor

malteos commented Sep 5, 2020

The case mentioned in the issue seems to have all reference correct ( https://de.openlegaldata.io/case/bag-2019-07-11-6-azr-4017 ). Do you have an example for still misaligend references?

@dennlinger
Copy link
Author

I was specifically looking at the most recent "Urteil" at the time of writing (https://de.openlegaldata.io/case/bverwg-2020-08-06-6-b-1120). Great to see that the original issue is fixed, though!

@malteos
Copy link
Contributor

malteos commented Sep 21, 2020

OK. Then let's reopen this one.

@malteos malteos reopened this Sep 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants