Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyld as the json-ld parser for rdflib #1836

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

edmondchuc
Copy link
Contributor

@edmondchuc edmondchuc commented Apr 16, 2022

Summary of changes

This PR integrates pyld as the json-ld parser for rdflib. Pyld aims to conform to JSON-LD 1.1 and newer.

This change means that rdflib now depends on pyld.

Failing tests (count: 0)

Update: Tests passing.


Possibly related issues

I plan to look at the list of issues that are relevant to json-ld and see if this PR fixes them.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests and type checking passes.
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
    • Considered adding an example in ./examples for new features.
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@edmondchuc
Copy link
Contributor Author

pre-commit.ci autofix

@aucampia
Copy link
Member

one concern I have with this is that pyld was last updated in 2020, I don't mind this in principle, but I would prefer to actually incorporate their codebase rather than just use their parser, because if they are not maintaining it then it becomes a liability.

There is also the possiblity of incorporating this as an extra, where someone can install rdflib with pyld extra and then use the pyld parser instead.

@nicholascar
Copy link
Member

We don’t want to have more than one equivalent JSON-LD parser to maintain so if we adopt pyLD, we should drop our current one.

If pyLD is maintained and better than our current one and it can be integrated successfully, we should swap current for it.

If pyLD is not maintained, we could incorporated it and switch maintenance effort to it instead of current.

Let’s put this to @aucampia @ashleysommer @gjhiggins @white-gecko @niklasl and the maintainers of pyLD too and see what the feeling is!

@edmondchuc
Copy link
Contributor Author

pyld passes 98% of the tests for JSON-LD 1.1's Transform JSON-LD to RDF section.

From memory, rdflib's current JSON-LD parser for JSON-LD 1.1 is incomplete. I am not sure if this PR RDFLib/rdflib-jsonld#74 ever made it in when rdflib-jsonld was merged with rdflib core.

one concern I have with this is that pyld was last updated in 2020, I don't mind this in principle, but I would prefer to actually incorporate their codebase rather than just use their parser, because if they are not maintaining it then it becomes a liability.

That's fair and I totally agree. I should really set this PR to draft as its intention is really to start a conversation on how best to get rdflib to conform to JSON-LD 1.1 for parsing.

I also have ideas around taking pyld's implementation and modifying bits of it to integrate into rdflib. For example, one thing I don't like with this PR's implementation is its inefficiency in the parse() method. It first reads in the data into pyld's data structure before transforming it into a string of quads and then reading it in to rdflib's graph.

A more efficient implementation would be to monkey patch https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/lib/pyld/jsonld.py#L3529 and directly add the triples into an rdflib graph instead of a pyld graph.

Let me know your thoughts. Cheers.

@edmondchuc edmondchuc self-assigned this Apr 17, 2022
@edmondchuc edmondchuc marked this pull request as draft April 17, 2022 13:16
@edmondchuc
Copy link
Contributor Author

In 91bf6ba, I monkey patched a few pyld functions to avoid creating pyld data structures. The parser's parse method now adds statements from the JSON-LD data directly into rdflib's store.

Slight speed improvement based on a JSON-LD dataset with 49,990 statements and a file size of 6.4MB.

Previous time:

real    0m9.703s
user    0m9.505s
sys     0m0.198s

real    0m9.795s
user    0m9.597s
sys     0m0.190s

real    0m9.748s
user    0m9.551s
sys     0m0.176s

New time:

real    0m8.969s
user    0m8.845s
sys     0m0.125s

real    0m8.965s
user    0m8.869s
sys     0m0.097s

real    0m9.030s
user    0m8.912s
sys     0m0.118s

@ghost
Copy link

ghost commented Apr 17, 2022

From memory, rdflib's current JSON-LD parser for JSON-LD 1.1 is incomplete.

I'd argue this perception is supported by the extensive list of skipped expected test failures

@edmondchuc
Copy link
Contributor Author

pre-commit.ci autofix

@edmondchuc
Copy link
Contributor Author

  • I need to fix the build errors in RTD.

There also needs to be a discussion to decide where the pyld and c14n packages should be. Currently the code is structured as:

rdflib/
├─ plugins/
│  ├─ parsers/
│  │  ├─ pyld/
│  │  │  ├─ pyld/
│  │  │  ├─ c14n/

I'm thinking it might be suitable to have them located under rdflib.tools or rdflib.extras.

@niklasl
Copy link
Member

niklasl commented Apr 18, 2022

Yes, this is an interesting move, which I do believe warrants a discussion before finalizing.

I was involved in a mail thread discussing what to do with rdflib-jsonld some years ago, before that code base was copied into RDFLib and the separate package was retired. I gather there's been no definitive changes, yet, to the decisions made back then, and that this PR is a continuation thereof?

It is a bit "after the fact" though in terms of rdflib-jsonld now having been copied into RDFLib. Dependers of RDFLib may now rely on these innards, so further deleting and introducing packages and modules does present some quite problematic instabilities.

IIRC I actually thought it might be worthwhile to do what this PR started out to do, i.e. depend on PyLD instead. But it appeared that the direction was to keep maintaining RDFLib's own implementation (which I did start, and had an original go at upgrading to 1.1 back then).

Apart from the relative merits and complexities of various implementations though, now we have the case of coyping code from another, and external, package to include within RDFLib itself.

I originally suggested keeping the JSON-LD processor as an external dependency, as JSON-LD processing is an involved affair which entails more than "just" maintaining a parser and serializer. Also, copying in code bases from other packages should not be done lightly, as that effectively creates a fork of it within the RDFLib code base. If it is done anyway for various reasons, and even while honoring the license of that external package, it might be argued that we're complicating matters with forks. Remember that JSON-LD may evolve further, not the least with RDF-star on the horizon of possible W3C endorsement somewhere down the line.

While I can see it both ways, I am still generally in favour of having an external dependency. (RDFLib does depend on external packages for e.g. XML and JSON parsing, HTTP processing, etc. It can be debated whether those are categorically different, but I believe the line to be drawn isn't as clear as it might first appear.)

It might be easier for RDFLib to swap out processors if the JSON-LD processor package is kept as an external dependency. Of course, that requires a stable interoperability API, which I'd be glad to discuss further. The rdflib-jsonld package was specifically designed to keep down unnecessary complexities by integrating tightly with RDFLib internals, but that might be possible to achieve on various levels, depending on how thin the contract with an external processor needs to be (and what the actual overhead will be in practise).

A middle ground could be to make an explicit fork of any external dependency into the RDFLib Github organization. (Specifically PyLD, as rdflib-jsonld was already owned by the RDFLib org.) Maintenance and API improvements there could go back upstream if so desired. Or for that matter, look into if PyLD needs maintainers and contribute there if changes are needed in order for RDFLib to fulfill its needs.

Speaking of implementations, I would like to point to my own TRLD as another candidate for discussion (perhaps elsewhere). It is faithfully following the JSON-LD 1.1 algorithms. While it is orthogonal and complementary to RDFLib in some ways, in others—for specific purposes which don't require a full RDF toolchain— it represents an alternative. It also has the explicit goal of being transpilable; currently to Java and JS. Its very permissive 0BSD license doesn't even require attribution (although it is appreciated of course). If we were to pursue the "copy some of the code" path, and in spite of my own opinion (attempting to honor a narrow ideal), that interestingly allows for anyone to take the parts needed and remake them as desired.

I'd gladly discuss these matters more in depth. Is this PR a good place though, or do we need a more general issue/discussion?

@aucampia
Copy link
Member

It is a bit "after the fact" though in terms of rdflib-jsonld now having been copied into RDFLib. Dependers of RDFLib may now rely on these innards, so further deleting and introducing packages and modules does present some quite problematic instabilities.

I think in part this is somewhat an issue on what the should be considered part of the public API of RDFLib, in my view the plugins should not be used directly, but I do understand that there may be cases where they are used directly but without any clear instruction on this you do have a fair point, by all indications it is part of the public API. And as you allude to further in your message, there is more the JSON-LD than parsing and serializing, and these other things should be part of the public API and should be maintained as such.

I originally suggested keeping the JSON-LD processor as an external dependency, as JSON-LD processing is an involved affair which entails more than "just" maintaining a parser and serializer.

To me part of the challenge with external libraries, or additional libraries is that it is quite resource intensive to just maintain RDFLib, to maintain the functionality as separate repos has a lot of overheads.

Also, copying in code bases from other packages should not be done lightly, as that effectively creates a fork of it within the RDFLib code base.

The challenge with pyld is that, even though it may be more compliant, it has not been maintained since 2020, so to me it is not really an option to depend on it without moving it into our codebase, I agree though, this takes on a significant burden, it is actually not easy for me to say at this point in time if it is worth it, I don't know the exactly level of compliance of rdflib's JSON-LD support or what it would take to make it fully compliant. To me the higher priority is getting a better picture on this, getting the JSON-LD test suite running entirely with failing tests marked with xfails so we know where we stand.

To summarise:

  • To me depending on https://github.com/digitalbazaar/pyld is not an option as I consider it dead given the last commit was on 2020-08-06.
  • We do have a challenge with JSON-LD 1.1 compliance that we should address - I'm not opposed to fixing the implementation we have, I think it is fixable, but I don't know who will have time when to fix it, it is not something that is very high on my list of priorities and if I were to be realistic I don't see this happening in the next 6 - 12 months.
  • We already have many repos in RDFLib which is not actively maintained, and may actually have been in a better state if they were in the main rdflib repo as they would benefit from general quality of life improvements we are making in RDFLib that we don't have time to replicate in all of them.
  • There are valid concerns regarding our public API and this change that we need to figure out.

I don't really know the best course of action, I suspect using https://github.com/niklasl/trld may be it. We need to reach out to the people form https://github.com/digitalbazaar/pyld to see what they think, maybe it is best to just ping them here: @dlongley @davidlehn @gkellogg

@gkellogg
Copy link

I actually authored most of the work to bring pyld up to 1.1 compliance although under contract to do so. Even though it's not my main focus, I would certainly be willing participate in helping to maintain pyld as part of RDFlib.

Trying to maintain two different Python libraries for JSON-LD seems like a pointless effort, and pyld has had a fair amount of work put into it to make it work similar to jsonld.js, which has some nice performance tweaks.

cc/ @davidlehn @azaroth

@nicholascar
Copy link
Member

Thanks for the eagerness @gkellogg, @edmondchuc for kicking this off and the recent integration work and @niklasl for the multiple implementation!

We will need to talk about this a bit more I think to see if we can all get behind one implementation. RDFLib maintainers should be having a phone call shortly, so we will discuss this then and report any thought back.

If anyone can think of anyone else that should be included here, please do bring them in too.

@rob-metalinkage
Copy link

see #1993 about ability to parse plain JSON using contexts not directly referenced, or in embedded objects.

@ghost
Copy link

ghost commented Jun 17, 2022

fwiw, pyld currently has a dependency tax: https://github.com/digitalbazaar/pyld/blob/master/requirements.txt

requests
aiohttp; python_version >= '3.5'
lxml
cachetools
frozendict ¹

¹ There's an unmerged PR changing the dependency on the (as-was unmaintained, now picked-up) frozendict to immutabledict

The api tests have some failures/errors:

FAILED (failures=11, errors=3, skipped=30)

@niklasl
Copy link
Member

niklasl commented Mar 19, 2023

I just want to note that anyone needing advanced JSON-LD support for RDFLib today can leverage what is already in place together with a dedicated JSON-LD processor. Here is a full example using RDFLib and PyLD together (using TRLD would work pretty much the same).

from rdflib import Graph

indata = {
    "@context": {
        "@version": 1.1,
        "dc": "http://purl.org/dc/terms/"
    },
    "@id": "http://example.org/about",
    "dc:title": {"@language": "en", "@value": "Someone's Homepage"},
    "dc:hasFormat": {"dc:format": "text/turtle"},
}

ctx = indata["@context"]

g = Graph()


# RDFLib's builtin json-ld support handles most forms, but is not a full-featured
# JSON-LD processor.
# g.parse(indata=indata, format='json-ld')

# Using PyLD to ensure full support of all compact forms.
from pyld.jsonld import expand  # type: ignore[import]

# The first integration point: expand the input data to ensure all data is captured.
g.parse(data={'@graph': expand(indata, None)}, format='json-ld')  # type: ignore[arg-type]

# Note: if you want prefix bindings from the compact form in the graph, use this:
g.parse(data={"@context": ctx}, format='json-ld')  # type: ignore[arg-type]

# Look at the results.
print(g.serialize(format='turtle'))


# To use the builtin automatic compaction, uncomment this.
# print(g.serialize(format='json-ld', auto_compact=True))

# Otherwise, get expanded JSON-LD from RDFLib.
from rdflib.plugins.serializers.jsonld import from_rdf
outdata = from_rdf(g)

# Then use the compaction you need from the processor (the second integration point).
from pyld.jsonld import compact
outdata = compact(outdata, {"@context": ctx})


# Serialize the native Python data structure as JSON text:
import json
print(json.dumps(outdata, indent=2))

There is just one internal detail exposed, the import of from_rdf. We could add a Graph.to_python() method to get a fairly simple form of JSON-LD-shaped structure out "for free", with the specific intention of letting JSON-LD processors work on that, rather than attempt to track every feature of them within RDFLib. I'm not sure it is a great idea, but one to consider (and when doing so, whether it should have an optional auto_compact flag, with is nice but more complex).

I do stand by the point that copying in PyLD (or TRLD for that matter) takes on a significant burden, and I would very much advice making PR:s or otherwise engaging in the maintenance of these separate JSON-LD processors. (TRLD is separate from PyLD for many reasons, one being that I need a very controlled source base to be able to transpile it into Java and Javascript.)

RDF graphs and JSON-LD are distinct enough in usage, in that "RDF users" would prefer getting the data out of the syntactic bounds and work with it as a proper graph, whereas "JSON users" may prefer the compact "dictionary tree" form along with one or more application-specific shortcuts/idiosyncrasies (such as @index, @nest or various type coercions), which JSON-LD caters for with some quite advanced mechanisms. As such, JSON-LD is not only an RDF syntax, but (for better or worse depending on your needs and goals) a particular way of working with linked data (formally based on RDF but with usage patterns if its own). I believe dealing with this difference is best done within distinct libraries (developed and maintained in sync with the evolution of JSON-LD).

That said, the upcoming RDF-star will require quite some work to cleanly integrate into all of these libraries, as it aims not to build upon RDF but to extend its semantics. I still believe that the arguments for separation stands though, as compact and framed forms are the ones most affected by further processing complexities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants