-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyld as the json-ld parser for rdflib #1836
base: main
Are you sure you want to change the base?
Conversation
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
one concern I have with this is that pyld was last updated in 2020, I don't mind this in principle, but I would prefer to actually incorporate their codebase rather than just use their parser, because if they are not maintaining it then it becomes a liability. There is also the possiblity of incorporating this as an extra, where someone can install rdflib with pyld extra and then use the pyld parser instead. |
We don’t want to have more than one equivalent JSON-LD parser to maintain so if we adopt pyLD, we should drop our current one. If pyLD is maintained and better than our current one and it can be integrated successfully, we should swap current for it. If pyLD is not maintained, we could incorporated it and switch maintenance effort to it instead of current. Let’s put this to @aucampia @ashleysommer @gjhiggins @white-gecko @niklasl and the maintainers of pyLD too and see what the feeling is! |
pyld passes 98% of the tests for JSON-LD 1.1's Transform JSON-LD to RDF section. From memory, rdflib's current JSON-LD parser for JSON-LD 1.1 is incomplete. I am not sure if this PR RDFLib/rdflib-jsonld#74 ever made it in when rdflib-jsonld was merged with rdflib core.
That's fair and I totally agree. I should really set this PR to draft as its intention is really to start a conversation on how best to get rdflib to conform to JSON-LD 1.1 for parsing. I also have ideas around taking pyld's implementation and modifying bits of it to integrate into rdflib. For example, one thing I don't like with this PR's implementation is its inefficiency in the A more efficient implementation would be to monkey patch https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/lib/pyld/jsonld.py#L3529 and directly add the triples into an rdflib graph instead of a pyld graph. Let me know your thoughts. Cheers. |
In 91bf6ba, I monkey patched a few pyld functions to avoid creating pyld data structures. The parser's parse method now adds statements from the JSON-LD data directly into rdflib's store. Slight speed improvement based on a JSON-LD dataset with 49,990 statements and a file size of 6.4MB. Previous time:
New time:
|
I'd argue this perception is supported by the extensive list of skipped expected test failures |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
…to pyld-parser
There also needs to be a discussion to decide where the pyld and c14n packages should be. Currently the code is structured as:
I'm thinking it might be suitable to have them located under |
Yes, this is an interesting move, which I do believe warrants a discussion before finalizing. I was involved in a mail thread discussing what to do with rdflib-jsonld some years ago, before that code base was copied into RDFLib and the separate package was retired. I gather there's been no definitive changes, yet, to the decisions made back then, and that this PR is a continuation thereof? It is a bit "after the fact" though in terms of rdflib-jsonld now having been copied into RDFLib. Dependers of RDFLib may now rely on these innards, so further deleting and introducing packages and modules does present some quite problematic instabilities. IIRC I actually thought it might be worthwhile to do what this PR started out to do, i.e. depend on PyLD instead. But it appeared that the direction was to keep maintaining RDFLib's own implementation (which I did start, and had an original go at upgrading to 1.1 back then). Apart from the relative merits and complexities of various implementations though, now we have the case of coyping code from another, and external, package to include within RDFLib itself. I originally suggested keeping the JSON-LD processor as an external dependency, as JSON-LD processing is an involved affair which entails more than "just" maintaining a parser and serializer. Also, copying in code bases from other packages should not be done lightly, as that effectively creates a fork of it within the RDFLib code base. If it is done anyway for various reasons, and even while honoring the license of that external package, it might be argued that we're complicating matters with forks. Remember that JSON-LD may evolve further, not the least with RDF-star on the horizon of possible W3C endorsement somewhere down the line. While I can see it both ways, I am still generally in favour of having an external dependency. (RDFLib does depend on external packages for e.g. XML and JSON parsing, HTTP processing, etc. It can be debated whether those are categorically different, but I believe the line to be drawn isn't as clear as it might first appear.) It might be easier for RDFLib to swap out processors if the JSON-LD processor package is kept as an external dependency. Of course, that requires a stable interoperability API, which I'd be glad to discuss further. The rdflib-jsonld package was specifically designed to keep down unnecessary complexities by integrating tightly with RDFLib internals, but that might be possible to achieve on various levels, depending on how thin the contract with an external processor needs to be (and what the actual overhead will be in practise). A middle ground could be to make an explicit fork of any external dependency into the RDFLib Github organization. (Specifically PyLD, as rdflib-jsonld was already owned by the RDFLib org.) Maintenance and API improvements there could go back upstream if so desired. Or for that matter, look into if PyLD needs maintainers and contribute there if changes are needed in order for RDFLib to fulfill its needs. Speaking of implementations, I would like to point to my own TRLD as another candidate for discussion (perhaps elsewhere). It is faithfully following the JSON-LD 1.1 algorithms. While it is orthogonal and complementary to RDFLib in some ways, in others—for specific purposes which don't require a full RDF toolchain— it represents an alternative. It also has the explicit goal of being transpilable; currently to Java and JS. Its very permissive 0BSD license doesn't even require attribution (although it is appreciated of course). If we were to pursue the "copy some of the code" path, and in spite of my own opinion (attempting to honor a narrow ideal), that interestingly allows for anyone to take the parts needed and remake them as desired. I'd gladly discuss these matters more in depth. Is this PR a good place though, or do we need a more general issue/discussion? |
I think in part this is somewhat an issue on what the should be considered part of the public API of RDFLib, in my view the plugins should not be used directly, but I do understand that there may be cases where they are used directly but without any clear instruction on this you do have a fair point, by all indications it is part of the public API. And as you allude to further in your message, there is more the JSON-LD than parsing and serializing, and these other things should be part of the public API and should be maintained as such.
To me part of the challenge with external libraries, or additional libraries is that it is quite resource intensive to just maintain RDFLib, to maintain the functionality as separate repos has a lot of overheads.
The challenge with pyld is that, even though it may be more compliant, it has not been maintained since 2020, so to me it is not really an option to depend on it without moving it into our codebase, I agree though, this takes on a significant burden, it is actually not easy for me to say at this point in time if it is worth it, I don't know the exactly level of compliance of rdflib's JSON-LD support or what it would take to make it fully compliant. To me the higher priority is getting a better picture on this, getting the JSON-LD test suite running entirely with failing tests marked with xfails so we know where we stand. To summarise:
I don't really know the best course of action, I suspect using https://github.com/niklasl/trld may be it. We need to reach out to the people form https://github.com/digitalbazaar/pyld to see what they think, maybe it is best to just ping them here: @dlongley @davidlehn @gkellogg |
I actually authored most of the work to bring pyld up to 1.1 compliance although under contract to do so. Even though it's not my main focus, I would certainly be willing participate in helping to maintain pyld as part of RDFlib. Trying to maintain two different Python libraries for JSON-LD seems like a pointless effort, and pyld has had a fair amount of work put into it to make it work similar to jsonld.js, which has some nice performance tweaks. cc/ @davidlehn @azaroth |
Thanks for the eagerness @gkellogg, @edmondchuc for kicking this off and the recent integration work and @niklasl for the multiple implementation! We will need to talk about this a bit more I think to see if we can all get behind one implementation. RDFLib maintainers should be having a phone call shortly, so we will discuss this then and report any thought back. If anyone can think of anyone else that should be included here, please do bring them in too. |
see #1993 about ability to parse plain JSON using contexts not directly referenced, or in embedded objects. |
fwiw, pyld currently has a dependency tax: https://github.com/digitalbazaar/pyld/blob/master/requirements.txt
¹ There's an unmerged PR changing the dependency on the (as-was unmaintained, now picked-up) The api tests have some failures/errors:
|
I just want to note that anyone needing advanced JSON-LD support for RDFLib today can leverage what is already in place together with a dedicated JSON-LD processor. Here is a full example using RDFLib and PyLD together (using TRLD would work pretty much the same). from rdflib import Graph
indata = {
"@context": {
"@version": 1.1,
"dc": "http://purl.org/dc/terms/"
},
"@id": "http://example.org/about",
"dc:title": {"@language": "en", "@value": "Someone's Homepage"},
"dc:hasFormat": {"dc:format": "text/turtle"},
}
ctx = indata["@context"]
g = Graph()
# RDFLib's builtin json-ld support handles most forms, but is not a full-featured
# JSON-LD processor.
# g.parse(indata=indata, format='json-ld')
# Using PyLD to ensure full support of all compact forms.
from pyld.jsonld import expand # type: ignore[import]
# The first integration point: expand the input data to ensure all data is captured.
g.parse(data={'@graph': expand(indata, None)}, format='json-ld') # type: ignore[arg-type]
# Note: if you want prefix bindings from the compact form in the graph, use this:
g.parse(data={"@context": ctx}, format='json-ld') # type: ignore[arg-type]
# Look at the results.
print(g.serialize(format='turtle'))
# To use the builtin automatic compaction, uncomment this.
# print(g.serialize(format='json-ld', auto_compact=True))
# Otherwise, get expanded JSON-LD from RDFLib.
from rdflib.plugins.serializers.jsonld import from_rdf
outdata = from_rdf(g)
# Then use the compaction you need from the processor (the second integration point).
from pyld.jsonld import compact
outdata = compact(outdata, {"@context": ctx})
# Serialize the native Python data structure as JSON text:
import json
print(json.dumps(outdata, indent=2)) There is just one internal detail exposed, the import of I do stand by the point that copying in PyLD (or TRLD for that matter) takes on a significant burden, and I would very much advice making PR:s or otherwise engaging in the maintenance of these separate JSON-LD processors. (TRLD is separate from PyLD for many reasons, one being that I need a very controlled source base to be able to transpile it into Java and Javascript.) RDF graphs and JSON-LD are distinct enough in usage, in that "RDF users" would prefer getting the data out of the syntactic bounds and work with it as a proper graph, whereas "JSON users" may prefer the compact "dictionary tree" form along with one or more application-specific shortcuts/idiosyncrasies (such as That said, the upcoming RDF-star will require quite some work to cleanly integrate into all of these libraries, as it aims not to build upon RDF but to extend its semantics. I still believe that the arguments for separation stands though, as compact and framed forms are the ones most affected by further processing complexities. |
Summary of changes
This PR integrates pyld as the json-ld parser for rdflib. Pyld aims to conform to JSON-LD 1.1 and newer.
This change means that rdflib now depends on pyld.
Failing tests (count: 0)
Update: Tests passing.
Possibly related issues
I plan to look at the list of issues that are relevant to json-ld and see if this PR fixes them.
Checklist
the same change.
./examples
for new features.CHANGELOG.md
).so maintainers can fix minor issues and keep your PR up to date.