-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can we have a property rearrangement
in the SampleProcessing
object?
#664
Comments
Happy new year! ⭐ Any thoughts on this? |
@gszep your request is not in line with the current architecture of the AIRR Schema:
|
Hi @gszep It might be helpful for you to explain what you are trying to accomplish, and we can describe how to do that with the current AIRR data model.
This relationship already exists, except it is in the rearrangement record. In the rearrangement, the |
@bussec A |
@schristley I am writing an AIRR-compliant HDF5 file format where
This facilitates lazy, parallel constant-time random access to metadata rearrangement data. Since the metadata and rearrangements are stored together (HDF5 must be self-describing) I need a place to place the rearrangement data somewhere (as a field somewhere accessible within |
The AIRR Standard was designed to be able to capture both cases, with the structure of the Repertoire object flexible enough to do that. The original design was done this way because we did not want to dictate the structural relationships between these objects and instead leave it up to the study designer and/or data curator. The Repertoire object currently has two definitions/uses (yes this is a flaw in the design), it was initially though of in the more biological sense, where it would represent all of the b-cell/t-cell in a single subject at current time point. But it is also used as a general grouping of samples in a variety of different ways. So at least from a standards definition you can do what makes sense to you. For example, in the iReceptor repositories we have a 1:1 relationship between Repertoire and Sample (the If you are writing code that handles general AIRR Repertoire data as input, you have to handle all cases - which can be quite challenging... |
This is what |
This is interesting. I've used HDF5 a few times but only as user, and only with small datasets. How large can HDF5 files become, multi TB? I know the HD means hierarchical data, but not sure what flexibility it has to represent data models. Is it as flexible like JSON? Does HDF5 prefer data to be in normal form, like in an SQL database? As there is no AIRR standard HDF5 file format, you cannot technically be AIRR-compliant ;-D but I assume you mean to be "compliant" with the AIRR Data Model. The AIRR Data Model is flexible enough that it can be optimized for different file formats. Can HDF5 represent compound indexes like SQL? That is, most AIRR objects require at least two identifiers, for rearrangements that's
Ok, I don't what HDF5 attributes and groups are, but a quick google shows me that looks like best practice. How are you handling the JSON? Are you using some tool like this or is there a better representation?
one dimensional? hmm, the
This sounds like a hash function. Also random access implies lookup by The However, just the two identifiers is making one assumption, which is that all repertoires only have a single
Where does the user get the list of identifiers in the first place? Some repertoires can have millions of rearrangements, so we don't want to store this list of (compound?) identifiers in the Repertoire object, at least not for file formats like JSON. That blows up the size. But maybe this is reasonable for HDF5? Could you give a code example to do a simple HDF5 metadata query on repertoires, like on subject age or sex, then using that access the rearrangements? Or at least how you are imagining it work. Something like what we have in the python docs. |
There has been some recent dialog about storing AIRR Data in the h5ad file format, which is an HDF5 based format for storing Anndata files (https://anndata.readthedocs.io/en/latest/index.html), which in turn is used in packages like scirpy and other scverse packages (https://scverse.org/) to process single-cell omics files. If you are looking at storing AIRR Data in HDF5, you might want to look at what is going on in this area as they might have solved this - certainly a fair bit of thought has gone into this. The nice thing is that if you are also consider single-cell data, this gets you both! |
I should add that we are using the h5ad format in our single-cell data export and analysis of Single-cell data from the AIRR Data Commons. We are not using the AIRR extension - we are just storing the Cell/GEX in an h5ad file as it facilitates analysis with tools like Conga and CellTypist (which we currently have integrated into the iReceptor Gateway (https://gateway.ireceptor.org)). |
I would like to request a
rearrangement
field inSampleProcessing
as followsThis way each sample is linked to a rearrangement in a 1 to 1 relationship 🙏🏼
The text was updated successfully, but these errors were encountered: