Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider external serialization libraries #180

Open
alecandido opened this issue Dec 10, 2022 · 7 comments
Open

Consider external serialization libraries #180

alecandido opened this issue Dec 10, 2022 · 7 comments
Labels
enhancement New feature or request output Output format and management refactor Refactor code

Comments

@alecandido
Copy link
Member

alecandido commented Dec 10, 2022

At the moment, I wrote in #172 a mini-serialization library based on dataclasses, it is currently called DictLike.

Now, dataclass already provides natively .asdict() and .from_dict() analagoue (i.e. MyDataClass(**mydict)).
This is the initial motivation to go down this way.

It is optimal, since it allows us to go through serialization preserving type information, since it is in the runtime structure, and this is the new instance of the "input layer" currently implemented in yadism.
The idea is to improve over and over input-checking relying on custom types definition, such that the internal library can make assumptions on those values.

Unfortunately, type hints and runtime classes are not the exact same thing, and proper serialization with generic type hints is complex (consider that a union is a type hint, not a class, including MyType | None, formerly typing.Optional[MyType], and also list[int], formerly typing.List[int]).
So, we already taking care of internal types is not coming at zero cost, especially because many features are introduced in later releases, and we need compatibility with py3.8 at the time of writing. This reflects the fact that types in Python are rather recent, with all their ecosystem (of which dataclass is part).

In order to reduce the burden of maintenance of the serialization part, it is worth to consider external libraries, especially if they are popular enough.
A good example would be lidatong/dataclasses-json, that provides a @dataclass_json decorator that is the equivalent of my DictLike base class.
It is also compatible with py3.7 and py3.6 (with backport of dataclasses, introduced in standard library in py3.7).
Unfortunately, it doesn't look so lively.

Other options are welcome.

@alecandido
Copy link
Member Author

alecandido commented Dec 10, 2022

Much better, pydantic seems to do it, so we might get it together with #177

@alecandido
Copy link
Member Author

alecandido commented Dec 10, 2022

https://www.attrs.org/en/stable/why.html

attrs is another candidate, together with its companion cattrs.

@alecandido
Copy link
Member Author

alecandido commented Dec 10, 2022

https://threeofwands.com/why-i-use-attrs-instead-of-pydantic/

This actually convinced me that we want Pydantic, rather than attrs, since we want to validate runcards. And if someone saves me the burden to write the validator (or part of it), I'm only grateful.

@felixhekhorn felixhekhorn added enhancement New feature or request refactor Refactor code output Output format and management labels Dec 12, 2022
@Fatal1ty
Copy link

Hi, @alecandido

In order to reduce the burden of maintenance of the serialization part, it is worth to consider external libraries, especially if they are popular enough.

This is a good idea especially if you’re interested in serialization of variadic generics from the new PEP 646. It has a lot of edge cases but I managed to cope with them in mashumaro that I would suggest you to try even if you have more common use cases.

@alecandido
Copy link
Member Author

alecandido commented Dec 20, 2022

@Fatal1ty another rather nasty type that requires some special care here is npt.NDArray. Have you ever tested it?

@Fatal1ty
Copy link

@Fatal1ty another rather nasty type that requires some special care here is npt.NDArray. Have you ever tested it?

For non-standard types like this one a custom universal serialization strategy can be registered. But since npt.NDArray isn't equal to np.ndarray such a strategy should be registered for any scalar type at the moment:

class NDArraySerializationStrategy(SerializationStrategy):
    def serialize(self, value: np.ndarray) -> str:
        tmp_io = io.BytesIO()
        np.save(tmp_io, value, allow_pickle=False)
        return tmp_io.getvalue().hex()

    def deserialize(self, value: str) -> np.ndarray:
        tmp_io = io.BytesIO(bytes.fromhex(value))
        return np.load(tmp_io, allow_pickle=False)

@dataclass
class C(DataClassDictMixin):
    x: npt.NDArray[np.float64]

    class Config(BaseConfig):
        serialization_strategy = {
            npt.NDArray[float64]: NDArraySerializationStrategy(),
        }

This could be unhandy when you have a lot of scalar variations, so I'm thinking about allowing to set a strategy for the origin type — np.ndarray in this case.

@felixhekhorn
Copy link
Contributor

a commit in an earlier attempt is c6c2fa6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request output Output format and management refactor Refactor code
Projects
None yet
Development

No branches or pull requests

3 participants