-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free type variables and extracting type information #1
Comments
For 1, you should be able to just use a regular type variable. Perhaps it would help if you explained in more detail what you are trying to do. |
Thanks for the quick response!
I thought this might be the case but I was unsure.
For (2), I'm trying to take a For the sake of this question, I guess you could say that this is equivalent to printing the type of an expression after type-checking. |
If you want to print out the type, then you just follow the graph nodes and print it as desired. Note that fully simplifying it can take exponential time in pathological cases. One of the key reasons that the base algorithm has worst case O(n^3) time is that it doesn't try to simplify anything, but you'll probably want to simplify it for display purposes. |
Apologies, I'm still a little confused. In the demo on this page: https://blog.polybdenum.com/2020/08/08/subtype-inference-by-example-part-6-numeric-types-and-operators.html I can type in the following expression without a type error.
Is this intentional? This seems like a malformed program that the type checker is accepting.
I'm not entirely sure what you mean by this. The existing type checker doesn't provide a way to 'follow' a |
|
Ah, the code is implicitly generating sum types? In this case, I am, if anything, more confused as to how one extracts that information from the type checker core. |
I think it's best to explain in more detail what you are trying to actually accomplish so we can figure out what the best way to do it is. Printing out inferred types is generally not useful, because the inferred types are often too large and complex for a human to understand anyway. |
To clarify: In In practice (at least, in past compiler projects) I've done this by taking the inferred types from the typechecker and 'baking' them into concrete types (instantiating polymorphic functions as necessary) in order to produce a program where every term is associated with a concrete type such that code generation can begin to occur. My problem is that, currently, I cannot see a simple way to extract this sort of type information from |
The type system shown in cubiml was designed for transpilation to JS and was not designed for efficient static compilation. If that's what you want to do, you'll probably want to use a different type system. You can still use the same algorithm, and the same code, but you'll want to define the actual types and subtype relations differently. For example, you'll probably want to get rid of width subtyping for records. Doing so of course opens up new issues, like how to identify which record type a given field access is supposed to belong to. I'd recommend looking at Ocaml and Haskell, which have wrestled with these issues. Ocaml doesn't allow anonymous record types at all, while Haskell requires all field names to be unique. |
Once you've done that, you can extract all the type information you want from the type graph. For a variable, just look at the set of VTypeHeads and/or UTypeHeads that flow to it. |
Thanks for the information! I think perhaps I'll take some inspiration from |
First, let me thank you for the tutorial and sample code. This kind of system has practically all of the qualities I wanted for my toy language, and the tutorial is extremely thorough, informative and persuasive. :) A few (naive) questions:
Thanks again! |
To @devdtc (and @zesterer may be interested). I don't mean to derail this discussion, but you may be interested to have a look at my Simpler-sub project, which proposes a very simple algorithm for subtype inference by starting from Simple-sub (an analogue to cubiml) and cutting some corners. |
The simplest approach would be to just hack in a special Use type head that contains mutable state which tracks what types have been checked against it previously and returns a type error in the case of "incompatible" types. A more technically correct approach would be to use a hybrid of unification and biunification depending on which behavior you want in which parts of the type system, but that makes the typechecker more complex. It is a vague question. Assuming you want to generate code based on inferred types, then that is a big no-no. The most important rule of global type inference is that types are derived from code, rather than the other way around. If you have both code derived from types and types derived from code, then your system no longer has well defined or predictable behavior. One way around this is to have a stratified type system, where you have two different kinds of types, one that is explicitly specified and usable in code generation and one that is inferred from the code. That being said, there are ways to do the sort of thing derive macros are used for without the actual code generation part. For example, you could arrange things so that the same code can be used for any component type, thus obviating the problem. Yes, that is correct. You would just say This is something I've been thinking a lot about, but it's a tricky subject, so I'll just do a brief brain dump. The fundamental problem with existing systems languages is that their type systems combine multiple levels of abstraction, leading to much confusion and bugs. I think the real way forward is to have a language which makes the different levels of abstraction explicit. I'm not sure how that might look - the particular design would be a UX question. But what you really need is some way to specify how types behave logically, as well as controlling how they are represented after optimization, etc. Depending on the focus of the language, you might start with just high level types and have optional annotations to provide optimization hints when the default behavior doesn't do what you want. Or you might do the reverse - start with low level physical types only and add the ability to build logical refinement types on top of them. It's all up to you. Of course, realistic compilers have many different optimization passes, and the different passes often themselves operate at different levels of abstraction. To be fully correct, you'd want to be able to expose and control behavior at each intermediate stage. But this would obviously get out of hand pretty quickly, and I'm not sure what a good way to do it would be. On the bright side, even trying already puts you ahead of the state of the art. Real world compilers like LLVM are full of bugs because they can't even keep the levels of abstraction consistent within the compiler itself, let alone what is exposed to the programmer. Another problem is that a Cubiml like approach might work if all you care about is emulating C, but any serious language designer will want a Rust-like lifetime system at the very least, and I do not think there is any way to have full, decidable lifetime inference. This means that you're going to need to require type annotations, at least for lifetimes, and that means a lot of language design questions start going down different paths. I'm not sure what a good solution to this problem is. @LPTK The problem with exposing inferred types is that they can be exponentially large, and hence you automatically get exponential complexity. What makes Cubiml unique is the worst-case cubic type inference, but that necessarily comes at the cost of not normalizing (and hence not exposing) inferred types. That being said, I'm not sure which approach is best. But losing the polynomial complexity guarantees is a major downside, and it means that you'll probably need the programmer to supply at least partial type annotations in order to keep compilation time manageable. And that in turn opens the thorny question of how to specify them and how to teach the programmer when and where they need to add annotations. Anyway, I hope that helped. Please let me know if you have any more questions. |
This is fantastic and much more than I expected--thank you both! @LPTK I saw you post about simple-sub on r/ProgrammingLanguages but I had not seen simpler-sub, which might be a great fit. Will definitely take a look! @Storyyeller this is very helpful (especially the brain dump!). A few follow-ups: 1. Another more basic part of this question that I probably should have started with is: what is the meaning of a type like 2. That makes a lot of sense. Maybe a better thing to do is have macros like derive work on the AST of the type alias definition or use some EDSL for types that the macro can understand. 5. This is a very interesting way to frame the problem. By performant code though, I didn't mean quite to the extent of Rust. My impression is you probably don't need to complicate the system too much to get good, predictable performance. C#/F# get by for gaming because of their value structs allow performance critical stuff to be written in an efficient way that minimizes allocations. Maybe all that's needed is a distinction between something akin to Cubiml objects and simple C-like structs that do not support subtyping and can be stored inline in arrays. GC overhead might be mitigated by configurable collection strategies (e.g. something like a function that calls a closure and tells the GC to use an arena and defer collections until it returns when it can just free the arena). Anyway, just some ideas. :) I think you're right that there probably isn't a good way to give full control over the generated code while maintaining the level of abstraction of languages like the MLs. My Re. the cost of normalizing inferred types--it should still be feasible to (eventually) have the compiler do it, even if only for the REPL right? I think that's the only thing I'd want it for, playing around with Cubiml. I get the argument that if the system is intuitive enough that printing types everywhere shouldn't be necessary, but it'd be helpful for people learning the system to see what's going on. zesterer's example is one that was counterintuitive to me coming from Haskell/OCaml (and having never used TS), for instance. |
|
I should have mentioned that in Simpler-sub, you don't need to have a top type. You can actually decide to reject things like This actually allows you to implement types that have different, incompatible representations, and to compile them efficiently. I think this is specifically what you were asking for. A term like In Simple-sub and cubiml, we could also achieve the same thing, but it'd be quite more complicated. Simple-sub already normalizes inferred types in the simplification phase, so such bad least-upper bounds could be reported at simplification time. But there are subtleties around record types which require complicating the engine further (specifically, when computing the LUB of
I'm not convinced this really is a problem in practice. The original HM type inference system itself has exponential complexity, and that has not prevented its use in languages like OCaml, where programmers often leave no type annotations at all in their code bases. Human-written code does not tend to exhibit that sort of complexity pathological cases. Now, subtyping adds its own potential pathological cases, but I don't see why things would be fundamentally different in this context. If you always normalize and simplify types, you keep them small (or at a size similar to what you'd get in ML), and so the size of the program doesn't matter that much – it's the size of individual definitions that tends to matter, but these are generally not very big (or the program needs some refactoring). A few hundred lines of code at most.
To be clear, I'm pretty sure Simple-sub has the same complexity as cubiml. It's the simplification and normalization that's more expensive. But again, if program definitions remain of reasonable size and you do simply and normalize each inferred type, you don't actually trigger pathological cases, in my (admittedly limited) practical experience. |
I suspect that the main reason that it is not a problem in "real world" codebases is that 1) in practice, people supply type annotations everywhere anyway and 2) in practice, people make relatively little use of the type system compared to what is possible and 3) even if they did, subtyping adds a lot of power compared to unification, and that comes at a (compile time) performance cost. The whole point of having worst case guarantees is that they let you actually make meaningful statements about the properties of the system, rather than just saying "well it works in practice if you don't actually make use of it much". You could say that about anything! |
I don't think that's it. There are many significant codebases in languages like SML, OCaml, F#, and even Haskell (before the convention to annotate the signatures of exported definitions was established) where basically no type annotations are provided. I have never heard of type inference being a bottleneck in these cases. In practice, it still behaves close to linearly with the size of the program, since the size of individual definitions is bounded.
Again looking at OCaml, you can see that they essentially emulate subtyping with row polymorphism in many places of the language – most notably in polymorphic variants and object types. Pure row polymorphism has some usability limitations, but note that for both of these features, OCaml enables recursive types by default, which places their expressiveness pretty close to MLsub. While it's true not many people use these features often, I have never heard of them causing type inference bottlenecks.
A complexity analysis is not very useful if it doesn't take into account the actual shape of the data that's meant to be processed. It just turns out that the shapes of programs human write have very strong invariants, so that should be what's optimized for IMHO. I agree it's nice to have worst-case complexity bounds in any algorithm (even when they won't normally be exercised), but it's a pretty big tradeoff that I wouldn't be willing to make. Most practical languages have horrible worst-case type checking complexity — in fact, most languages we actually use have undecidable type checking, including, Rust, OCaml, and Haskell! https://3fx.ch/typing-is-hard.html#how-hard-is-type-checking-your-favorite-language |
I'm starting to write an implementation of cubiml, for learning purposes, and I think what people are missing in the if-then-number-else-string scenario is if no code ever uses the result, of course it will type-check. That's just part of the deal with this kind of type-checker. If some code does use the value, it had better use it in a way where it can be a string or an int. At least, that's my understanding. And I suppose it's equivalent to Top "operationally" if you can't do anything with this type of value besides pass it around, including distinguishing it from other sorts of values besides int and string... though in another sense, it is not the same set as the set of all values, which would include functions, records, etc, right? Anyway, despite the repeated assurances that the types from this checker are unwieldy and useless--and should not be shown to the user or used for further compiler features--I am still hoping to leverage them to provide an experience like the type-checking in TypeScript or Scala. I'm curious to look at them. |
That is correct. |
Hi,
I'm very interested in using cubic biunification in a project of mine. However, I've two questions about the algorithm that I've not been able to figure out from the codebase and associated blog posts alone.
The text was updated successfully, but these errors were encountered: