Tags: Minep/lunaix-os
Tags
A Total Overhaul on the Lunaix's Virtual Memory Model (#26) * * Introducing a new declaritive pte manipulation toolset. Prior to this patch, the original page table API is a simple, straightforward, and yet much verbose design. Which can be seen through with following characteristics: 1. The `vmm_set_mapping` is the only way provided to set pte in the page table. It require explicitly specifying the physical, virtual and pte attributes, as was by-design to provide a comprehensiveness. However, we found that it always accompanied with cumbersome address calculations and nasty pte bit-masking just for setting these argment right, especially when doing non-trivial mapping. 2. The existing design assume a strict 2-level paging and fixed 4K page size, tightly coupled with x86's 32-bit paging. It makes it impossible to extend beyond these assumption, for example, adding huge page or supporting any non-x86 mmu. 3. Interfacing to page table manipulation is not centralised, there is a vast amount of eccentric and yet odd API dangling in the kboot area. In light of these limitations, we have redesign the entire virtual memory interface. By realising the pointer to pte has already encodes enough information to complete any pte read/write of any level, and the pointer arithematics will automatically result the valid pointer to the desired pte, allowing use to remove the bloat of invoking the vmm_set_mapping. Architectural-dependent information related to PTE are abstracted away from the generic kernel code base, giving a pure declaritive PTE construction and page table manipulation. * Refactoring done on making kboot using the new api. * Refactoring done on pfault handler. * * Correct ptep address deduction to take account of pte size, which previously result an unaligned ptw write * Correct the use of memset and tlb invalidation when zeroing an newly allocated pagetable. Deduce the next-level ptep and use it accordingly * Simplyfy the pre-boot stuff (boot.S) moves the setting of CRx into a more readable form. * Allocate a new stack reside in higher half mem for boostraping stage allow us to free the bootctx safely before getting into lunad * Adjust the bootctx helpers to work with the new vmm api. * (LunaDBG) update the mm lookup to detect the huge-page mapping correctly * * Dynamically allocate page table when ptep trigger page fault for pointing to a pte that do not have containing page table. Which previously we always assume that table is allocated before pte is written into. This on-demand allocation greatly remove the overhead as we need to go through all n-level just to ensure the hierarchy. * Page fault handling procedure is refactored, we put all the important information such as faulting pte and eip into a dedicated struct fault_context. * State out the definition we have invented for making things clear. * Rewrite vmap function with the new ptep feature, the reduction in LoC and complexity is significant. * * Use huge page to perform fast and memory-efficient identity mapping on physical address space (first 3GiB). Doing that enable us to eliminate the need of selective mapping on bootloader's mem_map. * Correct the address calculation in __alloc_contig_ptes * Change the behavior of previously pagetable_alloc, to offload most pte setting to it's caller, makes it more portable. We also renamed it to 'vmm_alloc_page' * Perform some formattings to make things more easy to read. * * Rewrite the vms duplication and deletion. Using the latest vmm refactoring, the implementation is much clean and intuitive than before, althought the LoC is slightly longer. The rewrited version is named to `vmscpy` and `vmsfree` as it remove the assumption of source vms to be VMS_SELF * Add `pmm_free_one` to allow user free the pmm page based on the attribute, which is intented to solve the recent discovered leakage in physical page resource, where the old pmm_free_page lack the feature to free the PP_FGLOCKED page which is allocated to page table, thus resulting pages that couldn't be freed by any means. * Rename some functions for better clarity. * * Rewrite the vmm_lookupat with new pte interface * Adjust the memory layout such that the guest vms mount point is shifted just before the vms self mounting point. This is remove effort to locate it and skip it during vmscpy * Add empty thread obj as place-holder, to prevent write to undefined location when context save/store happened before threaded environment is initialized * * Fix more issues related to recent refactoring 1. introduce pte_mkhuge to mark pte as a leaf which previously confuse the use of PS bit that has another interpretation on last level pte 2. fix the address increamention at vmap 3. invalidate the tlb cache whenever we dynamically allocated a page. * (LunaDBG) rewrite the vm probing, employing the latest pte interfacing and make it much more efficient by actually doing page-walk rather than scanning linearly * * Fix an issue where the boostrap stack is too small that the overflow corrupt adjacent kernel structure * Add assertion in pmm to enforce better consistency and invariants * Page fault handler how aware of ptep fault and assign suitable permission for level creation and page pre-allocation * Ensure the mapping on dest_mnt are properly invalidated in TLB cache after we setup the vms to be copied to. * (LunaDBG) Fix the ptep calculation at specified level when querying an individual pte * * Rework the vms mount, they are now have more unified interface and remove the burden of passing vm_mnt on each function call. It also allow us to track any dangling mount points * Fix a issue that dup_kernel_stack use stack top as start address to perform copying. Which cause the subsequent exec address to be corrupted * Fix ptep_step_out failed on non-VMS_SELF mount point * Change the way that assertion failure reporting, now they just report it directly without going into another sys-trap, thus preserve the stack around failing point to ease our debugging experience. * * ensure the tail pte checking is peformed regardless the pte value when doing page table walking (e.g., vmsfree and vmscpy). Which previously is a bug * the self-mount point is located incorrectly and thus cause wrong one being freed (vmsfree) * ensure we unref the physical page only when the corresponding pte is present (thus the pa is meaningful) * add a flag in fault_context to indicate the mem-access privilege level * address a issue that stack start ptep calculation is offseted by 1, causing a destoryed thread accidentially free adjacent one's kernel stack * * Purge the old page.h * * Refactor the fault.c to remove un-needed thing from arch-dependent side. * (LunaDBG) add utilities to interpret pte value and manipulate the ptep * * Add generic definition for arch-dependent pagetable
Support to multi-threading and pthread interface (POSIX.1-2008) (#23) This patch brings a functional multi-threading support to Lunaix kernel together with essential syscalls to support POSIX's pthread interfacing. About the threading model in Lunaix Like the Linux kernel, the threading feature is built upon the existing multi-processing infrastructure. However, unlike Linux which uses a more lazy yet clever approach to implement threads as a specialized process, Lunaix implements threading that perfectly reflects its orthodox definition. Which requires a from-scratch and massive refactoring of the existing process model. Doing this allows us to make things clearer and pursue a true lightweightness of what threads are supposed to be. Kernel thread and preemptive kernel As a natural result of our implementation, we have implemented the concept of kernel threads, which are subsidiaries of a special process (pid=0) that runs under kernel mode. Treating the kernel as a dedicated process rather than a process parasite, enables us to implement an advanced feature of a preemptive kernel. Unlike in Linux, where the kernel is preemptive anywhere; Things were different in Lunaix, where only functions called directly from the kernel thread can be preemptive, which allows us to perform more fine-grand control. This reduces the effort of refactoring and eases the writing of new kernel code, for which the non-preemptive assumption can be kept. Spawning and forking This patch introduces a set of tools for performing remote virtual memory space transaction, allow the kernel to inject data into another address space. And will be used as infrastructure for kernel-level support on the `posix-spawn` interface, which creates a process from scratch rather than fork from another, allows us to skip duplicating the process's VM space and reduce overhead. LunaDBG LunaDBG has been refactored for modularization and arch-agnostic. New set of commands are being added: mm: a sophisticated tool for examining page table mapping and performing physical memory profiling (detailed usage see up-coming documentation) sched: tools for examining the scheduler context, listing all threads and processes -------------- All changes included in this patch: * * Signal mechanism refactor The sigctx of proc_info is changed to a pointer reference as well as the sigact array is now in favour of storing references. Therefore we can keep the overall proc_info and sigctx size small thus to avoid internal fragmentation of allocating large cake piece. Some refactoring also done on signal related method/struct to improve overall readability * Temporary removal of x87 state from context switching until a space-efficient workaround emerged * Add check on kernel originated seg-fault and halt the kernel (for debugging). As by assumption kernel mapping will always present (at least for now, as page swapping and stagging is not implemented in Lunaix yet). * Re-group the fork related functions to a dedicated fork.c file * Fix a incorrect checking on privilege level of interrupt context when printing tracing * * Make proc_mm as a pointer reference to further reduce the single allocation size as well as making things more flexible * Remove the need of pid when allocating the physicall memory. Due the complexity and the dynamics in the ownership of physical page, there is no point to do such checking and tracking. * Add some short-cut for accessing some commonly used proc_mm field, to avoid nasty chain of cascading '->' for sake of readbility. * * Introducing struct thread to represent a light-weighted schedulable element. The `struct thread` abstract the execution context out of the process, while the latter now composed only descriptors to program resources (e.g., file, memory installed signal handlers). This made possible of duplicating concurrent execution flow while impose a rather less kernel overhead (e.g., cost to context switch, old-fashioned fork()-assisted concurrency). Such change to process model require vast amount of refactoring to almost every subsystem involving direct use of process. As well as introducing additional tools to create the initial process. This commit only contains a perliminary refactoring works, some parts require additional work-around is commented out and marked with `FIXME` * Other refactoring and cleaning has been done to improve the semantics of certain pieces of code. * * Process and thread spawning. Allow to reduce the system overhead introduced by invoking fork to create process. However, creating a process housed executable image is not implemented as it require remote injection of user stack for which is still under consideration * Introducing kernel process and kernel threads. Prior to the threading patch, the dummy process is a terrible minick of kernel process and used as merely a fallback when no schdulable process can be found. This makes it rather useless and a waste of kernel object pool space. The introducing of thread and new scheduler deisgn promote Lunaix to a full functioning kernel thread, it's preemptiveness enable the opportunity to integrating advanced, periodical, event driven kernel task (such as memory region coalescing, lightweight irq handler) * Some minor refactorings are also performed to make things more clean * Update the virtual memory layout to reflect the current development * * Fix the issue of transfer context being inject into wrong address as the page offset was some-how not considered * Fix the refactoring and various compile time error * Adjust the lunadbg to work with latest threading refactoring. * Also fix the issue that lunadbg's llist iterator had made false assumption on the offset of embeded llist_header. * Rename spawn_thread -> create_thread. And introduce spawn_kthread to spawn a kernel thread within kernel process. * Fix the issue in vmm_lookupat that ignore the present bit when doing pte probing * Leaves some holes for later investigations * * Make threading feature works properly * Fixed left-over issues as well as new issues found: 1. incorrect checking on thread state in 'can_schedule', causing scheduler unable to select suitable thread even though there exists one 2. double free struct v_file when destorying process. Which caused by a missing vfs_ref_file in elf32_openat 3. destory_thread removed wrong thread from global thread list 4. thread-local kernel and user thread don't get released when destorying thread 5. lunad should spawn a new process for user space initd rather than kexec on current kernel process 6. guarding the end of thread entry function with 'thread_exit' to prevent run-over into abyss. 7. fix tracing.c prints wrong context entring-leaving order 8. zero fill the first pde when duplicating the vm space to avoid garbage interfering the vmm * * Allow each process to store their executable arguments * Refactor the lunadbg toolset (done: process, thread and scheduler) * * Fix can_schedule() should check against thread's state rather than process state * Remove the hack of using ebp in 'syscall_hndlr', thus to prevent it for interferencing the stack-walker * Find tune the output of tracer when incountering unknown symbol * (LunaDBG) Add implementation for examing sigctx * * Add related syscall to manipulate threads * Factorise the access of frame pointer and return address to abi.h * Shrink the default pre-thread user stack size to 256K, to account the shortage on 32-bit virtual address space. * Add check to kernel preemptible function context * Add different test cases to exercise various pthread_* api * * (My Little Pthread Test) Fix the all sorts of issues found in current threading model implementation with a set of simple pthread tests. * Add more sanity checks on tracing and pfault handler, to avoid them spamming the output stream when the failure is severe enough to cause infinite nesting (e.g., when vm mapping of kernel stack get messed up) * Add guardian page at the end of thread-local kernel and user stack to detect stack overflow * Remove an unwanted interrupt enablement in ps2kbd.c (which behaviour is undefined in the booting stage) * Temporary fix issues with vmr dump py utils (need to adapt the new design sooner or later) * Specify a cpu model for QEMU, which make things more detrerministic * * Change the mmap flag for creating thread-local user stack to non-FIXED. As a previous experiment shows that during high concurrency situtaion, the calculation of ustack location for new thread will be affected and had risk of smashing existing thread's ustack causing undefined bevhaiour when return from kernel (as the stack address is implied from proc_info::threads_count) for which reason it should treated as hint to mem_map rather than a hard requirement. * Re-implement the VMR allocation alogirthm that will takes the vicinity of the hinted address as high priority search area, rather than dumbly start from beginning. * Remove the undesired pmm_free_page from __dup_kernel_stack. As we now skipped the first 4MiB when duplicating the page table. Thus the ref counters for these physical page are already 1 after fork. This has been identify the root cause of a randomly appearing segfault during memory intensive task such as test_pthread, as these falsely released physical page will get repurposed. However, this also lead to a question in Lunaix's memory utilisation, as the next-free strategy is unlikely to visit the previously allocated page when plenty of free space ahead. More efforts should be taken into investigating memory performance. * Added more assertion and checks to enhance the robustness and ease the debugging experience. * Adjust some output format and refactor the test_pthread code. * * (lunadbg) `mm` command for probing page table and physical memory profiling * Add missing syscall-table doc * Add more test cases related to pthread * * (LunaDBG) decouple the pte related operation as arch-dependent feature * (LunaDBG) adjust the output format * * (LunaDBG) Refactor VMR dump * * Adjust the thread id generation to avoid duplication ratio * Capped the thread limit per process * (LunaDBG) Fix the issue with display of percentage in pmem profiling
feat: a better boot command line parser fix: bugs in term interfacing
feat: gfxm: a layer provides user space access to low level interfaci… …ng of graphic adapter chore: clear things up
regression: test serial port r/w. fix: uart register bitmap fix: refine context switch trace message feat: add a dedicated program to host all test routines
feat: kernel stack tracing refactor: move cpu.h to arch specific
refactor: decouple the executable file implementations with execve fu… …nctionality.
regression: mmap for fd fix: replace %ebp register to %esi for passing 5-th arg when switching to syscall dispatcher. feat: support for anonymous mapping refactor: mm_region interfaces refactor: page fault handler clean up. refactor: resolve cyclic dependencies between mm.h and fs.h refactor: rename readdir to sys_readdir to distinguish readdir(3) wip refactor: separating syscall definitions to userspace.
feat: (iso9660) rock ridge extension fix: (pcache) over-reading the page cache
PreviousNext