Skip to content

Tags: Minep/lunaix-os

Tags

feat/vmm-rework

Toggle feat/vmm-rework's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
A Total Overhaul on the Lunaix's Virtual Memory Model (#26)

* * Introducing a new declaritive pte manipulation toolset.
  Prior to this patch, the original page table API is a simple,
  straightforward, and yet much verbose design. Which can be seen
  through with following characteristics:

        1. The `vmm_set_mapping` is the only way provided to set pte
           in the page table. It require explicitly specifying the
           physical, virtual and pte attributes, as was by-design to
           provide a comprehensiveness. However, we found that it
           always accompanied with cumbersome address calculations and
           nasty pte bit-masking just for setting these argment right,
           especially when doing non-trivial mapping.

        2. The existing design assume a strict 2-level paging and fixed
           4K page size, tightly coupled with x86's 32-bit paging. It
           makes it impossible to extend beyond these assumption, for
           example, adding huge page or supporting any non-x86 mmu.

        3. Interfacing to page table manipulation is not centralised,
           there is a vast amount of eccentric and yet odd API dangling
           in the kboot area.

  In light of these limitations, we have redesign the entire virtual
  memory interface. By realising the pointer to pte has already encodes
  enough information to complete any pte read/write of any level, and
  the pointer arithematics will automatically result the valid pointer
  to the desired pte, allowing use to remove the bloat of invoking the
  vmm_set_mapping.

  Architectural-dependent information related to PTE are abstracted
  away from the generic kernel code base, giving a pure declaritive
  PTE construction and page table manipulation.

* Refactoring done on making kboot using the new api.

* Refactoring done on pfault handler.

* * Correct ptep address deduction to take account of pte size, which
  previously result an unaligned ptw write

* Correct the use of memset and tlb invalidation when zeroing an
  newly allocated pagetable. Deduce the next-level ptep and use it
  accordingly

* Simplyfy the pre-boot stuff (boot.S) moves the setting of CRx into
  a more readable form.

* Allocate a new stack reside in higher half mem for boostraping stage
  allow us to free the bootctx safely before getting into lunad

* Adjust the bootctx helpers to work with the new vmm api.

* (LunaDBG) update the mm lookup to detect the huge-page mapping
  correctly

* * Dynamically allocate page table when ptep trigger page fault for
  pointing to a pte that do not have containing page table. Which
  previously we always assume that table is allocated before pte
  is written into. This on-demand allocation greatly remove the
  overhead as we need to go through all n-level just to ensure the
  hierarchy.

* Page fault handling procedure is refactored, we put all the
  important information such as faulting pte and eip into a dedicated
  struct fault_context.

* State out the definition we have invented for making things clear.

* Rewrite vmap function with the new ptep feature, the reduction in
  LoC and complexity is significant.

* * Use huge page to perform fast and memory-efficient identity mapping
  on physical address space (first 3GiB). Doing that enable us to
  eliminate the need of selective mapping on bootloader's mem_map.

* Correct the address calculation in __alloc_contig_ptes

* Change the behavior of previously pagetable_alloc, to offload most
  pte setting to it's caller, makes it more portable. We also renamed
  it to 'vmm_alloc_page'

* Perform some formattings to make things more easy to read.

* * Rewrite the vms duplication and deletion. Using the latest vmm
  refactoring, the implementation is much clean and intuitive than
  before, althought the LoC is slightly longer. The rewrited version
  is named to `vmscpy` and `vmsfree` as it remove the assumption of
  source vms to be VMS_SELF

* Add `pmm_free_one` to allow user free the pmm page based on the
  attribute, which is intented to solve the recent discovered leakage
  in physical page resource, where the old pmm_free_page lack the
  feature to free the PP_FGLOCKED page which is allocated to page
  table, thus resulting pages that couldn't be freed by any means.

* Rename some functions for better clarity.

* * Rewrite the vmm_lookupat with new pte interface

* Adjust the memory layout such that the guest vms mount point is
  shifted just before the vms self mounting point. This is remove
  effort to locate it and skip it during vmscpy

* Add empty thread obj as place-holder, to prevent write to undefined
  location when context save/store happened before threaded environment
  is initialized

* * Fix more issues related to recent refactoring

     1. introduce pte_mkhuge to mark pte as a leaf which previously
        confuse the use of PS bit that has another interpretation
        on last level pte

     2. fix the address increamention at vmap

     3. invalidate the tlb cache whenever we dynamically allocated
        a page.

* (LunaDBG) rewrite the vm probing, employing the latest pte interfacing
  and make it much more efficient by actually doing page-walk rather
  than scanning linearly

* * Fix an issue where the boostrap stack is too small that the overflow
  corrupt adjacent kernel structure

* Add assertion in pmm to enforce better consistency and invariants

* Page fault handler how aware of ptep fault and assign suitable permission
  for level creation and page pre-allocation

* Ensure the mapping on dest_mnt are properly invalidated in TLB cache
  after we setup the vms to be copied to.

* (LunaDBG) Fix the ptep calculation at specified level when querying an
  individual pte

* * Rework the vms mount, they are now have more unified interface
  and remove the burden of passing vm_mnt on each function call.
  It also allow us to track any dangling mount points

* Fix a issue that dup_kernel_stack use stack top as start address
  to perform copying. Which cause the subsequent exec address to be
  corrupted

* Fix ptep_step_out failed on non-VMS_SELF mount point

* Change the way that assertion failure reporting, now they just
  report it directly without going into another sys-trap, thus
  preserve the stack around failing point to ease our debugging
  experience.

* * ensure the tail pte checking is peformed regardless the pte value when
  doing page table walking (e.g., vmsfree and vmscpy). Which previously
  is a bug

* the self-mount point is located incorrectly and thus cause wrong one
  being freed (vmsfree)

* ensure we unref the physical page only when the corresponding pte is
  present (thus the pa is meaningful)

* add a flag in fault_context to indicate the mem-access privilege level

* address a issue that stack start ptep calculation is offseted by 1, causing
  a destoryed thread accidentially free adjacent one's kernel stack

* * Purge the old page.h

* * Refactor the fault.c to remove un-needed thing from arch-dependent side.

* (LunaDBG) add utilities to interpret pte value and manipulate the ptep

* * Add generic definition for arch-dependent pagetable

feat/threading

Toggle feat/threading's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Support to multi-threading and pthread interface (POSIX.1-2008) (#23)

This patch brings a functional multi-threading support to Lunaix kernel 
together with essential syscalls to support POSIX's pthread interfacing. 

About the threading model in Lunaix
Like the Linux kernel, the threading feature is built upon the existing 
multi-processing infrastructure. However, unlike Linux which uses a more
lazy yet clever approach to implement threads as a specialized process, 
Lunaix implements threading that perfectly reflects its orthodox definition.
Which requires a from-scratch and massive refactoring of the existing process
model. Doing this allows us to make things clearer and pursue a true
lightweightness of what threads are supposed to be.

Kernel thread and preemptive kernel
As a natural result of our implementation, we have implemented the concept
of kernel threads, which are subsidiaries of a special process (pid=0) that runs
under kernel mode. Treating the kernel as a dedicated process rather than a 
process parasite, enables us to implement an advanced feature of a preemptive 
kernel. Unlike in Linux, where the kernel is preemptive anywhere; Things were
different in Lunaix, where only functions called directly from the kernel thread can 
be preemptive, which allows us to perform more fine-grand control. This reduces
the effort of refactoring and eases the writing of new kernel code, for which the
non-preemptive assumption can be kept. 

Spawning and forking
This patch introduces a set of tools for performing remote virtual memory space
transaction, allow the kernel to inject data into another address space. And will
be used as infrastructure for kernel-level support on the `posix-spawn` 
interface, which creates a process from scratch rather than fork from another, 
allows us to skip duplicating the process's VM space and reduce overhead.

LunaDBG
LunaDBG has been refactored for modularization and arch-agnostic. New set of 
commands are being added:

        mm: a sophisticated tool for examining page table mapping and performing
            physical memory profiling (detailed usage see up-coming documentation)
     sched: tools for examining the scheduler context, listing all threads and 
            processes

--------------
All changes included in this patch:

* * Signal mechanism refactor

   The sigctx of proc_info is changed to a pointer reference as well as the
   sigact array is now in favour of storing references. Therefore we can keep
   the overall proc_info and sigctx size small thus to avoid internal fragmentation
   of allocating large cake piece.

   Some refactoring also done on signal related method/struct to improve
   overall readability

* Temporary removal of x87 state from context switching until a space-efficient
  workaround emerged

* Add check on kernel originated seg-fault and halt the kernel (for debugging).
  As by assumption kernel mapping will always present (at least for now, as
  page swapping and stagging is not implemented in Lunaix yet).

* Re-group the fork related functions to a dedicated fork.c file

* Fix a incorrect checking on privilege level of interrupt context when
  printing tracing

* * Make proc_mm as a pointer reference to further reduce the single allocation size
  as well as making things more flexible

* Remove the need of pid when allocating the physicall memory. Due the complexity and
  the dynamics in the ownership of physical page, there is no point to do such checking
  and tracking.

* Add some short-cut for accessing some commonly used proc_mm field, to avoid nasty
  chain of cascading '->' for sake of readbility.

* * Introducing struct thread to represent a light-weighted schedulable element.

  The `struct thread` abstract the execution context out of the process, while the
  latter now composed only descriptors to program resources (e.g., file, memory
  installed signal handlers). This made possible of duplicating concurrent
  execution flow while impose a rather less kernel overhead (e.g., cost to context
  switch, old-fashioned fork()-assisted concurrency).

  Such change to process model require vast amount of refactoring to almost every
  subsystem involving direct use of process. As well as introducing additional
  tools to create the initial process. This commit only contains a perliminary
  refactoring works, some parts require additional work-around is commented out and
  marked with `FIXME`

* Other refactoring and cleaning has been done to improve the semantics of certain
  pieces of code.

* * Process and thread spawning. Allow to reduce the system overhead
  introduced by invoking fork to create process. However, creating
  a process housed executable image is not implemented as it require
  remote injection of user stack for which is still under consideration

* Introducing kernel process and kernel threads. Prior to the threading
  patch, the dummy process is a terrible minick of kernel process
  and used as merely a fallback when no schdulable process can be found.
  This makes it rather useless and a waste of kernel object pool space.
  The introducing of thread and new scheduler deisgn promote Lunaix
  to a full functioning kernel thread, it's preemptiveness enable the
  opportunity to integrating advanced, periodical, event driven kernel
  task (such as memory region coalescing, lightweight irq handler)

* Some minor refactorings are also performed to make things more clean

* Update the virtual memory layout to reflect the current development

* * Fix the issue of transfer context being inject into wrong address
  as the page offset was some-how not considered

* Fix the refactoring and various compile time error

* Adjust the lunadbg to work with latest threading refactoring.

* Also fix the issue that lunadbg's llist iterator had made false
  assumption on the offset of embeded llist_header.

* Rename spawn_thread -> create_thread. And introduce spawn_kthread
  to spawn a kernel thread within kernel process.

* Fix the issue in vmm_lookupat that ignore the present bit when
  doing pte probing

* Leaves some holes for later investigations

* * Make threading feature works properly

* Fixed left-over issues as well as new issues found:

    1. incorrect checking on thread state in 'can_schedule', causing
       scheduler unable to select suitable thread even though there
       exists one

    2. double free struct v_file when destorying process. Which caused
       by a missing vfs_ref_file in elf32_openat

    3. destory_thread removed wrong thread from global thread list

    4. thread-local kernel and user thread don't get released when
       destorying thread

    5. lunad should spawn a new process for user space initd rather than
       kexec on current kernel process

    6. guarding the end of thread entry function with 'thread_exit'
       to prevent run-over into abyss.

    7. fix tracing.c prints wrong context entring-leaving order

    8. zero fill the first pde when duplicating the vm space to avoid
       garbage interfering the vmm

* * Allow each process to store their executable arguments

* Refactor the lunadbg toolset (done: process, thread and scheduler)

* * Fix can_schedule() should check against thread's state rather than process state

* Remove the hack of using ebp in 'syscall_hndlr', thus to prevent it for
  interferencing the stack-walker

* Find tune the output of tracer when incountering unknown symbol

* (LunaDBG) Add implementation for examing sigctx

* * Add related syscall to manipulate threads

* Factorise the access of frame pointer and return address to abi.h

* Shrink the default pre-thread user stack size to 256K, to account
  the shortage on 32-bit virtual address space.

* Add check to kernel preemptible function context

* Add different test cases to exercise various pthread_* api

* * (My Little Pthread Test) Fix the all sorts of issues found in current threading model implementation
  with a set of simple pthread tests.

* Add more sanity checks on tracing and pfault handler, to avoid them spamming the output stream when
  the failure is severe enough to cause infinite nesting (e.g., when vm mapping of kernel stack get
  messed up)

* Add guardian page at the end of thread-local kernel and user stack to detect stack overflow

* Remove an unwanted interrupt enablement in ps2kbd.c (which behaviour is undefined in the booting
  stage)

* Temporary fix issues with vmr dump py utils (need to adapt the new design sooner or later)

* Specify a cpu model for QEMU, which make things more detrerministic

* * Change the mmap flag for creating thread-local user stack to non-FIXED.
  As a previous experiment shows that during high concurrency situtaion,
  the calculation of ustack location for new thread will be affected and
  had risk of smashing existing thread's ustack causing undefined bevhaiour
  when return from kernel (as the stack address is implied from
  proc_info::threads_count) for which reason it should treated as
  hint to mem_map rather than a hard requirement.

* Re-implement the VMR allocation alogirthm that will takes the vicinity of
  the hinted address as high priority search area, rather than dumbly start
  from beginning.

* Remove the undesired pmm_free_page from __dup_kernel_stack. As we
  now skipped the first 4MiB when duplicating the page table. Thus the
  ref counters for these physical page are already 1 after fork. This
  has been identify the root cause of a randomly appearing segfault
  during memory intensive task such as test_pthread, as these falsely
  released physical page will get repurposed. However, this also lead
  to a question in Lunaix's memory utilisation, as the next-free strategy
  is unlikely to visit the previously allocated page when plenty of free
  space ahead. More efforts should be taken into investigating memory
  performance.

* Added more assertion and checks to enhance the robustness and ease
  the debugging experience.

* Adjust some output format and refactor the test_pthread code.

* * (lunadbg) `mm` command for probing page table and physical memory profiling

* Add missing syscall-table doc

* Add more test cases related to pthread

* * (LunaDBG) decouple the pte related operation as arch-dependent feature

* (LunaDBG) adjust the output format

* * (LunaDBG) Refactor VMR dump

* * Adjust the thread id generation to avoid duplication ratio

* Capped the thread limit per process

* (LunaDBG) Fix the issue with display of percentage in pmem profiling

feat/posix-term

Toggle feat/posix-term's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
feat: a better boot command line parser

fix: bugs in term interfacing

feat/kcmd

Toggle feat/kcmd's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
feat: a better boot command line parser

fix: bugs in term interfacing

feat/dev-vga

Toggle feat/dev-vga's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
feat: gfxm: a layer provides user space access to low level interfaci…

…ng of graphic adapter

chore: clear things up

feat/serial

Toggle feat/serial's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
regression: test serial port r/w.

fix: uart register bitmap
fix: refine context switch trace message
feat: add a dedicated program to host all test routines

feat/kernel-trace

Toggle feat/kernel-trace's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
feat: kernel stack tracing

refactor: move cpu.h to arch specific

feat/execv

Toggle feat/execv's commit message

Verified

This commit was signed with the committer’s verified signature.
Minep Lunaixsky
refactor: decouple the executable file implementations with execve fu…

…nctionality.

feat/mmap

Toggle feat/mmap's commit message
regression: mmap for fd

fix: replace %ebp register to %esi for passing 5-th arg when switching to syscall dispatcher.
feat: support for anonymous mapping
refactor: mm_region interfaces
refactor: page fault handler clean up.
refactor: resolve cyclic dependencies between mm.h and fs.h
refactor: rename readdir to sys_readdir to distinguish readdir(3)
wip refactor: separating syscall definitions to userspace.

feat/iso9660

Toggle feat/iso9660's commit message
feat: (iso9660) rock ridge extension

fix: (pcache) over-reading the page cache