Skip to content

Commit

Permalink
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/ker…
Browse files Browse the repository at this point in the history
…nel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
 "There are three implementations for NUMA balancing, this tree
  (balancenuma), numacore which has been developed in tip/master and
  autonuma which is in aa.git.

  In almost all respects balancenuma is the dumbest of the three because
  its main impact is on the VM side with no attempt to be smart about
  scheduling.  In the interest of getting the ball rolling, it would be
  desirable to see this much merged for 3.8 with the view to building
  scheduler smarts on top and adapting the VM where required for 3.9.

  The most recent set of comparisons available from different people are

    mel:    https://lkml.org/lkml/2012/12/9/108
    mingo:  https://lkml.org/lkml/2012/12/7/331
    tglx:   https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

  The results are a mixed bag.  In my own tests, balancenuma does
  reasonably well.  It's dumb as rocks and does not regress against
  mainline.  On the other hand, Ingo's tests shows that balancenuma is
  incapable of converging for this workloads driven by perf which is bad
  but is potentially explained by the lack of scheduler smarts.  Thomas'
  results show balancenuma improves on mainline but falls far short of
  numacore or autonuma.  Srikar's results indicate we all suffer on a
  large machine with imbalanced node sizes.

  My own testing showed that recent numacore results have improved
  dramatically, particularly in the last week but not universally.
  We've butted heads heavily on system CPU usage and high levels of
  migration even when it shows that overall performance is better.
  There are also cases where it regresses.  Of interest is that for
  specjbb in some configurations it will regress for lower numbers of
  warehouses and show gains for higher numbers which is not reported by
  the tool by default and sometimes missed in treports.  Recently I
  reported for numacore that the JVM was crashing with
  NullPointerExceptions but currently it's unclear what the source of
  this problem is.  Initially I thought it was in how numacore batch
  handles PTEs but I'm no longer think this is the case.  It's possible
  numacore is just able to trigger it due to higher rates of migration.

  These reports were quite late in the cycle so I/we would like to start
  with this tree as it contains much of the code we can agree on and has
  not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
  mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  mm: migrate: Account a transhuge page properly when rate limiting
  mm: numa: Account for failed allocations and isolations as migration failures
  mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
  mm: numa: Add THP migration for the NUMA working set scanning fault case.
  mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
  mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
  mm: sched: numa: Control enabling and disabling of NUMA balancing
  mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
  mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  mm: numa: migrate: Set last_nid on newly allocated page
  mm: numa: split_huge_page: Transfer last_nid on tail page
  mm: numa: Introduce last_nid to the page frame
  sched: numa: Slowly increase the scanning period as NUMA faults are handled
  mm: numa: Rate limit setting of pte_numa if node is saturated
  mm: numa: Rate limit the amount of memory that is migrated between nodes
  mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  mm: numa: Migrate pages handled during a pmd_numa hinting fault
  mm: numa: Migrate on reference policy
  ...
  • Loading branch information
torvalds committed Dec 16, 2012
2 parents 11520e5 + 4fc3f1d commit 3d59eeb
Show file tree
Hide file tree
Showing 45 changed files with 1,938 additions and 172 deletions.
3 changes: 3 additions & 0 deletions Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2032,6 +2032,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

nr_uarts= [SERIAL] maximum number of UARTs to be registered.

numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
Allowed values are enable and disable

numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
one of ['zone', 'node', 'default'] can be specified
This can be set from sysctl after boot.
Expand Down
1 change: 1 addition & 0 deletions arch/sh/mm/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
select ARCH_WANT_NUMA_VARIABLE_LOCALITY
default n
help
Some SH systems have many various memories scattered around
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ config X86
def_bool y
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
Expand Down
17 changes: 15 additions & 2 deletions arch/x86/include/asm/pgtable.h
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,14 @@ static inline int pte_same(pte_t a, pte_t b)

static inline int pte_present(pte_t a)
{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}

#define pte_accessible pte_accessible
static inline int pte_accessible(pte_t a)
{
return pte_flags(a) & _PAGE_PRESENT;
}

static inline int pte_hidden(pte_t pte)
Expand All @@ -420,7 +427,8 @@ static inline int pmd_present(pmd_t pmd)
* the _PAGE_PSE flag will remain set at all times while the
* _PAGE_PRESENT bit is clear).
*/
return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
_PAGE_NUMA);
}

static inline int pmd_none(pmd_t pmd)
Expand Down Expand Up @@ -479,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)

static inline int pmd_bad(pmd_t pmd)
{
#ifdef CONFIG_NUMA_BALANCING
/* pmd_numa check */
if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
return 0;
#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
}

Expand Down
20 changes: 20 additions & 0 deletions arch/x86/include/asm/pgtable_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,26 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

/*
* _PAGE_NUMA indicates that this page will trigger a numa hinting
* minor page fault to gather numa placement statistics (see
* pte_numa()). The bit picked (8) is within the range between
* _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
* require changes to the swp entry format because that bit is always
* zero when the pte is not present.
*
* The bit picked must be always zero when the pmd is present and not
* present, so that we don't lose information when we set it while
* atomically clearing the present bit.
*
* Because we shared the same bit (8) with _PAGE_PROTNONE this can be
* interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
* couldn't reach, like handle_mm_fault() (see access_error in
* arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
* handle_mm_fault() to be invoked).
*/
#define _PAGE_NUMA _PAGE_PROTNONE

#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
Expand Down
8 changes: 7 additions & 1 deletion arch/x86/mm/pgtable.c
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
}

/*
* Used to set accessed or dirty bits in the page table entries
* on other architectures. On x86, the accessed and dirty bits
* are tracked by hardware. However, do_wp_page calls this function
* to also make the pte writeable at the same time the dirty bit is
* set. In that case we do actually need to write the PTE.
*/
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
Expand All @@ -310,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
flush_tlb_page(vma, address);
}

return changed;
Expand Down
110 changes: 110 additions & 0 deletions include/asm-generic/pgtable.h
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif

#ifndef pte_accessible
# define pte_accessible(pte) ((void)(pte),1)
#endif

#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
Expand Down Expand Up @@ -580,6 +584,112 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

#ifdef CONFIG_NUMA_BALANCING
#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
/*
* _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
* same bit too). It's set only when _PAGE_PRESET is not set and it's
* never set if _PAGE_PRESENT is set.
*
* pte/pmd_present() returns true if pte/pmd_numa returns true. Page
* fault triggers on those regions if pte/pmd_numa returns true
* (because _PAGE_PRESENT is not set).
*/
#ifndef pte_numa
static inline int pte_numa(pte_t pte)
{
return (pte_flags(pte) &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
}
#endif

#ifndef pmd_numa
static inline int pmd_numa(pmd_t pmd)
{
return (pmd_flags(pmd) &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
}
#endif

/*
* pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
* because they're called by the NUMA hinting minor page fault. If we
* wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
* would be forced to set it later while filling the TLB after we
* return to userland. That would trigger a second write to memory
* that we optimize away by setting _PAGE_ACCESSED here.
*/
#ifndef pte_mknonnuma
static inline pte_t pte_mknonnuma(pte_t pte)
{
pte = pte_clear_flags(pte, _PAGE_NUMA);
return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
}
#endif

#ifndef pmd_mknonnuma
static inline pmd_t pmd_mknonnuma(pmd_t pmd)
{
pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
}
#endif

#ifndef pte_mknuma
static inline pte_t pte_mknuma(pte_t pte)
{
pte = pte_set_flags(pte, _PAGE_NUMA);
return pte_clear_flags(pte, _PAGE_PRESENT);
}
#endif

#ifndef pmd_mknuma
static inline pmd_t pmd_mknuma(pmd_t pmd)
{
pmd = pmd_set_flags(pmd, _PAGE_NUMA);
return pmd_clear_flags(pmd, _PAGE_PRESENT);
}
#endif
#else
extern int pte_numa(pte_t pte);
extern int pmd_numa(pmd_t pmd);
extern pte_t pte_mknonnuma(pte_t pte);
extern pmd_t pmd_mknonnuma(pmd_t pmd);
extern pte_t pte_mknuma(pte_t pte);
extern pmd_t pmd_mknuma(pmd_t pmd);
#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
#else
static inline int pmd_numa(pmd_t pmd)
{
return 0;
}

static inline int pte_numa(pte_t pte)
{
return 0;
}

static inline pte_t pte_mknonnuma(pte_t pte)
{
return pte;
}

static inline pmd_t pmd_mknonnuma(pmd_t pmd)
{
return pmd;
}

static inline pte_t pte_mknuma(pte_t pte)
{
return pte;
}

static inline pmd_t pmd_mknuma(pmd_t pmd)
{
return pmd;
}
#endif /* CONFIG_NUMA_BALANCING */

#endif /* CONFIG_MMU */

#endif /* !__ASSEMBLY__ */
Expand Down
16 changes: 14 additions & 2 deletions include/linux/huge_mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd);
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot);
unsigned long addr, pgprot_t newprot,
int prot_numa);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
Expand Down Expand Up @@ -111,7 +112,7 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
anon_vma_lock(__anon_vma); \
anon_vma_lock_write(__anon_vma); \
anon_vma_unlock(__anon_vma); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
Expand Down Expand Up @@ -171,6 +172,10 @@ static inline struct page *compound_trans_head(struct page *page)
}
return page;
}

extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
Expand Down Expand Up @@ -209,6 +214,13 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
{
return 0;
}

static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
return 0;
}

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#endif /* _LINUX_HUGE_MM_H */
8 changes: 6 additions & 2 deletions include/linux/hugetlb.h
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
void hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

#else /* !CONFIG_HUGETLB_PAGE */
Expand Down Expand Up @@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
{
}

#define hugetlb_change_protection(vma, address, end, newprot)
static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
return 0;
}

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
Expand Down
8 changes: 8 additions & 0 deletions include/linux/mempolicy.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}

extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);

#else

struct mempolicy {};
Expand Down Expand Up @@ -307,5 +309,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}

static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
unsigned long address)
{
return -1; /* no node preference */
}

#endif /* CONFIG_NUMA */
#endif
46 changes: 44 additions & 2 deletions include/linux/migrate.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
#define MIGRATEPAGE_BALLOON_SUCCESS 1 /* special ret code for balloon page
* sucessful migration case.
*/
enum migrate_reason {
MR_COMPACTION,
MR_MEMORY_FAILURE,
MR_MEMORY_HOTPLUG,
MR_SYSCALL, /* also applies to cpusets */
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CMA
};

#ifdef CONFIG_MIGRATION

Expand All @@ -32,7 +41,7 @@ extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
enum migrate_mode mode, int reason);
extern int migrate_huge_page(struct page *, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
Expand All @@ -54,7 +63,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
static inline void putback_movable_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
enum migrate_mode mode, int reason) { return -ENOSYS; }
static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
Expand Down Expand Up @@ -83,4 +92,37 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define fail_migrate_page NULL

#endif /* CONFIG_MIGRATION */

#ifdef CONFIG_NUMA_BALANCING
extern int migrate_misplaced_page(struct page *page, int node);
extern int migrate_misplaced_page(struct page *page, int node);
extern bool migrate_ratelimited(int node);
#else
static inline int migrate_misplaced_page(struct page *page, int node)
{
return -EAGAIN; /* can't migrate now */
}
static inline bool migrate_ratelimited(int node)
{
return false;
}
#endif /* CONFIG_NUMA_BALANCING */

#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
struct vm_area_struct *vma,
pmd_t *pmd, pmd_t entry,
unsigned long address,
struct page *page, int node);
#else
static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
struct vm_area_struct *vma,
pmd_t *pmd, pmd_t entry,
unsigned long address,
struct page *page, int node)
{
return -EAGAIN;
}
#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/

#endif /* _LINUX_MIGRATE_H */
Loading

0 comments on commit 3d59eeb

Please sign in to comment.