Linux内核中的内存空间与mmap

为了学习Linux内核的内存管理机制,结合看源码和看书,对Linux内核的内存机制和mmap的过程进行了粗略的学习,本篇博文是对学习到的一些内容的简单记录,暂不涉及slab分配器等内容,由于本人水平有限、源码阅读经验不足,如有错误的地方恳请指正!

本文中的Linux内核源代码为Linux-4.19.65 Linux-5.0.5

参考资料 《Linux Device Drivers》第三版第15章

《深入理解Linux内核》第二章、第八章

https://blog.csdn.net/gatieme/article/details/52384075

https://blog.csdn.net/gatieme/article/details/52384636

物理内存的组织

页框与struct page

每一个页框(物理页),会有一个struct page来描述。而为了节省内存,它内部使用了很多union。

一些比较重要的字段:

参考自:https://blog.csdn.net/gatieme/article/details/52384636

字段名 含义
flags 用来存放页的状态,这些状态定义在linux/page-flags.h中
lru 链表头,用于在各种链表上维护该页, 以便于按页将不同类别分组, 主要有3个用途: 伙伴算法, slab分配器, 被用户态使用或被当做页缓存使用
_refcount 引用计数,表示内核中引用该page的次数, 如果要操作该page, 引用计数会+1, 操作完成-1. 当该值为0时, 表示没有引用该page的位置,所以该page可以被解除映射
_mapcount 映射计数,也就是说该page同时被多少个进程共享
virtual 对于如果物理内存可以直接映射内核的系统, 我们可以之间映射出虚拟地址与物理地址的管理, 但是对于需要使用高端内存区域的页, 即无法直接映射到内核的虚拟地址空间, 因此需要用virtual保存该页的虚拟地址

flags字段一些可能的取值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
// /include/linux/mm_types.h

struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
//存放页的状态
/*
* Five words (20/40 bytes) are available in this union.
* WARNING: bit 0 of the first word is used for PageTail(). That
* means the other users of this union MUST NOT use the bit to
* avoid collision and false-positive PageTail().
*/
union {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* zone_lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
struct { /* slab, slob and slub */
union {
struct list_head slab_list; /* uses lru */
struct { /* Partial pages */
struct page *next;
#ifdef CONFIG_64BIT
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
#else
short int pages;
short int pobjects;
#endif
};
};
struct kmem_cache *slab_cache; /* not slob */
/* Double-word boundary */
void *freelist; /* first free object */
union {
void *s_mem; /* slab: first object */
unsigned long counters; /* SLUB */
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
};
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */

/* First tail page only */
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
unsigned long _compound_pad_2;
struct list_head deferred_list;
};
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
unsigned long hmm_data;
unsigned long _zd_pad_1; /* uses mapping */
};

/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};

union { /* This union is 4 bytes in size. */
/*
* If the page can be mapped to userspace, encodes the number
* of times this page is referenced by a page table.
*/
atomic_t _mapcount;

/*
* If the page is neither PageSlab nor mappable to userspace,
* the value stored here may help determine what this page
* is used for. See page-flags.h for a list of page types
* which are currently stored here.
*/
unsigned int page_type;

unsigned int active; /* SLAB */
int units; /* SLOB */
};

/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;

#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
#endif

/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
} _struct_page_alignment;

NUMA与UMA模型

我们习惯上认为计算机内存是一种均匀、共享的资源,我们期望不管内存单元处于何处,也不管CPU处于何处,CPU对内存单元的访问都需要相同的时间。可惜这种假设在某些体系结构上是不成立的,例如多处理器的Alpha或MIPS。

内存节点node

因此,linux支持非一致内存访问NUMA(None-Uniform Memory Access)模型,在这种模型中,给定CPU对不同内存单元的访问时间可能不一样。因此,系统的物理内存被划分为几个节点(node)。在一个单独的节点内,任一给定的CPU访问节点中page所需要的时间都是相同的,不同CPU所需的时间就可能会不同。对于每个CPU而言,都试图把耗时访问时间减到最少,这就需要小心地选择CPU最常引用的内核数据结构的存放位置。

NUMA主要应用于服务器架构中,而在均匀内存访问UMA模型中,就只有一个node。

每个节点中地内存又可以被分为几个管理区(Zone)

内存管理区zone

在一个理想的计算机体系结构中,一个页框就是一个内存存储单元,可以用于任何用途,任何任何种类的数据页都可以装载进任何页框,没有啥限制。

然而,实际地计算机体系结构有硬件的制约,这限制了页框可以使用的方式,尤其是,Linux内核必须处理80x86体系结构地两种硬件约束:

  • ISA总线的DMA(直接内存访问)处理器只能对RAM地前16MB寻址
  • 在具有大容量RAM的32位计算机中,CPU不能直接访问所有物理内存,因为线性地址空间不够(32位只有4G)

又或者在64位机器中,32位的DMA设备只能寻址4GB以内的内存。

内核中定义了以下一些Zone:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// /include/linux/mmzone.h

enum zone_type {
#ifdef CONFIG_ZONE_DMA
/*
* ZONE_DMA is used when there are devices that are not able
* to do DMA to all of addressable memory (ZONE_NORMAL). Then we
* carve out the portion of memory that is needed for these devices.
* The range is arch specific.
*
* Some examples
*
* Architecture Limit
* ---------------------------
* parisc, ia64, sparc <4G
* s390, powerpc <2G
* arm Various
* alpha Unlimited or 0-16MB.
*
* i386, x86_64 and multiple other arches
* <16M.
*/
ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
/*
* x86_64 needs two ZONE_DMAs because it supports devices that are
* only able to do DMA to the lower 16M but also 32 bit devices that
* can only do DMA areas below 4G.
*/
ZONE_DMA32,
#endif
/*
* Normal addressable memory is in ZONE_NORMAL. DMA operations can be
* performed on pages in ZONE_NORMAL if the DMA devices support
* transfers to all addressable memory.
*/
ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
/*
* A memory area that is only addressable by the kernel through
* mapping portions into its own address space. This is for example
* used by i386 to allow the kernel to address the memory beyond
* 900MB. The kernel will set up special mappings (page
* table entries on i386) for each page that the kernel needs to
* access.
*/
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
#ifdef CONFIG_ZONE_DEVICE
ZONE_DEVICE,
#endif
__MAX_NR_ZONES

};

ZONE_DMA和ZONE_DMA32都是划分出来供DMA设备使用的,ZONE_NORMAL区则是可以直接线性映射到线性地址空间中的第4个G中(即进程地址空间的内核区),因此内核可以直接访问,而ZONE_HIGHMEM包含的内存页不能直接由内核访问的(超出了线性地址空间,>4G?)。在64位体系结构上ZONE_HIGHMEM总是空的。

进程地址空间

虚拟内存区 VMA

一个vma表示进程的虚拟地址空间中的一个同类区域,它们拥有同样的权限标志,和同样的映射文件或交换空间。可以模糊地将其理解为“段”的概念,但是将其描述为“拥有自身属性的内存对象”可能更为贴切。每一个VMA会有一个vm_area_struct结构体来描述。

vm_area_struct

每个VMA都会有一个对应的vm_area_struct结构体,包含这个VMA的起始、结束地址(虚拟地址),映射的文件指针(可以为空),权限、属性(vm_flags),所属的虚拟地址空间(struct mm_struct *vm_mm),对这块VMA进行操作的函数指针(const struct vm_operations_struct *vm_ops;)等等,vm_ops的存在说明VMA其实是一个内核对象。

另外,可以看到,VMAs是通过双向链表来组织的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// /inlcude/linux/mm_types.h
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */

unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */

/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;

struct rb_node vm_rb;

/*
* Largest free memory gap in bytes to the left of this VMA.
* Either between this VMA and vma->vm_prev, or between one of the
* VMAs below us in the VMA rbtree and its ->vm_prev. This helps
* get_unmapped_area find a free area of the right size.
*/
unsigned long rb_subtree_gap;

/* Second cache line starts here. */

struct mm_struct *vm_mm; /* The address space we belong to. */
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */

/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
*/
struct {
struct rb_node rb;
unsigned long rb_subtree_last;
} shared;

/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
* page_table_lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */

/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;

/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */

atomic_long_t swap_readahead_info;
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;

vm_operations_struct

一组函数指针,每个VMA都有自己的这个结构体指针。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/*  /include/linux/mm.h  */

/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
* to the functions called when a no-page or a wp-page exception occurs.
*/
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*split)(struct vm_area_struct * area, unsigned long addr);
int (*mremap)(struct vm_area_struct * area);
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
void (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);

/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);

/* called by access_process_vm when get_user_pages() fails, typically
* for use by special VMAs that can switch between memory and hardware
*/
int (*access)(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write);

/* Called by the /proc/PID/maps code to ask the vma whether it
* has a special name. Returning non-NULL will also cause this
* vma to be dumped unconditionally. */
const char *(*name)(struct vm_area_struct *vma);

#ifdef CONFIG_NUMA
/*
* set_policy() op must add a reference to any non-NULL @new mempolicy
* to hold the policy upon return. Caller should pass NULL @new to
* remove a policy and fall back to surrounding context--i.e. do not
* install a MPOL_DEFAULT policy, nor the task or system default
* mempolicy.
*/
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);

/*
* get_policy() op must add reference [mpol_get()] to any policy at
* (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure
* in mm/mempolicy.c will do this automatically.
* get_policy() must NOT add a ref if the policy at (vma,addr) is not
* marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
* If no [shared/vma] mempolicy exists at the addr, get_policy() op
* must return NULL--i.e., do not "fallback" to task or system default
* policy.
*/
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr);
#endif
/*
* Called by vm_normal_page() for special PTEs to find the
* page for @addr. This is useful if the default behavior
* (using pte_page()) would not find the correct page.
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
unsigned long addr);
};

内存映射与struct mm_struct

每个进程都有一个mm_struct结构体,用于描述进程虚拟地址空间的内存映射信息,包含了这个进程的VMAs列表,还有其他一些内存管理相关的数据结构,如信号量和自旋锁。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
// /inlcude/linux/mm_types.h
struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
/* Base adresses for compatible mmap() */
unsigned long mmap_compat_base;
unsigned long mmap_compat_legacy_base;
#endif
unsigned long task_size; /* size of task vm space */
unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;

/**
* @mm_users: The number of users including userspace.
*
* Use mmget()/mmget_not_zero()/mmput() to modify. When this
* drops to 0 (i.e. when the task exits and there are no other
* temporary reference holders), we also release a reference on
* @mm_count (which may then free the &struct mm_struct if
* @mm_count also drops to 0).
*/
atomic_t mm_users;

/**
* @mm_count: The number of references to &struct mm_struct
* (@mm_users count as 1).
*
* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
* &struct mm_struct is freed.
*/
atomic_t mm_count;

#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
int map_count; /* number of VMAs */

spinlock_t page_table_lock; /* Protects page tables and some
* counters
*/
struct rw_semaphore mmap_sem;

struct list_head mmlist; /* List of maybe swapped mm's. These
* are globally strung together off
* init_mm.mmlist, and are protected
* by mmlist_lock
*/


unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */

unsigned long total_vm; /* Total pages mapped */
unsigned long locked_vm; /* Pages that have PG_mlocked set */
unsigned long pinned_vm; /* Refcount permanently increased */
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

spinlock_t arg_lock; /* protect the below fields */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;

struct linux_binfmt *binfmt;

/* Architecture-specific MM context */
mm_context_t context;

unsigned long flags; /* Must use atomic bitops to access */

struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_MEMBARRIER
atomic_t membarrier_state;
#endif
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
#endif
struct user_namespace *user_ns;

/* store ref to file /proc/<pid>/exe symlink points to */
struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
/*
* numa_next_scan is the next time that the PTEs will be marked
* pte_numa. NUMA hinting faults will gather statistics and
* migrate pages to new nodes if necessary.
*/
unsigned long numa_next_scan;

/* Restart point for scanning and setting pte_numa */
unsigned long numa_scan_offset;

/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
#endif
/*
* An operation with batched TLB flushing is going on. Anything
* that can move process memory needs to flush the TLB when
* moving a PROT_NONE or PROT_NUMA mapped page.
*/
atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
/* See flush_tlb_batched_pending() */
bool tlb_flush_batched;
#endif
struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
#endif
struct work_struct async_put_work;

#if IS_ENABLED(CONFIG_HMM)
/* HMM needs to track a few things per mm */
struct hmm *hmm;
#endif
} __randomize_layout;

/*
* The mm_cpumask needs to be at the end of mm_struct, because it
* is dynamically sized based on nr_cpu_ids.
*/
unsigned long cpu_bitmap[];
};

mmap

对于驱动程序来说,内存映射可以提供给用户程序直接访问设备内存的能力。

映射(mmap)一个设备意味着将进程虚拟地址空间中的一块区域(VMA)将设备内存进行直接关联,当用户对这块VMA进行读写操作时,访问的实际上就是其所映射的设备内存。

mmap()方法是file_operations结构的其中一个成员。因此,不同的文件(设备)可以实现自己的mmap()方法。而且mmap系统调用最终会调用到这个file_operations->mmap()方法。

另外,在mmap系统调用中,内核在调用到file_operations->mmap()之前,会先完成大量的工作,因此,file_operations->mmap()的函数原型与系统调用mmap原型是不太一样的:

1
2
3
4
5
//file_operations->mmap:
int (*mmap) (struct file *, struct vm_area_struct *);

//syscall mmap:
mmap (caddr_t addr, size_t len, int prot, int flags, int fd, off_t offset)

当用户进程调用mmap映射一个设备内存到用户地址空间时,系统会建立一个新的VMA来响应这个映射,而设备需要支持mmap(或者实现mmap方法)来帮助完成这个VMA的初始化。从file_operations->mmap()的函数原型也可以看到,只有两个参数,一个struct file *,以及struct vm_area_struct *,从这里可以看出,file_operations->mmap()需要完成的内容,就是VMA的初始化。例如,构建合适的页表,用参数中的struct file *来初始化vm_area_struct->vm_file,又或者在必要时,用一组新的操作替换vma->vm_ops

页表的构建

而完成页表的构建有两种方式,一种是通过内核接口remap_pfn_range()一次性完成所有工作,或者通过VMA的nopage()方法一次完成一个页,两种方法都有各自的优劣。

remap_pfn_range()

函数原型如下,这里暂时不深入探索它的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/*  /mm/memory.c  */
/**
* remap_pfn_range - remap kernel memory to userspace
* @vma: user vma to map to
* @addr: target user address to start at
* @pfn: physical address of kernel memory
* @size: size of map area
* @prot: page protection flags for this mapping
*
* Note: this is only safe if the mm semaphore is held when called.
*/
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot)
{
...
}
EXPORT_SYMBOL(remap_pfn_range);

我在源码中找到的,使用remap_pfn_range()完成页表构建的一个file_operations->mmap()实现如下,可以看到,用remap_pfn_range()可以很快捷简便的完成VMA的页表构建。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/*  /drivers/char/mem.c  */
static int mmap_mem(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
phys_addr_t offset = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;

/* Does it even fit in phys_addr_t? */
if (offset >> PAGE_SHIFT != vma->vm_pgoff)
return -EINVAL;

/* It's illegal to wrap around the end of the physical address space. */
if (offset + (phys_addr_t)size - 1 < offset)
return -EINVAL;

if (!valid_mmap_phys_addr_range(vma->vm_pgoff, size))
return -EINVAL;

if (!private_mapping_ok(vma))
return -ENOSYS;

if (!range_is_allowed(vma->vm_pgoff, size))
return -EPERM;

if (!phys_mem_access_prot_allowed(file, vma->vm_pgoff, size,
&vma->vm_page_prot))
return -EINVAL;

vma->vm_page_prot = phys_mem_access_prot(file, vma->vm_pgoff,
size,
vma->vm_page_prot);

vma->vm_ops = &mmap_mem_ops;

/* Remap-pfn-range will mark the range VM_IO */
if (remap_pfn_range(vma,
vma->vm_start,
vma->vm_pgoff,
size,
vma->vm_page_prot)) {
return -EAGAIN;
}
return 0;
}

nopage

ldd3中还提到一种建立正确页映射的方式为调用nopage方法,例如,当调用mremap系统调用扩展VMA时,就会调用nopage方法来设置新页。但是我自己去看4.19的源码,没有找到这个函数的定义,所以这个应该是要驱动自己实现的?。书中给出的定义如下:

1
2
struct page *(*nopage)(struct vm_area_struct *vma,
unsigned long address, int *type);

当进程试图访问一个还未在内存中的页时,相应的nopage方法就会被调用。

分析内核mmap源码

用户空间的方法xxx,对应系统调用层方法则是sys_xxx,因此mmap系统调用在内核中是sys_mmap()方法来响应的。

sys_mmap实现的系统调用与平台有关,但最终都调用ksys_mmap_pgoff()来实现。另外,mmap传递的offset参数会在调用的过程中转换成页数,ksys_mmap_pgoff()接收到的参数已经是已页为单位的。

例如,x86架构在arch/x86/um/sys_call_table_64.carch/x86/um/sys_call_table_64.c中:通过#define sys_mmap old_mmap宏定义语句,用old_mmap方法来响应sys_mmap。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mm/mmap.c

SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
{
struct mmap_arg_struct a;

if (copy_from_user(&a, arg, sizeof(a)))
return -EFAULT;
if (offset_in_page(a.offset))
return -EINVAL;

return ksys_mmap_pgoff(a.addr, a.len, a.prot, a.flags, a.fd,
a.offset >> PAGE_SHIFT);
}
#endif /* __ARCH_WANT_SYS_OLD_MMAP */

可以看到,将offset右移了PAGE_SHIFT(12)位,将其转化为页号。

arch/ia64/kernel/sys_ia64.c也有对sys_mmap的实现,其实和上面的old_mmap差不多,最后都是将offset右移PAGE_SHIFT后调用ksys_mmap_pgoff()

1
2
3
4
5
6
7
8
9
10
11
12
13
arch/ia64/kernel/sys_ia64.c

asmlinkage unsigned long
sys_mmap (unsigned long addr, unsigned long len, int prot, int flags, int fd, long off)
{
if (offset_in_page(off) != 0)
return -EINVAL;

addr = ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
if (!IS_ERR((void *) addr))
force_successful_syscall_return();
return addr;
}

ksys_mmap_pgoff()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
mm/mmap.c

unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
{
struct file *file = NULL;
unsigned long retval;

if (!(flags & MAP_ANONYMOUS)) {
audit_mmap_fd(fd, flags);
file = fget(fd);
if (!file)
return -EBADF;
if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
retval = -EINVAL;
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
struct user_struct *user = NULL;
struct hstate *hs;

hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (!hs)
return -EINVAL;

len = ALIGN(len, huge_page_size(hs));
/*
* VM_NORESERVE is used because the reservations will be
* taken when vm_ops->mmap() is called
* A dummy user value is used because we are not locking
* memory so no accounting is necessary
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
&user, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
}

flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
if (file)
fput(file);
return retval;
}

ksys_mmap_pgoff()先通过mmap参数里的文件描述符获得目标文件的struct file结构体,做了一些检查后调用vm_mmap_pgoff()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
mm/mmap.c

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
{
unsigned long ret;
struct mm_struct *mm = current->mm;
unsigned long populate;
LIST_HEAD(uf);

ret = security_mmap_file(file, prot, flag);
if (!ret) {
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate, &uf);
up_write(&mm->mmap_sem);
userfaultfd_unmap_complete(mm, &uf);
if (populate)
mm_populate(ret, populate);
}
return ret;
}

vm_mmap_pgoff()最后调用do_mmap_pgoff(),这是一个内联函数,直接封装了do_mmap()

1
2
3
4
5
6
7
8
9
10
include/linux/mm.h

static inline unsigned long
do_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
unsigned long pgoff, unsigned long *populate,
struct list_head *uf)
{
return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
}

do_mmap

重点看do_mmap()函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
mm/mmap.c

unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate,
struct list_head *uf)
{
struct mm_struct *mm = current->mm; //获取当前进程的 mm_struct结构体
int pkey = 0;

*populate = 0;

if (!len)
return -EINVAL; //len参数不可为零

/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
* (the exception is when the underlying filesystem is noexec
* mounted, in which case we dont add PROT_EXEC.)
*/
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;

/* force arch specific MAP_FIXED handling in get_unmapped_area */
if (flags & MAP_FIXED_NOREPLACE)
flags |= MAP_FIXED;

if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);

/* Careful about overflows.. */
len = PAGE_ALIGN(len); //页对齐
if (!len)
return -ENOMEM;

/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW; //检查是否会整数溢出?但是pgoff和len都是无符号整数,应该不会出现这种情况才对啊

/* Too many mappings? */
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM; //mm->map_count有最大值限制

/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
addr = get_unmapped_area(file, addr, len, pgoff, flags); //寻找一个合适的线性地址
if (offset_in_page(addr)) //#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) // PAGE_MASK=0XFFFFF000
return addr; //其实在get_unmapped_area中也做了一次这个检查,如果不是页对齐,会返回-EINVAL

if (flags & MAP_FIXED_NOREPLACE) { // MAP_FIXED_NOREPLACE和MAP_FIXED类似,但是不会抢占
struct vm_area_struct *vma = find_vma(mm, addr);

if (vma && vma->vm_start < addr + len)
return -EEXIST; //如果目标地址已被占用,直接返回-EEXIST
}

if (prot == PROT_EXEC) {
pkey = execute_only_pkey(mm);
if (pkey < 0)
pkey = 0;
}

/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)
if (!can_do_mlock())
return -EPERM;

if (mlock_future_check(mm, vm_flags, len))
return -EAGAIN;

if (file) {
struct inode *inode = file_inode(file);
unsigned long flags_mask;

if (!file_mmap_ok(file, inode, pgoff, len)) //检查len的合法性
return -EOVERFLOW;

flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags;

switch (flags & MAP_TYPE) {
case MAP_SHARED:
/*
* Force use of MAP_SHARED_VALIDATE with non-legacy
* flags. E.g. MAP_SYNC is dangerous to use with
* MAP_SHARED as you don't know which consistency model
* you will get. We silently ignore unsupported flags
* with MAP_SHARED to preserve backward compatibility.
*/
flags &= LEGACY_MAP_MASK;
/* fall through */
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;

/*
* Make sure we don't allow writing to an append-only
* file..
*/
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
return -EACCES;

/*
* Make sure there are no mandatory locks on the file.
*/
if (locks_verify_locked(file))
return -EAGAIN;

vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

/* fall through */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
if (path_noexec(&file->f_path)) {
if (vm_flags & VM_EXEC)
return -EPERM;
vm_flags &= ~VM_MAYEXEC;
}

if (!file->f_op->mmap)
return -ENODEV;
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
break;

default:
return -EINVAL;
}
} else {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
/*
* Ignore pgoff.
*/
pgoff = 0;
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
case MAP_PRIVATE:
/*
* Set pgoff according to addr for anon_vma.
*/
pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
}
}

/*
* Set 'VM_NORESERVE' if we should not account for the
* memory use of this mapping.
*/
if (flags & MAP_NORESERVE) {
/* We honor MAP_NORESERVE if allowed to overcommit */
if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
vm_flags |= VM_NORESERVE;

/* hugetlb applies strict overcommit unless MAP_NORESERVE */
if (file && is_file_hugepages(file))
vm_flags |= VM_NORESERVE;
}

addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); //完成vma的初始化、映射
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
*populate = len;
return addr;
}

do_mmap()的大致逻辑如下:

首先获取了当前进程的mm_struct结构体

将len参数(映射的长度)调整为页对齐

对参数合法性等做检查

调用get_unmapped_area(file, addr, len, pgoff, flags)来寻找并返回一个合适的线性地址,这里优先会调用file->f_op->get_unmapped_area(),如果文件没有定义这个方法,则调用current->mm->get_unmapped_area

get_unmapped_area返回的地址如果不是页对齐的话会直接返回-EINVAL错误代码。

在get_area()内部,如果参数addr等于0,或者(addr, addr+len)覆盖了已被占用的区域,则会根据当前进程的mm_struct的mm_rb字段指向的进程所有的vm_area_struct组成的红黑树查找一段合适的区域,如果addr不等于0,且(addr, addr+len)未被占用,则直接返回addr。另外,如果flag的MAP_FIXED被置位的话(MAP_FIXED表示严格按照参数传入的addr来映射,而且可以“抢占”),也会直接将addr返回,不会检查区间是否已被占用,如果该区间已被占用的话,后面会在mmap_region函数中取消该区间的映射。

拿到线性地址后,开始对flag值做计算…….

最后,进入mmap_region()函数来进行映射。

mmap_region

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
mm/mmap.c

unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;

/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;

/*
* MAP_FIXED may remove pages of mappings that intersects with
* requested mapping. Account for the pages it would unmap.
*/
nr_pages = count_vma_pages_range(mm, addr, addr + len);

if (!may_expand_vm(mm, vm_flags,
(len >> PAGE_SHIFT) - nr_pages))
return -ENOMEM;
}

/* Clear old maps */
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {
if (do_munmap(mm, addr, len, uf))
return -ENOMEM;
}

/*
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
vm_flags |= VM_ACCOUNT;
}

/*
* Can we just expand an old mapping?
*/
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
if (vma)
goto out;

/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
* not unmapped, but the maps are removed from the list.
*/
vma = vm_area_alloc(mm);
if (!vma) {
error = -ENOMEM;
goto unacct_error;
}

vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;

if (file) {
if (vm_flags & VM_DENYWRITE) {
error = deny_write_access(file);
if (error)
goto free_vma;
}
if (vm_flags & VM_SHARED) {
error = mapping_map_writable(file->f_mapping);
if (error)
goto allow_write_and_free_vma;
}

/* ->mmap() can change vma->vm_file, but must guarantee that
* vma_link() below can deny write-access if VM_DENYWRITE is set
* and map writably if VM_SHARED is set. This usually means the
* new file must not have been exposed to user-space, yet.
*/
vma->vm_file = get_file(file);
error = call_mmap(file, vma); //file->f_op->mmap(file, vma)这里是真正完成映射的(建立页表等)
if (error)
goto unmap_and_free_vma;

/* Can addr have changed??
*
* Answer: Yes, several device drivers can do it in their
* f_op->mmap method. -DaveM
* Bug: If addr is changed, prev, rb_link, rb_parent should
* be updated for vma_link()
*/
WARN_ON_ONCE(addr != vma->vm_start);

addr = vma->vm_start;
vm_flags = vma->vm_flags;
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
} else {
vma_set_anonymous(vma);
}

vma_link(mm, vma, prev, rb_link, rb_parent); //将vma插入链表和红黑树
/* Once vma denies write, undo our temporary denial count */
if (file) {
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
}
file = vma->vm_file;
out:
perf_event_mmap(vma);

vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
else
mm->locked_vm += (len >> PAGE_SHIFT);
}

if (file)
uprobe_mmap(vma);

/*
* New (or expanded) vma always get soft dirty status.
* Otherwise user-space soft-dirty page tracker won't
* be able to distinguish situation when vma area unmapped,
* then new mapped in-place (which must be aimed as
* a completely new data area).
*/
vma->vm_flags |= VM_SOFTDIRTY;

vma_set_page_prot(vma);

return addr;

unmap_and_free_vma:
vma->vm_file = NULL;
fput(file);

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
free_vma:
vm_area_free(vma);
unacct_error:
if (charged)
vm_unacct_memory(charged);
return error;
}

mmap_region主要是初始化vm_area_struct对象、完成映射、将对象插入链表和红黑树

完成映射是通过调用file->f_op->mmap(file, vma)来完成的,也就是说,是由设备驱动来决定如何完成映射(虚拟地址到物理地址的映射,即建立页表等)。如果设备有自己的物理内存的话,可以直接调用内核函数完成映射,如果没有的话,可以申请物理内存,然后通过上面提到的页表的构建来完成映射,又或者根本不做映射,只返回没有映射到物理内存的虚拟地址,然后后续真正要访问该地址时就会导致内存访问异常,那时才分配物理地址完成映射。

0%