Linux内核-进程、线程初探

记录一下Linux的进程/线程的一些基础学习,有一些内容是从一些参考资料中拿来的,根据我自己的需要整合了一下,也会贴一些源码和一些自己的理解。如有理解不透彻导致的错误,恳请指正。

参考资料

https://www.cnblogs.com/wipan/p/9488318.html

http://kernel.pursuitofcloud.org/531016

https://stackoverflow.com/questions/33982789/difference-between-euid-suid-and-ruid-in-linux-systems

《深入理解Linux内核》(第三版)

源码:Linux-4.19.65

一些概念、数据结构

进程、线程

通常把进程定义为程序执行的一个实例,因此,如果16个用户同事运行vi,那么就有16个独立的进程(尽管它们共享同一个可执行代码)。在Linux源代码中,常把进程称为任务(task)或线程(thread)。

在linux系统里,进程和线程都是通过task_struct结构体来描述。

进程是资源分配的基本单位,线程是调度的基本单位

CPU对任务进行调度时,可调度的基本单位 (dispatchable entity)是线程。如果一个进程中没有其他线程,可以理解成这个进程中只有一个主线程,这个主进程独享进程中的所有资源。

内核调度的对象是根据task_struct结构体。可以说是线程,而不是进程。

进程之间不共享地址空间,而线程与创建它的进程是共享地址空间的。

Linux系统 对线程和进程并不特别区分。线程仅仅被视为一个与其他线程共享某些资源的进程。每个线程都拥有唯一自己的task_struct。

一个进程由几个线程组成,每个线程都代表进程的一个执行流。

不仅仅要有资源,还需要有进程的描述,例如:pid

进程的个体间是完全独立的,而线程间是彼此依存,并且共享资源。多进程环境中,任何一个进程的终止,不会影响到其他非子进程。而多线程环境中,父线程终止,全部子线程被迫终止(没有了资源)。

pid、lwp、tid、tgid、

进程标识符 - Process ID(PID),可用来标识进程。PID存放在进程描述符pid字段中。PID是顺序编号的,新创建的进程PID通常是前一个进程的PID加1。

一个多线程引用程序中的所有线程都必须有相同的PID。因此,Linux引入线程组的表示。一个线程组中的所有线程使用和该线程组的领头线程(thread group leader)相同的PID,也就是该组中第一个轻量级进程的PID,它被存入进程描述符的tgid字段中。

  • pid: 进程ID。
  • lwp: 线程ID。在用户态的命令(比如ps)中常用的显示方式。
  • tid: 线程ID,等于lwp。tid在系统提供的接口函数中更常用,比如syscall(SYS_gettid)和syscall(__NR_gettid)。
  • tgid: 线程组ID,也就是线程组leader的进程ID,等于pid。

实际上我看源码,好像只在进程描述符中找到pidtgid,那么所谓的tid,应该就是描述符中的pid,所谓的pid,其实就是描述符中的tgid。主线程中,pid = tgid;而在非主线程中,pid != tgid,tgid = 主线程tgid = 主线程pid,我是这么理解的,如有错误恳请指正。。。

以下是一张图,描述了父子进程、线程之间的关系,来自https://stackoverflow.com/questions/9305992/if-threads-share-the-same-pid-how-can-they-be-identified

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                      USER VIEW
<-- PID 43 --> <----------------- PID 42 ----------------->
+---------+
| process |
_| pid=42 |_
_/ | tgid=42 | \_ (new thread) _
_ (fork) _/ +---------+ \
/ +---------+
+---------+ | process |
| process | | pid=44 |
| pid=43 | | tgid=42 |
| tgid=43 | +---------+
+---------+
<-- PID 43 --> <--------- PID 42 --------> <--- PID 44 --->
KERNEL VIEW

上图很好地描述了用户视角(user view)和内核视角(kernel view)看到线程的差别:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                      USER VIEW
<-- PID 43 --> <----------------- PID 42 ----------------->
+---------+
| process |
_| tid=42 |_
_/ | tgid=42 | \_ (new thread) _
_ (fork) _/ +---------+ \
/ +---------+
+---------+ | process |
| process | | tid=44 |
| tid=43 | | tgid=42 |
| tgid=43 | +---------+
+---------+
<-- TID 43 --> <--------- TID 42 --------> <--- TID 44 --->
KERNEL VIEW

上图很好地描述了用户视角(user view)和内核视角(kernel view)看到线程的差别:

  • 从用户视角出发,在pid 42中产生的tid 44线程,属于tgid(线程组leader的进程ID) 42。甚至用ps和top的默认参数,你都无法看到tid 44线程。
  • 从内核视角出发,tid 42和tid 44是独立的调度单元,可以把他们视为”pid 42”和”pid 44”。

进程描述符 task_struct

进程描述符都是task_struct类型结构,它的字段包含了与一个进程相关的所有信息,它是相当复杂的。不仅包含了很多进程属性的字段,而且一些字段还包括了指向其他数据结构的指针。每一个进程、线程都会有对应一个task_struct

进程控制块PCB

内核对进程的大部分引用是通过进程描述符指针进行的。

内核还定义了task_t数据类型等同于struct tasl_struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
/* /inlude/linux/sched.h */

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info; //线程描述符
#endif
/* -1 unrunnable, 0 runnable, >0 stopped: */
volatile long state; //进程的状态

/*
* This begins the randomizable portion of task_struct. Only
* scheduling-critical items should be added above here.
*/
randomized_struct_fields_start

void *stack;
atomic_t usage;
/* Per task flags (PF_*), defined further below: */
unsigned int flags;
unsigned int ptrace;

#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
#ifdef CONFIG_THREAD_INFO_IN_TASK
/* Current CPU: */
unsigned int cpu;
#endif
unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;

/*
* recent_used_cpu is initially set as the last CPU used by a task
* that wakes affine another task. Waker/wakee relationships can
* push tasks around a CPU where each wakeup moves to the next one.
* Tracking a recently used CPU allows a quick search for a recently
* used CPU that may be idle.
*/
int recent_used_cpu;
int wake_cpu;
#endif
int on_rq;

int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;

const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
struct sched_dl_entity dl;

#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
#endif

#ifdef CONFIG_BLK_DEV_IO_TRACE
unsigned int btrace_seq;
#endif

unsigned int policy;
int nr_cpus_allowed;
cpumask_t cpus_allowed;

#ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
union rcu_special rcu_read_unlock_special;
struct list_head rcu_node_entry;
struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

#ifdef CONFIG_TASKS_RCU
unsigned long rcu_tasks_nvcsw;
u8 rcu_tasks_holdout;
u8 rcu_tasks_idx;
int rcu_tasks_idle_cpu;
struct list_head rcu_tasks_holdout_list;
#endif /* #ifdef CONFIG_TASKS_RCU */

struct sched_info sched_info;

struct list_head tasks;
#ifdef CONFIG_SMP
struct plist_node pushable_tasks;
struct rb_node pushable_dl_tasks;
#endif

struct mm_struct *mm; //指向内存区描述符
struct mm_struct *active_mm;

/* Per-thread vma caching: */
struct vmacache vmacache;

#ifdef SPLIT_RSS_COUNTING
struct task_rss_stat rss_stat;
#endif
int exit_state;
int exit_code;
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
/* JOBCTL_*, siglock protected: */
unsigned long jobctl;

/* Used for emulating ABI behavior of previous Linux versions: */
unsigned int personality;

/* Scheduler bits, serialized by scheduler locks: */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
unsigned sched_remote_wakeup:1;
/* Force alignment to the next boundary: */
unsigned :0;

/* Unserialized, strictly 'current' */

/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
#ifndef TIF_RESTORE_SIGMASK
unsigned restore_sigmask:1;
#endif
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
#endif
#endif
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
#ifdef CONFIG_CGROUPS
/* disallow userland-initiated cgroup migration */
unsigned no_cgroup_migration:1;
#endif
#ifdef CONFIG_BLK_CGROUP
/* to be used once the psi infrastructure lands upstream. */
unsigned use_memdelay:1;
#endif

unsigned long atomic_flags; /* Flags requiring atomic access. */

struct restart_block restart_block;

pid_t pid; //全局的进程号
pid_t tgid; //全局的线程组标识符

#ifdef CONFIG_STACKPROTECTOR
/* Canary value for the -fstack-protector GCC feature: */
unsigned long stack_canary;
#endif
/*
* Pointers to the (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
* p->real_parent->pid)
*/

/* Real parent process: */
struct task_struct __rcu *real_parent;

/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;

/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;

/*
* 'ptraced' is the list of tasks this task is using ptrace() on.
*
* This includes both natural children and PTRACE_ATTACH targets.
* 'ptrace_entry' is this task's link on the p->parent->ptraced list.
*/
struct list_head ptraced;
struct list_head ptrace_entry;

/* PID/PID hash table linkage. */
struct pid *thread_pid;
struct hlist_node pid_links[PIDTYPE_MAX];
struct list_head thread_group;
struct list_head thread_node;

struct completion *vfork_done;

/* CLONE_CHILD_SETTID: */
int __user *set_child_tid;

/* CLONE_CHILD_CLEARTID: */
int __user *clear_child_tid;

u64 utime;
u64 stime;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
u64 utimescaled;
u64 stimescaled;
#endif
u64 gtime;
struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
struct vtime vtime;
#endif

#ifdef CONFIG_NO_HZ_FULL
atomic_t tick_dep_mask;
#endif
/* Context switch counts: */
unsigned long nvcsw;
unsigned long nivcsw;

/* Monotonic time in nsecs: */
u64 start_time;

/* Boot based time in nsecs: */
u64 real_start_time;

/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */
unsigned long min_flt;
unsigned long maj_flt;

#ifdef CONFIG_POSIX_TIMERS
struct task_cputime cputime_expires;
struct list_head cpu_timers[3];
#endif

/* Process credentials: */

/* Tracer's credentials at attach: */
const struct cred __rcu *ptracer_cred;

/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;

/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;

/*
* executable name, excluding path.
*
* - normally initialized setup_new_exec()
* - access it with [gs]et_task_comm()
* - lock it with task_lock()
*/
char comm[TASK_COMM_LEN];

struct nameidata *nameidata;

#ifdef CONFIG_SYSVIPC
struct sysv_sem sysvsem;
struct sysv_shm sysvshm;
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
unsigned long last_switch_count;
unsigned long last_switch_time;
#endif
/* Filesystem information: */
struct fs_struct *fs; //当前目录

/* Open file information: */
struct files_struct *files; //指向文件描述符的指针

/* Namespaces: */
struct nsproxy *nsproxy;

/* Signal handlers: */
struct signal_struct *signal;
struct sighand_struct *sighand;
sigset_t blocked;
sigset_t real_blocked;
/* Restored if set_restore_sigmask() was used: */
sigset_t saved_sigmask;
struct sigpending pending;
unsigned long sas_ss_sp;
size_t sas_ss_size;
unsigned int sas_ss_flags;

struct callback_head *task_works;

struct audit_context *audit_context;
#ifdef CONFIG_AUDITSYSCALL
kuid_t loginuid;
unsigned int sessionid;
#endif
struct seccomp seccomp;

/* Thread group tracking: */
u32 parent_exec_id;
u32 self_exec_id;

/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
spinlock_t alloc_lock;

/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;

struct wake_q_node wake_q;

#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task: */
struct rb_root_cached pi_waiters;
/* Updated under owner's pi_lock and rq lock */
struct task_struct *pi_top_task;
/* Deadlock detection and priority inheritance handling: */
struct rt_mutex_waiter *pi_blocked_on;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
/* Mutex deadlock detection: */
struct mutex_waiter *blocked_on;
#endif

#ifdef CONFIG_TRACE_IRQFLAGS
unsigned int irq_events;
unsigned long hardirq_enable_ip;
unsigned long hardirq_disable_ip;
unsigned int hardirq_enable_event;
unsigned int hardirq_disable_event;
int hardirqs_enabled;
int hardirq_context;
unsigned long softirq_disable_ip;
unsigned long softirq_enable_ip;
unsigned int softirq_disable_event;
unsigned int softirq_enable_event;
int softirqs_enabled;
int softirq_context;
#endif

#ifdef CONFIG_LOCKDEP
# define MAX_LOCK_DEPTH 48UL
u64 curr_chain_key;
int lockdep_depth;
unsigned int lockdep_recursion;
struct held_lock held_locks[MAX_LOCK_DEPTH];
#endif

#ifdef CONFIG_UBSAN
unsigned int in_ubsan;
#endif

/* Journalling filesystem info: */
void *journal_info;

/* Stacked block device info: */
struct bio_list *bio_list;

#ifdef CONFIG_BLOCK
/* Stack plugging: */
struct blk_plug *plug;
#endif

/* VM state: */
struct reclaim_state *reclaim_state;

struct backing_dev_info *backing_dev_info;

struct io_context *io_context;

/* Ptrace state: */
unsigned long ptrace_message;
siginfo_t *last_siginfo;

struct task_io_accounting ioac;
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
/* Accumulated virtual memory usage: */
u64 acct_vm_mem1;
/* stime + utime since last update: */
u64 acct_timexpd;
#endif
#ifdef CONFIG_CPUSETS
/* Protected by ->alloc_lock: */
nodemask_t mems_allowed;
/* Seqence number to catch updates: */
seqcount_t mems_allowed_seq;
int cpuset_mem_spread_rotor;
int cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock: */
struct css_set __rcu *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
struct list_head cg_list;
#endif
#ifdef CONFIG_INTEL_RDT
u32 closid;
u32 rmid;
#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
struct compat_robust_list_head __user *compat_robust_list;
#endif
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
#endif
#ifdef CONFIG_PERF_EVENTS
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
struct mutex perf_event_mutex;
struct list_head perf_event_list;
#endif
#ifdef CONFIG_DEBUG_PREEMPT
unsigned long preempt_disable_ip;
#endif
#ifdef CONFIG_NUMA
/* Protected by alloc_lock: */
struct mempolicy *mempolicy;
short il_prev;
short pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
int numa_scan_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
unsigned long numa_migrate_retry;
/* Migration stamp: */
u64 node_stamp;
u64 last_task_numa_placement;
u64 last_sum_exec_runtime;
struct callback_head numa_work;

/*
* This pointer is only modified for current in syscall and
* pagefault context (and for tasks being destroyed), so it can be read
* from any of the following contexts:
* - RCU read-side critical section
* - current->numa_group from everywhere
* - task's runqueue locked, task not running
*/
struct numa_group __rcu *numa_group;

/*
* numa_faults is an array split into four regions:
* faults_memory, faults_cpu, faults_memory_buffer, faults_cpu_buffer
* in this precise order.
*
* faults_memory: Exponential decaying average of faults on a per-node
* basis. Scheduling placement decisions are made based on these
* counts. The values remain static for the duration of a PTE scan.
* faults_cpu: Track the nodes the process was running on when a NUMA
* hinting fault was incurred.
* faults_memory_buffer and faults_cpu_buffer: Record faults per node
* during the current scan window. When the scan completes, the counts
* in faults_memory and faults_cpu decay and these values are copied.
*/
unsigned long *numa_faults;
unsigned long total_numa_faults;

/*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local or failed to migrate. The task scan
* period is adapted based on the locality of the faults with different
* weights depending on whether they were shared or private faults
*/
unsigned long numa_faults_locality[3];

unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */

#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_len;
u32 rseq_sig;
/*
* RmW on rseq_event_mask must be performed atomically
* with respect to preemption.
*/
unsigned long rseq_event_mask;
#endif

struct tlbflush_unmap_batch tlb_ubc;

struct rcu_head rcu;

/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;

struct page_frag task_frag;

#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif

#ifdef CONFIG_FAULT_INJECTION
int make_it_fail;
unsigned int fail_nth;
#endif
/*
* When (nr_dirtied >= nr_dirtied_pause), it's time to call
* balance_dirty_pages() for a dirty throttling pause:
*/
int nr_dirtied;
int nr_dirtied_pause;
/* Start of a write-and-pause period: */
unsigned long dirty_paused_when;

#ifdef CONFIG_LATENCYTOP
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
#endif
/*
* Time slack values; these are used to round up poll() and
* select() etc timeout values. These are in nanoseconds.
*/
u64 timer_slack_ns;
u64 default_timer_slack_ns;

#ifdef CONFIG_KASAN
unsigned int kasan_depth;
#endif

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
/* Index of current stored address in ret_stack: */
int curr_ret_stack;
int curr_ret_depth;

/* Stack of return addresses for return function tracing: */
struct ftrace_ret_stack *ret_stack;

/* Timestamp for last schedule: */
unsigned long long ftrace_timestamp;

/*
* Number of functions that haven't been traced
* because of depth overrun:
*/
atomic_t trace_overrun;

/* Pause tracing: */
atomic_t tracing_graph_pause;
#endif

#ifdef CONFIG_TRACING
/* State flags for use by tracers: */
unsigned long trace;

/* Bitmask and counter of trace recursion: */
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */

#ifdef CONFIG_KCOV
/* Coverage collection mode enabled for this task (0 if disabled): */
unsigned int kcov_mode;

/* Size of the kcov_area: */
unsigned int kcov_size;

/* Buffer for coverage collection: */
void *kcov_area;

/* KCOV descriptor wired with this task or NULL: */
struct kcov *kcov;
#endif

#ifdef CONFIG_MEMCG
struct mem_cgroup *memcg_in_oom;
gfp_t memcg_oom_gfp_mask;
int memcg_oom_order;

/* Number of pages to reclaim on returning to userland: */
unsigned int memcg_nr_pages_over_high;

/* Used by memcontrol for targeted memcg charge: */
struct mem_cgroup *active_memcg;
#endif

#ifdef CONFIG_BLK_CGROUP
struct request_queue *throttle_queue;
#endif

#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
#endif
#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK
/* A live task holds one reference: */
atomic_t stack_refcount;
#endif
#ifdef CONFIG_LIVEPATCH
int patch_state;
#endif
#ifdef CONFIG_SECURITY
/* Used by LSM modules for access restriction: */
void *security;
#endif

/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
*/
randomized_struct_fields_end

/* CPU-specific state of this task: */
struct thread_struct thread;

/*
* WARNING: on x86, 'thread_struct' contains a variable-sized
* structure. It *MUST* be at the end of 'task_struct'.
*
* Do not put anything below here!
*/
};

线程描述符 thread_info

task_struct结构体中有一个struct thread_info thread_info字段

进程描述符 task_struct存放在动态内存中(由Slab分配器分配),而线程描述符thread_info和内核堆栈一起放在内核地址空间(内核虚拟地址)的开始连续两个页框内(8kb大)。

关于为啥要有了一个task_struct还要有一个thread_info,网上有说法是:

为什么需要thread_info
内核还需要存储每个进程的PCB信息,linux内核是支持不同体系的的, 但是不同的体系结构可能进程需要存储的信息不尽相同,这就需要我们实现一种通用的方式,我们将体系结构相关的部分和无关的部门进行分离
用一种通用的方式来描述进程, 这就是struct task_struct,而thread_info就保存了特定体系结构的汇编代码段需要访问的那部分进程的数据,我们在thread_info中嵌入指向task_struct的指针,则我们可以很方便的通过thread_info来查找task_struct。
————————————————
版权声明:本文为CSDN博主「JeanCheng」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/gatieme/article/details/51577479

自己翻了一下源码,确实在不同的架构下,会有自己对thread_info的定义

arch/x86/include/asm/thread_info.h :

arch/arc/include/asm/thread_info.h

/arch/arm/include/asm/thread_info.h

进程链表

Linux内核定义了一个list_head结构,这个结构有两个字段next和prev,分别表示通用双向链表向前和向后的指针元素。不过,值得特别关注的是,list_head字段中的指针指向的是另一个list_head字段的地址,而不是含有该list_head结构的整个数据结构的地址。

进程链表是一个双向循环链表,它把所有进程的描述符链接起来。每个task_struct结构都包含一个list_head类型的字段tasks,这个结构的prev和next分别指向前面和后面的task_struct元素。

TASK_RUNNING状态的进程链表

当内核寻找一个新进程在CPU上运行时,必须只考虑可运行进程(即处在TASK_RUNNING状态的进程)。

为了提高调度程序运行速度,Linux会建立多个可运行进程链表,每种进程优先权对应一个不同的链表。每个task_struct描述符包含一个list_head类型的字段run_list

1
2
3
4
5
6
7
8
9
10
11
12
13
/* /inlude/linux/sched.h */

struct task_struct {
...
struct sched_rt_entity rt;
...
};

struct sched_rt_entity {
struct list_head run_list; //可运行进程链表
...
...
}

例如,如果进程的优先权等于k,run_list字段就会把该进程链入优先权为k的可运行链表中。

安全上下文 cred

进程描述符中,有两个字段const struct cred __rcu *real_cred;const struct cred __rcu *cred,它们都是cred类型的结构体指针,这个结构体记录了一些当前进程的安全凭证/权限信息。定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/* /include/linux/cred.h */
/*
* The security context of a task
*
* The parts of the context break down into two categories:
*
* (1) The objective context of a task. These parts are used when some other
* task is attempting to affect this one.
*
* (2) The subjective context. These details are used when the task is acting
* upon another object, be that a file, a task, a key or whatever.
*
* Note that some members of this structure belong to both categories - the
* LSM security pointer for instance.
*
* A task has two security pointers. task->real_cred points to the objective
* context that defines that task's actual details. The objective part of this
* context is used whenever that task is acted upon.
*
* task->cred points to the subjective context that defines the details of how
* that task is going to act upon another object. This may be overridden
* temporarily to point to another security context, but normally points to the
* same context as task->real_cred.
*/
struct cred {
atomic_t usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
atomic_t subscribers; /* number of processes subscribed */
void *put_addr;
unsigned magic;
#define CRED_MAGIC 0x43736564
#define CRED_MAGIC_DEAD 0x44656144
#endif
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
unsigned securebits; /* SUID-less security management */
kernel_cap_t cap_inheritable; /* caps our children can inherit */
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
kernel_cap_t cap_ambient; /* Ambient capability set */
#ifdef CONFIG_KEYS
unsigned char jit_keyring; /* default keyring to attach requested
* keys to */
struct key __rcu *session_keyring; /* keyring inherited over fork */
struct key *process_keyring; /* keyring private to this process */
struct key *thread_keyring; /* keyring private to this thread */
struct key *request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
void *security; /* subjective LSM security */
#endif
struct user_struct *user; /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
struct group_info *group_info; /* supplementary groups for euid/fsgid */
/* RCU deletion */
union {
int non_rcu; /* Can we skip RCU deletion? */
struct rcu_head rcu; /* RCU deletion hook */
};
} __randomize_layout;

read_cred与cred

根据源码中的注释,我是这样理解这两个字段的:

task->real_cred points to the objective context that defines that task’s actual details. The objective part of this context is used whenever that task is acted upon.

real_cred指向客观上下文,这个cred定义了这个进程客观拥有的一些细节(就是那些各种ID和别的一些安全信息),当这个进程被使用(别的对象对当前对象进行操作时,我的理解是,操作当前对象所需要的权限、安全信息),用到的是这个real_cred。

task->cred points to the subjective context that defines the details of how that task is going to act upon another object. This may be overridden temporarily to point to another security context, but normally points to the same context as task->real_cred.

而cred字段则指向主观上下文,也就是当前对象对别的对象进行操作时需要参考的cred(我的理解是,当前对象正在拥有的权限、安全信息),它通常和real_cred指向同一个cred结构体,但也可能临时指向另一个cred。

以上是我的个人理解,如有错误请指正!

UID、SUID、EUID

cred结构体中维护了三种uid,分别是 uid、suid和euid,简述一下它们之间的不同:(这里我也不是理解的很清楚。。参考自 https://stackoverflow.com/questions/33982789/difference-between-euid-suid-and-ruid-in-linux-systems

  • uid:real UID of the task,进程真正的uid,即为创建该进程的用户的uid
  • euid:effective UID of the task,有效的UID,用于进程访问资源时的访问检查,大多数情况下,EUID是同于UID的,但是也可以不同。
  • suid:saved UID of the task,保留的UID,例如,当一个特权进程需要临时降低其权限时,将其euid更改为非特权的UID,然后将原来的EUID保存到SUID,当需要恢复权限时,将EUID改为SUID中保存的UID即可。

kernel pwn 中的提权

可以通过执行commit_creds(prepare_kernel_cred(0))来获得root权限(root的uid、gid均为0)

下面根据源码看一下这俩个内核函数。

prepare_kernel_cred()

根据源码注释中的描述,这个函数返回一个cred结构体,可以用于代替进程原来的cred以便能够完成需要不同subjective context的任务。如果提供了参数@daemon,那么security data将来源于此,而这个参数也可为空,然后内容字段会被设置成0(uid/gid都是0,就是root权限咯?)

这个函数定义在/kernel/cred.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/* /kernel/cred.c */
/**
* prepare_kernel_cred - Prepare a set of credentials for a kernel service
* @daemon: A userspace daemon to be used as a reference
*
* Prepare a set of credentials for a kernel service. This can then be used to
* override a task's own credentials so that work can be done on behalf of that
* task that requires a different subjective context.
*
* @daemon is used to provide a base for the security record, but can be NULL.
* If @daemon is supplied, then the security data will be derived from that;
* otherwise they'll be set to 0 and no groups, full capabilities and no keys.
*
* The caller may change these controls afterwards if desired.
*
* Returns the new credentials or NULL if out of memory.
*
* Does not take, and does not return holding current->cred_replace_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
const struct cred *old;
struct cred *new;

new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
if (!new)
return NULL;

kdebug("prepare_kernel_cred() alloc %p", new);

if (daemon)
old = get_task_cred(daemon);
else
old = get_cred(&init_cred);

validate_creds(old);

*new = *old;
new->non_rcu = 0;
atomic_set(&new->usage, 1);
set_cred_subscribers(new, 0);
get_uid(new->user);
get_user_ns(new->user_ns);
get_group_info(new->group_info);

#ifdef CONFIG_KEYS
new->session_keyring = NULL;
new->process_keyring = NULL;
new->thread_keyring = NULL;
new->request_key_auth = NULL;
new->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
#endif

#ifdef CONFIG_SECURITY
new->security = NULL;
#endif
if (security_prepare_creds(new, old, GFP_KERNEL) < 0)
goto error;

put_cred(old);
validate_creds(new);
return new;

error:
put_cred(new);
put_cred(old);
return NULL;
}
EXPORT_SYMBOL(prepare_kernel_cred);

commit_creds()

根据源码注释的描述,这个函数会将当前进程的real_credcred都设置成一组新的cred。

这个函数也定义在/kernel/cred.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
/* /kernel/cred.c */
/**
* commit_creds - Install new credentials upon the current task
* @new: The credentials to be assigned
*
* Install a new set of credentials to the current task, using RCU to replace
* the old set. Both the objective and the subjective credentials pointers are
* updated. This function may not be called if the subjective credentials are
* in an overridden state.
*
* This function eats the caller's reference to the new credentials.
*
* Always returns 0 thus allowing this function to be tail-called at the end
* of, say, sys_setgid().
*/
int commit_creds(struct cred *new)
{
struct task_struct *task = current;
const struct cred *old = task->real_cred;

kdebug("commit_creds(%p{%d,%d})", new,
atomic_read(&new->usage),
read_cred_subscribers(new));

BUG_ON(task->cred != old);
#ifdef CONFIG_DEBUG_CREDENTIALS
BUG_ON(read_cred_subscribers(old) < 2);
validate_creds(old);
validate_creds(new);
#endif
BUG_ON(atomic_read(&new->usage) < 1);

get_cred(new); /* we will require a ref for the subj creds too */

/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
/*
* If a task drops privileges and becomes nondumpable,
* the dumpability change must become visible before
* the credential change; otherwise, a __ptrace_may_access()
* racing with this change may be able to attach to a task it
* shouldn't be able to attach to (as if the task had dropped
* privileges without becoming nondumpable).
* Pairs with a read barrier in __ptrace_may_access().
*/
smp_wmb();
}

/* alter the thread keyring */
if (!uid_eq(new->fsuid, old->fsuid))
key_fsuid_changed(task);
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(task);

/* do it
* RLIMIT_NPROC limits on user->processes have already been checked
* in set_user().
*/
alter_cred_subscribers(new, 2);
if (new->user != old->user)
atomic_inc(&new->user->processes);
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
atomic_dec(&old->user->processes);
alter_cred_subscribers(old, -2);

/* send notifications */
if (!uid_eq(new->uid, old->uid) ||
!uid_eq(new->euid, old->euid) ||
!uid_eq(new->suid, old->suid) ||
!uid_eq(new->fsuid, old->fsuid))
proc_id_connector(task, PROC_EVENT_UID);

if (!gid_eq(new->gid, old->gid) ||
!gid_eq(new->egid, old->egid) ||
!gid_eq(new->sgid, old->sgid) ||
!gid_eq(new->fsgid, old->fsgid))
proc_id_connector(task, PROC_EVENT_GID);

/* release the old obj and subj refs both */
put_cred(old);
put_cred(old);
return 0;
}
EXPORT_SYMBOL(commit_creds);

综上,通过prepare_kernel_cred(0)获得一个root的cred,然后再用commit_creds()将其安装到当前进程,即commit_creds(prepare_kernel_cred(0)),这样就可以提权啦!

0%