【容器底层技术】 namespaces详解

1.简介

namespaces是 Linux 它是核心的一个功能隔离核心资源，一组过程只能看到一部分与自己相关的资源，而另一组过程可以看到另一组资源，让它与众不同 namespaces 该过程具有独立的全球系统资源，可以改变 namespaces 系统资源只会影响当前的系统资源 namespaces 其他的过程 namespaces 过程没有影响。

通过使用相同的资源和过程namespace工作，不一样namespace引用不同的资源，有些资源可能存在于多个空间，如过程 ID、主机名，用户 ID、与网络访问和过程间通信相关的文件名和一些名称。

namespaces是 Linux 上容器的基本功能。Linux 每种类型的单个系统namespace对于所有的过程，可以创建额外的过程namespaces加入不同的namespaces。

2.种类

截至至内核5.6，namespaces一共有8种。各个类型的namespace功能模式相同：每个过程都与一个过程相同namespace这是相关的，只能看到或使用namespace子代及其可用性namespaces通过这种方法，每个过程对系统资源都有不同的视角。隔离哪种资源取决于为给定的过程组创建的资源namespace类型。

overview

type	flag	function
mnt	CLONE_NEWNS	控制挂载点
pid	CLONE_NEWPID	为这个过程提供了一组独立的namespaces的进程 IDs (PIDs)
net	CLONE_NEWNET	用于隔离网络设备，IP网络栈，如地址端口namespace
uts	CLONE_NEWUTS	允许单个系统对不同的过程有不同的主机名和域名
user	CLONE_NEWUSER	跨多组过程提供权限隔离和用户身份隔离
ipc	CLONE_NEWIPC	将进程与 SysV 通信隔离了风格的过程
cgroup	CLONE_NEWCGROUP	隐藏过程所属控制组的身份
time	CLONE_NEWTIME	允许不同的过程看到不同的系统时间

Mount(mnt) namespace

Mount namespace用来控制挂载点。不同namespace文件系统在过程中的层次也不同。mount namespace中调用mount(), unmount()只会影响现在namespace内部文件系统。创建时，当前mount namespace新命名空间复制了中间的挂载点，但后来创建的挂载点不会在namespaces传播(如果使用共享子树，可以在命名空间之间传播挂载点）。

创建这类新命名空间clone flag是 CLONE_NEWNS - “NEW NameSpace”的缩写。这个术语不是描述性的(不可能从名字上看出要创建哪个命名空间)，因为挂载命名空间是第一个命名空间，设计师没想到还有其他命名空间。

Process ID(pid) namespace

PID namespace为这个过程提供了一组独立的namespaces的进程 IDs (PIDs)。 PID namespace是嵌套是的，这意味着当创建一个新的过程时，它将从现在开始namespace到初始 PID namespace对应的每一个 PID。因此初始 PID namespace尽管你看到了所有的过程PID与其他namespace看到的差异。

在 PID namespace中创的第一个第一个过程 1 号进程 ID，接受和平凡 init 特殊处理过程基本相同，其中最值得注意的是namespace所有的孤儿过程都会附加到它身上（孤儿进程，Orphan Process，这个过程被称为收养，即在父亲的过程完成或终止后继续运行。这也意味着 PID 1 该过程的终止也将立即终止 PID 命名空间中的所有过程和任何后代。

Network namespace是用于隔离网络设备，IP网络栈，如地址端口namespace。每个网络接口(物理或虚拟)只存在于 1 个namespace而且可以namespace之间移动。每个namespace有一组私有 IP 与网络相关的地址、自己的路由表、套接字列表、连接跟踪表、防火墙等资源。Network namespace过程可以有自己独立的(虚拟)网络设备，每个网络设备namespace内端口不会发生冲突。

UTS namespace

UTS（UNIX 分时）namespace允许单个系统对不同的过程有不同的主机名和域名。当一个过程创过程 UTS namespace时，新 UTS namespace主机名和域来自调用者 UTS namespace复制中相应值。

User ID (user) namespace

user namespace是内核3.8正式推出的功能，可以跨多组过程提供权限隔离和用户身份隔离。借助管理协助，无需实际授予用户流程改进，即可构建具有看似管理权限的容器。 PID 命名空间相同，user namespace是嵌套，每一个新的user namespace都被认为是创造的user namespace的子级。

user namespace将用户包含一个映射表 ID 从容器的角度转换为系统的角度。例如，这允许 root 用户在容器中拥有用户 id 0，但实际上，该系统将其视为用户id 1,400,000检查所有权。组中使用类似的表 ID 映射及所有权检查。

一般认为，为了便于管理操作的权限隔离，每个namespace一种类型user namespace所拥有，这个user namespace基于创建时的活动user namespace。在适当的user namespace具有管理权限的用户将被允许在其他地方工作namespace管理操作在类型中执行。例如，如果一个过程有更改网络接口的过程 IP 地址的管理权限，只要它自己的user namespace与拥有该net namespace的user namespace它可以做同样的(或祖先)。因此初始user namespace系统中的一切namespace类型有管理控制权。

Interprocess Communication (ipc) namespace

IPC namespace将进程与 SysV 通信隔离了风格的过程。这可以防止差异 IPC namespace例如，中间使用过程 SHM 在两个过程之间建立共享内存范围。相反，每个过程将能够在共享内存区域使用相同的标识符，并产生两个不同的区域。

Control group (cgroup) namespace

cgroup namespace类型隐藏过程所属控制组的身份。在这样的命名空间中，在检查任何过程属于哪个控制组时，都会看到一条实际上相对于创建时设置的控制组路径，隐藏其真实控制组的位置和身份。这种命名空间类型 2016 年 3 月起 Linux 4.6 中存在。

Time namespace

time namespace使用类似UTS namespace的方法允许不同进程看到不同的系统时间

3. 操控

Linux内核在/proc/<pid>/ns/中为每个进程与每个namespace类型指定一个符号链接（symbolic link）。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-35Pe2t3L-1657261001794)(.\QQ截图20220616162848.png)]

这些符号链接的格式为xxx:[inode number]，其中的 xxx 为 namespace 的类型，inode number 则用来标识一个 namespace，此符号链接指向的inode number对于此namespace中的每个进程都是相同的。这样通过其符号链接之一指向的 inode 编号可以唯一地标识每个命名空间。

通过以下三个系统调用可以直接操控namespaces

clone：通过flags指定新进程应该被迁移到哪些新的namespaces中

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
		 int __user *, parent_tidptr,
		 unsigned long, tls,
		 int __user *, child_tidptr)
{ 
        
	struct kernel_clone_args args = { 
        
		.flags		= (lower_32_bits(clone_flags) & ~CSIGNAL),
		.pidfd		= parent_tidptr,
		.child_tid	= child_tidptr,
		.parent_tid	= parent_tidptr,
		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
		.stack		= newsp,
		.tls		= tls,
	};

	return kernel_clone(&args);
}

pid_t kernel_clone(struct kernel_clone_args *args)
{ 
        
	u64 clone_flags = args->flags;
	struct completion vfork;
	struct pid *pid;
	struct task_struct *p;
	int trace = 0;
	pid_t nr;

	/* * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate * field in struct clone_args and it still doesn't make sense to have * them both point at the same memory location. Performing this check * here has the advantage that we don't need to have a separate helper * to check for legacy clone(). */
	if ((args->flags & CLONE_PIDFD) &&
	    (args->flags & CLONE_PARENT_SETTID) &&
	    (args->pidfd == args->parent_tid))
		return -EINVAL;

	/* * Determine whether and which event to report to ptracer. When * called from kernel_thread or CLONE_UNTRACED is explicitly * requested, no event is reported; otherwise, report if the event * for the type of forking is enabled. */
	if (!(clone_flags & CLONE_UNTRACED)) { 
        
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if (args->exit_signal != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
	add_latent_entropy();

	if (IS_ERR(p))
		return PTR_ERR(p);

	/* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. */
	trace_sched_process_fork(current, p);

	pid = get_task_pid(p, PIDTYPE_PID);
	nr = pid_vnr(pid);

	if (clone_flags & CLONE_PARENT_SETTID)
		put_user(nr, args->parent_tid);

	if (clone_flags & CLONE_VFORK) { 
        
		p->vfork_done = &vfork;
		init_completion(&vfork);
		get_task_struct(p);
	}

	wake_up_new_task(p);

	/* forking complete and child started to run, tell ptracer */
	if (unlikely(trace))
		ptrace_event_pid(trace, pid);

	if (clone_flags & CLONE_VFORK) { 
        
		if (!wait_for_vfork_done(p, &vfork))
			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
	}

	put_pid(pid);
	return nr;
}

unshare：允许进程（或线程）解除当前与其他进程（或线程）共享的部分执行上下文的关联，主要用于使一个进程不需要创建一个新的进程就可以控制它所共享的执行上下文，相当于跳出原来的namespaces，加入到新的namespaces中。

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{ 
        
	return ksys_unshare(unshare_flags);
}

/* * unshare allows a process to 'unshare' part of the process * context which was originally shared using clone. copy_* * functions used by kernel_clone() cannot be used here directly * because they modify an inactive task_struct that is being * constructed. Here we are modifying the current, active, * task_struct. */
int ksys_unshare(unsigned long unshare_flags)
{ 
        
	struct fs_struct *fs, *new_fs = NULL;
	struct files_struct *new_fd = NULL;
	struct cred *new_cred = NULL;
	struct nsproxy *new_nsproxy = NULL;
	int do_sysvsem = 0;
	int err;

	/* * If unsharing a user namespace must also unshare the thread group * and unshare the filesystem root and working directories. */
	if (unshare_flags & CLONE_NEWUSER)
		unshare_flags |= CLONE_THREAD | CLONE_FS;
	/* * If unsharing vm, must also unshare signal handlers. */
	if (unshare_flags & CLONE_VM)
		unshare_flags |= CLONE_SIGHAND;
	/* * If unsharing a signal handlers, must also unshare the signal queues. */
	if (unshare_flags & CLONE_SIGHAND)
		unshare_flags |= CLONE_THREAD;
	/* * If unsharing namespace, must also unshare filesystem information. */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;

	err = check_unshare_flags(unshare_flags);
	if (err)
		goto bad_unshare_out;
	/* * CLONE_NEWIPC must also detach from the undolist: after switching * to a new ipc namespace, the semaphore arrays from the old * namespace are unreachable. */
	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
		do_sysvsem = 1;
	err = unshare_fs(unshare_flags, &new_fs);
	if (err)
		goto bad_unshare_out;
	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
	if (err)
		goto bad_unshare_cleanup_fs;
	err = unshare_userns(unshare_flags, &new_cred);
	if (err)
		goto bad_unshare_cleanup_fd;
	err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy,
					 new_cred, new_fs);
	if (err)
		goto bad_unshare_cleanup_cred;

	if (new_cred) { 
        
		err = set_cred_ucounts(new_cred);
		if (err)
			goto bad_unshare_cleanup_cred;
	}

	if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) { 
        
		if (do_sysvsem) { 
        
			/* * CLONE_SYSVSEM is equivalent to sys_exit(). */
			exit_sem(current);
		}
		if (unshare_flags & CLONE_NEWIPC) { 
        
			/* Orphan segments in old ns (see sem above). */
			exit_shm(current);
			shm_init_task(current);
		}

		if (new_nsproxy)
			switch_task_namespaces(current, new_nsproxy);

		task_lock(current);

		if (new_fs) { 
        
			fs = current->fs;
			spin_lock(&fs->lock);
			current->fs = new_fs;
			if (--fs->users)
				new_fs = NULL;
			else
				new_fs = fs;
			spin_unlock(&fs->lock);
		}

		if (new_fd)
			swap(current->files, new_fd);

		task_unlock(current);

		if (new_cred) { 
        
			/* Install the new user namespace */
			commit_creds(new_cred);
			new_cred = NULL;
		}
	}

	perf_event_namespaces(current);

bad_unshare_cleanup_cred:
	if (new_cred)
		put_cred(new_cred);
bad_unshare_cleanup_fd:
	if (new_fd)
		put_files_struct(new_fd);

bad_unshare_cleanup_fs:
	if (new_fs)
		free_fs_struct(new_fs);

bad_unshare_out:
	return err;
}

setns：通过一个文件描述符进入特定的namespace中

 SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 { 
        
     struct task_struct *tsk = current;
     struct nsproxy *new_nsproxy;
     struct file *file;
     struct ns_common *ns;
     int err;

     file = proc_ns_fget(fd);

     ns = get_proc_ns(file_inode(file));
     if (nstype && (ns->ops->type != nstype))
         goto out;

     new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);

     err = ns->ops->install(new_nsproxy, ns);

     switch_task_namespaces(tsk, new_nsproxy);
     out:
     fput(file);
     return err;
 }

当一个namespace不再被引用时，它会被删除，对所包含资源的处理则取决于namespace的类型。namespace可通过以下三种方式被引用

通过属于该namespace的进程
通过打开该namespace的文件描述符，/proc/<pid>/ns/<ns-kind>
namespace文件的绑定挂载，/proc/<pid>/ns/<ns-kind>

4. 源码定义

/* * A structure to contain pointers to all per-process * namespaces - fs (mount), uts, network, sysvipc, etc. * * The pid namespace is an exception -- it's accessed using * task_active_pid_ns. The pid namespace here is the * namespace that children will use. * * 'count' is the number of tasks holding a reference. * The count for each namespace, then, will be the number * of nsproxies pointing to it, not the number of tasks. * * The nsproxy is shared by tasks which share all namespaces. * As soon as a single namespace is cloned or unshared, the * nsproxy is copied. */
struct nsproxy { 
        
	atomic_t count;
	struct uts_namespace *uts_ns;
	struct ipc_namespace *ipc_ns;
	struct mnt_namespace *mnt_ns;
	struct pid_namespace *pid_ns_for_children;
	struct net 	     *net_ns;
	struct time_namespace *time_ns;
	struct time_namespace *time_ns_for_children;
	struct cgroup_namespace *cgroup_ns;
};

/* nsproxy中没有user_namespace变量，是因为user_namespace比较特殊，它用于进程的认证 * 而nsproxy会被所有共享namespaces的进程共享， * 在struct task_struct的成员变量cred中有 struct user_namespace user_ns用于身份识别 */
 struct user_namespace { 
        
     struct uid_gid_map	uid_map;
     struct uid_gid_map	gid_map;
     struct uid_gid_map	projid_map;
     atomic_t		count;
     struct user_namespace	*parent;
     int			level;
     kuid_t			owner;
     kgid_t			group;
     struct ns_common	ns;
     unsigned long		flags;

     /* Register of per-UID persistent keyrings for this namespace */
     #ifdef CONFIG_PERSISTENT_KEYRINGS
     struct key		*persistent_keyring_register;
     struct rw_semaphore	persistent_keyring_register_sem;
     #endif
 };

struct mnt_namespace { 
        
	atomic_t		count;
	struct ns_common	ns;
	struct mount *	root;
	struct list_head	list;
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	u64			seq;	/* Sequence number to prevent loops */
	wait_queue_head_t poll;
	u64 event;
	unsigned int		mounts; /* # of mounts in the namespace */
	unsigned int		pending_mounts;
}

struct uts_namespace { 
        
	struct kref kref;
	struct new_utsname name;
	struct user_namespace *user_ns;
	struct ucounts *ucounts;
	struct ns_common ns;
}

struct ipc_namespace { 
        
	refcount_t	count;
	struct ipc_ids	ids[3];

	int		sem_ctls[4];
	int		used_sems;

	unsigned int	msg_ctlmax;
	unsigned int	msg_ctlmnb;
	unsigned int	msg_ctlmni;
	atomic_t	msg_bytes;
	atomic_t	msg_hdrs;

	size_t		shm_ctlmax;
	size_t		shm_ctlall;
	unsigned long	shm_tot;
	int		shm_ctlmni;
	/* * Defines whether IPC_RMID is forced for _all_ shm segments regardless * of shmctl() */
	int		shm_rmid_forced;

	struct notifier_block ipcns_nb;

	/* The kern_mount of the mqueuefs sb. We take a ref on it */
	struct vfsmount	*mq_mnt;

	/* # queues in this ns, protected by mq_lock */
	unsigned int    mq_queues_count;

	/* next fields are set through sysctl */
	unsigned int    mq_queues_max;   /* initialized to DFLT_QUEUESMAX */
	unsigned int    mq_msg_max;      /* initialized to DFLT_MSGMAX */
	unsigned int    mq_msgsize_max;  /* initialized to DFLT_MSGSIZEMAX */
	unsigned int    mq_msg_default;
	unsigned int    mq_msgsize_default;

	/* user_ns which owns the ipc ns */
	struct user_namespace *user_ns;
	struct ucounts *ucounts;

	struct ns_common ns;
}

struct pid_namespace { 
        
	struct kref kref;
	struct idr idr;
	struct rcu_head rcu;
	unsigned int pid_allocated;
	struct task_struct *child_reaper;
	struct kmem_cache *pid_cachep;
	unsigned int level;
	struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
	struct vfsmount *proc_mnt;
	struct dentry *proc_self;
	struct dentry *proc_thread_self;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
	struct fs_pin *bacct;
#endif
	struct user_namespace *user_ns;
	struct ucounts *ucounts;
	struct work_struct proc_work;
	kgid_t pid_gid;
	int hide_pid;
	int reboot;	/* group exit code if this pidns was rebooted */
	struct ns_common ns;
}

struct cgroup_namespace { 
        
	refcount_t		count;
	struct ns_common	ns;
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	struct css_set          *root_cset;
};

struct net { 
        
	refcount_t		passive;	/* To decided when the network * namespace should be freed. */
	refcount_t		count;		/* To decided when the network * namespace should be shut down. */
	spinlock_t		rules_mod_lock;

	atomic64_t		cookie_gen;

	struct list_head	list;		/* list of network namespaces */
	struct list_head	exit_list;	/* To linked to call pernet exit * methods on dead net ( * pernet_ops_rwsem read locked), * or to unregister pernet ops * (pernet_ops_rwsem write locked). */
	struct llist_node	cleanup_list;	/* namespaces on death row */

	struct user_namespace   *user_ns;	/* Owning user namespace */
	struct ucounts		*ucounts;
	spinlock_t		nsid_lock;
	struct idr		netns_ids;

	struct ns_common	ns;

	struct proc_dir_entry 	*proc_net;
	struct proc_dir_entry 	*proc_net_stat;

#ifdef CONFIG_SYSCTL
	struct ctl_table_set	sysctls;
#endif

	struct sock 		*rtnl;			/* rtnetlink socket */
	struct sock		*genl_sock;

	struct uevent_sock	*uevent_sock;		/* uevent socket */

	struct list_head 	dev_base_head;
	struct hlist_head 	*dev_name_head;
	struct hlist_head	*dev_index_head;
	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
	int			ifindex;
	unsigned int		dev_unreg_count;

	/* core fib_rules */
	struct list_head	rules_ops;

	struct list_head	fib_notifier_ops;  /* Populated by * register_pernet_subsys() */
	struct net_device       *loopback_dev;          /* The loopback */
	struct netns_core	core;


        标签： 二极管dflt48a
 锐单商城拥有海量元器件数据手册、
          IC替代型号，打造
          电子元器件IC百科大全！

 热门文章
          
  动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用
 
                          具有四个过电压保护(OVP)的USB Type-C CC和SBU信号传导路径。
                        
Sensata PTE7300密封数字压力传感器的介绍、特性、及应用
PANJIT PBHV8110DA/PBHV9110DA低Vce(sat)晶体管的介绍、特性、及应用
ams OSRAM OSLON 黑色平板X LED器件的介绍、特性、及应用
Cree LED CLQ6A三合一贴片LED的介绍、特性、及应用
Cree LED CLQ6B 4-in-1 RGBW贴片LED的介绍、特性、及应用
NDK NX1210AB表面贴装晶体的介绍、特性、及应用
伊顿ACE2V3225共模芯片电感器的介绍、特性、及应用
意法半导体X040灵敏型栅可控硅和Z040可控硅的介绍、特性、及应用
ABLIC S-82Y1B电池保护芯片的介绍、特性、及应用
Bel Power Solutions RDT-6Y系列6W DC-DC转换器的介绍、特性、及应用
 热门型号
          
 ERJ-P14D2401U
9T04021A1581FBHF3
9T06031A1621DAHFT
9T12062A1181BBHFT
9T06031A5360DBHFT
ERJ-XGEJ221Y
MCR10EZHF2741
ERJ-P08F1783V
9T08052A9761FBHFT
MCR03EZPFX5102
 锐单商城 - 一站式电子元器件采购平台

资讯详情

【容器底层技术】 namespaces详解

【容器底层技术】 namespaces详解

1.简介

2.种类

overview

Mount(mnt) namespace

Process ID(pid) namespace

Network(net) namespace

UTS namespace

User ID (user) namespace

Interprocess Communication (ipc) namespace

Control group (cgroup) namespace

Time namespace

3. 操控

4. 源码定义

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

【容器底层技术】 namespaces详解

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录