APT29a Security Blog

tltr;

I developed a coverage-guided (v8) JavaScript fuzzer similar to Fuzzilli (but without an intermediate language and developed in Python). I participated with the fuzzer in the Fuzzilli Research Grant Program and Google gave me 5000$ in the form of Google Compute Engine credits. Using 3400 € of these credits I tested approximately 15 billion generated testcases. In total I found 18 unique bugs, 11 of them were not security related, 4 were security related but duplicates and the remaining 3 bugs were security related and new (2 were very similar). I submitted them in these reports: CRBUG 1236694 and CRBUG 1237730. I got 5 000$ bounty per report, so in total 10 000$.

Most of my ideas are not fully implemented yet and the fuzzer still lacks a lot of fine tuning because I didn't had so much time for the project. I still try to describe my ideas in this blog post.

You can find the source code of the fuzzer here.

The story behind this blog post

In this blog post I want to summarize the results, ideas, failures, and everything else I learned along my journey to fuzz JavaScript engines (mainly Chrome’s v8 engine).

Everything started with my Master Thesis which I wrote on browser exploits. Between December 2019 and September 2020, I researched a lot on the topic of browser and JavaScript engine exploitation. In the end, I analyzed 56 different browser exploits which were exploited in the past 5 years. I re-developed most of these exploits and my goal was to find similarities which can help fuzzers to identify variations of these exploitable bugs. For example, by analyzing the PoCs, it can be seen that most vulnerability classes share a similar structure and that a fuzzer must not identify this structure again and again. Instead, the structure of the vulnerability class can be hardcoded in the fuzzer and just the other required code lines can be fuzzed. This should reduce the search space and should therefore make fuzzing more efficient. You can find my master thesis in the docs-folder of the fuzzer project.

To proof (or disproof) my ideas, I developed a fuzzer. I had two possibilities – modify the well-working Fuzzilli fuzzer or to develop a fuzzer from scratch. In the end, I developed a new fuzzer – basically because I thought that just by developing a completely new fuzzer I would likely find new vulnerabilities different to Fuzzilli. In the end this choice had advantages and disadvantages:

Advantages:

The fuzzer is different to Fuzzilli and finds therefore different bugs.
I could design the fuzzer as I like.
I learned a lot.
The fuzzer is developed in Python, so maybe other researchers prefer to use a Python-Fuzzer over a Swift-Fuzzer (Fuzzilli) because more security researcher are familar with Python.

Disadvantages:

Developing everything from scratch was very time-consuming. I greatly underestimated the required time.
In the end it turned out that Fuzzilli’s design is very good. A lot of design choices make a lot of sense afterwards. For example, Fuzzilli operates on an intermediate language. My fuzzer applies mutations directly on the JavaScript code. This helps my fuzzer to find very “exotic” testcases (but they don’t really occur so often…), however, it requires that I parse JavaScript code myself which is a ton of work to implement (and it's error-prone). Moreover, other fuzzers such as DIE or Code alchemist use a third-party library to parse JavaScript. I decided to implement everything myself – again something which resulted in a lot of work. Another difference is that I have a state which describes a testcase and this means that I must implement mutations twice - one time on the JavaScript code and one time on the state (and an error in one of them would de-synchronize the state with the testcase).

I also talked to a lot of researchers who develop browser exploits professionally. Really everyone told me that I should not develop a fuzzer and instead focus on code review. However, I had to develop a fuzzer because I couldn’t write a master thesis on “I performed code review and found X bugs”. In the end, I think I would had found way more bugs with code review using the same time so I would also suggest to others to do code review instead.

For evaluation of my Master Thesis I started my fuzzer on a VM on my home computer for 1 week. Obviously, running a fuzzer on just 1 core on a VM for just 1 week is not enough (just to compare numbers: I performed 8 467 311 executions in that week; Samuel, the developer of Fuzzilli, told me, that he typically performs at least 1 billion executions before he stops his fuzzer; in the project zero supported researched, which I explain in the next chapters, I also performed 15 billion executions with my fuzzer) – but yeah I found an exploitable v8 vulnerability in this single week. It was a duplicate (with CRBUG 1072171) and another researcher had already found the same vulnerability before with a modified version of Fuzzilli, but at least my fuzzer found something.

Google Project Zero Fuzzilli Research Program

In October 2020 Google announced the Fuzzilli Research Grant Program – a program which grants researchers up to 5 000 $ in the form of Google Cloud credits for fuzzing.

Previously I had my fuzzer only running on a single CPU for just 1 week. The fuzzer was just a simple PoC and most of my ideas were not implemented at that time. That meant that I would had to add a lot of improvements so that it pays off that the fuzzer runs on a so huge infrastructure. And I had to add them in my spare time (nights / weekends) because in my daily job I work for a company as pentester.

I still decided to apply for the project.

In the end, I got the funding from the research program. I decided to start end of February 2021 and I had to finish my research within 6 months.

Unfortunately, I didn’t had so much time as I was expecting. Because of Corona a lot of companies decided to do pentests which means our company had a lot of work and I often worked late into the night. Moreover, I also had to unforeseen relocate into another apartment and my dad got sick and I therefore had to take care of him / help my mother. So, on many weekends/nights I couldn’t work on the fuzzer. I had a lot more ideas but just didn’t find time to implement all of them. That’s also the reason why the code isn't really so nice at the moment...

The design of the fuzzer

Currently, the fuzzer is a purely mutation-based fuzzer, however, I plan to change this soon (more on this in the generator sub chapter).

Most JavaScript fuzzers don’t use coverage feedback or a corpus of input files. To my knowledge Fuzzilli was the first (public) famous fuzzer which added the idea of feedback-based fuzzing to JavaScript engines. In my opinion that was a good idea, although coverage-based fuzzing works better when fuzzing file formats. The reason is that JavaScript fuzzing (or in-general browser-fuzzing) requires especially that different “memory states” and not only “code-paths” are reached. For example, in file format fuzzing corner cases can often be found by just taking different if-branches in the code. Coverage-feedback helps to find testcases which trigger different if-branches and is therefore very effective in file-format fuzzing.

Applied to JavaScript fuzzing, the coverage feedback helps to detect “unique JavaScript code” like unique ways to call a function. However, to trigger bugs in JavaScript engines, it’s required to find the correct combination of these code lines and the correct structure of a testcase. The coverage-feedback doesn’t help with that. What we would need is some other form of execution feedback, but we currently don’t have that.

My idea was therefore to create a very good corpus of JavaScript testcases – each triggering unique coverage in the JavaScript engine. Then, I extract all JavaScript operations, function calls, numbers, strings and so on out of the corpus and use them during fuzzing. Moreover, the fuzzer applies the mutations (like inserting an identified operation on a variable) on testcases from the corpus.

Traditional fuzzers manually implement these operations in the fuzzer. They have a list of available methods (or a grammar) maybe together with the number and type of arguments. And then they add a random method call at a random code line with random values for the specified argument-datatypes.

However, it doesn’t make sense to add this randomly. For example, take the Math.sign() function. It doesn’t make sense to add Math.sign(123), then Math.sign(12515), then Math.sign(1251) and so on. It makes a lot more sense to just add calls in a new testcase which each result in unique behavior, like one time Math.sign(1), then Math.sign(-1), then Math.sign(0) and then Math.sign(-0). And maybe sometimes, but very rarely, add completely random calls.

Moreover, it’s very likely that these traditional fuzzers miss some variable operations because they manually implement them. For example, “variable1 instanceof Array” would also be an operation which can trigger side effects and it would be required to manually add such operations (because of the unique syntax of this operation because it is not a function invocations). Or another example is “class A extends var_1_” which was an operation required to trigger CVE-2019-0539.

The advantage of automatically creating a database / catalog of operations from a corpus is that important operations are not forgotten.

Someone could argue that these operations can also be extracted from a set of testcases (without a minimized corpus using coverage feedback). Someone could download all regression- and testcase-files from a JavaScript engine and create the database based on these testcases. However, then a lot of useless operations would be stored in the database and fuzzing would not be efficient. The database could contain 100 Math.sign calls with different arguments and just one of them is a call to Math.sign(-0) – it would therefore be unlikely that the fuzzer would add this one operation because there are so many others. If on the other hand, coverage feedback is used, then the database would just contain 4 Math.sign calls: Math.sign(1), Math.sign(-1), Math.sign(0) and Math.sign(-0) – because they have unique coverage (maybe plus some other variations like object arguments – but I want to keep the discussion simple). The likelihood that the fuzzer adds an important operation would therefore be 1:4 instead of 1:100.

Another advantage is that I can create tests for my database / catalog. I can analyze previously exploited vulnerabilities and manually extract important operations from them. Then, I can create a testsuite which checks if these operations are stored in my generated operations-database or not. If not, I know that my code or corpus is currently not good enough and I'm maybe missing other important operations and I can debug why this occurs (is the feedback not good enough? Does my fuzzer not generate testcases with these calls? And so on).

My general approach was to first develop a very basic fuzzer and get all components working. Then, subsequently improve all the components. This is exactly where I am right now. Everything is implemented, but important components still need a lot of improvements or fine-tuning. The following components are currently implemented in the fuzzer:

Generator:

A generator generates a new base testcase for a fuzzing iteration. In a later step, mutations are applied on this base testcase. Currently, I just have two options for generators: Use my testcase corpus or the template corpus (the template corpus is later explained). The generator can combine multiple testcases from the corpus by appending, inserting or slicing them. In the future, I also want to implement other generators. For example, a generator could create completely new testcases based on a grammar. Another option is to add a generator which takes regression tests from popular browsers as input, so that mutations are applied on them. Another very important generator would be one which generates testcases which follow specific vulnerability-class structures (e.g. trigger compilation of the fuzzed code; or add bound- or type-checks at required locations; for examples see the JIT or escape analysis chapters in my Master Thesis).

Mutator:

Currently 28 different mutations are implemented. Examples of mutations are: Adding a variable, calling an operation on a variable, Assigning or accessing properties or array elements, modifying numbers, if/for-wrapping code lines, moving lines around, exchanging two variables of the same type in a code line and so on. Some mutations are based on specific vulnerability patterns, e.g. materialization of a variable. During JIT-compilation a compiler-step is to de-materialize objects. It therefore makes sense to add a mutation which materializes variables (e.g. changing "var var_1_ = 5" to "var var_1_ = {someProperty: 5}". This ensures that early optimizations from the compiler can't be applied which means that the testcase would test later optimization phases (which are applied after de-materialization phase). Other mutation techniques try to trigger specific bug classes, e.g. setting an array length to 0 in a callback or assigning objects/double values to array elements (to trigger type confusions) or changing numeric values to -0.

Standardizer:

This module standardizes new found testcases. For example, when importing new files from the testsuite of a browser or importing a corpus from a different fuzzer, the standardizer ensures that files added to my own corpus are standardized. All variables are renamed to var_1_, var_2_, and so on. Similar renaming is done for functions and classes (and in the future for object properties). Comments are removed and newlines are added at specific locations to ensure the fuzzer can add code at all important locations.

Minimizer:

This module minimizes a testcase based on newly identified coverage. It tries to remove not required code to ensure that all testcases in the corpus are small but still trigger the new unique coverage. This increases fuzzer efficiency (e.g. a faster runtime and a smaller testcases has a higher likelihood that the fuzzer inserts code at the correct location to trigger a bug). For example, it removes not required variables, function parameters, code lines, loops and so on. If code is wrapped inside a function, it tries to remove the function and to immediately execute the code.

A problem are minimizer-improvements over time: For example, let's assume that I have a testcase with code like "function func_1_(var_4_, var_5_, var_6_)", however, the 3 variables (arguments) are never used within the function. The minimizer should therefore remove the 3 variables from the function definition, but I forgot to implement that. After implementing it, I can't say: apply now the new minimizer on all corpus entries. The reason is that the minimizer requires dynamic feedback from the execution of the testcase. If execution of the modified testcase doesn't trigger the new coverage anymore, the modification must be discarded. But the problem is that for some testcases I can't get this feedback anymore. A testcase can be in my input corpus because it for example triggered in v8 from 2020 new coverage. In v8 from 2021 it's possible that it doesn't trigger the coverage anymore, but it still makes sense to keep it in the corpus. Similarly, it makes sense to keep corpus entries created for v8 in a corpus used to fuzz SpiderMonkey. But with this decision I'm losing the possibility to apply new minimization/standardization operations on old corpus entries (or I have to store the origin of the testcase - which I plan for the future).

Executor & coverage feedback:

Currently, I'm mainly using the REPRL-implementation (read-eval-print-reset-loop) from Fuzzilli to execute in-memory fuzzing iterations (with some minor modifications; e.g.: I ported the code to Python and added some code to make the feedback more reliable by doing multiple executions. You can find a short example script which demonstrates the basic usage of the executor here). For coverage feedback I used the default edge-coverage instrumentation. I also experimented with "v8_enable_builtins_profiling=true" to get a better coverage feedback, however, I would currently not recommend using it. I can also start fuzzing without coverage which makes sense when a good input corpus was already previously created. Fuzzing without coverage feedback is 2-5 times faster. The advantage of using Fuzzilli's in-memory execution implementation is that I don't have to implement code in other browsers/JS engines (for the communication) myself to support them. Theoretically, my fuzzer should therefore work with every JavaScript engine which is compatible with Fuzzilli (but this is untested).

State-Creation:

For every new testcase in my corpus I create a state-object which describes the testcase. This includes information like which variables, functions and classes are available. What is the datatype of every variable in every code line? If a variable is an array, what's the size of the array in every code line? What are the data types of all array elements or object properties? What is the expected runtime of the testcase and how reliable does it trigger coverage feedback? How many edges are trigger and how many of them are unique in the corpus? Must a code line end with a semicolon or a coma? Must a code line start with a coma? How often is a code line executed (to avoid endless loops or hanging testcases)?

I'm extracting this information dynamically by executing the testcase with injected code (which reflects the data types and other information back to the fuzzer). This is time-consuming, but state-creation is not a common task during fuzzing. For example, if my corpus contains 18 000 testcases and I perform 15 billion fuzzing iterations, then the 15 billion fuzzing iterations must be as fast as possible and calling 18 000 times the state-creation code is negligible.

This is just partially true because I also have a template corpus and my currently template corpus contains approx. 230 000 testcases and I must also create a state for them, but the size of it will change soon (more on this later).

Another problem is that I forgot to add some required state-fields. The "flow" of a testcase (the order of the executed code lines) can be important, e.g. when a mutation adds a new variable to the testcase, then I must mark the variable in the associated state as available in the correct code lines (all subsequent flow-code lines). After such a modification, the state-creation code must be re-executed on all corpus and template-corpus entries and on all extracted operations which can take several weeks of runtime. I therefore avoided state modifications at the moment because I didn't had time to recalculate the states (but there are a lot of fields in the state which can be improved).

There are also some other problems, for example, a loop which changes the size of an array. In such a case the array-length entry can contain a list with thousands of entries. Merging such a state into another state (when 2 testcases are combined) is very time-consuming and it's therefore better to simplify the state in such cases.

Currently, a big challenge is also updating stored code lines in the state. For example, I store that in code line 5 the data type of var_1_ is array. When my fuzzer then adds a new operation in code line 4, then I must adapt all code lines in the state and change the above mentioned line 5 to line 6. Similar adaption must also be performed when merging testcases, when removing, adding or moving code around and so on. Using profiling, I detected that this is one of the slowest operations from the fuzzer and therefore creates unnecessary overhead. It's therefore better to store the state in a way which doesn't require line numbers, but such a change must extensively be tested before.

The state storage must be a good tradeoff between speed, memory consumption and file disk consumption. During fuzzing, I'm just loading parts of the corpus into memory, however, even this subset can quickly consuming a lot of memory and I want to always use the cheapest GCE instances (they don't have a lot of RAM) to keep the price low.

By the way, the current state is really not optimal yet. Another example is that I currently store something like state.datatypes["var_2_"] = *some info*. It doesn't make sense to store a string here, better would be state.datatypes[2] = *some info*. With one testcase it doesn't make a big difference, but I store hundreds of thousands of state objects which have up to 100 different variables. That's one of the reasons why the states consume so much disk space. The reason that I'm currently using strings is that I also have some special variables like "new.target", "arguments" or just variables, for which renaming failed. I will change the state-storage soon. I also store data types as strings (and not numbers) at the moment, which also consumes way too much memory.

Sync-Engine:

Since I was using Google Compute Engine, my plan was to also use some technology from GCE to implement synchronization of corpus files from multiple fuzzer instances and to store results. Well, I ended up using Google Buckets and that was not a good decision. I think Google Buckets are not intended to store and access so many files because it is slow as hell. Downloading all my final crash files (~60 000 files) took something like a full day. Downloading it with the recommended command shuts my internet down (I think the command sends a DNS storm..). So I implemented a download/sync functionality myself which kind of works, but is still very slow. But it's also possible that I'm just not using the buckets correctly..

Data Extractor (Strings and Numbers):

This module extracts strings and numbers from the corpus. Mutators can then use the extracted tokens during fuzzing to replace or modify other strings/numbers. The idea is that the corpus likely contains important boundary values and I therefore don't have to hardcode them. However, I also have interesting numbers hardcoded in the fuzzer. To ensure that extracted data helps during fuzzing, it's important that the minimizer also minimizes numbers and strings. Let's say that JIT compilation is triggered after executing a loop 64 times. It's possible that the input files contain testcases which perform 68, 90, 73, ... loop iterations. Without number-minimization, the extracted numbers would contain 68, 90 and 73 - a lot of "useless" numbers which would then be used during fuzzing. However, with number-minimization performed, the numbers would be changed to 64 and this also means the data extractor would extract way less numbers, but it should still catch all important ones.

Data Extractor (Operations Extractor):

As already explained above, I'm extracting all variable-operations from my corpus. If I have a testcase with the codeline "var_5_.charAt(1)" and var_5_ is a string, then I store "var_TARGET_.charAt(1)" in the string-database. As you can imagine, this is not simple task. For example, if the datatype is array and the operation is "var_1_[5]", then I must ensure that the target array on which I would apply the mutation has at least a length of 6 (or the code would throw an exception). Another example: Consider a testcase with the code "if(var_1_) { InvalidCode;}" and var_1_ is always false in the testcase. For some reasons this code is important to trigger a unique coverage and therefore the minimizer didn't remove this code line. If I would add such a code (together with an operation) to my database, my fuzzer would suddenly start to create a lot of testcases which throw exceptions. Such cases must therefore be handled.

I'm extracting "main" and "other" operations for every data type. "Main" operations are operations which were directly extracted from a variable which had the corresponding datatype. However, it's also often possible to take an operation from datatype1 and execute it with a variable of datatype2 (e.g. "var_1_ instanceof Array" can be executed with any datatype; such an operation would be stored in the "other" operations database). Therefore, I'm storing "main" and "others" operations in separated databases, so that the fuzzer can focus on insertion of "main operations" and from time to time it can add "other operations". My main-operations database has between 20 and 2000 operations stored per datatype. The others operations stores approx. 7000 operations per datatype. Side note: The current database contains a lot of "useless" or "bad" operations. The code which extracts them can heavily be improved.

I also have a database for generic operations such as "Array(2**30)", "gc()", "%DeoptimizeNow()", "try { } finally { }", "parseInt()" and so on. The fuzzer can add such operations at arbitrary locations. As a quick side note: It's also very important that the fuzzer not only adds new code between two code lines - instead, it helps a lot to add code in loop-headers, default argument initializations, function arguments and so on (not commonly fuzzed locations).

Tagging-Engine:

In my code I can tag specific code lines. Such a tag is added during testcase generation. After execution of the testcase, I store for every added tag if the execution resulted in an exception, new behavior, crash and so on. This allows to easily detect flaws in the code. For example, I have tags for every mutation strategy and if one tag has a very high exception- or hang-rate, then I know there is a flaw in the implementation. It also helps to configure the fuzzer like specifying the frequency of specific mutation strategies. I also have a "tag" for every testcase which means I can easily detect if a testcase leads to a lot of hangs or exceptions.

Callback-Injector:

A lot of JavaScript vulnerabilities occur because it's possible to trigger a callback and change assumptions which the developers didn't expect could change. As example let's analyze CVE-2015-6764:

A callback can be triggered because the toJSON() function was defined for an object. JSON.stringify() is a built-in function and implemented in C++ code. At the beginning, this function extracts the length of the passed array, stores the length in a local variable and iterates over the array to convert every element to a JSON string. The array-length-variable is used in the loop termination condition. The developers didn't expect that a callback can be triggered in the loop body which could modify the array length. After execution of the callback, the C++ code therefore uses the incorrect array length as stop condition and accesses data out-of-bound.

In the last years a lot of vulnerabilities which involved callbacks were discovered. As far as I know, most of these vulnerabilities were found via manual source code review. Researchers looked for code which can trigger a callback and then tried to change assumptions within the callback. Changing assumptions basically means to change the state of some objects used by the builtin-function. This is in most cases the length of an array, but I also saw cases where other properties were modified.

Let's try to find a method to discover similar bugs using fuzzing. Currently, most fuzzers try to fuzz both steps at the same time (if they fuzz callbacks at all). They try to inject a callback and add some random fuzzed code within the callback. This creates a huge search space and is therefore not very efficient.

My approach was to split both steps. First, my fuzzer tries to find possible callback locations. A testcase with such an injected callback is then stored within the "template corpus" (the testcase is a "template" because it specifies the locations for the fuzzed code).

Let's assume for a moment that my input corpus contains this testcase:

Then, my fuzzer would create code like the following:

The fuzzilli-code is a special function which reflects data back to the fuzzer. In the above case _PRINT_ID_2_ would be returned because toJSON() was executed. The fuzzer therefore knows that the injected callback worked and can store the testcase in the template corpus (the other callbacks will be removed). The above code is obviously just a simple example. There are a lot of other methods to trigger callbacks (sub classing code, Symbol.Species, setters & getters, modifying the prototype, proxy objects, and so on) and the fuzzer tries to inject all these possibilities.

This approach allows the fuzzer to dynamically find possible callback locations. And it's possible to create a test suite which verifies all the assumptions, e.g.: that my input corpus contains the above mentioned testcase and that I have template files which trigger CVE-2015-6764. I can create testcases like this for all vulnerabilities found in the last years.

Let's now discuss some problems with this approach. The first problem occurs, when my corpus just contains a testcase like the following:

In this case I can't inject the callback because the array is not assigned to a variable. Moreover, in a possible callback I can't modify the length of the array because I can't access the array. It's therefore important to rewrite the code before I start to inject callbacks.

Let's look at another example:

This testcase must first be rewritten to use the "new"-syntax, so that I can introduce callbacks by sub-classing code. In a first step the code must be rewritten to:

And in the next step I must assign the objects to variables, so that I can access them within callbacks:

As final step, I can start to inject different callbacks.

Another problem of the template corpus is it's size. I created a template corpus with an input corpus of 18 000 testcases. This resulted in 230 000 template files. And I need to create a state object for every template file which is a very slow (this can require several weeks of computation). However, most of these template files are not really required. It's very likely that e.g. testcase5 creates the same callback as for example testcase6831. Because of this, I have a lot of template files and the correct templates, which could lead to the discovery of new vulnerabilities, are just rarely selected by the fuzzer.

I therefore need a way to detect if a callback was already found by the fuzzer. The basic idea is to create code like:

The CALLSTACK-HASH function traverses the C++ callstack backwards and calculates a hash of it. Since the callstack should contain the JSON.stringify() function, it's hash should be different to other callstacks which are for example triggered via a valueOf() function in an addition. If the template corpus already contains an entry for the hash, the new entry will be skipped.

Using this approach, it should be possible to reduce the number of template files, but this is not implemented yet.

Another good example of a recently exploited callback-vulnerability is CVE-2021-21225. Brendon Tiszka wrote a very good post on this vulnerability. By looking at the PoC of this vulnerability, it should be obvious that it's really hard to find such vulnerabilities automatically via fuzzing.

A last word on the state modification operations: Within a callback, code must be added which modifies the state of an object. In most vulnerabilities this is code which changes the length of an array (and calls garbage collection afterwards). I also analyzed some other vulnerabilities/bugs which perform other state-modification operations. However, in my fuzzing tests I never found a crash with another state modification operation (or with the callbacks at all). I'm therefore unsure if the implementation of all the above mentioned steps really justifies the time spend on this idea.

Profiler:

I regularly profile the fuzzer to identify bottlenecks. I'm mainly using cProfiler and line_profiler for this, see the profile_performance_of_fuzzer.sh file for details. It's really simple to use these two profiles, it just takes some minutes to understand how they can be used - if you write Python code and don't know them, you should seriously take a look at them.

I know that most of my developed code is not really performant and partially stupid. For example, I know that I’m hashing in a specific code location data three times in a row to finally use the resulting hash as key in a hash table… I wrote that code in a late (very late) night and never found time to refactor it. However, it was also not yet necessary. This code is just executed once and not in every fuzzing iteration and the performance penalty is therefore negligible. A lot of code is maybe not performant, however the important one (often executed code) hopefully is.

I’m also intentionally often manually parsing strings instead of using a regex because I made the experience that the regex-approach is very slow. I’m therefore often using complex manual string parsing code which is error-prone, but a lot faster.

In the following paragraphs I want to give an example of why profiling the code of a fuzzer is important.

In a late night, I planned to implement a mutation strategy which replaces one variable in a code line with another one if both have the same data type. I thought that this mutation is important because I’m combining multiple input testcases by appending or inserting one testcase to the other testcase. Moreover, when adding variable operations there are also often additional other variables used by the operation. It’s therefore important to exchange variables from two different testcases (or operations) to really combine and mix the testcases (e.g., code lines from testcase1 should access variables from testcase2 and vice versa).

My tired me (it was really late at night) implemented this operation using the following algorithm:

Iterate over all variables in the final (combined) testcase (Iteration variable: variable_name1)

Iterate a second time over all available variables (Iteration variable: variable_name2)

Iterate over all data types and line numbers of variable_name1 (a variable can have different data types in the same line; e.g.: when the code line is in a loop)

Iterate over all data types and line numbers of variable_name2

If the line numbers and data types match, then store variable_name1, variable_name2 and the line number as possible candidates

At the end select a random entry from all candidates

Well, some days afterwards I was confused why my fuzzer was suddenly so slow.

Using my profile_performance_of_fuzzer.sh script (which internally uses cProfile), I got the following output:

The script starts the fuzzer in a generate-only mode which just generates testcases as fast as possible without actually executing them and it measures the execution time of each function. The above output shows that the total runtime was 969 seconds and 729 of these seconds were spend only in this function. This obviously indicates a problem because I had something like 20 mutation strategies at that time and this one consumed all the time. Moreover, I could just generate ~60 testcases per second and when I removed this mutation strategy, it was again back at 240 generated testcases per second.

I then rewrote the algorithm to:

Iterate over all variables in the final (combined) testcase (Iteration variable: variable_name1)

Iterate over all data types and line numbers of variable_name1

Use the data type and line number as a dictionary key and store the variable name in a list.

At the end, iterate over all dictionary entries. If the list contains 2 or more entries, then store them as possible candidates.

Both algorithms implement the same operation and result in the same output, however, the first algorithm used four interleaved loops and is therefore extremely slow.

The new algorithm resulted in 190 generated testcases per second which is a lot better, but still not 240 testcases per second:

A short note to the other entries in the image:

When I insert a new line in the testcase (e.g.: when I add a variable operation), the function state_update_all_line_numbers() gets called. It must update all line numbers of all variables (e.g.: which variable has which data type in which line) which is obviously a time-consuming task. This was already described in one of the previous paragraphs and will be changed soon.

The “list.append” shows up because I apply mutations on lists which obviously results in list operations. For the future, I plan to maybe switch to another datatype, e.g.: a mutable string.

The “re.findall” code also indicates that regex-operations are very slow – especially since I tried to avoid them as much as possible and it’s still the function with the 6th longest runtime.

I already knew that the two other functions get_number_token_occurences_all() and get_index_of_next_symbol_not_within_string() are slow. I use these functions to parse the testcase, e.g. when parsing a for loop which starts with a { block, then I need to find the corresponding } ending-symbol (e.g.: to move the for-loop around) and I must ignore } symbols within strings or } symbols associated with other deeper blocks. Here is another example: If I want to move a line around and the line is “var_1_=1;}”, then I can’t perform this operation because the line contains a control-flow character (}). On the other hand, a line like “var_1_=’xx}xx’” is allowed or a line like “if(…) { var_1_=1;}” is also allowed. The parsing function is pretty complex with a lot of deep if-conditions. I must basically handle all possible corner cases and since I perform the operations on a corpus of regression/test cases, my input files trigger really all kinds of edge cases. Because of the long runtime of this parsing function, I re-implemented it in C code.

Here is another performance measure result:

I modified the fuzzer to add 5 new variables in every iteration and removed all other mutations. Then get_index_of_next_symbol_not_within_string() becomes the bottleneck. The reason is that I recalculate the positions of all control-flow characters (e.g.: “{“ and “}”) when I add a new variable because I need this information to derive the scope of the variable (to fill the state object correctly; I currently don't have access to the "flow" as described before - so I need to calculate it).

When I measured the performance with this fuzzer configuration, I achieved a speed of 130 testcases per second. I then re-implemented the same function in C code and called the C implementation from Python. Initially, the result of the C-implementation was different because C handles Unicode characters differently. In my C code I’m counting bytes and if the testcase contains for example a 3-byte Unicode character, it counts 3 positions. In Python this is just one character and therefore the returned offset didn’t match.

My stupid first solution was to just replace all Unicode characters with a space in Python and then call the C implementation:

content_without_unicode = ''.join([i if ord(i) < 128 else ' ' for i in content])

The code worked, but the speed was just 140 testcases per second because calculation of the above list comprehension also consumes a lot of time. I therefore removed it and instead used the C function mbrlen() which calculates the number of bytes a Unicode character consumes. This pure C-implementation then had a generation speed of 555 testcases per second which is more than enough.

Originally, when creating 50 000 testcases I spend 272 seconds in the function. With the new C-implementation I can perform the same task in just 5 seconds which is ~50 times faster and it only took me ~1 hour to port the code to C.

Here is one last example (of my bad code):

The above figure also shows that get_line_number_of_offset() is a new bottleneck (with the used fuzzer configuration). The purpose of this function is to translate an offset in the testcase to the corresponding line number. I was already using an “optimized” version of the function at the time, but the above output showed that it should further be improved. My first, very slow version was:

When generating 50 000 testcases, the overhead by this function was 108 seconds.

My rewritten, faster Python implementation was:

This code reduced the execution time to 31 seconds. After implementing the above function in C code to execution time dropped to 7 seconds which is ~15 times faster than the original code.

My key takeaways:

Don’t try to blindly improve functions which you think are the bottleneck. Instead, always first profile your code and improve the code which really bottlenecks.

First use cprofile to get a general overview of functions with a slow performance. Then use line_profiler to understand in-depth the bottleneck in the function.

Instead of developing the fuzzer late at night at 1-2 am, I switched to go earlier to bed and develop 1-2 hours in the morning with a fresh mind.

Corpus Visualizer:

This is the only module which is not implemented yet, but planned for the future. The basic idea is to visualize code lines triggered by my corpus in the v8 code base. This is useful for two reasons:

1) I can easily see which code parts are not fuzzed yet and can focus source code audits on these parts (or improve the fuzzer). Moreover, I can also better understand my corpus (if a new file is added to the corpus I can mark the code lines which the new file triggers).

2) To develop 1-day exploits. Previously v8 developers always uploaded a regression test together with a code fix. The regression test can be used to trigger the bug and is therefore useful to develop an exploit. This changed (as far as I know) with the bug submission from Brendon Tiszka. Now, regression tests for vulnerabilities will be uploaded 2 or 3 months after a fix is available (at least most; I found some exceptions). However, it's still possible to detect which commits/changes are security-related. The change-message contains a reference to the bug-tracker and if you get a "permission denied" on the bug tracker, it's a security bug. Based on the assigned persons, commit message, review messages and changed code it's also possible to detect if it's a critical vulnerability. In such a case, a regression test is no longer available and it's useful to quickly find testcases which trigger specific code lines (which were changed by the fix) to build a regression test myself.

Developing 1-days is done because there is a time-window between the fix in the code and the availability of an update for end users. In the last years this became harder because Google is shipping updates more frequently. For example, Exodus Intelligence wrote in a blog post that it's not profitable for them anymore because such a vulnerability could just be exploited for a few days. However, I still think that this is profitable for criminals and it's therefore an attack which should be considered in a threat model. Having a working 1-day Chrome exploit for 3-4 days is enough for criminals to compromise hundreds of banks (in my opinion).

Another use-case for 1-days are exploits for software components which use the v8 JavaScript engine from Google Chrome. For example, Foxit PDF reader likes to use older v8 engines, like one from 2019 and is therefore prone to several public available exploits. It's therefore relatively simple to exploit Foxit Reader. I reported this problem together a friend, but Foxit didn't update the engine. They assigned a CVE number (CVE-2020-15638) and made bug details public, but didn't fix the bug (lol). Instead, they removed the JavaScript datatype BigUint64Array to break the submitted exploit because I stored shellcode using BigUint64Array. You can see a demo of my updated Foxit exploit which avoids this data type here (timestamp 23:08). The exploit worked at least in October 2021 (I didn't check if it still works before publishing the post, but I guess so?).

Another example is Steam and @_tsuro already posted on twitter multiple times that steam browser is outdated and prone to several public exploits. I redeveloped an exploit to demonstrate that it's really trivial to attack the ~90 million steam users with it. You can find a demo of the exploit here (timestamp 24:44). The exploit worked at least in October 2021 (I didn't check if it still works before publishing the post, but I guess so?). Google published a step-by-step guide how to exploit the vulnerability here. It's also interesting that the blog post mentions that an APT group actively exploited the vulnerability in Chrome. This APT group therefore had for several years an exploit which could be used to compromise ~90 million systems (but victims would need to open a link).

Obtaining a good JavaScript corpus

At the start I mentioned that I was just fuzzing for 1 week on 1 core for my Master Thesis. Well, that’s not 100% correct – I previously used computation resources to create an initial corpus of unique JavaScript testcases.

At the beginning, I used the AWS cloud for that – as a student I got something like 200 $ of free credits which I used to create the corpus.

I also used GCP (Google Cloud Platform) because I got the Fuzzilli Research Program credits for GCP. If you sign-up at GCP you also get ~250 € of free credits. I used 136 € of these credits during development/improvement of my fuzzer before I started the Fuzzilli Research Program.

Abraham Lincoln once said: “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe”.

Applied to my fuzzing project, “sharpening the axe” would be the time I spent on creating a good input corpus. I used ~6 months to create my input corpus. The implementation of the mutation strategies on the other hand just took me several days.

Another interesting fact is that coverage feedback is not required all the time during fuzzing. Extracting coverage feedback makes fuzzing slow. For example, when I achieve 200 executions / second without coverage feedback (testcase which just executes one operation), then I get approximately 60 executions per second when I enable coverage feedback. When I enable additional flags to get better coverage feedback (v8_enable_builtins_profiling=true), I only achieve 20-30 executions per second. As a side note: I’m performing in-memory fuzzing, which means I’m typically restarting the engine after several hundred of executions. When I restart the engine after every execution, the speed drops to 0,5 – 2 executions per second. I’m doing this during testcases analysis and deterministic fuzzing. Therefore, the overall speed is maybe sometimes below 20-30 executions per second during fuzzing. During fuzzing I’m also operating on bigger testcases which means the execution speed was typically between 6-20 exec / sec / core.

I just need to use the full coverage (which is slow) while creating a good input corpus. I created this input corpus from publicly available regression and unit tests and from corpus files created by other fuzzers and later applied my own fuzzing on them. As soon as the input corpus is good enough, I can switch to fuzzing without the coverage feedback. At that point, I want to find “combinations” of the identified testcases / operations and code- or edge-coverage feedback doesn’t help here. Therefore, fuzzing at full speed can be used.

This also means that I can use different mutation-strategies in the different phases. In the first phase while improving the corpus, I mainly use mutation strategies which likely find testcases with new coverage. In the second phase (when coverage feedback is disabled), I mainly try to create testcases based on the identified structures from different vulnerability classes or in general prefer mutations which likely lead to vulnerabilities (like the insertion of minus zero values).

Fuzzing numbers before I started the project

I can’t write too much about my AWS fuzzing numbers because I didn’t measure them.

On GCP I used the free credits (136 €). At that time, I was using default systems (none preemptible systems which are more expensive and with 2 cores per machine optimized for CPU-calculations). I started to measure fuzzing numbers in January 2021, but I was also already fuzzing in December 2020.

In January 2021 I:

performed approximately 170 million fuzzing executions against v8

I used 2628 CPU hours (that’s approximately 14 days of fuzzing on 8 cores; 8 cores were the limit with the free credits)

That would be ~18 executions / second / core. This is a little bit slower than I initially mentioned because I restarted the v8 engine more frequently (in total I had 4 429 050 engine restarts). Typically, I perform more executions in-memory which leads to a faster fuzzing speed.

In these 14 days of fuzzing, I found 54 different crashing testcases. These 54 crashes can be traced back to 12 unique bugs in the v8 code. However, most of them were related to the special instrumentation flags (v8_enable_builtins_profiling=true) which I used. I therefore consider none of these bugs exploitable because they can only be triggered in a debug-version of v8 and didn't report most of them. You can find an example of such a bug in CRBUG 1170776.

80,79% of the generated testcases were valid testcases which didn’t throw an exception. 1,12 % of the testcases resulted in a timeout. However, in this fuzzing phase I was just applying one mutation on a testcase. Testcases were not merged and multiple mutations were not applied.

I just applied one mutation at a time to better measure the mutations. It allows me to compare mutations with each other. This helps to decide which strategies I need to improve first and to specify how frequently which technique should be applied. For example, techniques which often lead to hangs and which never contributed to crashes or new coverage should not be invoked too often.

Here are some mutation statistics:

I expected that the mutation strategy with the lowest success rate would be “insert a random operation”. At that time, I didn’t have a database of operations and I just extract all function names of all data types and called a function with random arguments (also the number and type of arguments were not correct and just random). This resulted in a 67,46 % success rate. However, the mutation “duplicate a code line” resulted in only a 64,58 % success rate and was therefore my worst strategy. This technique also resulted in 4,37 % of the iterations in a hang. Hangs (aka timeouts) are the worst thing which can happen because they require that the engine must be restarted which is very time-consuming. I could therefore greatly improve the performance of my fuzzer by improving this one mutation strategy.

Another interesting result: One of my fuzzing strategies wraps a number, string, or variable within a function invocation. If it’s a number and if the number is passed as argument to the function and the function returns the unmodified number, it results in 2,08 % of cases in a hang. However, in all other cases (e.g.: if it’s a string instead of a number or if the number is hardcoded in the function’s return statement and not passed as argument), it results in just ~0,5% of the testcases in a hang. This raises the question: Why? How can even wrapping a value in such a way lead to a hang? And why more frequently in this case? For example, this one case resulted in 18 000 engine restarts. If I would have fixed this case before starting fuzzing, I could had performed additional ~3 million executions (without coverage feedback) instead of wasting the time doing engine restarts.

Another strategy with a similarly high number of hangs is “modify number” and it’s obvious that this often leads to hangs. The strategy replaces a number with a random other number. If it replaces numbers in a loop header, it can easily lead to endless loops and therefore to hangs. I therefore need to take care of such cases in my mutation implementation.

Using the measurement, I also detected several bugs in my code. For example, one mutation resulted 100% of the time in exceptions. The mutation added a call to “%ArrayBufferDetached(array)”. However, the function requires an array buffer as argument and not an array and therefore always resulted in an exception. Btw, that’s a good example of why manually adding these operations is error prone. If I instead would had used a database of operations (as mentioned above), this flaw would not have occurred. I manually implemented some operations which I often saw in vulnerability PoCs to increase the likelihood of these operations. I later added the automatically generated database of variable operations in March 2021.

The mutation strategy with the highest success rate was “materialize value”. It resulted in 97,02% of the cases in a testcase without an exception, it detected 2 crashes and it found several testcases with new behavior.

Here are some statistics on the testcases:

First, I looked for testcases which often result in a hang – because they heavily affect the fuzzer speed. I therefore sorted the tagging output file to display the testcases which had the most hangs. One testcase which resulted in a lot of hangs was:

If a mutation modifies one of the numbers to a bigger value, it immediately results in a very long runtime and therefore the testcase has a lot of hangs. One possibility to fix the mutation strategy is to not insert too big numbers in such testcases. However, a better solution is to add an algorithm which detects which testcases result frequently in hangs and then the fuzzer should skip (“permanently disable”) these testcases. A similar idea was also implemented for testcases which often throw exceptions.

Here is another code line from such an often-hanging-testcase:

It’s obvious that modifying a number in this code can immediately lead to a timeout (endless loop). Detecting this condition within the mutation strategy (to skip "number modification"-mutations) is hard and therefore it’s better to detect which testcases lead very often to hangs and then skip them. The question is only: Should I detect it generically or per mutation strategy? Should all mutations be stopped on the testcase or just specific ones? With the above testcase only the “modify number” mutations must be skipped. Currently, I permanently disable these testcases but I will maybe improve this later.

My corpus contained at that time 9787 testcases. If I would just had skipped the 3 testcases with the most hangs and would had fuzzed the other 9784 files, then I could had avoided 42 665 engine restarts. Let’s assume an engine restart takes 1 second, then I could had performed 170,8 instead of 170 million executions. If I would have skipped all testcases which resulted in more than 15% of cases in a hang (that’s 50 testcases), I could had performed 174 instead of 170 million executions. If I would have skipped all testcases which resulted in more than 5% of cases in a hang (that’s 735 testcases), I could have performed 189 instead of 170 million executions.

On the other hand: If I start to modify the “replace/modify number” mutation and if I can just improve the logic to create 50% less timeouts (e.g.: by not setting numbers in loop headers to too high values), then I can perform 185 instead of 170 million executions. So, it maybe makes sense to skip testcases which very often result in hangs (let’s say more than 30% of the time) and focus my time on improving the mutation strategies.

Here is another testcase (excerpt) which resulted in 93,85 % of cases in an exception when a “replace number” mutation was applied:

Typically, the result of var_1_ after the loop is 42999999 and therefore the exception is not thrown. However, when the fuzzer starts to mutate just one of the shown numbers, the result immediately changes and the throw-line gets executed. To solve this problem the “testcase minimizer” should had removed the throw code line because it’s never executed – however, it didn’t because the code is for some reason required to trigger unique behavior. A solution is therefore to try to replace all throw-code lines within the minimizer/standardizer or to try-catch wrap them.

Here is another problematic testcase:

When the fuzzer starts to modify the string, it immediately leads to an exception because JSON.parse() throws an exception if the argument has an incorrect format. There are several JavaScript functions which throw an exception if the passed arguments are wrong. These functions must specially be handled by the mutation engine.

Another example:

If the fuzzer modifies the string in the 2^nd code line it immediately leads to an exception. There are a lot of other such examples.

Importing a Corpus from Fuzzilli

In February 2021 Samuel (the developer of Fuzzilli) shared a recently generated corpus with me, which he created in a fuzzing session between 1. and 8. February. His corpus had with my coverage feedback implementation (with v8_enable_builtins_profiling=true and testcases which throw exceptions are ignored) 23.3593 % coverage and contained 21 895 testcases.

My corpus contained at that time 9 312 testcases and resulted in 27.728% coverage, however, it doesn't really make sense to compare these numbers. Samuels corpus was completely created by his fuzzer, whereas my corpus was created from regression and testcases from browser test suites. Moreover, my corpus was improved by fuzzing a specific v8 version and Samuel fuzzed another v8 version. It's likely that Samuels corpus therefore performs better on the other v8 version.

After importing the files from Samuel my corpus contained 13 111 testcases with a coverage of 29.2974%.

Developing statistics (April 2021)

In April 2021 I implemented the operations-databases. Here are the stats of a short fuzzing session from April 2021:

performed executions: 8 940 320
Execution speed: 16,87 exec / sec / core
Total runtime: 6 days 3 hours 11 minutes 28 seconds
Total new corpus entries: 994
Total crashes: 57
Total success: 7 050 609 (81.34 %)
Total exceptions: 1 462 483 (16.87 %)
Total timeouts: 154 601 (1.78 %)
Total engine restarts: 270 023

I just performed one mutation per iteration. This together with the tagging engine helped me to detect flaws in the mutation-implementations.

First fuzzing session (~20 €)

On 02.06.2021 I started ~10 preemptive VMs. At the start I got 4100 € from Google. 30 € were used for development/testing. For the first fuzzing session I used ~20 €. Here are some statistics:

performed executions: ~49 million
Execution speed: 6,34 exec / sec / core
Total runtime: 89 days 8 hours 45 minutes 48 seconds
Total new corpus entries: 3 108
Total crashes: 159
Total success: 27 938 970 (60.52 %)
Total exceptions: 157 744 76 (34.17 %)
Total timeouts: 2 448 084 (5.30 %)
Total engine restarts: ~3 million

The fuzzer speed was very slow because I used full instrumentation (v8_enable_builtins_profiling=true) together with a lot of mutations and combined testcases. The resulting testcases were therefore pretty big and slow, but I found 3108 new behavior testcases.

I fuzzed v8 in version 8.7.220.29 which was used by Chrome 87.0.4280.88. I initially thought it's smart to fuzz v8 in a version used by Chrome users because vulnerabilities in it would have the highest impact. However, the downside is that you get a lot of duplicates because you find old bugs which are already fixed in the current v8 version.

I also decided to fuzz v8 in a little bit older version at the beginning: Some vulnerabilities were already public known in this version and I could therefore check if my fuzzer can find these bugs.

Second fuzzing session (~423 €)

On 04.06.2021 I changed to the newer v8 version 9.1.269.28 (from Chrome 91.0.4472.77; newest version at that time), however, I didn't immediately had time to start fuzzing. On 14.06.2021 I started again a fuzzing session for ~432 €. Every hour I added 6 new fuzzer-instances until I had approximately 100 instances (~400 CPU Cores distributed on different regions) running. I stopped the fuzzing on 21.06.2021.

I couldn't add more than 6 fuzzer-instances per hour because I created a snapshot of my base fuzzer machine. Starting a new instance from such a snapshot is limited to 6 in GCE for some reason (but I could had created more snapshots).

performed executions: 1 046 702 688 (~ one billion executions)
Execution speed: 6.78 exec / sec / core
Total runtime: 42 888 CPU hours (4,9 CPU years)
Total new corpus entries: 1486
Total crashes: 3775
Total success: 610 040 565 (58.79 %)
Total exceptions: 368 028 182 (35.47 %)
Total timeouts: 59 603 483 (5.74 %)
Total engine restarts: 62 099 301

The 3 775 crash files can be traced back to 8 different root causes and therefore to 8 different bugs. I give an overview on them at the end of the post.

At the end of this fuzzing session my corpus contained ~18 000 testcases. All fuzzing sessions to this point didn't use a template corpus. I used this corpus to generate a template corpus with 360 000 template files (~2.5 GB file size on the disk).

Third fuzzing session (~3500 €)

I planned my last fuzzing session for August 2021 because my credits were available until 18. August 2021. I wanted to finish fuzzing by 11. August because I wanted to have a one week buffer (e.g. if I would forget to delete something that I don't have to pay that money myself). The problem was that I didn't had a lot of time back then and extending the credits deadline was not possible. Moreover, I also had some other problems related to GCE. Paying 3500 € in 1-2 weeks on GCE is not that simple. You have quota-limitations on a lot of resources. For example:

You can just spawn 5-6 instances from a snapshot per hour.
You also have CPU quotas which means you can just start 72 CPU cores (per region) at the same time. Samuel told me at the beginning that I should fill a request to increase my quota limit. However, I was using two different Google accounts. For some reason all GCE pages are shown for account1 (which has the credits) and only the GCE quotas page is loaded with the 2nd account - for whatever reason! The page doesn't has a visual clue that it is loaded with a different account and with this account it just showed an empty page without any quotas. I therefore thought that quotas were removed from my account... Just when I tried to start more instances, I saw that I still had quotas applied. Well, I thought that I would just fill a request and ask GCE to update my quota to something like 3000 CPU cores and told them that I'm doing a project for Google and must be able to spend ~3500 € in 1-2 weeks. Well, the answer from them was just a simple "no no no, you won't get 3000 CPUs". I then tried to get a 800-1000 CPU quota update for multiple regions (as recommended by Samuel) and that worked out. I was then able to start more instances but I had to distribute the workload between regions (I used the 5 cheapest regions).
Well, that worked for some time, until I discovered that IP-addresses are also limited. At some point you can't start instances anymore because you don't have free IP-addresses anymore. My initial plan was to use small fuzzing instances (with 2-4 CPU cores), but then you run out of IP-addresses. Even slightly increasing you IP-address quota limit in GCE will be declined, so I had to change my setup to instead use 16-CPU core systems.
And another problem I learned the hard way (unrelated to GCE): On Linux it's possible that you have enough free disk-space, but you can run out of inodes. For example, if you store hundreds of thousands of JavaScript files during fuzzing in a corpus... When you run out of inodes, you can't save files (aka new crashes) anymore...
Another thing to consider was that downloading results from GCE buckets takes 1-2 days, so I also had to calculate these days.
And the last problem was the cost-calculation: When I start X instances and they would cost something like 10 000 € per day, I would quickly exceed my credits and would have to pay it myself. I therefore tried to start slow (but at the same time fast enough to be able to spend several thousand euros until the deadline).

As v8 target I chose v8 commit hash 185badc9122fe6274c1b2fe54e03fda5315cb80e, which was the latest available v8 version by 01.08.2021. This also lead to some initial problems. For example, I couldn't just start my fuzzer against this v8 version because files from my input corpus already crashed it, so I had to remove them first.

I reported this crash as CRBUG 1235152, but this bug is obviously not security-related and will therefore not be fixed. So you can still crash Chrome with it (but with no security implication):

After slowly starting more and more fuzzer instances, I finally had on 04.08.2021 in total 286 16-core instances running (4576 CPU cores). I expected that this should cost ~ 1112 $ per day if the systems are not interrupted. I also had a script running which automatically restarted every 30-40 minutes all stopped preemptible systems.

To give you some numbers: On 04.08.2021 at 07:30 I still had 3111 € credits. At 18:30 it were 2621 € and at 23:00 it were 2531 €. At 01:00 in the night I stopped my script which restarted preemptive systems (because I had the fear that I could exceed the credits).

The following figure shows the CPU-usage on the next morning on 05.08.2021 (~2096 € credits left):

I kept the fuzzers running, but didn't restart stopped instances. On 06.08.2021 at 09:00 all instances were stopped and I was at ~1750 €. At 15:00 I restarted again all fuzzer instances. At 19:40 the usage was:

As can be seen, starting preemptible instances in us-east4 and us-west1 was therefore better at that time because they were not stopped. At 23:00 I restarted again all stopped instances.

I continued this until 08.08.2021 00:21 (credits left: 890 €). I don't have images for the usage on 07.08.2021 because the quotas view stopped working at some point:

Also stopping and deleting everything at the end was not that trivial. For example, deleting my bucket results took several hours and required starting the command several times because it failed with some objects:

I used scripts to delete all fuzzer instances and all associated disks. I triple-checked in the UI that everything was deleted because I really had the fear that something would still be running after the credit deadline. And really, when I checked on 13.08.2021 my bill, I saw that GCE was still charging me disk-space (from the credits). 149 instance-disks just suddenly re-appeared. After removing these disk image, GCE stopped charging money.

The credits at the end were 632 € because I started all instances again to download crash files also via SSH. I had the fear that GCE buckets were maybe not working during fuzzing and that therefore some crashes were missing, but that was not the case. All crashes were successfully stored in the GCE bucket (and were similar to the ones downloaded via SSH).

Stats for the final fuzzing session:

performed executions: 8 604 798 966 (~8 billion)
Execution speed: 16.67 exec / sec / core
Total runtime: 5973 days 20 hours 40 minutes 29 seconds
Total new corpus entries: 0 (coverage feedback was not enabled)
Total crashes: 15 884
Total success: 4 570 082 178 (53.11 %)
Total exceptions: 3 696 324 963 (42.96 %)
Total timeouts: 338 342 843 (3.93 %)
Total engine restarts: 343 851 753

The above stats are not 100% correct. I store these stats regularly in the bucket, however, since I'm using preemptible VMs, a VM could suddenly just be stopped. In such a case I try to still store results in the bucket, but that's often not possible. Crashes on the other hand are immediately stored in the bucket. The above stats show 15 884 crashes, but my bucket contained 26 650 crashes. So I need to multiply the above stats by 1,678 to get the real number of performed executions which is something like 14.4 billion executions (together with the other fuzzing sessions ~15 billion executions).

And 15 billion executions is really a lot. I often see that people can't really differentiate between millions and billions. Just to give you an example: One million seconds correspond to 11 days, whereas one billion seconds correspond to 31.7 *years*. So 15 billion tested JavaScript samples is really a lot and I think 3400 € for it is really cheap.

Here is a short conclusion (with some tips)

Ensure to request a CPU quota upgrade to 800-1000 CPUs for multiple regions early.

Plan your fuzzer for 16 or more core systems (you don't have an infinity number of IP addresses available)

I would recommend to not use GCE buckets to store results (or data which should be accessed frequently).

If you want to fuzz v8, fuzz it in the latest version (and not the latest version from Chrome). I would not use "v8_enable_builtins_profiling=true" anymore because I found a lot of bugs specific to this setting.

Develop your crash triage scripts before you start fuzzing, otherwise reporting bugs will take very long which increases the chance of duplicates.

I found most of the bugs pretty early. I think it was not required to spend 3000 € on a fuzzing sessions because I found nearly all bugs already with the first 300-500 €. This could also indicate some problems with the search space of my fuzzer.

Don't try to insert funny Unicode symbols with a JavaScript fuzzer. I thought it's a good idea to add symbols which change length and so on, but in the end I had a lot of problems because these symbols screwed up Python (my code) way more often.

I still think that the ideas of my fuzzer are good, but the current implementation is not. A lot of fine-tuning is still missing. Nevertheless, I would recommend to do source code analysis instead of JavaScript fuzzing.

In the end I just found three new bugs, but I'm still happy that Google gave me the possibility to run my fuzzer on their infrastructure. Personally, I hope that Google continues to support researchers with such funding's. It helps both sides: Google gets bug reports "for free" and researchers can use a huge infrastructure to test new ideas.

Found bugs / vulnerabilities

This post is already pretty long, but since the funding requires an analysis of the found vulnerabilities, I provide here a short overview. I'm just describing "interesting" crashes, otherwise the post would really become long.

It's also important to mention that my goal was to find bugs with a "structure" from a vulnerability class, however, I didn't find such a bug. The simple explanation is that I didn't finish the implementation of it when I started the last fuzzing session and therefore I didn't find such bugs.

I'm describing the bugs in the order on which I found them. If you just want to read about the two security-related bugs for which I got a bounty, go to the last 2 bugs.

Math.max() missing -0 case:

This bug was a duplicate with CRBUG 1072171. I found this bug during my master thesis evaluation (before the Google funding) and found it in v8 8.1.307.28. The PoC from the bug tracker does not work in this v8 version. My PoC is slightly different and therefore works in this v8 version. It was rather trivial to find this bug because I just had to perform a single mutation on the regression testcase from this bug. It's therefore a good example that having an input corpus of regression tests is useful. The bug occurred because the compiler had incorrect return values for Math.max() annotated (-0 was missing).

OOB Array Access:

This bug initially looked promising, however, it can just be triggered if the v8_enable_builtins_profiling configuration is set to true. The PoC leads to an integer overflow in the profiling code which triggers the bug and allows to access data OOB (z should be 0 but is 39017943).

Truncation bug in Simplified Lowering:

This bug was a duplicate and was already fixed by this commit. I found the bug in the 2nd fuzzing session (Chrome 91.0.4472.77; v8 9.1.269.28). In this session I fuzzed the v8 version 10.05.2021. However, the above mentioned fix already fixed the bug on 21.04.2021. So I just re-discovered an already known bug because I was fuzzing a too old v8 version because I thought it's a good idea to fuzz v8 from the current Chrome release. I would therefore recommend to always fuzz the latest v8 code.

It's also interesting that I found this bug 2 324 times and I already triggered this bug after a few minutes because it's really trivial to find. This was kind of surprising to me because this was a v8 version used in a Chrome release. I thought that Google would fuzz a v8 version for some days before it's shipping it in Google Chrome.

The following code triggers the bug. It's basically enough to perform two multiplications inside loops to trigger compilation.

Bug: 32 Bit value in register is not zero-extended:

This one is interesting because I just found a single crash file triggering this bug. However, the bug was not reproduceable in the latest v8 version and was therefore likely already fixed.

Here is the minimized crash file:

In v8 9.1.269.28 it aborts in a debug-build with this error:

Bug: Double Free in tcache 2

I found this crash 18 times. By slightly modifying the PoC, it's possible to trigger a segmentation fault, double free or a heap meta data corruption (top or !prev). The following PoC triggers for example the double free:

The bug was again already fixed by this commit (because I fuzzed a too old v8 version...). I also found some other DCHECK failures in the .maximize() and .minimize() Intl functions, but they were not security related.

Bug in Object.freeze (Segmentation Fault):

This bug was found 405 times and was again not reproduceable in the latest v8 version. The bug can be triggered in v8 9.1.269.28 with the following PoC:

By slightly modifying the PoC, it's possible to trigger a segmentation fault or a Stacktrace-message.

DCHECK stack position > stack guard in Isolate::StackOverflow

The PoC triggers the following DCHECK in the StackOverflow function:

Debug check failed: GetCurrentStackPosition() >= stack_guard()->real_climit() - 8 * KB (140729015611600 vs. 140729015656416).

According to the v8 developers the DCHECK is not security related.

v8 Typer CHECK InductionVariablePhiTypeIsPrefixedPoint Fails

This one triggers a CHECK failure in the Typer:

Check failed: visitor.InductionVariablePhiTypeIsPrefixedPoint(induction_var);

According to the developers it's not security related. The PoC is:

To trigger the bug/CHECK, it's basically enough to trigger compilation of a for-loop and to use the "new Number()" syntax instead of plain numbers in the loop-header.

Bug in d8 NormalizePath(): Empty vector is used

This bug is not so interesting because it's a bug in the d8-binary and therefore doesn't affect Chrome.

CHECK Fail: IsStruct_NonInline() / Wrong SFI in eval():

I received for this bug report 5 000 $. The bug is rather complex and very specific to v8 internal code, I would therefore recommend to check the v8 developer commits from the bug tracker for an analysis. The code was basically using a wrong code pattern which skipped stack frames. A simplified PoC to trigger the bug is:

Two memory corruptions in BigInt ToStringFormatter():

I submitted both bugs in this bug tracker and got 5 000 $ bounty for it. The problems occurred when .toString() was invoked on huge BigInt-values.

The first PoC is:

eval('100000000000000000000000000000000000000000000'.repeat(20)+'0n').toLocaleString();

Numbers can be converted between different bases, e.g. our decimal system works with the base 10 and hex works with the base 16. BigInt values are internally stored with the base 2^32. The .toString() implementation must therefore convert the 2^32-base value to a decimal value. V8 has a special algorithm which implements this and which contained the bugs. The basic idea is that a value can be split into pieces and each piece can be converted separately (strongly simplified). For demonstration, let's assume a value with the base 2^32 and the value is represented by: ABC. Let's further assume that you can convert A, B and C separately (A,B and C are just placeholders for some huge numbers). Let's assume A converts to the decimal number 170, B to 003 and C to 701. Then the .toString() output would become 170003701. The problem occurred in the conversation of the B-part. B translates in this example to the number 3, but it must be padded on the left side with two zeros because it's stored in the middle of A and C and the zeros are therefore important. The code forgot to write the 0-values and left the memory uninitialized. Since BigInt numbers allow up to 1 million digits (B's in the above example), it was possible to leak a lot of memory using this attack (e.g. map pointers and so on).

You can also turn this bug from an information leakage to a out-of-bound write. The following PoC writes OOB (one zero was added to the string which triggers a different code path):

eval('1000000000000000000000000000000000000000000000'.repeat(20)+'0n').toLocaleString();

With this PoC, the code writes the '0'-filler string out of bound in this loop. It's hard (but not impossible?) to actually exploit this bug, but I thought that exploitation of the info-leak is likely easier.

It's also interesting to mention that the code used raw C-pointers. Recently Samuel implemented a v8 sandbox and attacking raw C-pointers would be one of my first ideas to bypass it.

Congratulations, you reached the end of the blog post ;)