tltr;
I developed a coverage-guided (v8) JavaScript fuzzer similar to Fuzzilli (but without an intermediate language and developed in Python). I participated with the fuzzer in the
Most of my ideas are not fully implemented yet and the fuzzer still lacks a lot of fine tuning because I didn't had so much time for the project. I still try to describe my ideas in this blog post.
You can find the source code of the fuzzer here.
The story behind this blog post
In this blog post I want to summarize the
results, ideas, failures, and everything else I learned along my journey to
fuzz JavaScript engines (mainly Chrome’s v8 engine).
Everything started with my Master Thesis which I
wrote on browser exploits. Between December 2019 and September 2020, I
researched a lot on the topic of browser and JavaScript engine exploitation. In
the end, I analyzed 56 different browser exploits which were exploited in the
past 5 years. I re-developed most of these exploits and my goal was to find
similarities which can help fuzzers to identify variations of these exploitable
bugs. For example, by analyzing the PoCs, it can be seen that most
vulnerability classes share a similar structure and that a fuzzer must not
identify this structure again and again. Instead, the
structure of the vulnerability class can be hardcoded in the fuzzer and just
the other required code lines can be fuzzed. This should reduce the search space and should
therefore make fuzzing more efficient. You can find my master thesis in the docs-folder of the fuzzer project.
To proof (or disproof) my ideas, I developed a fuzzer. I
had two possibilities – modify the well-working Fuzzilli fuzzer or to develop a
fuzzer from scratch. In the end, I developed a new fuzzer – basically because I
thought that just by developing a completely new fuzzer I would likely find new
vulnerabilities different to Fuzzilli. In the end this choice had advantages
and disadvantages:
Advantages:
- The fuzzer is different to Fuzzilli and finds therefore different bugs.
- I could design the fuzzer as I like.
- I learned a lot.
- The fuzzer is developed in Python, so maybe other researchers prefer to use a Python-Fuzzer over a Swift-Fuzzer (Fuzzilli) because more security researcher are familar with Python.
Disadvantages:
- Developing everything from scratch was very time-consuming. I greatly underestimated the required time.
- In the end it turned out that Fuzzilli’s design is very good. A lot of design choices make a lot of sense afterwards. For example, Fuzzilli operates on an intermediate language. My fuzzer applies mutations directly on the JavaScript code. This helps my fuzzer to find very “exotic” testcases (but they don’t really occur so often…), however, it requires that I parse JavaScript code myself which is a ton of work to implement (and it's error-prone). Moreover, other fuzzers such as DIE or Code alchemist use a third-party library to parse JavaScript. I decided to implement everything myself – again something which resulted in a lot of work. Another difference is that I have a state which describes a testcase and this means that I must implement mutations twice - one time on the JavaScript code and one time on the state (and an error in one of them would de-synchronize the state with the testcase).
I also talked to a lot of researchers who develop browser exploits professionally. Really everyone told me that I should not develop a fuzzer and instead focus on code
review. However, I had to develop a fuzzer because I couldn’t write a master
thesis on “I performed code review and found X bugs”. In the end, I think I
would had found way more bugs with code review using the same time so I would also suggest to others to do code review instead.
For evaluation of my Master Thesis I started my fuzzer on a VM on my home computer for 1 week. Obviously, running a fuzzer on just 1 core on a VM for just 1 week is not enough (just to compare numbers: I performed 8 467 311 executions in that week; Samuel, the developer of Fuzzilli, told me, that he typically performs at least 1 billion executions before he stops his fuzzer; in the project zero supported researched, which I explain in the next chapters, I also performed 15 billion executions with my fuzzer) – but yeah I found an exploitable v8 vulnerability in this single week. It was a duplicate (with CRBUG 1072171) and another researcher had already found the same vulnerability before with a modified version of Fuzzilli, but at least my fuzzer found something.
Google Project Zero Fuzzilli Research Program
In October 2020 Google announced the Fuzzilli Research Grant Program – a program which grants researchers up to 5 000 $ in
the form of Google Cloud credits for fuzzing.
Previously I had my fuzzer only running on a
single CPU for just 1 week. The fuzzer was just a simple PoC and most of my
ideas were not implemented at that time. That meant that I would had to add a lot of
improvements so that it pays off that the fuzzer runs on a so huge
infrastructure. And I had to add them in my spare time (nights / weekends)
because in my daily job I work for a company as pentester.
I still decided to apply for the project.
In the end, I got the funding from the research
program. I decided to start end of February 2021 and I had to finish my
research within 6 months.
Unfortunately, I didn’t had so much time as I was expecting. Because of Corona a lot of companies decided to do pentests which means our company had a lot of work and I often worked late into the night. Moreover, I also had to unforeseen relocate into another apartment and my dad got sick and I therefore had to take care of him / help my mother. So, on many weekends/nights I couldn’t work on the fuzzer. I had a lot more ideas but just didn’t find time to implement all of them. That’s also the reason why the code isn't really so nice at the moment...
The design of the fuzzer
Currently, the fuzzer is a purely mutation-based fuzzer, however, I plan to change this soon (more on this in the generator sub chapter).
Most JavaScript fuzzers don’t use coverage feedback or a corpus of input files. To my knowledge Fuzzilli was the first (public) famous fuzzer which added the idea of feedback-based fuzzing to JavaScript engines. In my opinion that was a good idea, although coverage-based fuzzing works better when fuzzing file formats. The reason is that JavaScript fuzzing (or in-general browser-fuzzing) requires especially that different “memory states” and not only “code-paths” are reached. For example, in file format fuzzing corner cases can often be found by just taking different if-branches in the code. Coverage-feedback helps to find testcases which trigger different if-branches and is therefore very effective in file-format fuzzing.
Applied to JavaScript fuzzing, the coverage feedback helps to detect “unique JavaScript code” like unique ways to call a function. However, to trigger bugs in JavaScript engines, it’s required to find the correct combination of these code lines and the correct structure of a testcase. The coverage-feedback doesn’t help with that. What we would need is some other form of execution feedback, but we currently don’t have that.
My idea was therefore to create a very good corpus of JavaScript testcases – each triggering unique coverage in the JavaScript engine. Then, I extract all JavaScript operations, function calls, numbers, strings and so on out of the corpus and use them during fuzzing. Moreover, the fuzzer applies the mutations (like inserting an identified operation on a variable) on testcases from the corpus.
Traditional fuzzers manually implement these
operations in the fuzzer. They have a list of available methods (or a grammar) maybe together with the number and type of arguments. And then they add a
random method call at a random code line with random values for the specified
argument-datatypes.
However, it doesn’t make sense to add this randomly. For example, take the Math.sign() function. It doesn’t make sense to add Math.sign(123), then Math.sign(12515), then Math.sign(1251) and so on. It
makes a lot more sense to just add calls in a new testcase which each result in unique behavior, like one time Math.sign(1), then Math.sign(-1), then Math.sign(0) and then Math.sign(-0). And maybe sometimes, but very rarely, add completely random calls.
Moreover, it’s very likely that these
traditional fuzzers miss some variable operations because they manually
implement them. For example, “variable1 instanceof Array” would also be an
operation which can trigger side effects and it would be required to manually
add such operations (because of the unique syntax of this operation because it is not a function invocations).
The advantage of automatically creating a database / catalog of operations from a corpus is that important operations are not forgotten.
Someone could argue that these operations can
also be extracted from a set of testcases (without a minimized corpus using
coverage feedback). Someone could download all regression- and testcase-files from a JavaScript engine and create the database based on these
testcases. However, then a lot of useless operations would be stored in the
database and fuzzing would not be efficient. The database could
contain 100 Math.sign calls with different arguments and just one of them is a
call to Math.sign(-0) – it would therefore be unlikely that the fuzzer would
add this one operation because there are so many others. If on the other hand, coverage feedback is used, then the database would just contain 4 Math.sign
calls: Math.sign(1), Math.sign(-1), Math.sign(0) and Math.sign(-0) – because
they have unique coverage (maybe plus some other variations like object arguments
– but I want to keep the discussion simple). The likelihood that the fuzzer
adds an important operation would therefore be 1:4 instead of 1:100.
Another advantage is that I can create tests for my database / catalog. I can analyze previously exploited vulnerabilities and manually extract important operations from them. Then, I can create a testsuite which checks if these operations are stored in my generated operations-database or not. If not, I know that my code or corpus is currently not good enough and I'm maybe missing other important operations and I can debug why this occurs (is the feedback not good enough? Does my fuzzer not generate testcases with these calls? And so on).
My general approach was to first develop a very basic fuzzer and get all components working. Then, subsequently improve all the components. This is exactly where I am right now. Everything is implemented, but important components still need a lot of improvements or fine-tuning. The following components are currently implemented in the fuzzer:
Generator:Mutator:
In this case I can't inject the callback because the array is not assigned to a variable. Moreover, in a possible callback I can't modify the length of the array because I can't access the array. It's therefore important to rewrite the code before I start to inject callbacks.
And in the next step I must assign the objects to variables, so that I can access them within callbacks:
The CALLSTACK-HASH function traverses the C++ callstack backwards and calculates a hash of it. Since the callstack should contain the JSON.stringify() function, it's hash should be different to other callstacks which are for example triggered via a valueOf() function in an addition. If the template corpus already contains an entry for the hash, the new entry will be skipped.
I’m also intentionally often manually parsing strings instead of using a regex because I made the experience that the regex-approach is very slow. I’m therefore often using complex manual string parsing code which is error-prone, but a lot faster.
In the following paragraphs I want to give an example of why profiling the code of a fuzzer is important.
In a late night, I planned to implement a
mutation strategy which replaces one variable in a code line with another one
if both have the same data type. I thought that this mutation is important
because I’m combining multiple input testcases by
appending or inserting one testcase to the other testcase. Moreover, when adding variable operations there are also often additional other variables used by the operation. It’s therefore important to exchange variables from two different
testcases (or operations) to really combine and mix the testcases (e.g., code lines from
testcase1 should access variables from testcase2 and vice versa).
My tired me (it was really late at night) implemented this operation using the following algorithm:
- Iterate over all variables in the final (combined) testcase (Iteration variable: variable_name1)
- Iterate a second time over all available variables (Iteration variable: variable_name2)
- Iterate over all data types and line numbers of variable_name1 (a variable can have different data types in the same line; e.g.: when the code line is in a loop)
- Iterate over all data types and line numbers of variable_name2
- If the line numbers and data types match, then store variable_name1, variable_name2 and the line number as possible candidates
- At the end select a random entry from all candidates
Well, some
days afterwards I was confused why my fuzzer was suddenly so slow.
Using my profile_performance_of_fuzzer.sh
script (which internally uses cProfile), I got the following output:
The script starts the fuzzer in a generate-only
mode which just generates testcases as fast as possible without actually
executing them and it measures the execution time of each function. The above
output shows that the total runtime was 969 seconds and 729 of these seconds
were spend only in this function. This obviously indicates a problem because I
had something like 20 mutation strategies at that time and this one consumed all the time.
Moreover, I could just generate ~60 testcases per second and when I removed
this mutation strategy, it was again back at 240 generated testcases per
second.
I then rewrote the algorithm to:
- Iterate over all variables in the final (combined) testcase (Iteration variable: variable_name1)
- Iterate over all data types and line numbers of variable_name1
- Use the data type and line number as a dictionary key and store the variable name in a list.
- At the end, iterate over all dictionary entries. If the list contains 2 or more entries, then store them as possible candidates.
Both
algorithms implement the same operation and result in the same output, however, the first
algorithm used four interleaved loops and is therefore extremely slow.
The new
algorithm resulted in 190 generated testcases per second which is a lot better,
but still not 240 testcases per second:
A short
note to the other entries in the image:
When I
insert a new line in the testcase (e.g.: when I add a variable operation), the
function state_update_all_line_numbers() gets called. It must update all line
numbers of all variables (e.g.: which variable has which data type in which
line) which is obviously a time-consuming task. This was already described in one of the previous paragraphs and will be changed soon.
The “list.append” shows up because I apply mutations on lists which obviously results in list operations. For the future, I plan to maybe switch to another datatype, e.g.: a mutable string.
The
“re.findall” code also indicates that regex-operations are very slow –
especially since I tried to avoid them as much as possible and it’s still the function with the 6th longest runtime.
I already knew that the two other functions get_number_token_occurences_all() and get_index_of_next_symbol_not_within_string() are slow. I use these functions to parse the testcase, e.g. when parsing a for loop which starts with a { block, then I need to find the corresponding } ending-symbol (e.g.: to move the for-loop around) and I must ignore } symbols within strings or } symbols associated with other deeper blocks. Here is another example: If I want to move a line around and the line is “var_1_=1;}”, then I can’t perform this operation because the line contains a control-flow character (}). On the other hand, a line like “var_1_=’xx}xx’” is allowed or a line like “if(…) { var_1_=1;}” is also allowed. The parsing function is pretty complex with a lot of deep if-conditions. I must basically handle all possible corner cases and since I perform the operations on a corpus of regression/test cases, my input files trigger really all kinds of edge cases. Because of the long runtime of this parsing function, I re-implemented it in C code.
Here is another performance measure result:
I modified
the fuzzer to add 5 new variables in every iteration and removed all other mutations. Then get_index_of_next_symbol_not_within_string() becomes the bottleneck. The
reason is that I recalculate the positions of all control-flow characters (e.g.: “{“
and “}”) when I add a new variable because I need this information to derive
the scope of the variable (to fill the state object correctly; I currently don't have access to the "flow" as described before - so I need to calculate it).
When I measured the performance with this fuzzer configuration, I achieved a speed of 130 testcases per second. I then re-implemented the same function in C code and called the C implementation from Python. Initially, the result of the C-implementation was different because C handles Unicode characters differently. In my C code I’m counting bytes and if the testcase contains for example a 3-byte Unicode character, it counts 3 positions. In Python this is just one character and therefore the returned offset didn’t match.
My stupid first solution was to just replace all Unicode characters with a space in Python and
then call the C implementation:
content_without_unicode =
''.join([i if ord(i) < 128 else ' ' for i in content])
The code worked, but the speed was just 140 testcases per second because calculation of the above list comprehension also consumes a lot of time. I therefore removed it and instead used the C function mbrlen() which calculates the number of bytes a Unicode character consumes. This pure C-implementation then had a generation speed of 555 testcases per second which is more than enough.
Originally,
when creating 50 000 testcases I spend 272 seconds in the function. With the
new C-implementation I can perform the same task in just 5 seconds which is ~50 times faster and it only took me ~1 hour to port the code to C.
Here is one last example (of my bad code):
The above
figure also shows that get_line_number_of_offset() is a new bottleneck (with
the used fuzzer configuration). The purpose of this function is to translate an
offset in the testcase to the corresponding line number. I was already using an
“optimized” version of the function at the time, but the above output showed that it should
further be improved. My first, very slow version was:
When generating
50 000 testcases, the overhead by this function was 108 seconds.
My
rewritten, faster Python implementation was:
This code reduced the execution time to 31 seconds. After implementing the above function in C code to execution time dropped to 7 seconds which is ~15 times faster than the original code.
My key takeaways:
- Don’t try to blindly improve functions which you think are the bottleneck. Instead, always first profile your code and improve the code which really bottlenecks.
- First use cprofile to get a general overview of functions with a slow performance. Then use line_profiler to understand in-depth the bottleneck in the function.
- Instead of developing the fuzzer late at night at 1-2 am, I switched to go earlier to bed and develop 1-2 hours in the morning with a fresh mind.
Obtaining a good JavaScript corpus
At the start I mentioned that I was just fuzzing for 1 week on 1 core for my Master Thesis. Well, that’s not 100% correct – I previously used computation resources to create an initial corpus of unique JavaScript testcases.
At the beginning, I used the AWS cloud for that – as
a student I got something like 200 $ of free credits which I used to create the
corpus.
I also used GCP (Google Cloud Platform) because
I got the Fuzzilli Research Program credits for GCP. If you sign-up at GCP you
also get ~250 € of free credits. I used 136 € of these credits during
development/improvement of my fuzzer before I started the Fuzzilli Research
Program.
Abraham Lincoln once said: “Give me six hours
to chop down a tree, and I will spend the first four sharpening the axe”.
Applied to my fuzzing project, “sharpening the axe” would be the time I spent on creating a good input corpus. I used ~6 months to create my input corpus. The implementation of the mutation strategies on the other hand just took me several days.
Another interesting fact is that coverage
feedback is not required all the time during fuzzing. Extracting coverage
feedback makes fuzzing slow. For example, when I achieve 200 executions /
second without coverage feedback (testcase which just executes one operation),
then I get approximately 60 executions per second when I enable coverage
feedback. When I enable additional flags to get better coverage feedback (v8_enable_builtins_profiling=true),
I only achieve 20-30 executions per second. As a side note: I’m performing
in-memory fuzzing, which means I’m typically restarting the engine after
several hundred of executions. When I restart the engine after every execution,
the speed drops to 0,5 – 2 executions per second. I’m doing this during
testcases analysis and deterministic fuzzing. Therefore, the overall speed is
maybe sometimes below 20-30 executions per second during fuzzing. During
fuzzing I’m also operating on bigger testcases which means the execution speed
was typically between 6-20 exec / sec / core.
I just need to use the full coverage (which
is slow) while creating a good input corpus. I created this input corpus from
publicly available regression and unit tests and from corpus files created by
other fuzzers and later applied my own fuzzing on them. As soon as the input
corpus is good enough, I can switch to fuzzing without the coverage
feedback. At that point, I want to find “combinations” of the identified
testcases / operations and code- or edge-coverage feedback doesn’t help here.
Therefore, fuzzing at full speed can be used.
This also means that I can use different
mutation-strategies in the different phases. In the first phase while improving
the corpus, I mainly use mutation strategies which likely find testcases with
new coverage. In the second phase (when
coverage feedback is disabled), I mainly try to create testcases based on the
identified structures from different vulnerability classes or in general prefer
mutations which likely lead to vulnerabilities (like the insertion of minus
zero values).
Fuzzing numbers before I started the project
I can’t write too much about my AWS fuzzing
numbers because I didn’t measure them.
On GCP I used the free credits (136 €). At that
time, I was using default systems (none preemptible systems which are more
expensive and with 2 cores per machine optimized for CPU-calculations). I
started to measure fuzzing numbers in January 2021, but I was also already
fuzzing in December 2020.
In January 2021 I:
- performed approximately 170 million fuzzing executions against v8
- I used 2628 CPU hours (that’s approximately 14 days of fuzzing on 8 cores; 8 cores were the limit with the free credits)
- That would be ~18 executions / second / core. This is a little bit slower than I initially mentioned because I restarted the v8 engine more frequently (in total I had 4 429 050 engine restarts). Typically, I perform more executions in-memory which leads to a faster fuzzing speed.
- In these 14 days of fuzzing, I found 54 different crashing testcases. These 54 crashes can be traced back to 12 unique bugs in the v8 code. However, most of them were related to the special instrumentation flags (v8_enable_builtins_profiling=true) which I used. I therefore consider none of these bugs exploitable because they can only be triggered in a debug-version of v8 and didn't report most of them. You can find an example of such a bug in CRBUG 1170776.
- 80,79% of the generated testcases were valid testcases which didn’t throw an exception. 1,12 % of the testcases resulted in a timeout. However, in this fuzzing phase I was just applying one mutation on a testcase. Testcases were not merged and multiple mutations were not applied.
- I just applied one mutation at a time to better measure the mutations. It allows me to compare mutations with each other. This helps to decide which strategies I need to improve first and to specify how frequently which technique should be applied. For example, techniques which often lead to hangs and which never contributed to crashes or new coverage should not be invoked too often.
Here are some mutation statistics:
- I expected that the mutation strategy with the lowest success rate would be “insert a random operation”. At that time, I didn’t have a database of operations and I just extract all function names of all data types and called a function with random arguments (also the number and type of arguments were not correct and just random). This resulted in a 67,46 % success rate. However, the mutation “duplicate a code line” resulted in only a 64,58 % success rate and was therefore my worst strategy. This technique also resulted in 4,37 % of the iterations in a hang. Hangs (aka timeouts) are the worst thing which can happen because they require that the engine must be restarted which is very time-consuming. I could therefore greatly improve the performance of my fuzzer by improving this one mutation strategy.
- Another interesting result: One of my fuzzing strategies wraps a number, string, or variable within a function invocation. If it’s a number and if the number is passed as argument to the function and the function returns the unmodified number, it results in 2,08 % of cases in a hang. However, in all other cases (e.g.: if it’s a string instead of a number or if the number is hardcoded in the function’s return statement and not passed as argument), it results in just ~0,5% of the testcases in a hang. This raises the question: Why? How can even wrapping a value in such a way lead to a hang? And why more frequently in this case? For example, this one case resulted in 18 000 engine restarts. If I would have fixed this case before starting fuzzing, I could had performed additional ~3 million executions (without coverage feedback) instead of wasting the time doing engine restarts.
- Another strategy with a similarly high number of hangs is “modify number” and it’s obvious that this often leads to hangs. The strategy replaces a number with a random other number. If it replaces numbers in a loop header, it can easily lead to endless loops and therefore to hangs. I therefore need to take care of such cases in my mutation implementation.
- Using the measurement, I also detected several bugs in my code. For example, one mutation resulted 100% of the time in exceptions. The mutation added a call to “%ArrayBufferDetached(array)”. However, the function requires an array buffer as argument and not an array and therefore always resulted in an exception. Btw, that’s a good example of why manually adding these operations is error prone. If I instead would had used a database of operations (as mentioned above), this flaw would not have occurred. I manually implemented some operations which I often saw in vulnerability PoCs to increase the likelihood of these operations. I later added the automatically generated database of variable operations in March 2021.
- The mutation strategy with the highest success rate was “materialize value”. It resulted in 97,02% of the cases in a testcase without an exception, it detected 2 crashes and it found several testcases with new behavior.
Here are some statistics on the testcases:
- First, I looked for testcases which often result in a hang – because they heavily affect the fuzzer speed. I therefore sorted the tagging output file to display the testcases which had the most hangs. One testcase which resulted in a lot of hangs was:
- If a mutation modifies one of the numbers to a bigger value, it immediately results in a very long runtime and therefore the testcase has a lot of hangs. One possibility to fix the mutation strategy is to not insert too big numbers in such testcases. However, a better solution is to add an algorithm which detects which testcases result frequently in hangs and then the fuzzer should skip (“permanently disable”) these testcases. A similar idea was also implemented for testcases which often throw exceptions.
- Here is another code line from such an often-hanging-testcase:
- It’s obvious that modifying a number in this code can immediately lead to a timeout (endless loop). Detecting this condition within the mutation strategy (to skip "number modification"-mutations) is hard and therefore it’s better to detect which testcases lead very often to hangs and then skip them. The question is only: Should I detect it generically or per mutation strategy? Should all mutations be stopped on the testcase or just specific ones? With the above testcase only the “modify number” mutations must be skipped. Currently, I permanently disable these testcases but I will maybe improve this later.
- My corpus contained at that time 9787 testcases. If I would just had skipped the 3 testcases with the most hangs and would had fuzzed the other 9784 files, then I could had avoided 42 665 engine restarts. Let’s assume an engine restart takes 1 second, then I could had performed 170,8 instead of 170 million executions. If I would have skipped all testcases which resulted in more than 15% of cases in a hang (that’s 50 testcases), I could had performed 174 instead of 170 million executions. If I would have skipped all testcases which resulted in more than 5% of cases in a hang (that’s 735 testcases), I could have performed 189 instead of 170 million executions.
- On the other hand: If I start to modify the “replace/modify number” mutation and if I can just improve the logic to create 50% less timeouts (e.g.: by not setting numbers in loop headers to too high values), then I can perform 185 instead of 170 million executions. So, it maybe makes sense to skip testcases which very often result in hangs (let’s say more than 30% of the time) and focus my time on improving the mutation strategies.
- Here is another testcase (excerpt) which resulted in 93,85 % of cases in an exception when a “replace number” mutation was applied:
- Typically, the result of var_1_ after the loop is 42999999 and therefore the exception is not thrown. However, when the fuzzer starts to mutate just one of the shown numbers, the result immediately changes and the throw-line gets executed. To solve this problem the “testcase minimizer” should had removed the throw code line because it’s never executed – however, it didn’t because the code is for some reason required to trigger unique behavior. A solution is therefore to try to replace all throw-code lines within the minimizer/standardizer or to try-catch wrap them.
- Here is another problematic testcase:
When the fuzzer starts to modify the string, it immediately leads to an exception because JSON.parse() throws an exception if the argument has an incorrect format. There are several JavaScript functions which throw an exception if the passed arguments are wrong. These functions must specially be handled by the mutation engine.
- Another example:
If the fuzzer modifies the string in the 2nd code line it immediately leads to an exception. There are a lot of other such examples.
Importing a Corpus from Fuzzilli
In February 2021 Samuel (the developer of Fuzzilli) shared a recently generated corpus with me, which he created in a fuzzing session between 1. and 8. February. His corpus had with my coverage feedback implementation (with v8_enable_builtins_profiling=true and testcases which throw exceptions are ignored) 23.3593 % coverage and contained 21 895 testcases.
My corpus contained at that time 9 312 testcases and resulted in 27.728% coverage, however, it doesn't really make sense to compare these numbers. Samuels corpus was completely created by his fuzzer, whereas my corpus was created from regression and testcases from browser test suites. Moreover, my corpus was improved by fuzzing a specific v8 version and Samuel fuzzed another v8 version. It's likely that Samuels corpus therefore performs better on the other v8 version.
After importing the files from Samuel my corpus contained 13 111 testcases with a coverage of 29.2974%.
Developing statistics (April 2021)
In April 2021 I implemented the operations-databases. Here are the stats of a short fuzzing session from April 2021:
- performed executions: 8 940 320
- Execution speed: 16,87 exec / sec / core
- Total runtime: 6 days 3 hours 11 minutes 28 seconds
- Total new corpus entries: 994
- Total crashes: 57
- Total success: 7 050 609 (81.34 %)
- Total exceptions: 1 462 483 (16.87 %)
- Total timeouts: 154 601 (1.78 %)
- Total engine restarts: 270 023
I just performed one mutation per iteration. This together with the tagging engine helped me to detect flaws in the mutation-implementations.
First fuzzing session (~20 €)
On 02.06.2021 I started ~10 preemptive VMs. At the start I got 4100 € from Google. 30 € were used for development/testing. For the first fuzzing session I used ~20 €. Here are some statistics:
- performed executions: ~49 million
- Execution speed: 6,34 exec / sec / core
- Total runtime: 89 days 8 hours 45 minutes 48 seconds
- Total new corpus entries: 3 108
- Total crashes: 159
- Total success: 27 938 970 (60.52 %)
- Total exceptions: 157 744 76 (34.17 %)
- Total timeouts: 2 448 084 (5.30 %)
- Total engine restarts: ~3 million
The fuzzer speed was very slow because I used full instrumentation (v8_enable_builtins_profiling=true) together with a lot of mutations and combined testcases. The resulting testcases were therefore pretty big and slow, but I found 3108 new behavior testcases.
I fuzzed v8 in version 8.7.220.29 which was used by Chrome 87.0.4280.88. I initially thought it's smart to fuzz v8 in a version used by Chrome users because vulnerabilities in it would have the highest impact. However, the downside is that you get a lot of duplicates because you find old bugs which are already fixed in the current v8 version.
I also decided to fuzz v8 in a little bit older version at the beginning: Some vulnerabilities were already public known in this version and I could therefore check if my fuzzer can find these bugs.
Second fuzzing session (~423 €)
On 04.06.2021 I changed to the newer v8 version 9.1.269.28 (from Chrome 91.0.4472.77; newest version at that time), however, I didn't immediately had time to start fuzzing. On 14.06.2021 I started again a fuzzing session for ~432 €. Every hour I added 6 new fuzzer-instances until I had approximately 100 instances (~400 CPU Cores distributed on different regions) running. I stopped the fuzzing on 21.06.2021.
I couldn't add more than 6 fuzzer-instances per hour because I created a snapshot of my base fuzzer machine. Starting a new instance from such a snapshot is limited to 6 in GCE for some reason (but I could had created more snapshots).
- performed executions: 1 046 702 688 (~ one billion executions)
- Execution speed: 6.78 exec / sec / core
- Total runtime: 42 888 CPU hours (4,9 CPU years)
- Total new corpus entries: 1486
- Total crashes: 3775
- Total success: 610 040 565 (58.79 %)
- Total exceptions: 368 028 182 (35.47 %)
- Total timeouts: 59 603 483 (5.74 %)
- Total engine restarts: 62 099 301
The 3 775 crash files can be traced back to 8 different root causes and therefore to 8 different bugs. I give an overview on them at the end of the post.
At the end of this fuzzing session my corpus contained ~18 000 testcases. All fuzzing sessions to this point didn't use a template corpus. I used this corpus to generate a template corpus with 360 000 template files (~2.5 GB file size on the disk).
Third fuzzing session (~3500 €)
I planned my last fuzzing session for August 2021 because my credits were available until 18. August 2021. I wanted to finish fuzzing by 11. August because I wanted to have a one week buffer (e.g. if I would forget to delete something that I don't have to pay that money myself). The problem was that I didn't had a lot of time back then and extending the credits deadline was not possible. Moreover, I also had some other problems related to GCE. Paying 3500 € in 1-2 weeks on GCE is not that simple. You have quota-limitations on a lot of resources. For example:
- You can just spawn 5-6 instances from a snapshot per hour.
- You also have CPU quotas which means you can just start 72 CPU cores (per region) at the same time. Samuel told me at the beginning that I should fill a request to increase my quota limit. However, I was using two different Google accounts. For some reason all GCE pages are shown for account1 (which has the credits) and only the GCE quotas page is loaded with the 2nd account - for whatever reason! The page doesn't has a visual clue that it is loaded with a different account and with this account it just showed an empty page without any quotas. I therefore thought that quotas were removed from my account... Just when I tried to start more instances, I saw that I still had quotas applied. Well, I thought that I would just fill a request and ask GCE to update my quota to something like 3000 CPU cores and told them that I'm doing a project for Google and must be able to spend ~3500 € in 1-2 weeks. Well, the answer from them was just a simple "no no no, you won't get 3000 CPUs". I then tried to get a 800-1000 CPU quota update for multiple regions (as recommended by Samuel) and that worked out. I was then able to start more instances but I had to distribute the workload between regions (I used the 5 cheapest regions).
- Well, that worked for some time, until I discovered that IP-addresses are also limited. At some point you can't start instances anymore because you don't have free IP-addresses anymore. My initial plan was to use small fuzzing instances (with 2-4 CPU cores), but then you run out of IP-addresses. Even slightly increasing you IP-address quota limit in GCE will be declined, so I had to change my setup to instead use 16-CPU core systems.
- And another problem I learned the hard way (unrelated to GCE): On Linux it's possible that you have enough free disk-space, but you can run out of inodes. For example, if you store hundreds of thousands of JavaScript files during fuzzing in a corpus... When you run out of inodes, you can't save files (aka new crashes) anymore...
- Another thing to consider was that downloading results from GCE buckets takes 1-2 days, so I also had to calculate these days.
- And the last problem was the cost-calculation: When I start X instances and they would cost something like 10 000 € per day, I would quickly exceed my credits and would have to pay it myself. I therefore tried to start slow (but at the same time fast enough to be able to spend several thousand euros until the deadline).
- performed executions: 8 604 798 966 (~8 billion)
- Execution speed: 16.67 exec / sec / core
- Total runtime: 5973 days 20 hours 40 minutes 29 seconds
- Total new corpus entries: 0 (coverage feedback was not enabled)
- Total crashes: 15 884
- Total success: 4 570 082 178 (53.11 %)
- Total exceptions: 3 696 324 963 (42.96 %)
- Total timeouts: 338 342 843 (3.93 %)
- Total engine restarts: 343 851 753
- Ensure to request a CPU quota upgrade to 800-1000 CPUs for multiple regions early.
- Plan your fuzzer for 16 or more core systems (you don't have an infinity number of IP addresses available)
- I would recommend to not use GCE buckets to store results (or data which should be accessed frequently).
- If you want to fuzz v8, fuzz it in the latest version (and not the latest version from Chrome). I would not use "v8_enable_builtins_profiling=true" anymore because I found a lot of bugs specific to this setting.
- Develop your crash triage scripts before you start fuzzing, otherwise reporting bugs will take very long which increases the chance of duplicates.
- I found most of the bugs pretty early. I think it was not required to spend 3000 € on a fuzzing sessions because I found nearly all bugs already with the first 300-500 €. This could also indicate some problems with the search space of my fuzzer.
- Don't try to insert funny Unicode symbols with a JavaScript fuzzer. I thought it's a good idea to add symbols which change length and so on, but in the end I had a lot of problems because these symbols screwed up Python (my code) way more often.
- I still think that the ideas of my fuzzer are good, but the current implementation is not. A lot of fine-tuning is still missing. Nevertheless, I would recommend to do source code analysis instead of JavaScript fuzzing.
- In the end I just found three new bugs, but I'm still happy that Google gave me the possibility to run my fuzzer on their infrastructure. Personally, I hope that Google continues to support researchers with such funding's. It helps both sides: Google gets bug reports "for free" and researchers can use a huge infrastructure to test new ideas.
This post is already pretty long, but since the funding requires an analysis of the found vulnerabilities, I provide here a short overview. I'm just describing "interesting" crashes, otherwise the post would really become long.
It's also important to mention that my goal was to find bugs with a "structure" from a vulnerability class, however, I didn't find such a bug. The simple explanation is that I didn't finish the implementation of it when I started the last fuzzing session and therefore I didn't find such bugs.
I'm describing the bugs in the order on which I found them. If you just want to read about the two security-related bugs for which I got a bounty, go to the last 2 bugs.
Math.max() missing -0 case:
This bug was a duplicate with CRBUG 1072171. I found this bug during my master thesis evaluation (before the Google funding) and found it in v8 8.1.307.28. The PoC from the bug tracker does not work in this v8 version. My PoC is slightly different and therefore works in this v8 version. It was rather trivial to find this bug because I just had to perform a single mutation on the regression testcase from this bug. It's therefore a good example that having an input corpus of regression tests is useful. The bug occurred because the compiler had incorrect return values for Math.max() annotated (-0 was missing).
OOB Array Access:
This bug initially looked promising, however, it can just be triggered if the v8_enable_builtins_profiling configuration is set to true. The PoC leads to an integer overflow in the profiling code which triggers the bug and allows to access data OOB (z should be 0 but is 39017943).
Truncation bug in Simplified Lowering:
This bug was a duplicate and was already fixed by this commit. I found the bug in the 2nd fuzzing session (Chrome 91.0.4472.77; v8 9.1.269.28). In this session I fuzzed the v8 version 10.05.2021. However, the above mentioned fix already fixed the bug on 21.04.2021. So I just re-discovered an already known bug because I was fuzzing a too old v8 version because I thought it's a good idea to fuzz v8 from the current Chrome release. I would therefore recommend to always fuzz the latest v8 code.
It's also interesting that I found this bug 2 324 times and I already triggered this bug after a few minutes because it's really trivial to find. This was kind of surprising to me because this was a v8 version used in a Chrome release. I thought that Google would fuzz a v8 version for some days before it's shipping it in Google Chrome.
The following code triggers the bug. It's basically enough to perform two multiplications inside loops to trigger compilation.
Bug: 32 Bit value in register is not
zero-extended:
This one is interesting because I just found a single crash file triggering this bug. However, the bug was not reproduceable in the latest v8 version and was therefore likely already fixed.
Here is the minimized crash file:
In v8 9.1.269.28 it aborts in a debug-build with this error:
Bug: Double Free in tcache 2
I found this crash 18 times. By slightly modifying the PoC, it's possible to trigger a segmentation fault, double free or a heap meta data corruption (top or !prev). The following PoC triggers for example the double free:
The bug was again already fixed by this commit (because I fuzzed a too old v8 version...). I also found some other DCHECK failures in the .maximize() and .minimize() Intl functions, but they were not security related.
Bug in Object.freeze (Segmentation Fault):
This bug was found 405 times and was again not reproduceable in the latest v8 version. The bug can be triggered in v8 9.1.269.28 with the following PoC:
By slightly modifying the PoC, it's possible to trigger a segmentation fault or a Stacktrace-message.
DCHECK stack position > stack guard in Isolate::StackOverflow
The PoC triggers the following DCHECK in the StackOverflow function:
Debug check failed: GetCurrentStackPosition() >= stack_guard()->real_climit() - 8 * KB (140729015611600 vs. 140729015656416).
According to the v8 developers the DCHECK is not security related.
v8 Typer CHECK InductionVariablePhiTypeIsPrefixedPoint Fails
This one triggers a CHECK failure in the Typer:
Check failed: visitor.InductionVariablePhiTypeIsPrefixedPoint(induction_var);
According to the developers it's not security related. The PoC is:
To trigger the bug/CHECK, it's basically enough to trigger compilation of a for-loop and to use the "new Number()" syntax instead of plain numbers in the loop-header.
Bug in d8 NormalizePath(): Empty vector is used
This bug is not so interesting because it's a bug in the d8-binary and therefore doesn't affect Chrome.
CHECK Fail: IsStruct_NonInline() / Wrong SFI in eval():
I received for this bug report 5 000 $. The bug is rather complex and very specific to v8 internal code, I would therefore recommend to check the v8 developer commits from the bug tracker for an analysis. The code was basically using a wrong code pattern which skipped stack frames. A simplified PoC to trigger the bug is:
Two memory corruptions in BigInt ToStringFormatter():
I submitted both bugs in this bug tracker and got 5 000 $ bounty for it. The problems occurred when .toString() was invoked on huge BigInt-values.
The first PoC is:
eval('100000000000000000000000000000000000000000000'.repeat(20)+'0n').toLocaleString();
Numbers can be converted between different bases, e.g. our decimal system works with the base 10 and hex works with the base 16. BigInt values are internally stored with the base 2^32. The .toString() implementation must therefore convert the 2^32-base value to a decimal value. V8 has a special algorithm which implements this and which contained the bugs. The basic idea is that a value can be split into pieces and each piece can be converted separately (strongly simplified). For demonstration, let's assume a value with the base 2^32 and the value is represented by: ABC. Let's further assume that you can convert A, B and C separately (A,B and C are just placeholders for some huge numbers). Let's assume A converts to the decimal number 170, B to 003 and C to 701. Then the .toString() output would become 170003701. The problem occurred in the conversation of the B-part. B translates in this example to the number 3, but it must be padded on the left side with two zeros because it's stored in the middle of A and C and the zeros are therefore important. The code forgot to write the 0-values and left the memory uninitialized. Since BigInt numbers allow up to 1 million digits (B's in the above example), it was possible to leak a lot of memory using this attack (e.g. map pointers and so on).
You can also turn this bug from an information leakage to a out-of-bound write. The following PoC writes OOB (one zero was added to the string which triggers a different code path):
eval('1000000000000000000000000000000000000000000000'.repeat(20)+'0n').toLocaleString();
With this PoC, the code writes the '0'-filler string out of bound in this loop. It's hard (but not impossible?) to actually exploit this bug, but I thought that exploitation of the info-leak is likely easier.
It's also interesting to mention that the code used raw C-pointers. Recently Samuel implemented a v8 sandbox and attacking raw C-pointers would be one of my first ideas to bypass it.
Congratulations, you reached the end of the blog post ;)