Cloudflare Incident: February 18, 2017

February 18, 2017. At the Cloudflare HQ work began winding down this Friday afternoon as the weekend approached. Morale was high and no one was prepared for the disaster that was about to happen. The engineers were excited to go home early and enjoy the beautiful weekend free of any operational issues. Suddenly 4:11 pm Pacific time at the friendly neighborhood Google complex, one of the Googlers working in Project Zero, a security research team, discovered a severe issue with Cloudflare's system. He immediately reached out through the most sensible channel for something urgent like this and first contact was made minutes later. It was now 4:32 pm and the alarming details of the report were made clear to Cloudflare suggesting a possible widespread data leak. You may have seen Cloudflare's DDOS mitigation service before. This is built on top of their primary product, a Content delivery Network, or CDN. CDNs came into existence in the 1990s to speed up the delivery of Internet content. They're kind of like distribution centers. Amazon isn't just going to have a single Warehouse in the middle of the United States that every delivery driver starts from. There are many spread all across the country and they a store or should I say cache commonly sold items to minimize delivery time. Similarly it makes no sense to deliver internet content to all users across the world from a single centralized source. A CDN will have many points of presence across the world with edge servers that cache content from the origin server. When a user makes a request for a particular website, the request is directed to the nearest edge server where the content is most likely already cached. It was here that Cloudflare not only returned the requested website but also cookies, keys, and other sensitive customer data. This is what it would look like. If you knew where to look, plenty of useful information could be extracted from the leaked memory. Full HTTPS requests, IP addresses, responses, passwords, and who knows how long this exploit has been out there. Bad actors could have already compromised thousands of companies. And Cloudflare's monitoring evidently did not self-detect this issue, as a third party had to identify it and reach out to them. Data leakage like this can come with hefty consequences: FTC fines, lawsuits, increased audits. But most importantly of all it degrades customer trust. No customer trust, no customers. No customers, no revenue. No revenue, no Taco Tuesdays. To make matters worse, search engines like Google also regularly index and cache websites so this leaked data could also be accessed through Google's cache. 4:40 PM. Now this was serious business. Everyone immediately assembled in San Francisco. Maybe even some cross-company action with the Google employees. The engineers noticed in the dashboards that the occurrence of this bug seemed to correlate with the usage of the email obfuscation feature, which was also immediately suspected as it had a recent deployment to partially migrate to a new HTML parser. Either way every feature that Cloudflare ships comes with a feature flag and engineers immediately flipped what they called the global kill, which would prevent all customers from using the feature. By 5:22 PST about an hour after the initial report, email obfuscation had been disabled worldwide. However, the bug was still occurring. On the other side of the Atlantic, the London team had joined the call. All hands on deck, it was time to spend the Friday night debugging and rethinking life. 8:24 PM PST, four hours in, another two features were found to be problematic: automatic HTTP rewrites and server-side excludes. Automatic HTTP rewrites was shut down immediately with its global kill. But server-side excludes with such an old feature that it predated the practice of deploying with global kills. The engineers had a crossroads here. They could release a patch for this feature to allow it to be turned off, but that would take some time for implementation and deployment. Alternatively they could spend time root causing the issue and deploy a single proper fix. But the root cause was not apparent. Thus the engineers began working on the global kill for server-side excludes and readied it for a deployment. 11:22 PM PST, 7 hours in. As the night progressed streets outside the San Francisco office grew quieter. Daybreak in London, the engineers were more than ready to sign off and get some much needed sleep. The patch to turn off server-side excludes was finally deployed worldwide, but there was still much work to be done. Cached data from search engines still needed to be purged and without knowing the true root cause, reoccurrence was still within the realm of possibilities. But what could have caused this? Well, edge servers contain software to perform all kinds of operations on the content they deliver. This was the clear common denominator among the three aforementioned features. They all parsed and modified the returned HTML content in some way. Email obfuscation would erase any email addresses in the returned webpage if the requester source IP was deemed suspicious. Server-side excludes is very similar, it can automatically hide content wrapped in a special tag from suspicious source IPs. Automatic HTTP rewrites would simply rewrite any HTTP content embedded on the returned website to HTTPS. Furthermore these three features all used the new HTML parser mentioned earlier: cf-html. The engineers however found nothing suspicious in the code despite thorough verification. It wasn't until the next few days that the root cause was made clear. Now, Cloudflare had originally been using a parser generated using Ragel and they were looking to migrate to something simpler and more maintainable. It was in this self-described ancient piece of software that the bug took root. Ragel is a parser language that no one knows how to pronounce, that works through defining finite state machines with regular expressions and performing various actions based on the match results. You can think of it like those flow charts, where we start at one state and transfer to different states based on various conditions. For example, here is a machine which matches consecutive numbers and letters. In practice you can see Ragel code is embedded within C here using the double percent signs and then it can compile down to C, C++, Java, etc. So why Ragel? It's actually fairly readable and concise after a bit of getting used to, and I'd imagine its very performant. So the HTML web page consumed by the Ragel parser is represented by a series of data buffers with each buffer containing a portion of the HTML code. Each time the Ragel parser is invoked to consume a buffer, the user needs to pass in data pointers initialized to the beginning and end of the buffer. Ragel uses p to iterate through the buffer and pe to tell when the buffer has been fully parsed. In Cloudflare's case, one of the things they wanted to parse were HTML attributes within script tags, such as type or source. Taking a look at the Ragel code, this script_consume_attribute machine will try to match this regular expression: attribute characters followed by "space," "/," or ">" Then we have a few actions, this is an entering action which will be performed when starting the machine. It simply logs that the script is running. This @ symbol is a finishing action which is performed when the machine completes successfully. Here we call fhold which is equivalent to p-- and will move the pointer back by one. This is likely because the script_tag_parse machine it proceeds to jump to needs to consume the "space" "/" or ">" characters that the attribute machine would have already matched, as those are also part of the tag. There is also a local error action, which is performed when an attribute fails to match. There's a log here for failure and then it recurses and tries to parse the next attribute. Going back to the success case, after exiting back to script_tag_parse, the many parser machines will continue until the end of the buffer is reached. But how do we know if we've reached the end of the buffer? Well if the data pointer p is equal to the data end pointer pe then we have surely reached the end of the buffer. So it turns out that something very bad happens if there is an unfinished attribute at the very end of a web page. When this happens failure to match would occur when the data pointer p is equal to data end pointer pe. The parser then re-invokes itself, now at risk of parsing undefined heap memory. Let's see if the buffer end check saves us. Aw man, the pre-increment causes p to skip over and never be equal to pe. Rookie mistake. But wait, this is a bug in the old parser which has been in use for years. Has Cloudflare been leaking data all this time? No. So it was actually migration to the new parser that triggered the issue. Going back to the buffer override we were talking about before, if there are more buffers to come, the unfinished tag could just be due to the rest of the elements being in the next buffer. So the error action will not be invoked. The error action is only triggered on an unfinished match within the very last buffer, as there is no more data at that point to complete the match. This is why in the example, the unfinished attribute is at the very end of the page, that is at the very end of the last possible buffer. However the key here is that historically when only the old parser was used, it would always receive an extra dummy last buffer that had no content. Why? No particular reason, it just did. This meant that for a website that ended with an unfinished tag, the unfinished tag would be in the second to last buffer, and the error action would not be called. Then since the last buffer is empty, the parser would also not be invoked again. After the new parser was introduced, this behavior changed and the empty last buffer was no longer present in the buffer sequence passed to Ragel, causing the unfinished tag to be in the last buffer and making the overrun possible. Perhaps the new parser cleaned up the empty last buffer before passing data to the old one. This also meant that the bug can only occur when a customer enables features which in combination use both old and new parsers. So what can we learn from this failure? Well here we see a classic example of backwards compatibility. No matter how dumb the behavior of something is, if it's been set in stone for a long time and you change it, something is definitely going to break. However it's not always so easy to maintain backwards compatibility. Obviously Microsoft can easily choose to not deprecate the ability for Windows to run 32-bit programs, but cf-html removing the last buffer, or perhaps more accurately, not adding the extra dummy buffer back for no reason, is something that can easily be overlooked. And it was not just this but also a bug in the existing code plus a very specific type of input that in combination caused the data leak. When you consider even larger systems with dozens of interlocking components each with millions of possible inputs, it's clear that there will inevitably be bugs in all software. So what can be done to minimize impact? Cloudflare mentions fuzzing generated code to search for pointer overruns, as well as building test cases for malformed web pages. There are also various memory management techniques that can reduce impact. This can likely also have been caught by static code analysis. Perhaps another thing worth pointing out are best practices. The coding standards for Ragel are not very clear but from my limited experimentation, I don't think it is possible for Ragel to naturally overrun the buffer. It's possible to under-run the buffer by spamming fhold, but Ragel's default behavior seems to make overrunning impossible. There's no Ragel command to force iteration of the data pointer, and the way Ragel iterates the data pointer forward naturally is as follows: always explicitly checking if it has reached the data end. This points to Cloudflare potentially going in and modifying the compiled C code rather than the Ragel code itself, something that would obviously not be Ragel best practice. Two days later, pointer checks to detect memory leaks were rolled out, and three days later, the engineers determined it was safe enough to re-enable the three aforementioned features. Cloudflare then worked with the various search engines to purge their caches of affected websites. In terms of overall impact, evidence suggests that it was quite small. There were quite a few conditions that needed to be met for the bug to manifest and Cloudflare claims that there is no evidence of the bug being leveraged for any attacks. We know that 0.6 percent of cloudflare websites ended with unfinished tags and that the bug occurred more than 18 million times. It is reasonable to say that Cloudflare just got really lucky. In fact, one of the features which could trigger this bug was available as far back as November 2016. Had this exploit falling into the wrong hands, or occurred more recently now that Cloudflare is so much bigger, there may not have been such a happy ending.

February 18, 2017. At the Cloudflare HQ work 
began winding down this Friday afternoon   as the weekend approached. Morale was high and no 
one was prepared for the disaster that was about   to happen. The engineers were excited to go home 
early and enjoy the beautiful weekend free of any   operational issues. Suddenly 4:11 pm Pacific time 
at the friendly neighborhood Google complex, one of   the Googlers working in Project Zero, a security 
research team, discovered a severe issue with   Cloudflare&#39;s system. He immediately reached out 
through the most sensible channel for something   urgent like this and first contact was made 
minutes later. It was now 4:32 pm and the   alarming details of the report were made clear 
to Cloudflare suggesting a possible widespread data leak. You may have seen Cloudflare&#39;s DDOS mitigation service before. This is built on top of their primary product, a Content delivery Network, or CDN. CDNs came into existence in the 1990s 
to speed up the delivery of Internet content. They&#39;re kind of like distribution centers. Amazon 
isn&#39;t just going to have a single Warehouse in the   middle of the United States that every delivery 
driver starts from. There are many spread all   across the country and they a store or should I say 
cache commonly sold items to minimize delivery time. Similarly it makes no sense to deliver internet 
content to all users across the world from a   single centralized source. A CDN will have 
many points of presence across the world   with edge servers that cache content from the 
origin server. When a user makes a request for   a particular website, the request is directed to 
the nearest edge server where the content is most   likely already cached. It was here that Cloudflare 
not only returned the requested website but also   cookies, keys, and other sensitive customer data. 
This is what it would look like. If you knew   where to look, plenty of useful information could 
be extracted from the leaked memory. Full HTTPS requests, IP addresses, responses, passwords, and who 
knows how long this exploit has been out there. Bad   actors could have already compromised thousands 
of companies. And Cloudflare&#39;s monitoring evidently   did not self-detect this issue, as a third party 
had to identify it and reach out to them. Data leakage like this can come with hefty consequences:
FTC fines, lawsuits, increased audits. But most   importantly of all it degrades customer trust.
No customer trust, no customers. No customers, no revenue. No revenue, no Taco Tuesdays. To make 
matters worse, search engines like Google also   regularly index and cache websites so this leaked 
data could also be accessed through Google&#39;s cache. 4:40 PM. Now this was serious business. Everyone 
immediately assembled in San Francisco. Maybe   even some cross-company action with the Google 
employees. The engineers noticed in the dashboards   that the occurrence of this bug seemed to correlate 
with the usage of the email obfuscation feature,   which was also immediately suspected as it had a recent deployment to partially migrate to a   new HTML parser. Either way every feature that 
Cloudflare ships comes with a feature flag and   engineers immediately flipped what they called the 
global kill, which would prevent all customers from   using the feature. By 5:22 PST about an hour 
after the initial report, email obfuscation   had been disabled worldwide. However, the 
bug was still occurring. On the other side of   the Atlantic, the London team had joined the 
call. All hands on deck, it was time to spend   the Friday night debugging and rethinking life. 8:24 PM PST, four hours in, another two features were   found to be problematic: automatic HTTP rewrites 
and server-side excludes. Automatic HTTP rewrites   was shut down immediately with its global kill. But server-side excludes with such an old feature that   it predated the practice of deploying with global 
kills. The engineers had a crossroads here. They   could release a patch for this feature to allow 
it to be turned off, but that would take some time   for implementation and deployment. Alternatively 
they could spend time root causing the issue   and deploy a single proper fix. But the root cause 
was not apparent. Thus the engineers began working   on the global kill for server-side excludes and 
readied it for a deployment. 11:22 PM PST, 7 hours in. As the night progressed streets outside the 
San Francisco office grew quieter. Daybreak in   London, the engineers were more than ready to 
sign off and get some much needed sleep. The   patch to turn off server-side excludes was finally 
deployed worldwide, but there was still much work   to be done. Cached data from search engines still 
needed to be purged and without knowing the true   root cause, reoccurrence was still within the realm 
of possibilities. But what could have caused this? Well, edge servers contain software to perform all 
kinds of operations on the content they deliver. This was the clear common denominator among the 
three aforementioned features. They all parsed   and modified the returned HTML content in some way.
Email obfuscation would erase any email addresses   in the returned webpage if the requester source IP 
was deemed suspicious. Server-side excludes is very   similar, it can automatically hide content wrapped 
in a special tag from suspicious source IPs.   Automatic HTTP rewrites would simply rewrite any 
HTTP content embedded on the returned website to   HTTPS. Furthermore these three features all used 
the new HTML parser mentioned earlier: cf-html. The   engineers however found nothing suspicious 
in the code despite thorough verification. It wasn&#39;t until the next few days 
that the root cause was made clear. Now, Cloudflare had originally been using a parser 
generated using Ragel and they were looking to   migrate to something simpler and more maintainable.
It was in this self-described ancient piece of   software that the bug took root. Ragel is a 
parser language that no one knows how to pronounce, that works through defining finite state machines 
with regular expressions and performing various   actions based on the match results. You can think 
of it like those flow charts, where we start at one   state and transfer to different states based on 
various conditions. For example, here is a machine   which matches consecutive numbers and letters. In 
practice you can see Ragel code is embedded within   C here using the double percent signs and then it 
can compile down to C, C++, Java, etc. So why Ragel? It&#39;s actually fairly readable and concise 
after a bit of getting used to, and I&#39;d imagine its very performant.  So the HTML web page consumed by the Ragel parser is represented by a series of data buffers 
with each buffer containing a portion of the HTML   code. Each time the Ragel parser is invoked to 
consume a buffer, the user needs to pass in data   pointers initialized to the beginning and end 
of the buffer. Ragel uses p to iterate through   the buffer and pe to tell when the buffer has been 
fully parsed. In Cloudflare&#39;s case, one of the things   they wanted to parse were HTML attributes 
within script tags, such as type or source.   Taking a look at the Ragel code, this script_consume_attribute machine will try to match   this regular expression: attribute characters 
followed by &quot;space,&quot; &quot;/,&quot; or &quot;&gt;&quot; Then we have a few actions, this is an entering 
action which will be performed when starting   the machine. It simply logs that the script is 
running. This @ symbol is a finishing action   which is performed when the machine completes 
successfully. Here we call fhold which is   equivalent to p-- and will move the 
pointer back by one. This is likely because the   script_tag_parse machine it proceeds to jump to 
needs to consume the &quot;space&quot; &quot;/&quot; or &quot;&gt;&quot; characters that the attribute machine would have 
already matched, as those are also part of the tag.   There is also a local error action, which 
is performed when an attribute fails to   match. There&#39;s a log here for failure and then it 
recurses and tries to parse the next attribute. Going back to the success case, after exiting back 
to script_tag_parse, the many parser machines will   continue until the end of the buffer is reached.
But how do we know if we&#39;ve reached the end of   the buffer? Well if the data pointer p is equal 
to the data end pointer pe then we have surely   reached the end of the buffer. So it turns out 
that something very bad happens if there is an   unfinished attribute at the very end of a web 
page. When this happens failure to match would   occur when the data pointer p is equal to data 
end pointer pe. The parser then re-invokes itself, now at risk of parsing undefined heap memory.
Let&#39;s see if the buffer end check saves us. Aw man, the pre-increment causes p to skip over 
and never be equal to pe. Rookie mistake.   But wait, this is a bug in the old parser which has 
been in use for years. Has Cloudflare been leaking   data all this time? No. So it was actually migration 
to the new parser that triggered the issue. Going   back to the buffer override we were talking about 
before, if there are more buffers to come, the   unfinished tag could just be due to the rest of 
the elements being in the next buffer. So the error   action will not be invoked. The error action is 
only triggered on an unfinished match within the   very last buffer, as there is no more data at that 
point to complete the match. This is why in the   example, the unfinished attribute is at the very 
end of the page, that is at the very end of the   last possible buffer. However the key here is that 
historically when only the old parser was used, it   would always receive an extra dummy last buffer 
that had no content. Why? No particular reason, it just did. This meant that for a website that ended 
with an unfinished tag, the unfinished tag would be   in the second to last buffer, and the error action 
would not be called. Then since the last buffer is   empty, the parser would also not be invoked again. After the 
new parser was introduced, this behavior changed   and the empty last buffer was no longer present 
in the buffer sequence passed to Ragel, causing   the unfinished tag to be in the last buffer and 
making the overrun possible. Perhaps the new parser   cleaned up the empty last buffer before passing 
data to the old one. This also meant that the bug   can only occur when a customer enables features 
which in combination use both old and new parsers. So what can we learn from this failure? Well 
here we see a classic example of backwards   compatibility. No matter how dumb the behavior of 
something is, if it&#39;s been set in stone for a long   time and you change it, something is definitely 
going to break. However it&#39;s not always so easy   to maintain backwards compatibility. Obviously 
Microsoft can easily choose to not deprecate   the ability for Windows to run 32-bit programs, 
but cf-html removing the last buffer, or perhaps   more accurately, not adding the extra dummy buffer 
back for no reason, is something that can easily   be overlooked. And it was not just this but also a 
bug in the existing code plus a very specific type   of input that in combination caused the data leak. 
When you consider even larger systems with dozens   of interlocking components each with millions 
of possible inputs, it&#39;s clear that there will   inevitably be bugs in all software. So what can 
be done to minimize impact? Cloudflare mentions fuzzing generated code to search for pointer 
overruns, as well as building test cases for   malformed web pages. There are also various memory 
management techniques that can reduce impact. This   can likely also have been caught by static code 
analysis. Perhaps another thing worth pointing out   are best practices. The coding standards for 
Ragel are not very clear but from my limited   experimentation, I don&#39;t think it is possible 
for Ragel to naturally overrun the buffer. It&#39;s possible to under-run the buffer by spamming 
fhold, but Ragel&#39;s default behavior seems to make   overrunning impossible. There&#39;s no Ragel command 
to force iteration of the data pointer, and the way   Ragel iterates the data pointer forward naturally 
is as follows: always explicitly checking if it   has reached the data end. This points to Cloudflare 
potentially going in and modifying the compiled C   code rather than the Ragel code itself, something 
that would obviously not be Ragel best practice. Two days later, pointer checks to detect memory 
leaks were rolled out, and three days later, the   engineers determined it was safe enough to 
re-enable the three aforementioned features.   Cloudflare then worked with the various 
search engines to purge their caches of   affected websites. In terms of overall impact, 
evidence suggests that it was quite small. There   were quite a few conditions that needed to 
be met for the bug to manifest and Cloudflare   claims that there is no evidence of the bug 
being leveraged for any attacks. We know that   0.6 percent of cloudflare websites ended with 
unfinished tags and that the bug occurred more   than 18 million times. It is reasonable to say 
that Cloudflare just got really lucky. In fact,   one of the features which could trigger this 
bug was available as far back as November 2016.   Had this exploit falling into the wrong 
hands, or occurred more recently now that   Cloudflare is so much bigger, there 
may not have been such a happy ending.

Transcript for:Cloudflare Incident: February 18, 2017

Transcript for:
Cloudflare Incident: February 18, 2017