Transcript for:
Cloudflare Incident: February 18, 2017

February 18, 2017. At the Cloudflare HQ work  began winding down this Friday afternoon   as the weekend approached. Morale was high and no  one was prepared for the disaster that was about   to happen. The engineers were excited to go home  early and enjoy the beautiful weekend free of any   operational issues. Suddenly 4:11 pm Pacific time  at the friendly neighborhood Google complex, one of   the Googlers working in Project Zero, a security  research team, discovered a severe issue with   Cloudflare's system. He immediately reached out  through the most sensible channel for something   urgent like this and first contact was made  minutes later. It was now 4:32 pm and the   alarming details of the report were made clear  to Cloudflare suggesting a possible widespread data leak. You may have seen Cloudflare's DDOS mitigation service before. This is built on top of their primary product, a Content delivery Network, or CDN. CDNs came into existence in the 1990s  to speed up the delivery of Internet content. They're kind of like distribution centers. Amazon  isn't just going to have a single Warehouse in the   middle of the United States that every delivery  driver starts from. There are many spread all   across the country and they a store or should I say  cache commonly sold items to minimize delivery time. Similarly it makes no sense to deliver internet  content to all users across the world from a   single centralized source. A CDN will have  many points of presence across the world   with edge servers that cache content from the  origin server. When a user makes a request for   a particular website, the request is directed to  the nearest edge server where the content is most   likely already cached. It was here that Cloudflare  not only returned the requested website but also   cookies, keys, and other sensitive customer data.  This is what it would look like. If you knew   where to look, plenty of useful information could  be extracted from the leaked memory. Full HTTPS requests, IP addresses, responses, passwords, and who  knows how long this exploit has been out there. Bad   actors could have already compromised thousands  of companies. And Cloudflare's monitoring evidently   did not self-detect this issue, as a third party  had to identify it and reach out to them. Data leakage like this can come with hefty consequences: FTC fines, lawsuits, increased audits. But most   importantly of all it degrades customer trust. No customer trust, no customers. No customers, no revenue. No revenue, no Taco Tuesdays. To make  matters worse, search engines like Google also   regularly index and cache websites so this leaked  data could also be accessed through Google's cache. 4:40 PM. Now this was serious business. Everyone  immediately assembled in San Francisco. Maybe   even some cross-company action with the Google  employees. The engineers noticed in the dashboards   that the occurrence of this bug seemed to correlate  with the usage of the email obfuscation feature,   which was also immediately suspected as it had a recent deployment to partially migrate to a   new HTML parser. Either way every feature that  Cloudflare ships comes with a feature flag and   engineers immediately flipped what they called the  global kill, which would prevent all customers from   using the feature. By 5:22 PST about an hour  after the initial report, email obfuscation   had been disabled worldwide. However, the  bug was still occurring. On the other side of   the Atlantic, the London team had joined the  call. All hands on deck, it was time to spend   the Friday night debugging and rethinking life. 8:24 PM PST, four hours in, another two features were   found to be problematic: automatic HTTP rewrites  and server-side excludes. Automatic HTTP rewrites   was shut down immediately with its global kill. But server-side excludes with such an old feature that   it predated the practice of deploying with global  kills. The engineers had a crossroads here. They   could release a patch for this feature to allow  it to be turned off, but that would take some time   for implementation and deployment. Alternatively  they could spend time root causing the issue   and deploy a single proper fix. But the root cause  was not apparent. Thus the engineers began working   on the global kill for server-side excludes and  readied it for a deployment. 11:22 PM PST, 7 hours in. As the night progressed streets outside the  San Francisco office grew quieter. Daybreak in   London, the engineers were more than ready to  sign off and get some much needed sleep. The   patch to turn off server-side excludes was finally  deployed worldwide, but there was still much work   to be done. Cached data from search engines still  needed to be purged and without knowing the true   root cause, reoccurrence was still within the realm  of possibilities. But what could have caused this? Well, edge servers contain software to perform all  kinds of operations on the content they deliver. This was the clear common denominator among the  three aforementioned features. They all parsed   and modified the returned HTML content in some way. Email obfuscation would erase any email addresses   in the returned webpage if the requester source IP  was deemed suspicious. Server-side excludes is very   similar, it can automatically hide content wrapped  in a special tag from suspicious source IPs.   Automatic HTTP rewrites would simply rewrite any  HTTP content embedded on the returned website to   HTTPS. Furthermore these three features all used  the new HTML parser mentioned earlier: cf-html. The   engineers however found nothing suspicious  in the code despite thorough verification. It wasn't until the next few days  that the root cause was made clear. Now, Cloudflare had originally been using a parser  generated using Ragel and they were looking to   migrate to something simpler and more maintainable. It was in this self-described ancient piece of   software that the bug took root. Ragel is a  parser language that no one knows how to pronounce, that works through defining finite state machines  with regular expressions and performing various   actions based on the match results. You can think  of it like those flow charts, where we start at one   state and transfer to different states based on  various conditions. For example, here is a machine   which matches consecutive numbers and letters. In  practice you can see Ragel code is embedded within   C here using the double percent signs and then it  can compile down to C, C++, Java, etc. So why Ragel? It's actually fairly readable and concise  after a bit of getting used to, and I'd imagine its very performant. So the HTML web page consumed by the Ragel parser is represented by a series of data buffers  with each buffer containing a portion of the HTML   code. Each time the Ragel parser is invoked to  consume a buffer, the user needs to pass in data   pointers initialized to the beginning and end  of the buffer. Ragel uses p to iterate through   the buffer and pe to tell when the buffer has been  fully parsed. In Cloudflare's case, one of the things   they wanted to parse were HTML attributes  within script tags, such as type or source.   Taking a look at the Ragel code, this script_consume_attribute machine will try to match   this regular expression: attribute characters  followed by "space," "/," or ">" Then we have a few actions, this is an entering  action which will be performed when starting   the machine. It simply logs that the script is  running. This @ symbol is a finishing action   which is performed when the machine completes  successfully. Here we call fhold which is   equivalent to p-- and will move the  pointer back by one. This is likely because the   script_tag_parse machine it proceeds to jump to  needs to consume the "space" "/" or ">" characters that the attribute machine would have  already matched, as those are also part of the tag.   There is also a local error action, which  is performed when an attribute fails to   match. There's a log here for failure and then it  recurses and tries to parse the next attribute. Going back to the success case, after exiting back  to script_tag_parse, the many parser machines will   continue until the end of the buffer is reached. But how do we know if we've reached the end of   the buffer? Well if the data pointer p is equal  to the data end pointer pe then we have surely   reached the end of the buffer. So it turns out  that something very bad happens if there is an   unfinished attribute at the very end of a web  page. When this happens failure to match would   occur when the data pointer p is equal to data  end pointer pe. The parser then re-invokes itself, now at risk of parsing undefined heap memory. Let's see if the buffer end check saves us. Aw man, the pre-increment causes p to skip over  and never be equal to pe. Rookie mistake.   But wait, this is a bug in the old parser which has  been in use for years. Has Cloudflare been leaking   data all this time? No. So it was actually migration  to the new parser that triggered the issue. Going   back to the buffer override we were talking about  before, if there are more buffers to come, the   unfinished tag could just be due to the rest of  the elements being in the next buffer. So the error   action will not be invoked. The error action is  only triggered on an unfinished match within the   very last buffer, as there is no more data at that  point to complete the match. This is why in the   example, the unfinished attribute is at the very  end of the page, that is at the very end of the   last possible buffer. However the key here is that  historically when only the old parser was used, it   would always receive an extra dummy last buffer  that had no content. Why? No particular reason, it just did. This meant that for a website that ended  with an unfinished tag, the unfinished tag would be   in the second to last buffer, and the error action  would not be called. Then since the last buffer is   empty, the parser would also not be invoked again. After the  new parser was introduced, this behavior changed   and the empty last buffer was no longer present  in the buffer sequence passed to Ragel, causing   the unfinished tag to be in the last buffer and  making the overrun possible. Perhaps the new parser   cleaned up the empty last buffer before passing  data to the old one. This also meant that the bug   can only occur when a customer enables features  which in combination use both old and new parsers. So what can we learn from this failure? Well  here we see a classic example of backwards   compatibility. No matter how dumb the behavior of  something is, if it's been set in stone for a long   time and you change it, something is definitely  going to break. However it's not always so easy   to maintain backwards compatibility. Obviously  Microsoft can easily choose to not deprecate   the ability for Windows to run 32-bit programs,  but cf-html removing the last buffer, or perhaps   more accurately, not adding the extra dummy buffer  back for no reason, is something that can easily   be overlooked. And it was not just this but also a  bug in the existing code plus a very specific type   of input that in combination caused the data leak. When you consider even larger systems with dozens   of interlocking components each with millions  of possible inputs, it's clear that there will   inevitably be bugs in all software. So what can  be done to minimize impact? Cloudflare mentions fuzzing generated code to search for pointer  overruns, as well as building test cases for   malformed web pages. There are also various memory  management techniques that can reduce impact. This   can likely also have been caught by static code  analysis. Perhaps another thing worth pointing out   are best practices. The coding standards for  Ragel are not very clear but from my limited   experimentation, I don't think it is possible  for Ragel to naturally overrun the buffer. It's possible to under-run the buffer by spamming  fhold, but Ragel's default behavior seems to make   overrunning impossible. There's no Ragel command  to force iteration of the data pointer, and the way   Ragel iterates the data pointer forward naturally  is as follows: always explicitly checking if it   has reached the data end. This points to Cloudflare  potentially going in and modifying the compiled C   code rather than the Ragel code itself, something  that would obviously not be Ragel best practice. Two days later, pointer checks to detect memory  leaks were rolled out, and three days later, the   engineers determined it was safe enough to  re-enable the three aforementioned features.   Cloudflare then worked with the various  search engines to purge their caches of   affected websites. In terms of overall impact,  evidence suggests that it was quite small. There   were quite a few conditions that needed to  be met for the bug to manifest and Cloudflare   claims that there is no evidence of the bug  being leveraged for any attacks. We know that   0.6 percent of cloudflare websites ended with  unfinished tags and that the bug occurred more   than 18 million times. It is reasonable to say  that Cloudflare just got really lucky. In fact,   one of the features which could trigger this  bug was available as far back as November 2016.   Had this exploit falling into the wrong  hands, or occurred more recently now that   Cloudflare is so much bigger, there  may not have been such a happy ending.