Coding Horror

programming and human factors

Sins of Software Security

I picked up a free copy of 19 Deadly Sins of Software Security at a conference last year. I didn't expect the book to be good because it was a free giveaway item from one of the the vendor booths. But I paged through it on the flight home, and I was pleasantly surprised. It's actually quite good.

19 Deadly Sins of Software Security

Software security isn't exactly my favorite topic, so holding my interest is no mean feat. It helps that the book is mercifully brief and to the point, and filled with practical examples and citations. It's an excellent cross-platform, language-agnostic checksheet of common software security risks.

Here's a brief summary of each of the 19 sins, along with a count of the number of vulnerabilities I found in the Common Vulnerabilities and Exposures database for each one.

Affected Languages Exploit count
Buffer Overflows C, C++ A buffer overrun occurs when a program allows input to write beyond the end of the allocated buffer. Results in anything from a crash to the attacker gaining complete control of the operating system. Many famous exploits are based on buffer overflows, such as the Morris worm. 3,326
Format String Problems C, C++ The standard format string libraries in C/C++ include some potentially dangerous commands (particularly %n). If you allow untrusted user input to pass through a format string, this can result in anything from arbitrary code execution to spoofing user output. 411
Integer Overflows C, C++, others Failure to range check on integer types. This can cause integer overflow crashes and logic errors. In C/C++, integer overflows can be turned into a buffer overrun and arbitrary code execution, but all languages are prone to denial of service and logic errors. 288
SQL Injection All Forming SQL statements with untrusted user input means users can "inject" their own commands into your SQL statements. This puts your data at risk, and can even lead to complete server and network compromise. 2,225
Command Injection All Occurs when untrusted user input is passed to a compiler or interpreter, or worse, a command line shell. Potential risk depends on the context. 193
Failing to Handle Errors Most A broad category of problems related to a program's error handling strategy; anything that leads to the program crashing, aborting, or restarting is potentially a denial of service issue and therefore can be a security problem, particularly on servers. 80
Cross-Site Scripting (XSS) Any web-facing A web application takes some input from the user, fails to validate it, and echoes that input directly back to the web page. Because this code is running in the context of your web site, it can do anything your website could do, including retrieving cookies, modifying the HTML DOM, and so forth. 2,996
Failing to Protect Network Traffic All Most programmers understimate the risk of transmitting data over the network, even if that data is not private. Attackers can eavesdrop, replay, spoof, tamper with, or otherwise hijack any unprotected data sent over the wire. 26
Use of Magic URLs and Hidden Form Fields Any web-facing Passing sensitive or secure information via the URL querystring or hidden HTML form fields, sometimes with lousy or ineffectual "encryption" schemes. Attackers can use these fields to hijack or manipulate a browser session. 33
Improper use of SSL and TLS All Using most SSL and TLS APIs requires writing a lot of error-prone code. If programmers aren't careful, they will have an illusion of security in place of the actual security promised by SSL. Attackers can use certificates from lax authorities, subtly invalid certificates, or stolen/revoked certificates, and it's up to the developer to write the code to check for that. 123
Use of Weak Password-Based Systems All Anywhere you are using passwords, you need to seriously consider the risks inherent to all password-based systems. Risks like phishing, social engineering, eavesdropping, keyloggers, brute force attacks, and so on. And then you have to worry about how users choose passwords, and where to store them securely on the server. Passwords are a necessary evil, but tread carefully. 1,235
Failing to Store and Protect Data Securely All Information spends more time stored on disk than in transit. Consider filesystem permissions and encryption for any data you're storing. And try to avoid hardcoding "secrets" into your code or configuration files. 56
Information Leakage All The classic trade-off between giving the user helpful information, and preventing attackers from learning about the internal details of your system. Was the password invalid, or the username? 26
Improper File Access All 1) There is often a window of vulnerability between time of check and time of use (TOCTOU) in the filesystem, so an attacker can slip changes in, particularly if the files are accessed over the network.
2) The "it isn't really a file problem"; you may think you have a file, but attackers may substitute a link to another file, or a device name, or a pipe.
3) Allowing users control over the complete filename and path of files used by the program; this can lead to directory traversal attacks.
5, 58
Trusting Network Name Resolution All It's simple to override and subvert DNS on a server or workstation with a local HOSTS file. How do you really know you're talking to the real "secureserver.com" when you make a HTTP request? 20
Race Conditions All A race condition is when two different execution contexts are able to change a resource and interfere with each other. If attackers can force a race condition, they can execute a denial of service attack. Unfortunately, writing properly concurrent code is incredibly difficult. 139
Unauthenticated Key Exchange All Exchanging a private key without properly authenticating the entity/machine/service that you're exchanging the key with. To have a secure session, both parties need to agree on the identity of the opposing party. You'd be shocked how often this doesn't happen. 1
Cryptographically Strong Random Numbers All Imagine you're playing poker online. The computer shuffles and deals the cards. You get your cards, and then another program tells you what's in everybody else's hands. Random numbers are similarly fundamental to cryptography; they're used to generate things like keys and session identifiers. An attacker who can predict numbers-- even with only a slight probability of success-- can often leverage this information to breach the security of a system. 5
Poor Usability All Security is always extra complexity and pain for the user. It's up to us software developers to go out of our way to make it as painless as it can reasonably be. Security only works if the secure way also happens to be the easy way. All

It's true that C and C++ have a heavy cross to bear. But only 3 of the 19 sins can be completely lumped on the plate of K&R. The other 16 apply almost everywhere, to any developer writing code on any platform. It's a sobering thought.

The usability sin is the one that's most interesting to me. Usability is tough under the best of conditions-- and security is the worst of conditions, at least from the user's perspective. It's quite a challenge. There are a few great links in the book on the topic of security usability:

You can certainly find other books that go much deeper on particular aspects of software security. But if you're looking for an excellent primer on the entire gamut of security problems that could potentially afflict your project, 19 Deadly Sins of Software Security is an excellent starting point.

Discussion

When In Doubt, Make It Public

Marc Hedlund offered some unique advice to web entrepreneurs last month:

One of my favorite business model suggestions for [web] entrepreneurs is to find an old UNIX command that hasn't yet been implemented on the web, and fix that.

To illustrate, Marc provides a list of UNIX commands with their corresponding web implementations:

talk, fingerICQ
LISTSERVDejaNews
lsYahoo! directory
find, grepGoogle
rnBloglines
pineGoogle Mail
mountAmazon S3
bashYahoo! Pipes
wallTwitter

Jason Kottke noted that most successful "new" business models on the web aren't new at all-- they're simply taking what was once private and making it public and permanent:

Blogger = public email messages. (1999) Instead of "Dear Bob, Check out this movie." it's "Dear People I May or May Not Know Who Are Interested in Film Noir, check out this movie. If you like it, maybe we can be friends."

Flickr = public photo sharing. (2004) Flickr co-founder Caterina Fake said in a recent interview: "When we started the company, there were dozens of other photosharing companies such as Shutterfly, but on those sites there was no such thing as a public photograph -- it didn't even exist as a concept -- so the idea of something 'public' changed the whole idea of Flickr."

YouTube = public home videos. (2005) Bob Saget was onto something.

Twitter = public IM. (2006) I don't think it's any coincidence that one of the people responsible for Blogger is also responsible for Twitter.

But you don't have to found a new Web 2.0 company to benefit from the power of public information. Even brick and mortar companies are finally realizing that the age-old principle of "secret by default" may not be the best policy today:

Companies used to assume that details about their internal workings were valuable precisely because they were secret. If you were cagey about your plans, you had the upper hand; if you kept your next big idea to yourself, people couldn't steal it. Now, billion- dollar ideas come to CEOs who give them away; corporations that publicize their failings grow stronger. Power comes not from your Rolodex but from how many bloggers link to you - and everyone trembles before search engine rankings.

Power, it seems, comes from public information. Secrets are only a source of powerlessness. Just ask Brad Abrams, who poses this rhetorical question:

If no one knows you did X, did you really get all the benefits for doing X?

I think Brad is being a bit too cautious here. I'll go one step further. Until you've..

  • Written a blog entry about X
  • Posted Flickr photos of X
  • Uploaded a video of X to YouTube
  • Typed a Twitter message about X

.. did X really happen at all?

This is not to say we should fill the world with noise on every mundane aspect of our existence. But who decides what is mundane? Who decides what is interesting? Everything's interesting to someone, even if that someone is only you and a few other people in the world.

It's my firm belief that the inclusionists are winning. We live in a world of infinitely searchable micro-content, and every contribution, however small, enriches all of us. But more selfishly, if you're interested in deriving maximum benefit from your work, there's no substitute for making it public and findable. Obscurity sucks. But obscurity by choice is irrational. When in doubt, make it public.

Discussion

Reddit: Language vs. Platform

My previous entry, Twitter: Service vs. Platform, was widely misunderstood. I suppose I only have myself to blame, so I'll try to clarify with another example.

Consider Reddit. The Reddit development team switched from Lisp to Python late in 2005:

If Lisp is so great, why did we stop using it? One of the biggest issues was the lack of widely used and tested libraries. Sure, there is a CL library for basically any task, but there is rarely more than one, and often the libraries are not widely used or well documented. Since we're building a site largely by standing on the shoulders of others, this made things a little tougher. There just aren't as many shoulders on which to stand.

On that note, if you have been considering writing a web application in Lisp, go for it. It will be tough if you're not already a Lisper, but you will learn a lot along the way, and it will be worth it I am sure. Lisp is especially great for projects where the end goal is unknown because it's so easy to steer in different directions. Lisp will never get in your way, although sometimes the environment will.

Language performance is a red herring. That's especially true when we're comparing dynamic languages like Ruby, Lisp, and Python that will never be known for their high octane, nitro burnin' performance levels. I assumed Alex Payne knew that when he chose to specifically call out Ruby language performance, but maybe I assumed wrong.

When you choose a language, like it or not, you've chosen a platform. And as Steve so patiently and calmly explained to all the Lisp enthusiasts, the platform around the language, more than the language itself, sets the tone for your development experience. The availability of common, popular libraries and the maturity of the development environment end up trumping any particular significance the language holds.

That's why the Reddit switch makes good business sense: they didn't change languages; they changed platforms. At the point which your choice of platform starts to jeopardize your service, you switch platforms, exactly as Reddit did. Your users don't give a damn what framework and language you're using. The only people who care about that stuff are other software developers. And God help you if your users are software developers; then you're really in trouble.

But things aren't all roses in Python-land either. The Reddit developers initially used a Rails-like web application framework, with decidedly mixed results:

The framework that seems most promising is Django and indeed the authors of reddit initially attempted to rewrite their site in it. I was curious about their experience, so I carefully followed them along, trying to help them out.

Django seemed great from the outside: a nice-looking website, intelligent and talented developers, and a seeming surplus of nice features. The developers and community are extremely helpful and responsive to patches and suggestions. And all the right goals are espoused in their philosophy documents and FAQs. Unfortunately, however, they seem completely incapable of living up to them.

While Django claims that it's "loosely coupled", using it pretty much requires fitting your code into Django's worldview. Django insists on executing your code itself, either through its command-line utility or a specialized server handler called with the appropriate environment variables and Python path. When you start a project, by default Django creates folders nested four levels deep for your code and while you can move around some files, I had trouble figuring out which ones and how.

Django's philosophy says "Explicit is better than implicit", but Django has all sorts of magic. Database models you create in one file magically appear someplace else deep inside the Django module with a different name. When your model function is called, new things have been added to its variable-space and old ones removed. (I'm told they're currently working on fixing both of these, though.)

Note that any analogies I'm drawing between Rails and Django here are purely intentional.

Not that there's anything wrong with adopting a web application framework. But at least in Python you have a choice of web application frameworks. Instead of investing in the Django worldview, the Reddit team decided that the lighter weight web.py better suited their needs. Similarly, some ASP.NET developers reject the entire page lifecycle model, preferring to write their own HttpHandlers and HttpModules for finer-grained control over what's happening on their website. And that's fine; the ASP.NET platform accommodates both camps of developers.

It's true that Twitter represents an extreme case, but it sure looks like the Twitter developers could benefit from a choice of web application frameworks, too. In the end, it's about choice and flexibility. Not just in the language, but in the platform that inevitably comes along with any language.

Discussion

Twitter: Service vs. Platform

Twitter is a victim of its own success. The site has massive scaling problems, to the tune of 11,000 pageviews per second. According to this interview with a Twitter developer, a lot of the scaling problems are attributable to Twitter's choice of platform:

By various metrics Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues - issues that any growing site eventually contends with - far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there's no facility in Rails to talk to more than one database at a time. The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it's not just cost, it's time, and time is that much more precious when people can['t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

It's also worth mentioning that there shouldn't be doubt in anybody's mind at this point that Ruby itself is slow. It's great that people are hard at work on faster implementations of the language, but right now, it's tough. If you're looking to deploy a big web application and you're language-agnostic, realize that the same operation in Ruby will take less time in Python. All of us working on Twitter are big Ruby fans, but I think it's worth being frank that this isn't one of those relativistic language issues. Ruby is slow.

I've often said that performance doesn't always matter. But if, like Twitter, your business model is predicated on how fast your users can press the Refresh button in their browser, you could be in serious trouble if your service becomes popular.

What I find particularly amusing is the performance comparison with Python. It's hard to believe that Python is that much faster than Ruby. Python, like Ruby, is an interpreted language, and interpreted languages are so slow that if you have to ask how much performance you're giving up, you can't afford it. Consider this chart from Code Complete 2.0:

Language Type of Language Execution Time Relative to C++
C++ Compiled 1:1
Visual Basic Compiled 1:1
C# Compiled 1:1
Java Byte code 1.5:1
PHP Interpreted > 100:1
Python Interpreted > 100:1

I realize that Web 2.0 is built on the back of the cheap "whatever box" server. Twitter is probably the perfect storm of refresh-heavy design coupled with exponential growth. Most websites wish they were so lucky.

To be fair, it sounds like most of Twitter's problems are database problems, so maybe it doesn't matter what language they use. But it does make you wonder: what's more important-- the service, or the platform you deliver that service on?

In the case where the latter is jeopardizing the former, I think it's pretty clear where your allegiances should lie. Your users don't care how cool the Rails platform is-- but they sure do care about consistent availability of your service.

Update: This entry isn't as clear as it could be. See my followup to this post for a better explanation of my position.

Discussion

The Pernicious Issue of Software Patents

A reddit user recently invoked link necromancy on a 1994 Donald Knuth letter to the U.S. Patent Office:

When I think of the computer programs I require daily to get my own work done, I cannot help but realize that none of them would exist today if software patents had been prevalent in the 1960s and 1970s. Changing the rules now will have the effect of freezing progress at essentially its current level. If present trends continue, the only recourse available to the majority of America's brilliant software developers will be to give up software or to emigrate. The U.S.A. will soon lose its dominant position.

Please do what you can to reverse this alarming trend. There are far better ways to protect the intellectual property rights of software developers than to take away their right to use fundamental building blocks.

You have to respect the opinion of Donald Knuth, because he's our homeboy.

Knuth is my Homeboy

Still, opinions vary. The software patent debate merits an entire Wikipedia article, and the ensuing comment debate on Reddit represents plenty of opposing viewpoints.

Paul Graham, surprisingly, thinks software patents don't matter:

I'm not saying secrecy would be worse than patents, just that we couldn't discard patents for free. Businesses would become more secretive to compensate, and in some fields this might get ugly. Nor am I defending the current patent system. There is clearly a lot that's broken about it. But the breakage seems to affect software less than most other fields.

In the software business I know from experience whether patents encourage or discourage innovation, and the answer is the type that people who like to argue about public policy least like to hear: they don't affect innovation much, one way or the other. Most innovation in the software business happens in startups, and startups should simply ignore other companies' patents. At least, that's what we advise, and we bet money on that advice.

Paul Heckel goes so far as to say responsible, rational use of software patents may actually encourage innovation:

In brief, what superficially looks like another problem to be dealt with in the increasingly competitive, commodities oriented software business, will prove to be what makes products less price competitive. Many industries have worked on this basis all along: patents make industries more diverse in their offerings, more profitable, more innovative, and ultimately will make the U.S. more competitive.

The essence of this article is simple: Software intellectual property issues are not inherently different in substance from other technologies; what motivates people is not inherently different; industry life cycle is not inherently different; marketing and business strategies and tactics are not inherently different; the law and policy issues are not inherently different. The technology is not even new. Software has been around for 40 years. The issues may be new to those who had no experience of them, but the only thing that is different is that software is a mass market industry for the first time and real money is at stake.

As much as I respect Knuth, I have to agree that the problem with software patents isn't the patents themselves. It's the sloppy, haphazard way the patents are granted and enforced. If anything needs reforming, it's the U.S. Patent Office.

Discussion