Chapter 14. Databases

Databases are one of computing’s most important inventions. They allow applications to store massive quantities of information, with the ability to search through millions of items and retrieve the ones you need in a fraction of a second. A high-quality database can scale to large numbers of concurrent end users, while providing very reliable storage, even in the face of system crashes. And even if you don’t need the scalability, databases still look compelling if your program needs to remember data for any length of time—applications that store valuable information usually rely on databases.

The .NET Framework provides several different ways to communicate with databases. We will mainly be looking at its most recently introduced data access mechanism, the Entity Framework, and how that works with the LINQ features of C#. But first, we’ll take a quick look at all of the database features of the .NET Framework, to put the Entity Framework in context.

The main focus of this chapter, the Entity Framework, was first released as part of Service Pack 1 for Visual Studio 2008, which emerged less than a year after the initial (pre-Service-Pack) release of Visual Studio 2008. This was remarkable, since that first release had already introduced a brand-new data access feature, LINQ to SQL, but then Microsoft has released a lot of data access technologies over the years.

While the pace of change can sometimes seem daunting, each new piece has been a useful advance, and despite the new APIs, the data access services that appeared in .NET v1.0 are still relevant today. So we’re not in a state of continuous revolution—new features mostly add layers of functionality. This means it’s useful to understand all the parts in order to know what to choose for your applications, so we’ll review what each is for and how the pieces build on one another.

.NET v1 provided a set of data access services called ADO.NET.[34] In more recent years, ADO.NET seems to have grown into an umbrella term—as new data access features have been added, most (but not all) appear in the ADO.NET section of the documentation. But to understand the layers, it’s worth starting with the two parts that were in the first version: interfaces for querying and updating databases, and classes that support disconnected use of data.

ADO.NET defines a family of interfaces that provide a uniform way to perform basic operations such as executing queries, inserting new rows into database tables, and updating or deleting existing rows. Some data access features are common to many different programming systems—if you’re familiar with ODBC, or with Java’s JDBC, you could think of these ADO.NET interfaces as being the .NET equivalent of those APIs.

These interfaces provide the most direct and efficient way to access the basic services offered by relational databases, which is why the other data access features we’ll be looking at in this chapter do not replace this part of ADO.NET. They build on this low-level feature to provide higher-level services.

Because it’s not the main focus of this chapter, we won’t go into too much detail on how this part of ADO.NET works, and will instead just provide a quick taste. Table 14-1 shows the main ADO.NET base classes that represent the various things needed to get a database to do some work.

Example 14-1 shows the typical pattern of communication. It starts by creating a connection object—a SqlConnection here because this code connects to SQL Server, but for other databases you’d use other types derived from DbConnection, such as OracleConnection. Next, it builds a command object, setting its CommandText to the SQL we want the database to execute. This particular example is a parameterized command—it selects addresses from a specified state, so we supply the command with a parameter object specifying the state. Then we execute the command by calling ExecuteReader, using the data reader object it returns to iterate through the rows produced by the query, and we print out the values. (This particular example assumes you have a SQL Server instance called .\SQLEXPRESS. If you installed the full edition or developer edition of SQL Server, specify just . instead of .\SQLEXPRESS. See Getting up and running with SQL Server 2008 Express for information on getting the samples installed.)

You might be wondering why we’re fiddling around with parameter objects when it would have been simpler to just put the state directly into the SQL string. This particular example hardcodes the state, so that would have worked, but the technique here would be important if the value was picked at runtime. In general, building SQL queries with string concatenation is a dangerous thing to do—if any of the text comes from outside your code (e.g., from a form on a web page, or part of a URL) your code will be vulnerable to a SQL injection attack. Imagine that Example 14-1 was part of a web application, and state here came from part of a URL such as http://example.com/showinfo?state=California. Users are free to modify URLs—you can just type them into the address bar—so a malicious user might decide to modify that part of the URL. If the code just took the string from the URL and concatenated it directly into the SQL, we would effectively be giving anyone with an Internet connection the ability to run arbitrary SQL commands on our database—SQL queries can contain multiple commands, so users would be able to add extra commands to run after the SELECT. Parameters are one way to avoid this, because the value of a parameter will not be treated as SQL. So it’s a good idea to get in the habit of using parameters whenever some part of the query needs to change at runtime.

The API we used here directly reflects the steps needed to communicate with a database, so we have to write a lot of code to bridge between the queries, parameters, and columns of the database world and the world of C#. Just as a sneak preview, Example 14-2 shows the equivalent code using the Entity Framework. Notice that instead of having to build a parameter object for a parameterized query, we’ve just been able to use a LINQ where clause and the C# == comparison syntax. (The Entity Framework performs a parameterized query under the covers, so this is safe from SQL injection attacks.) Also notice that all the database columns are available as object properties, so we don’t have to call GetString or similar helpers to retrieve column values.

Example 14-1 has one obvious benefit in exchange for the complexity: we have complete control over the SQL query. You can’t see the SQL in Example 14-2 because it gets generated for you. In general, the low-level ADO.NET API gives us more direct access to database features—for example, with SQL Server you can arrange to be notified when a query you executed earlier will now return different results due to changes in the database. (This can be useful in caching systems—ASP.NET’s cache can take advantage of this. It needs to be used with care, however, because it requires you to keep a database connection open at all times, which can cause severe scaling problems.)

Another potential benefit is that Example 14-1 does not require the application to commit to the Entity Framework’s way of doing things. Not all applications necessarily want to use databases in the way the Entity Framework chooses to.

The use of this old-style part of ADO.NET usually comes down to a need for control over some specific aspects of data access, or occasionally because it offers performance benefits in certain specialized scenarios. But for the majority of developers, this style of data access will be unnecessarily low-level and verbose.

These interfaces are not the only part of ADO.NET v1—it has another piece whose job is to manage data after the query that fetched it has completed.

ADO.NET defines the DataSet class, which is a collection of DataTable objects. A DataTable is an in-memory copy of some tabular data. Typically, this would be fetched from a database table or view, although it’s possible to build up a DataTable from any information source—it provides methods for creating new rows from scratch.

A DataSet can be a convenient way of loading a small subset of a database’s contents into client-side code, enabling information to be browsed locally using data binding. It also supports some basic client-side processing of a kind that might normally be done in the database—you can perform searching, filtering, and sorting, for example. In Windows GUIs, moving this sort of work to the client side can improve responsiveness—the user doesn’t have to wait for the database to respond to see results. This is what’s meant by disconnected operation—you can still work with the data even after closing the connection to the database.

DataSet objects are serializable, so it’s possible to save one to disk or send it across a network. It can use an XML representation, which in theory makes it possible for non-.NET code to access the information in a DataSet. However, while it’s certainly workable, in practice this seems not to be a popular technique. This may be because the XML representation is relatively complex and unique to the DataSet, so there’s not much support for it outside of the .NET Framework.

Visual Studio is able to generate derived classes to build a so-called strongly typed DataSet, whose tables offer row objects with .NET properties representing columns in the corresponding database table. Strongly typed DataSets are often used to reduce the amount of code required to bridge between C# and the database. However, since LINQ to SQL and LINQ to Entities came along, this use of DataSets has become less popular, because the LINQ-based approaches offer the same benefit but are typically easier to use. So DataSets are somewhat out of favor today.

The low-level ADO.NET data access interfaces were the main way to access data in .NET right up until .NET 3.5 and Visual Studio 2008 shipped, bringing LINQ.

As we saw in Chapter 8, LINQ lets you perform tasks with collections of data including filtering, sorting, and grouping. In that chapter, we were working only with objects, but these are exactly the jobs that databases are good at. Moreover, one of the motivations behind LINQ’s design was to make it easier to use databases from code. As you can see in Example 14-2, LINQ blends data access seamlessly into C# code—this database example looks very similar to the object examples we saw in the earlier chapter.

Example 14-2 uses LINQ to Entities—a LINQ provider for the Entity Framework. The Entity Framework didn’t appear until Service Pack 1 of Visual Studio 2008, and there’s an older database LINQ provider called LINQ to SQL that appeared in the first Visual Studio 2008 release.

LINQ to SQL works only with SQL Server and SQL Server Compact 3.5, and has a fairly narrow goal. It aims to reduce the overhead involved in writing data access code by providing a convenient C# syntax for working with the data in a SQL Server database.

The Entity Framework is similar, but it adds a couple of additional features. First, it is designed to support multiple database vendors—it has an open provider model, enabling support to be written for any database, and you can get providers for most popular databases. Second, the Entity Framework allows the .NET representation to have a different structure from your database schema if necessary. You can define a conceptual model whose entities do not necessarily correspond directly to rows of particular tables—an entity might include data that spans multiple tables in the database itself. This entity can then be represented as a single object.

Of course, it’s possible to have your conceptual model correspond exactly to your database model—you’re free to create a straightforward mapping where one entity represents one row in one table. Used in this way the Entity Framework, in conjunction with LINQ to Entities, makes LINQ to SQL look redundant. So why do we have both?

The main reason LINQ to SQL exists is that it was ready when Visual Studio 2008 shipped, whereas Microsoft hadn’t finished the Entity Framework at that point. LINQ was a major part of that release, and since one of the main motivations behind LINQ was data access, shipping without a LINQ-based data access feature would have been a bit of a letdown. LINQ to SQL was developed by a different team (it came from the LINQ team, and not the data access group), and it was ready earlier, due no doubt in part to its less ambitious goals.

Microsoft has stated that while both technologies are fully supported, the Entity Framework is where the majority of its efforts will now be focused. Visual Studio 2010 adds a few new LINQ to SQL features, but LINQ to Entities will see more development in the long run.

That’s why this chapter’s focus is the Entity Framework (although a lot of the concepts here apply equally to both technologies). That being said, both authors really like LINQ to SQL. In scenarios where we’re using SQL Server and where we don’t need the conceptual model and mapping features of the Entity Framework, we’re both more inclined to use LINQ to SQL because of its simplicity and because we’ve already learned how to use it. But if you learn only one data access technology for .NET, the Entity Framework looks like the better choice for the long term.

By the time Microsoft shipped the Entity Framework, various third-party options for mapping relational data into object models had been around for a while. We’re not going to talk about them in this book, but it’s useful to be aware that the Entity Framework isn’t the only game in town.

Perhaps the best known alternative is NHibernate. This is a .NET version of Hibernate, a popular Java ORM (Object Relational Mapper). NHibernate had already been around for a few years by the time the Entity Framework emerged (and its Java progenitor is considerably older). So in many respects it’s a more mature and more fully featured ORM than the Entity Framework. On the other hand, NHibernate predates LINQ (and Java currently has nothing resembling LINQ), so at the time of this writing, its LINQ support is somewhat limited.

Many other ORMs are available for .NET, some free and some commercial. They are too numerous to mention here, as a quick web search will confirm.

Most communication with databases happens over specialized, vendor-specific protocols. Firewalls are usually configured not to let such protocols through, and with good reason: from a security perspective, making your database directly accessible on the Internet tends to look like a very bad idea. Nonetheless, some people want to do this, and there are scenarios in which it’s not the terrible idea it might first seem, particularly if you can exercise sufficient control over what gets exposed.

With WCF Data Services, you can present a relational data store over HTTP and XML or JSON. You can be selective about what data you expose and to whom. Moreover, the model you present doesn’t necessarily have to be the same as your underlying database structure. In fact, there doesn’t have to be a database involved at all—there’s a provider model that enables you to present any data through this mechanism, as long as you can find a way to make it look like relational data.

You will normally use WCF Data Services in conjunction with the Entity Framework—you can define the entities you’d like to present over HTTP, and use the framework’s mapping services to bridge between that and the underlying data store. So we’ll be looking at these services in more detail later in the chapter, once we’ve finished exploring the Entity Framework.

The focus of WCF Data Services is different than for the other data access features we’ve discussed so far—it’s mainly about presenting data on the network, where everything else has been about consuming data. However, there’s also a client-side component that provides LINQ-based querying for such services. While it’s part of the WCF Data Services technology, it’s optional—you’re not obliged to use it on the client. And this client doesn’t strictly require WCF Data Services on the server—the client-side parts could be used against any service that exposes data in the same way.

The full .NET Framework is designed to work with a wide range of databases. The simple ADO.NET data access we started with uses interfaces to allow database vendors to supply their own database-specific implementations. Likewise, the Entity Framework is database-agnostic—it has an open provider model designed to allow support for any relational database to be added. Of course, Microsoft ships a provider for its own database, SQL Server, but other suppliers offer providers for various databases, including Oracle, MySQL, PostgreSQL, SQLite, Sybase, and DB2.

In this book, we will use SQL Server. The examples work with SQL Server, which is available for free. (Some editions of Visual Studio will automatically install SQL Server 2008 Express for you by default.) The Express edition of SQL Server is the same database engine as the “real” versions, but with some limits on database size and with some of the more advanced features missing. Despite being a trimmed down version, it’s easily capable of supporting substantial websites. It can also be used on client applications written with WPF or Windows Forms, to support client-side data stores or caching, although it can complicate the installation process for such an application—installing a SQL Server instance is not a trivial task.

If you want to follow the examples in this chapter, not only will you need a copy of SQL Server 2008 Express installed, but you’ll also need to install a sample database. We’ll be using the lightweight version of the Adventure Works database available from http://msftdbprodsamples.codeplex.com/.

Getting this sample up and running is slightly fiddly, because there are numerous different versions of the Adventure Works sample—there are full and lightweight versions for both SQL Server 2005 and SQL Server 2008, and each version of SQL Server comes in various editions, not all of which put their datafiles in the same place. Because of all the variations, it’s quite easy to find that the sample database has failed to appear even though the installation appeared to proceed without error.

Moreover, the steps required to install the database change from time to time, as new versions are released. We had been planning to provide detailed steps here, but while we were writing this book, changes to the database installer rendered the first set of instructions we had produced useless. Since that could well happen again between us finishing the book and you reading it, we’re providing the instructions as part of the sample code you can download for this book from the O’Reilly website so that we can update them when necessary. You can find these at http://oreilly.com/catalog/9780596159832/.

Now that we’ve finished a quick survey of the data access features available in .NET and we’ve seen how to get the sample database installed, let’s look at the Entity Framework in more detail. We’ll start with the model at the heart of the framework.



[34] The name is a little confusing. In a sense, ADO.NET is a successor to ADO (ActiveX Data Objects), a data access system that was around before .NET. So ADO.NET does for . NET what ADO did for Visual Basic 6. But they are quite different technologies—ADO.NET makes no use of ADO, or ActiveX. ADO.NET can use OLE DB, the technology underpinning ADO, but native ADO.NET providers are preferred—the OLE DB provider is mainly for legacy sources.