LINQ

LINQ, or Language Integrated Query, allows you to write structured type-safe queries over local object collections and remote data sources.

LINQ lets you query any collection implementing IEnumerable<>, whether an array, list, XML DOM, or remote data source (such as a table in SQL Server). LINQ offers the benefits of both compile-time type checking and dynamic query composition.

Note

A good way to experiment with LINQ is to download LINQPad. LINQPad lets you interactively query local collections and SQL databases in LINQ without any setup and is preloaded with numerous examples.

The basic units of data in LINQ are sequences and elements. A sequence is any object that implements the generic IEnumerable interface, and an element is each item in the sequence. In the following example, names is a sequence, and Tom, Dick, and Harry are elements:

string[] names = { "Tom", "Dick", "Harry" };

A sequence such as this we call a local sequence because it represents a local collection of objects in memory.

A query operator is a method that transforms a sequence. A typical query operator accepts an input sequence and emits a transformed output sequence. In the Enumerable class in System.Linq, there are around 40 query operators; all implemented as static extension methods. These are called standard query operators.

An important feature of many query operators is that they execute not when constructed, but when enumerated (in other words, when MoveNext is called on its enumerator). Consider the following query:

var numbers = new List<int> { 1 };
numbers.Add (1);

IEnumerable<int> query = numbers.Select (n => n * 10);
numbers.Add (2);    // Sneak in an extra element

foreach (int n in query)
  Console.Write (n + "|");          // 10|20|

The extra number that we sneaked into the list after constructing the query is included in the result, because it’s not until the foreach statement runs that any filtering or sorting takes place. This is called deferred or lazy evaluation. Deferred execution decouples query construction from query execution, allowing you to construct a query in several steps, as well as making it possible to query a database without retrieving all the rows to the client. All standard query operators provide deferred execution, with the following exceptions:

The conversion operators are useful, in part, because they defeat lazy evaluation. This can be useful when:

The following example illustrates the ToList operator:

var numbers = new List<int>() { 1, 2 };

List<int> timesTen = numbers
  .Select (n => n * 10)
  .ToList();    // Executes immediately into a List<int>

numbers.Clear();
Console.WriteLine (timesTen.Count);      // Still 2

The standard query operators (as implemented in the System.Linq.Enumerable class) can be divided into 12 categories, summarized in Table 1-1.

Tables 1-2 through 1-13 summarize each of the query operators. The operators shown in bold have special support in C# (see Query Expressions).

In addition to these, Framework 4.0 provides a new Zip operator, which enumerates two sequences in step (like a zipper), returning a sequence based on applying a function over each element pair.

To build more complex queries, you chain query operators together. For example, the following query extracts all strings containing the letter “a,” sorts them by length, and then converts the results to uppercase:

string[] names = { "Tom","Dick","Harry","Mary","Jay" };

IEnumerable<string> query = names
  .Where   (n => n.Contains ("a"))
  .OrderBy (n => n.Length)
  .Select  (n => n.ToUpper());

foreach (string name in query)
  Console.Write (name + "|");

// RESULT: JAY|MARY|HARRY|

Where, OrderBy, and Select are all standard query operators that resolve to extension methods in the Enumerable class. The Where operator emits a filtered version of the input sequence; OrderBy emits a sorted version of its input sequence; Select emits a sequence where each input element is transformed or projected with a given lambda expression (n.ToUpper(), in this case). Data flows from left to right through the chain of operators, so the data is first filtered, then sorted, then projected. The end result resembles a production line of conveyor belts, as illustrated in Figure 1-6.

Deferred execution is honored throughout with operators, so no filtering, sorting, or projecting takes place until the query is actually enumerated.

So far, we’ve written queries by calling extension methods in the Enumerable class. In this book, we describe this as fluent syntax. C# also provides special language support for writing queries, called query expressions. Here’s the preceding query expressed as a query expression:

using System.Linq;
...

string[] names = { "Tom","Dick","Harry","Mary","Jay" };

IEnumerable<string> query =
  from n in names
  where n.Contains ("a")
  orderby n.Length
  select n.ToUpper();

A query expression always starts with a from clause, and ends with either a select or group clause. The from clause declares a range variable (in this case, n) which you can think of as traversing the input collection—rather like foreach. Figure 1-7 illustrates the complete syntax.

The compiler processes query expressions by translating them to fluent syntax. It does this in a fairly mechanical fashion—much like it translates foreach statements into calls to GetEnumerator and MoveNext:

IEnumerable<string> query = names
  .Where   (n => n.Contains ("a"))
  .OrderBy (n => n.Length)
  .Select  (n => n.ToUpper());

The Where, OrderBy, and Select operators then resolve, using the same rules that would apply if the query were written in fluent syntax. In this case, they bind to extension methods in the Enumerable class (assuming you’ve imported the System.Linq namespace) because names implements IEnumerable<string>. The compiler doesn’t specifically favor the Enumerable class, however, when translating query syntax. You can think of the compiler as mechanically injecting the words “Where,” “OrderBy,” and “Select” into the statement, and then compiling it as though you’d typed the method names yourself. This offers flexibility in how they resolve—the operators in LINQ to SQL and Entity Framework queries, for instance, bind instead to the extension methods in the Queryable class.

The let keyword introduces a new variable alongside the range variable. For instance, suppose we want to list all names, whose length without vowels, is greater than two characters:

string[] names = { "Tom","Dick","Harry","Mary","Jay" };

IEnumerable<string> query =
  from n in names
  let vowelless = Regex.Replace (n, "[aeiou]", "")
  where vowelless.Length > 2
  orderby vowelless
  select n + " - " + vowelless;

The output from enumerating this query is:

Dick - Dck
Harry - Hrry
Mary - Mry

The let clause performs a calculation on each element, without losing the original element. In our query, the subsequent clauses (where, orderby, and select) have access to both n and vowelless. A query can include any multiple let clauses, and they can be interspersed with additional where and join clauses.

The compiler translates the let keyword by projecting into temporary anonymous type that contains both the original and transformed elements:

IEnumerable<string> query = names
 .Select (n => new
   {
     n = n,
     vowelless = Regex.Replace (n, "[aeiou]", "")
   }
 )
 .Where (temp0 => (temp0.vowelless.Length > 2))
 .OrderBy (temp0 => temp0.vowelless)
 .Select (temp0 => ((temp0.n + " - ") + temp0.vowelless))

If you want to add clauses after a select or group clause, you must use the into keyword to “continue” the query. For instance:

from c in "The quick brown tiger".Split()
select c.ToUpper() into upper
where upper.StartsWith ("T")
select upper

// RESULT: "THE", "TIGER"

Following an into clause, the previous range variable is out of scope.

The compiler translates queries with an into keyword simply into a longer chain of operators:

"The quick brown tiger".Split()
  .Select (c => c.ToUpper())
  .Where (upper => upper.StartsWith ("T"))

(It omits the final Select(upper=>upper) because it’s redundant.)

A query can include multiple generators (from clauses). For example:

int[] numbers = { 1, 2, 3 };
string[] letters = { "a", "b" };

IEnumerable<string> query = from n in numbers
                            from l in letters
                            select n.ToString() + l;

The result is a cross product, rather like you’d get with nested foreach loops:

"1a", "1b", "2a", "2b", "3a", "3b"

When there’s more than one from clause in a query, the compiler emits a call to SelectMany:

IEnumerable<string> query = numbers.SelectMany (
  n => letters,
  (n, l) => (n.ToString() + l));

SelectMany performs nested looping. It enumerates every element in the source collection (numbers), transforming each element with the first lambda expression (letters). This generates a sequence of subsequences, which it then enumerates. The final output elements are determined by the second lambda expression (n.ToString()+l).

If you subsequently apply a where clause, you can filter the cross product and project a result akin to a join:

string[] players = { "Tom", "Jay", "Mary" };

IEnumerable<string> query =
  from name1 in players
  from name2 in players
  where name1.CompareTo (name2) < 0
  orderby name1, name2
  select name1 + " vs " + name2;

RESULT: { "Jay vs Mary", "Jay vs Tom", "Mary vs Tom" }

The translation of this query into fluent syntax is more complex, requiring a temporary anonymous projection. The ability to perform this translation automatically is one of the key benefits of query expressions.

The expression in the second generator is allowed to use first range variable:

string[] fullNames =
  { "Anne Williams", "John Fred Smith", "Sue Green" };

IEnumerable<string> query =
  from fullName in fullNames
  from name in fullName.Split()
  select name + " came from " + fullName;

Anne came from Anne Williams
Williams came from Anne Williams
John came from John Fred Smith

This works because the expression fullName.Split emits a sequence (an array of strings).

Multiple generators are used extensively in database queries, to flatten parent-child relationships and to perform manual joins.

LINQ provides joining operators for performing keyed lookup-based joins. The joining operators support only a subset of the functionality you get with multiple generators/SelectMany, but are more performant with local queries because they use a hashtable-based lookup strategy rather than performing nested loops. (With LINQ to SQL and Entity Framework queries, the joining operators have no advantage over multiple generators).

The joining operators support equi-joins only (i.e., the joining condition must use the equality operator). There are two methods: Join and GroupJoin. Join emits a flat result set whereas GroupJoin emits a hierarchical result set.

The syntax for a flat join is:

from outer-var in outer-sequence
join inner-var in inner-sequence
  on outer-key-expr equals inner-key-expr

For example, given the following collections:

var customers = new[]
{
      new { ID = 1, Name = "Tom" },
      new { ID = 2, Name = "Dick" },
      new { ID = 3, Name = "Harry" }
};
var purchases = new[]
{
      new { CustomerID = 1, Product = "House" },
      new { CustomerID = 2, Product = "Boat" },
      new { CustomerID = 2, Product = "Car" },
      new { CustomerID = 3, Product = "Holiday" }
};

we could perform a join as follows:

IEnumerable<string> query =
  from c in customers
  join p in purchases on c.ID equals p.CustomerID
  select c.Name + " bought a " + p.Product;

The compiler translates this to:

customers.Join (                // outer collection
  purchases,                    // inner collection
  c => c.ID,                    // outer key selector
  p => p.CustomerID,            // inner key selector
  (c, p) =>                     // result selector
     c.Name + " bought a " + p.Product
);

Here’s the result:

Tom bought a House
Dick bought a Boat
Dick bought a Car
Harry bought a Holiday

With local sequences, the join operators are more efficient at processing large collections than SelectMany because they first preload the inner sequence into a keyed hashtable-based lookup. With a database query, however, you could achieve the same result equally efficiently, as follows:

from c in customers
from p in purchases
where c.ID == p.CustomerID
select c.Name + " bought a " + p.Product;

GroupJoin does the same work as Join, but instead of yielding a flat result, it yields a hierarchical result, grouped by each outer element.

The query expression syntax for GroupJoin is the same as for Join, but is followed by the into keyword. Here’s a basic example, using the customers and purchases collections we set up in the previous section:

IEnumerable<IEnumerable<Purchase>> query =
  from c in customers
  join p in purchases on c.ID equals p.CustomerID
  into custPurchases
  select custPurchases;   // custPurchases is a sequence

Note

An into clause translates to GroupJoin only when it appears directly after a join clause. After a select or group clause it means query continuation. The two uses of the into keyword are quite different, although they have one feature in common: they both introduce a new query variable.

The result is a sequence of sequences, which we could enumerate as follows:

foreach (IEnumerable<Purchase> purchaseSequence in query)
  foreach (Purchase p in purchaseSequence)
    Console.WriteLine (p.Description);

This isn’t very useful, however, because outerSeq has no reference to the outer customer. More commonly, you’d reference the outer range variable in the projection:

from c in customers
join p in purchases on c.ID equals p.CustomerID
into custPurchases
select new { CustName = c.Name, custPurchases };

We could obtain the same result (but less efficiently, for local queries) by projecting into an anonymous type which included a subquery:

from c in customers
select new
{
  CustName = c.Name,
  custPurchases =
    purchases.Where (p => c.ID == p.CustomerID)
}

The orderby keyword sorts a sequence. You can specify any number of expressions upon which to sort:

string[] names = { "Tom","Dick","Harry","Mary","Jay" };

IEnumerable<string> query = from n in names
                            orderby n.Length, n
                            select n;

This sorts first by length, then name, so the result is:

Jay, Tom, Dick, Mary, Harry

The compiler translates the first orderby expression to a call to OrderBy, and subsequent expressions to a call to ThenBy:

IEnumerable<string> query = names
  .OrderBy (n => n.Length)
  .ThenBy (n => n)

The ThenBy operator refines (not replaces) the previous sorting.

You can include the descending keyword after any of the orderby expressions:

orderby n.Length descending, n

This translates to:

.OrderByDescending (n => n.Length).ThenBy (n => n)

GroupBy organizes a flat input sequence into sequences of groups. For example, the following groups a sequence of names by their length:

string[] names = { "Tom","Dick","Harry","Mary","Jay" };

var query = from name in names
            group name by name.Length;

The compiler translates this query into this:

IEnumerable<IGrouping<int,string>> query =
  names.GroupBy (name => name.Length);

Here’s how to enumerate the result:

foreach (IGrouping<int,string> grouping in query)
{
  Console.Write ("\r\n Length=" + grouping.Key + ":");
  foreach (string name in grouping)
    Console.Write (" " + name);
}

 Length=3: Tom Jay
 Length=4: Dick Mary
 Length=5: Harry

Enumerable.GroupBy works by reading the input elements into a temporary dictionary of lists so that all elements with the same key end up in the same sublist. It then emits a sequence of groupings. A grouping is a sequence with a Key property:

public interface IGrouping <TKey,TElement>
  : IEnumerable<TElement>, IEnumerable
{
  // Key applies to the subsequence as a whole
  TKey Key { get; }
}

By default, the elements in each grouping are untransformed input elements, unless you specify an elementSelector argument. The following projects each input element to uppercase:

from name in names
group name.ToUpper() by name.Length

which translates to this:

names.GroupBy (
  name => name.Length,
  name => name.ToUpper() )

The subcollections are not emitted in order of key. GroupBy does no sorting (in fact, it preserves the original ordering.) To sort, you must add an OrderBy operator (which means first adding an into clause, because group...by ordinarily ends a query):

from name in names
group name.ToUpper() by name.Length into grouping
orderby grouping.Key
select grouping

Query continuations are often used in a group...by query. The next query filters out groups that have exactly two matches in them:

from name in names
group name.ToUpper() by name.Length into grouping
where grouping.Count() == 2
select grouping

OfType and Cast accept a nongeneric IEnumerable collection and emit a generic IEnumerable<T> sequence that you can subsequently query:

var classicList = new System.Collections.ArrayList();
classicList.AddRange ( new int[] { 3, 4, 5 } );
IEnumerable<int> sequence1 = classicList.Cast<int>();

This is useful because it allows you to query collections written prior to C# 2.0 (when IEnumerable<T> was introduced), such as ControlCollection in System.Windows.Forms.

Cast and OfType differ in their behavior when encountering an input element that’s of an incompatible type: Cast throws an exception whereas OfType ignores the incompatible element.

The rules for element compatibility follow those of C#’s is operator. Here’s the internal implementation of Cast:

public static IEnumerable<TSource> Cast <TSource>
             (IEnumerable source)
{
  foreach (object element in source)
    yield return (TSource)element;
}

C# supports the Cast operator in query expressions: simply insert the element type immediately after the from keyword:

from int x in classicList ...

This translates to:

from x in classicList.Cast <int>() ...