[Update June 20, 2008: Audio version added:]
You can also download the mp3 file to your iPod, Zune, or other mp3 player: LinqLinqSayNoMore.mp3
An interesting question that someone asked of me recently, indicating perhaps that the information from Microsoft about it isn't getting across very well, was what is LINQ?. To me there are two aspects of LINQ that I find compelling, and in just narrowing it down to those two aspects, I'm probably doing it just as much a disservice as I implied Microsoft were doing just then.
The thing is, there is a lot of information, very detailed information, about certain attributes of LINQ, but it tends to focus on one or maybe two areas like query expressions and LINQ to SQL, and then things like the var keyword gets mixed in somehow.
Let's start from the beginning. C# and VB and a lot of other languages we learned at our mother's knee are known as imperative languages. Well, they're also known by a number of other monikers invoking how "typed" they are, but I'll ignore that for now.
Imperative programming languages involve the programmer detailing to the nth degree exactly what the program should do. In essence, the programmer writes a set of commands for the computer to execute. The order of the commands or statements he writes dictates the order in which they are executed. (To be strict, .NET does allow the compiler and JITter to reorganize these statements internally to a certain degree, but since many people don't understand what this can mean, we'll ignore it.)
If the programmer needs to iterate over a bunch of records from the database, for example to find all the records that match some criteria, he has to write a loop of some description, worry about the bounds of the loop, and then work out the statements that need to be executed inside the loop for each record. Of course, he also has to worry about where to put the records that match, which involves other issues and problems that need to be solved.
Compare that mode of working with SQL. This is a declarative language, and furthermore one that operates on sets of data. Here if you want the set of records that match some criteria, you write a WHERE clause in your SQL statement and don't even worry how the database engine works out or stores the set of records you needed. Sure it may be done with a loop, or it may be several loops, one per CPU, or it may be some other mechanism entirely. The result set may be in memory, it may be on disk, or it may be a bunch of pointers to the original records. What matters to the programmer is the end result: what looks to be a set of records that he can then manipulate.
The first aspect of LINQ that I find fascinating is the melding of a query syntax that returns sets of objects (a declarative language) into an imperative language with all its loops and ifs and whatnots.
The C# designers have been very clever in this respect. First of all, they introduced the language extensions known as Language Integrated Query (or LINQ) that everyone first thinks of when they see the letters L, I, N, Q. LINQ in this regard is a language within a language, a way of abstracting out the SQLness and the XPathness from their respective query universes.
To go along with this is a much enhanced set of generic interfaces and types, especially centered around IEnumerable<T>. This interface in particular has had new extension methods written (like Where(), Select(), and so on, all within the Enumerable class in System.Linq) that take a delegate, usually implemented as a anonymous method, to act on each item in the collection, and that return another collection. Then they married this up with type inference system and automatic type creation and the var keyword.
All in all, this is the LINQ that people think of, and it is most amazing. The ability to specify a query that will return some collection of objects (sometimes the objects are of some type that has been created specially for this query) and then be able to act on the set in toto using some of the new methods and passing anonymous methods along for the ride too, is a quite staggering achievement.
But that's not all, as they say in the best ads.
When you write a LINQ query, what happens to it? Well, the one and only thing that happens until run-time is that the compiler will compile the LINQ expression into an expression tree. It does this for two reasons: the first is that LINQ expressions will get parsed at compile-time and will trigger compile errors should the expression be invalid. The compiler will convert the LINQ expression into an expression tree (more fabulous .NET Framework code has been written to create, navigate and execute expression trees -- they're not linked to LINQ especially).
Of course, it won't actually execute the expression tree (or, equivalently, execute the query) until run-time, since the data source is only known at that point. But the second most amazing thing about LINQ, in my view, is that the LINQ expression is only executed when it needs to be, at the absolute last possible moment. In other words, if you have
var bestCustomers = from c in customers
where c.SalesTotal >= 1000000
Then at run-time bestCustomers will remain in a kind of limbo not exactly full of data -- a bit like Schroedinger's cat, neither here nor there -- until you actually try and do something with it, like print out the customers. If it turns out that, through the execution of the program, you don't actually reference the data in the bestCustomers collection, it won't be there, and the query won't be executed.
Something even more fascinating: if I write this
var superCustomers = from c in bestCustomers
where c.SalesTotal >= 10000000
Then this declaration of superCustomers still won't cause the bestCustomers query to be executed. Trying to enumerate superCustomers will cause two queries to be executed: first the one for bestCustomers and then the one for superCustomers. But if I don't enumerate superCustomers, this won't happen.
Microsoft refers to this as deferred execution, and MSDN has this to say: This method [Select()] is implemented using deferred execution. The immediate return value is an object that stores all the information required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
So if you have an amazingly ornate LINQ query expression (and I'm thinking of Luke Hoban's ray tracer code implemented as a LINQ expression as an example), then it will only suck up CPU time if you need the result of the expression. This deferred execution is to me an astounding achievement, something you should strive for in your own code.
Even better: the objects in the collection that results from the LINQ select operator may not all be resolved at once. They may be resolved one by one as you use the enumerator, or even in batches. It doesn't matter to you how its implemented, all you know is that if you enumerate the objects, you will get them.
After that, it's all a bit of a downer: the LINQ provider will convert the expression tree into something that its domain understands and then executes that. So, for example, LINQ to SQL will convert the expression tree to a very optimized SQL expression and execute it. This is pretty good code too, but for some reason it doesn't float my boat as much as the other two aspects of LINQ.
So there you have it: LINQ in the eyes of the beholder named Julian.