Chapter 11. Files and Streams

Almost all programmers have to deal with storing, retrieving, and processing information in files at some time or another. The .NET Framework provides a number of classes and methods we can use to find, create, read, and write files and directories In this chapter we’ll look at some of the most common.

Files, though, are just one example of a broader group of entities that can be opened, read from, and/or written to in a sequential fashion, and then closed. .NET defines a common contract, called a stream, that is offered by all types that can be used in this way. We’ll see how and why we might access a file through a stream, and then we’ll look at some other types of streams, including a special storage medium called isolated storage which lets us save and load information even when we are in a lower-trust environment (such as the Silverlight sandbox). Finally, we’ll look at some of the other stream implementations in .NET by way of comparison. (Streams crop up in all sorts of places, so this chapter won’t be the last we see of them—they’re important in networking, for example.)

Inspecting Directories and Files

We, the authors of this book, have often heard our colleagues ask for a program to help them find duplicate files on their system. Let’s write something to do exactly that. We’ll pass the names of the directories we want to search on the command line, along with an optional switch to determine whether we want to recurse into subdirectories or not. In the first instance, we’ll do a very basic check for similarity based on filenames and sizes, as these are relatively cheap options. Example 11-1 shows our Main function.

Example 11-1. Main method of duplicate file finder

static void Main(string[] args)
{
    bool recurseIntoSubdirectories = false;

    if (args.Length < 1)
    {
        ShowUsage();
        return;
    }

    int firstDirectoryIndex = 0;

    if (args.Length > 1)
    {
        // see if we're being asked to recurse
        if (args[0] == "/sub")
        {
            if (args.Length < 2)
            {
                ShowUsage();
                return;
            }
            recurseIntoSubdirectories = true;
            firstDirectoryIndex = 1;
        }
    }

    // Get list of directories from command line.
    var directoriesToSearch = args.Skip(firstDirectoryIndex);

    List<FileNameGroup> filesGroupedByName =
        InspectDirectories(recurseIntoSubdirectories, directoriesToSearch);

    DisplayMatches(filesGroupedByName);

    Console.ReadKey();
}

The basic structure is pretty straightforward. First we inspect the command-line arguments to work out which directories we’re searching. Then we call InspectDirectories (shown later) to build a list of all the files in those directories. This groups the files by filename (without the full path) because we do not consider two files to be duplicates if they have different names. Finally, we pass this list to DisplayMatches, which displays any potential matches in the files we have found. DisplayMatches refines our test for duplicates further—it considers two files with the same name to be duplicates only if they have the same size. (That’s not foolproof, of course, but it’s surprisingly effective, and we will refine it further later in the chapter.)

Let’s look at each of these steps in more detail.

The code that parses the command-line arguments does a quick check to see that we’ve provided at least one command-line argument (in addition to the /sub switch if present) and we print out some usage instructions if not, using the method shown in Example 11-2.

Example 11-2. Showing command line usage

private static void ShowUsage()
{
    Console.WriteLine("Find duplicate files");
    Console.WriteLine("====================");
    Console.WriteLine(
        "Looks for possible duplicate files in one or more directories");
    Console.WriteLine();
    Console.WriteLine(
        "Usage: findduplicatefiles [/sub] DirectoryName [DirectoryName] ...");
    Console.WriteLine("/sub - recurse into subdirectories");
    Console.ReadKey();
}

The next step is to build a list of files grouped by name. We define a couple of classes for this, shown in Example 11-3. We create a FileNameGroup object for each distinct filename. Each FileNameGroup contains a nested list of FileDetails, providing the full path of each file that has that name, and also the size of that file.

Example 11-3. Types used to keep track of the files we’ve found

class FileNameGroup
{
    public string FileNameWithoutPath { get; set; }
    public List<FileDetails> FilesWithThisName { get; set; }
}

class FileDetails
{
    public string FilePath { get; set; }
    public long FileSize { get; set; }
}

For example, suppose the program searches two folders, c:\One and c:\Two, and suppose both of those folders contain a file called Readme.txt. Our list will contain a FileNameGroup whose FileNameWithoutPath is Readme.txt. Its nested FilesWithThisName list will contain two FileDetails entries, one with a FilePath of c:\One\Readme.txt and the other with c:\Two\Readme.txt. (And each FileDetails will contain the size of the relevant file in FileSize. If these two files really are copies of the same file, their sizes will, of course, be the same.)

We build these lists in the InspectDirectories method, which is shown in Example 11-4. This contains the meat of the program, because this is where we search the specified directories for files. Quite a lot of the code is concerned with the logic of the program, but this is also where we start to use some of the file APIs.

Example 11-4. InspectDirectories method

private static List<FileNameGroup> InspectDirectories(
    bool recurseIntoSubdirectories,
    IEnumerable<string> directoriesToSearch)
{
    var searchOption = recurseIntoSubdirectories ?
        SearchOption.AllDirectories : SearchOption.TopDirectoryOnly;

    // Get the path of every file in every directory we're searching.
    var allFilePaths = from directory in directoriesToSearch
                       from file in Directory.GetFiles(directory, "*.*",
                                                        searchOption)
                       select file;

    // Group the files by local filename (i.e. the filename without the
    // containing path), and for each filename, build a list containing the
    // details for every file that has that filename.
    var fileNameGroups = from filePath in allFilePaths
                         let fileNameWithoutPath = Path.GetFileName(filePath)
                         group filePath by fileNameWithoutPath into nameGroup
                         select new FileNameGroup
                         {
                             FileNameWithoutPath = nameGroup.Key,
                             FilesWithThisName =
                              (from filePath in nameGroup
                               let info = new FileInfo(filePath)
                               select new FileDetails
                               {
                                   FilePath = filePath,
                                   FileSize = info.Length
                               }).ToList()
                         };

    return fileNameGroups.ToList();
}

To get it to compile, you’ll need to add:

using System.IO;

The parts of Example 11-4 that use the System.IO namespace to work with files and directories have been highlighted. We’ll start by looking at the use of the Directory class.