Understanding the ingest_directory() function

The ingest_directory() function handles the input mode for our script and recursively captures the metadata of files from a user-supplied root directory. On line 158, we set up our database cursor before a count variable, which will keep count of the number of files stored in the Files table:

148     def ingest_directory(conn, target, custodian_id):
149 """
150 The ingest_directory function reads file metadata and stores
151 it in the database
152 :param conn: The sqlite3 database connection object
153 :param target: The path for the root directory to
154 recursively walk
155 :param custodian_id: The custodian ID
156 :return: None
157 """
158 cur = conn.cursor()
159 count = 0

The most important part of this function is the for loop on line 160. This loop uses the os.walk() method to break apart a provided directory path into an iterative array that we can step through. There are three components of the os.walk() method. They are generally named root, folders, and files. The root value is a string that represents the path of the base directory we are currently walking during the specific loop iteration. As we traverse through subfolders, they will be appended to the root value. The folders and files variables provide lists of folder and filenames within the current root, respectively. Although these variables may be renamed as you see fit, this is a good naming convention to prevent overwriting Python statements, such as file or dir, which are already used in Python. In this instance, though, we do not need the folders list from os.walk(), so we will name it as a single underscore (_):

160     for root, _, files in os.walk(target):
This is a common practice for assigning a value to a variable that is unused in the code. For this reason, only use a single underscore to represent unused data. Where possible, try to redesign your code to not return unwanted values.

Within the loop, we begin iterating over the files list to access information about each file. On line 162, we create a file-specific dictionary, meta_data, to store the collected information, as follows:

161         for file_name in files:
162 meta_data = dict()

On line 163, we use a try-except clause to catch any exceptions. We know we said not to do that, but hear us out first. This catch-all is in place so that any error within a discovered file does not cause the script to crash and stop. Instead, the filename and error will be written to the log before skipping that file and continuing execution. This can help an examiner quickly locate and troubleshoot specific files. This is important as some errors may occur on Windows systems due to filesystem flags and naming conventions that cause errors in Python. Different errors will then occur on macOS and Linux/UNIX systems, making it hard to predict all of the instances where the script will crash. This is an excellent example of why logging is important, as we can review errors that have been generated by our script.

Within the try-except clause, we store the different properties of the file's metadata to keys. To begin, we record the filename and full path on lines 163 and 164. Note how the dictionary keys share the name with the columns they belong to in the Files table. This format will make our lives easier later in the script. The file path is stored using the os.path.join() method, which combines separate paths into a single one using the operating system's specific path separator.

On line 167, we gather the file extension by using the os.path.splitext() method to split the extension after the last . in the filename. Since this function on line 167 creates a list, we select the last element to ensure that we store the extension. In some situations, the file may not have an extension (for example, a .DS_Store file), in which case the last value in the returned list is an empty string. Be aware that this script does not check file signatures to confirm that the file type matches the extension; the process of checking file signatures can be automated:

163             try:
164 meta_data['file_name'] = file_name
165 meta_data['file_path'] = os.path.join(root,
166 file_name)
167 meta_data['extension'] = os.path.splitext(
168 file_name)[-1]