Loading the data

Before we can make computations on the data, it must be loaded from a storage location (usually a database or a real-time data feed) into a computing workspace. Workspaces allow the user to manipulate the data and build models using popular languages including R, Python, Hadoop, and Spark. Many commercial databases have specialized functionality in order to facilitate loading into workspaces. The machine learning languages themselves also have functions that read from text files and connect to and read from databases. Sometimes the user may also prefer to perform data quality control and cleansing directly in the database. This typically includes steps such as building a patient index, data normalization, and data cleaning. In Chapter 4, Computing Foundations – Databases, we discuss the manipulation of databases using the Structured Query Language (SQL) and in Chapter 5, Computing Foundations – Introduction to Python, we discuss methods for loading the data into a Python workspace.