In Chapter 2, we explained the concept of Automated Machine Learning and provided a brief overview of the automated ML tool on Microsoft Azure Machine Learning. In this chapter, we look at how to get started with Azure Machine Learning and, subsequently, automated ML. Before going into details of automated ML, we’ll first discuss some of the common challenges that enterprises face in their machine learning projects, to better understand these issues.
When solving problems with machine learning, we begin with a problem statement in terms of what we are trying to optimize. Next, we look for a dataset that will help us solve the problem. We begin looking at the data and use a data manipulation library like Pandas. We look at missing values, distribution of data, and errors in the data. We try to join multiple datasets. When we think we have a good enough dataset to get underway, we split it into train, test, and validation datasets, typically in a ratio of 70:20:10. This helps avoid overfitting, which basically means we’re not using the same dataset for training and testing. We use the train dataset to train the machine learning algorithm. The test dataset is used for testing the machine learning model after training is complete, to ascertain how well the algorithm performed.
We establish a metric to determine algorithm performance and keep iterating until we get a good algorithm. Then we use the validation dataset to check the algorithm’s performance. Sometimes, the validation dataset might not be in the main dataset, in which case we can split the original dataset for training and testing in an 80:20 ratio. All of these datasets should be representative samples of the main dataset to avoid skewed data (also known as bias). As you can see, this process is iterative and can be time-consuming. Figure 3-1 shows a summary of the manual process.
Let’s look at some of the other challenges a data scientist can face when embarking on a machine learning project.
Data scientists in the enterprise can work solo or in teams. Nowadays, machine learning projects are more complicated, and data scientists often collaborate. However, it might not be easy for data scientists to share results and review code.
Other challenges that data scientists face when working together are how to track the machine learning experiments and then track the history of multiple iterations (runs) within each experiment. There are additional challenges to having a training environment that can scale horizontally and vertically. When we need more nodes in a cluster, we want to scale it horizontally, and when we need more CPU or memory, we scale each node vertically.
After the trained model satisfies the business criteria, the next step is to operationalize it so that we can use it for predictions. This is also known as deployment of the model. A model can be deployed as a web service for real-time scoring or as a batch-scoring model for scoring in bulk. Figure 3-2 shows a summary of the steps a data scientist might perform, from training to deployment. Now, let’s understand how Azure Machine Learning and automated ML help address some of these challenges.
The Azure Machine Learning service helps data scientists track experiments and iterations (runs) in a cloud-based workspace. It is a machine learning platform designed to help with end-to-end (E2E) machine learning.
To use the Azure Machine Learning service (including the Azure Machine Learning SDK, and automated ML), you must have an Azure Machine Learning workspace. This workspace resides in your Azure subscription and you can use it as a single collaborative place to track experiments and do more things. Anyone with owner or contributor access to the resource group can create the workspace in an Azure resource group. You can create it using Microsoft Azure portal or using the Python SDK. Later in this chapter, we review in detail the steps of installing the SDK.
When creating the workspace for the first time, you need to register a few resource providers (RP) in the Azure subscription. Let’s register the RPs needed to use the Azure Machine Learning workspace. You can search for this RPs in the Azure portal under your subscription. Here are the steps to do this:
Open the Azure portal, and then go to your subscription, as shown in Figure 3-3.
On the left side of the pane, browse to “Resource providers,” as shown in Figure 3-4.
In the filter, in the Search box, type “machinelearning,” and then register the Machine Learning RPs, as shown in Figure 3-5.
Register the KeyVault, ContainerRegistry, and ContainerInstance RPs, as shown in Figures 3-6 and 3-7.
Now we’re ready to create an Azure ML workspace. To do this, we need Contributor or Owner access to the Azure resource group. After we have confirmed the appropriate access, let’s create the Azure Machine Learning workspace:
Go to the Azure portal and search for “machine learning service workspaces,” as shown in Figure 3-8.
Fill in the required details in the pane, as shown in Figures 3-9 through 3-11.
You can choose an existing resource group or create a new one.
Click Create when you have made all your selections.
In the upper part of the notification pane that then opens, click the bell icon (Figure 3-12) to go to the newly created Azure Machine Learning workspace.
As shown in Figure 3-13, the main page of the Azure Machine Learning workspace shows you how to get started and all of the assets that are a part of the workspace. When we run our automated ML experiment, the details will show up in the Experiments section.
After you create this workspace, you can use it for training, deployment, and more machine learning–related activities. For the remainder of this chapter, we focus on using automated ML.
There are multiple ways a data scientist or artificial intelligence (AI) developer can use automated ML. It comes packaged as part of the Azure Machine Learning SDK. It can be installed in any Python environment as a PyPi package.
Here we use Azure Notebooks (a Jupyter environment in the cloud) to run an E2E experiment with automated ML. When used with Azure Notebooks, the SDK is preinstalled in the environment. Let’s create a project:
Start Azure Notebooks by going to https://notebooks.azure.com, as shown in Figure 3-14. Click the Try It Now button and sign in.
From your profile page, you can view the Azure Notebooks projects (Figure 3-15).
Run the compute as your notebook server, as depicted in Figure 3-16.
Once you open the notebook (see Figure 3-17), it spins up the Jupyter kernel. You can execute the code in the cell by pressing Shift + Enter.
As shown in Figures 3-18 and 3-19, you begin by authorizing the environment to access the Azure subscription and thus the Azure Machine Learning workspace that you created earlier.
Now, instantiate the Azure Machine Learning workspace by providing the subscription, resource group, and workspace name as shown in Figures 3-20 and 3-21. Begin by importing the libraries and then use the get
method to instantiate the workspace object, which can then be used by automated ML and other related activities.
Define an experiment within the Azure Machine Learning workspace to get started with automated ML, as shown in Figure 3-22.
From the dataset that will be used for automated ML training, we create the DataFrames for the feature columns and prediction label. These DataFrames are represented as X and y in the automated ML configuration. The configuration takes various other parameters, as shown in Figures 3-23 and 3-24.
In addition to the experiment type, these parameters define the constraints that help control the time it takes and the money we spend on training. Details of these parameters are available in the official Azure documentation.
Submit this training and monitor the progress of the experiment in the notebook by using a widget, or through the Azure portal in your Azure Machine Learning workspace, as shown in Figures 3-25 through 3-27. This shows the metric score, status, and duration of the experiment run. These metrics can be useful to find what automated ML tried and the result of each iteration.
We can explore details in the child runs (iteration) by checking graphs of the true value and predicted value, as shown in Figures 3-28 and 3-29.
You can export the trained model from any of the child runs, as shown in Figure 3-30. Using Azure Machine Learning, you can deploy this model to the cloud or edge for making predictions. You also can deploy it to another environment of your choice. You can take advantage of the benefits of containerizing the model and then deploying it as a real-time web service or as a batch service using Azure Machine Learning. (We examine deployment in Chapter 5.)
Alternatively, after the training is complete, you can select the best model writing Python code as shown in Figure 3-31.
When you go to the main experiments page in your Azure Machine Learning workspace, you can look at all the experiments that you have run as well as their child runs. The portal automatically sorts the child runs based on the metric you are optimizing. In Figure 3-32, you can see a summary of the experiment run. It has various panes to show the run config and the run results. The best pipeline is shown at the top in Figure 3-33.
As of this writing, a new cloud-based notebook server is available in preview. This secure, cloud-based Azure workstation provides a Jupyter notebook server, JupyterLab, and a fully prepared machine learning environment. You can learn more about it in the Azure Machine Learning Notebooks documentation.
In this chapter, you learned about the Azure Machine Learning workspace and how to get started with automated ML by using Azure Notebooks. In Chapter 8, you’ll use more coding environments to run automated ML experiments.