One of the most important skills for a data scientist is the ability to frame a real-world problem as a standard data science task. Most data science projects can be classified as belonging to one of four general classes of task:
Understanding which task a project is targeting can help with many project decisions. For example, training a prediction model requires that each of the instances in the data set include the value of the target attribute. So knowing that the project is doing prediction gives guidance (through requirements) in terms of data set design. Understanding the task also informs which ML algorithm(s) to use. Although there are a large number of ML algorithms, each algorithm is designed for a particular data-mining task. For example, ML algorithms that generate decision-tree models are designed primarily for prediction tasks. There is a many-to-one relationship between ML algorithms and a task, so knowing the task doesn’t tell you exactly which algorithm to use, but it does define a set of algorithms that are designed for the task. Because the data science task affects both the data set design and the selection of ML algorithms, the decision regarding which task the project will target has to be made early on in the project life cycle, ideally during the business-understanding phase of the CRISP-DM life cycle. To provide a better understanding of each of these tasks, this chapter describes how some standard business problems map to tasks.
One of the most frequent application areas of data science in business is to support marketing and sales campaigns. Designing a targeted marketing campaign requires an understanding of the target customer. Most businesses have a diverse range of customers with a variety of needs, so using a one-size-fits-all approach is likely to fail with a large segment of a customer base. A better approach is to try to identify a number of customer personas or customer profiles, each of which relates to a significant segment of the customer base, and then to design targeted marketing campaigns for each persona. These personas can be created using domain expertise, but it is generally a good idea to base the personas on the data that the business has about its customers. Human intuition about customers can often miss important nonobvious segments or not provide the level of granularity that is required for nuanced marketing. For example, Meta S. Brown (2014) reports how in one data science project the well-known stereotype soccer mom (a suburban homemaker who spends a great deal of time driving her children to soccer or other sports practice) didn’t resonate with a customer base. However, using a data-driven clustering process identified more focused personas, such as mothers working full-time outside the home with young children in daycare and mothers who work part-time with high-school-age children and women interested in food and health and who do not have children. These customer personas define clearer targets for marketing campaigns and may highlight previously unknown segments in the customer base.
The standard data science approach to this type of analysis is to frame the problem as a clustering task. Clustering involves sorting the instances in a data set into subgroups containing similar instances. Usually clustering requires an analyst to first decide on the number of subgroups she would like identified in the data. This decision may be based on domain knowledge or informed by project goals. A clustering algorithm is then run on the data with the desired number of subgroups input as one of the algorithms parameters. The algorithm then creates that number of subgroups by grouping instances based on the similarity of their attribute values. Once the algorithm has created the clusters, a human domain expert reviews the clusters to interpret whether they are meaningful. In the context of designing a marketing campaign, this review involves checking whether the groups reflect sensible customer personas or identifies new personas not previously considered.
The range of attributes that can be used to describe customers for clustering is vast, but some typical examples include demographic information (age, gender, etc.), location (ZIP code, rural or urban address, etc.), transactional information (e.g., what products or services they have purchased), the revenue the company generates from them, how long have they been customers, if they are a member of a loyalty-card scheme, whether they ever returned a product or made a complaint about a service, and so on. As is true of all data science projects, one of the biggest challenges with clustering is to decide which attributes to include and which to exclude so as to get the best results. Making this decision on attribute selection will involve iterations of experiments and human analysis of the results of each iteration.
The best-known ML algorithm for clustering is the k-means algorithm. The k in the name signals that the algorithm looks for k clusters in the data. The value of k is predefined and is often set through a process of trial-and-error experimentation with different values of k. The k-means algorithm assumes that all the attributes describing the customers in the data set are numeric. If the data set contains nonnumeric attributes, then these attributes need to be mapped to numeric values in order to use k-means, or the algorithm will need to be amended to handle these nonnumeric values. The algorithm treats each customer as a point in a point cloud (or scatterplot), where the customer’s position is determined by the attribute values in her profile. The goal of the algorithm is to find the position of each cluster’s center in the point cloud. There are k clusters, so there are k cluster centers (or means)—hence the name of the algorithm.
The k-means algorithm begins by selecting k instances to act as initial cluster centers. Current best practice is to use an algorithm called “k-means++” to select the initial cluster centers. The rationale behind k-means++ is that it is a good idea to spread out the initial cluster centers as much as possible. So in k-means++ the first cluster center is set by randomly selecting one of the instances in the data set. The second and subsequent cluster centers are set by selecting an instance from the data set with the probability that an instance selected is proportional to the squared distance from the closest existing cluster center. Once all k cluster centers have been initialized, the algorithm works by iterating through a two-step process: first, assigning each instance to the nearest cluster center, and then, second, updating the cluster center to be in the middle of the instances assigned to it. In the first iteration the instances are assigned to the nearest cluster center returned by the k-means++ algorithm, and then these cluster centers are moved so that they are positioned at the center of instances assigned to them. Moving the cluster centers is likely to move them closer to some instances and farther away from other instances (including farther away from some instances assigned to the cluster center). The instances are then reassigned, again to the closest updated cluster center. Some instances will remain assigned to the same cluster center, and others may be reassigned to a new cluster center. This process of instance assignment and center updating continues until no instances are assigned to a new cluster center during an iteration. The k-means algorithm is nondeterministic, meaning that different starting positions for the cluster centers will likely produce different clusters. As a result, the algorithm is typically run several times, and the results of these different runs are then compared to see which clusters appear most sensible given the data scientist’s domain knowledge and understanding.
When a set of clusters for customer personas has been deemed to be useful, the clusters are often given names to reflect the main characteristics of the cluster persona. Each cluster center defines a different customer persona, with the persona description generated from the attribute values of the associated cluster center. The k-means algorithm is not required to return equal-size clusters, and, in fact, it is likely to return different-size clusters. The sizes of the clusters can be useful, though, because they can help to guide marketing. For example, the clustering process may reveal small, focused clusters of customers that current marketing campaigns are missing. Or an alternative strategy might be to focus on clusters that contain customers that generate a great deal of revenue. Whatever marketing strategy is adopted, understanding the segments within a customer base is the prerequisite to marketing success.
One of the advantages of clustering as an analytics approach is that it can be applied to most types of data. Because of its versatility, clustering is often used as a data-exploration tool during the data-understanding stage of many data science projects. Also, clustering is also useful across a wide range of domains. For example, it has been used to analyze students in a given course in order to identify groups of students who need extra support or prefer different learning approaches. It has also been used to identify groups of similar documents in a corpus, and in science it has been used in bio-informatics to analyze gene sequences in microarray analysis.
Anomaly detection or outlier analysis involves searching for and identifying instances that do not conform to the typical data in a data set. These nonconforming cases are often referred to as anomalies or outliers. Anomaly detection is often used in analyzing financial transactions in order to identify potential fraudulent activities and to trigger investigations. For example, anomaly detection might uncover fraudulent credit card transactions by identifying transactions that have occurred in an unusual location or that involve an unusually large amount compared to other transactions on a particular credit card.
The first approach that most companies typically use for anomaly detection is to manually define a number of rules based on domain expertise that help with identifying anomalous events. This rule set is often defined in SQL or in another language and is run against the data in the business databases or data warehouse. Some programming languages have begun to include specific commands to facilitate the coding of these types of rules. For example, database implementations of SQL now includes a MATCH_RECOGNIZE function to facilitate pattern matching in data. A common pattern in credit card fraud is that when a credit card gets stolen, the thief first checks that the card is working by purchasing a small item on the card, and then if that transaction goes through, the thief as quickly as possible follows that purchase with the purchase of an expensive item before the card is canceled. The MATCH_RECOGNIZE function in SQL enables database programmers to write scripts that identify sequences of transactions on a credit card that fit this pattern and either block the card automatically or trigger a warning to the credit-card company. Over time, as more anomalous transactions are identified—for example, by customers reporting fraudulent transactions—the set of rules identifying anomalous transactions is expanded to handle these new instances.
The main drawback with a rule-based approach to anomaly detection is that defining rules in this way means that anomalous events can be identified only after they have occurred and have come to the company’s attention. Ideally, most organizations would like to be able to identify anomalies when they first happen or if they have happened but have not been reported. In some ways, anomaly detection is the opposite of clustering: the goal of clustering is to identify groups of similar instances, whereas the goal of anomaly detection is to find instances that are dissimilar to the rest of the data in the data set. By this intuition, clustering can also be used to automatically identify anomalies. There are two approaches to using clustering for anomaly detection. The first is that the normal data will be clustered together, and the anomalous records will be in separate clusters. The clusters containing the anomalous records will be small and so will be clearly distinct from the large clusters for the main body of the records. The second approach is to measure the distance between each instance and the center of the cluster. The farther away the instance is from the center of the cluster, the more likely it is to be anomalous and thus to need investigation.
Another approach to anomaly detection is to train a prediction model, such as a decision tree, to classify instances as anomalous or not. However, training such a model normally requires a training data set that contains both anomalous records and normal records. Also, it is not enough to have just a few instances of anomalous records; in order to train a normal prediction model, the data set needs to contain a reasonable number of instances from each class. Ideally, the data set should be balanced; in a binary-outcome case, balance would imply a 50:50 split in the data. In general, acquiring this type of training data for anomaly detection is not feasible: by definition, anomalies are rare events, occurring maybe in 1 to 2 percent or less of the data. This data constraint precludes the use of normal, off-the-shelf prediction models. There are, however, ML algorithms known as one-class classifiers that are designed to deal with the type of imbalanced data that are typical of anomaly-detection data sets.
The one-class support-vector machine (SVM) algorithm is a well-known one-class classifier. In general terms, the one-class SVM algorithm examines the data as one unit (i.e., a single class) and identifies the core characteristics and expected behavior of the instances. The algorithm will then indicate how similar or dissimilar each instance is from the core characteristics and expected behavior. This information can then be used to identify instances that warrant further investigation (i.e., the anomalous records). The more dissimilar an instance is, the more likely that it should be investigated.
The fact that anomalies are rare means that they can be easy to miss and difficult to identify. As a result, data scientists often combine a number of different models to detect anomalies. The idea is that different models will capture different types of anomalies. In general, these models are used to supplement the known rules within the business that already define various types of anomalous activity. The different models are integrated together into a decision-management solution that enables the predictions from each of the models to feed into a decision of the final predicted outcome. For example, if a transaction is identified as fraudulent by only one out of four models, the decision system may decide that it isn’t a true case of fraud, and the transaction can be ignored. Conversely, however, if three or four out of the four models have identified the transaction as possible fraud, then the transaction would be flagged for a data scientist to investigate.
Anomaly detection can be applied to many problem domains beyond credit card fraud. More generally, it is used by clearinghouses to identify financial transactions that require further investigation as potentially fraudulent or as cases of money laundering. It is used in insurance-claims analysis to identify claims that are not in keeping with a company’s typical claims. In cybersecurity, it is used to identify network intrusions by detecting possible hacking or untypical behavior by employees. In the medical domain, identifying anomalies in medical records can be useful for diagnosing disease and in studying treatments and their effects on the body. Finally, with the proliferation of sensors and the increasing usage of Internet of Things technology, anomaly detection will play an important role in monitoring data and alerting us when abnormal sensor events occur and action is required.
A standard strategy in sales is cross-selling, or suggesting to customers who are buying products that they may also want to purchase other related or complementary products. The idea is to increase the customers’ overall spending by getting them to purchase more products and at the same time to improve customer service by reminding customers of products they probably wanted to buy but may have forgotten to do so. The classic example of the cross-sell is when a waiter in a hamburger restaurant asks a customer who has just ordered a hamburger, “Do you want fries with that?” Supermarkets and retailer businesses know that shoppers purchase products in groups, and they use this information to set up cross-selling opportunities. For example, supermarket customers who buy hot dogs are also likely to purchase ketchup and beer. Using this type of information, a store can plan the layout of the products. Locating hot dogs, ketchup, and beer near each other in the store helps customers to collect this group of items quickly and may also boost the store sales because customers who are purchasing hot dogs might see and purchase the ketchup and beer that they forgot they needed. Understanding these types of associations between products is the basis of all cross-selling.
Association-rule mining is an unsupervised-data-analysis technique that looks to find groups of items that frequently co-occur together. The classic case of association mining is market-basket analysis, wherein retail companies try to identify sets of items that are purchased together, such as hot dogs, ketchup, and beer. To do this type of data analysis, a business keeps track of the set (or basket) of items that each customer bought during each visit to the store. Each row in the data set describes one basket of goods purchased by a particular customer on a particular visit to the store. So the attributes in the data set are the products the store sells. Given these data, association-rule mining looks for items that co-occur within each basket of goods. Unlike clustering and anomaly detection, which focus on identifying similarities or differences between instances (or rows) in a data set, association-rule mining focuses on looking at relationships between attributes (or columns) in a data set. In a general sense, it looks for correlations—measured as co-occurrences—between products. Using association-rule mining, a business can start to answer questions about its customers’ behaviors by looking for patterns that may exist in the data. Questions that market-basket analysis can be used to answer include: Did a marketing campaign work? Have this customer’s buying patterns changed? Has the customer had a major life event? Does the product location affect buying behavior? Who should we target with our new product?
The Apriori algorithm is the main algorithm used to produce the association rules. It has a two-step process:
The Apriori algorithm generates association rules that express probabilistic relationships between items in frequent itemsets. An association rule is of the form “IF antecedent, THEN consequent.” It states that an item or group of items, the antecedent, implies the presence of another item in the same basket of goods, the consequent, with some probability. For example, a rule derived from a frequent itemset containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included:
IF {hot-dogs, ketchup}, THEN {beer}.
This rule indicates that customers who are buying hot dogs and ketchup are also likely to buy beer. A frequent example of the power of association-rule mining is the beer-diapers example that describes how an unknown US supermarket in the 1980s used an early computer system to analyze its checkout data and identified an unusual association between diapers and beer in customer purchases. The theory developed to understand this rule was that families with young children were preparing for the weekend and knew that they would need diapers and would have to socialize at home. The store placed the two items near each other, and sales soared. The beer-and-diapers story has been debunked as apocryphal, but it is still a useful example of the potential benefits of association-rule mining for retail businesses.
Two main statistical measures are linked with association rules: support and confidence. The support percentage of an association rule—or the ratio of transactions that include both the antecedent and consequent to the total number of transactions—indicates how frequently the items in the rule occur together. The confidence percentage of an association rule—or the ratio of the number of transactions that include both the antecedent and consequent to the number of transactions that includes the antecedent—is the conditional probability that the consequent will occur given the occurrence of the antecedent. So, for example, a confidence of 75 percent for the association rule relating hot dogs and ketchup with beer would indicate that in 75 percent of cases where customers purchased both hot dogs and ketchup, they also purchased beer. The support score of a rule simply records the percentage of baskets in the data set where the rule holds. For example, a support of 5 percent indicates that 5 percent of all the baskets in the data set contain all three items in the rule “hot dogs, ketchup, and beer.”
Even a small data set can result in the generation of a large number of association rules. In order to control the complexity of the analysis of these rules, it is usual to prune the generated rule set to include only rules that have both a high support and a high confidence. Rules that don’t have high support or confidence are not interesting either because the rule covers only a very small percentage of baskets (low support) or because the relationship between the items in the antecedent and the consequent is low (low confidence). Rules that are trivial or inexplicable should also be pruned. Trivial rules represent associations that are obvious and well known to anyone who understands the business domain. An inexplicable rule represents associations that are so strange that it is difficult to understand how to convert the rule into a useful action for the company. It is likely that an inexplicable rule is the result of an odd data sample (i.e., the rule represents a spurious correlation). Once the rule set has been pruned, the data scientist can then analyze the remaining rules to understand what products are associated with each other and apply this new information in the organization. Organizations will typically use this new information to determine store layout or to perform some targeted marketing campaigns to their customers. These campaigns can involve updates to their websites to include recommended products, in-store advertisements, direct mailings, the cross-selling of other products by check-out staff, and so on.
Association mining becomes more powerful if the baskets of items are connected to demographic data about the customer. This is why so many retailers run loyalty-card schemes because such schemes allow them not only to connect different baskets of goods to the same customer through time but also to connect baskets of goods to the customer’s demographics. Including this demographic information in the association analysis enables the analysis to be focused on particular demographics, which can further help marketing and targeted advertising. For example, demographic-based association rules can be used with new customers, for whom the company has no buying-habit information but does have demographic information. An example of an association rule augmented with demographic information might be
IF gender(male) and age(< 35) and {hot-dogs, ketchup}, THEN {beer}.
[Support = 2%, Confidence = 90%.]
The standard application area for association-rule mining focuses on what products are in the shopping basket and what products are not in the shopping basket. This assumes that the products are purchased in one visit to the store or website. This kind of scenario will probably work in most retail and other related scenarios. However, association-rule mining is also useful in a range of domains outside of retail. For example, in the telecommunications industry, applying association-rule mining to customer usage helps telecommunications companies to design how to bundle different services together into packages. In the insurance industry, association-rule mining is used to see if there are associations between products and claims. In the medical domain, it is used to check if there are interactions between existing and new treatments and medicines. And in banking and financial services, it is used to see what products customers typically have and whether these products can be applied to new or existing customers. Association-rule mining can also be used to analyze purchasing behavior over a period of time. For example, customers tend to buy product X and Y today, and in three months’ time they buy product Z. This time period can be considered a shopping basket, although it is one that spans three months. Applying association-rule mining to this kind of temporally defined basket expands the applications areas of association-rule mining to include maintenance schedules, the replacement of parts, service calls, financial products, and so on.
A standard business task in customer-relationship management is to estimate the likelihood that an individual customer will take an action. The term propensity modeling is used to describe this task because the goal is to model an individual’s propensity to do something. This action could be anything from responding to marketing to defaulting on a loan or leaving a service. The ability to identify customers who are likely to leave a service is particularly important to cell phone service companies. It costs a cell phone service company a substantial amount of money to attract new customers. In fact, it is estimated that it generally costs five to six times more to attract a new customer than it does to retain an established one (Verbeke et al. 2011). As a result, many cell phone service companies are very keen to retain their current customers. However, they also want to minimize costs. So although it would be easy to retain customers by simply giving all customers reduced rates and great phone upgrades, this is not a realistic option. Instead, they want to target the offers they give their customers to just those customers who are likely to leave in the near future. If they can identify a customer who is about to leave a service and persuade that customer to stay, perhaps by offering her an upgrade or a new billing package, then they can save the difference between the price of the enticement they gave the customer and the cost of attracting a new customer.
The term customer churn is used to describe the process of customers leaving one service and joining another. So the problem of predicting which customers are likely to leave in the near future is known as churn prediction. As the name suggests, this is a prediction task. The prediction task is to classify a customer as being a churn risk or not. Many companies are using this kind of analysis to predict churn customers in the telecommunications, utilities, banking, insurance, and other industries. A growing area that companies are focusing on is the prediction of staff turnover or staff churn: which staff are likely to leave the company within a certain time period.
When a prediction model returns a label or category for an input, it is known as a classification model. Training a classification model requires historic data, where each instance is labeled to indicate whether the target event has happened for that instance. For example, customer-churn classification requires a data set in which each customer (one row per customer) is assigned a label indicating whether he or she has churned. The data set will include an attribute, known as the target attribute, that lists this label for each customer. In some instances, assigning a churn label to a customer record is a relatively straightforward task. For example, the customer may have contacted the organization and explicitly canceled his subscription or contract. However, in other cases the churn event may not be explicitly signaled. For example, not all cell phone customers have a monthly contract. Some customers have a pay-as-you-go (or prepay) contract in which they top up their account at irregular intervals when they need more phone credit. Defining whether a customer with this type of contract has churned can be difficult: Has a customer who hasn’t made a call in two weeks churned, or is it necessary for a customer to have a zero balance and no activity for three weeks before she is considered to have churned? Once the churn event has been defined from a business perspective, it is then necessary to implement this definition in code in order to assign a target label to each customer in the data set.
Another complicating factor in constructing the training data set for a churn-prediction model is that time lags need to be taken into account. The goal of churn prediction is to model the propensity (or likelihood) that a customer will churn at some point in the future. As a consequence, this type of model has a temporal dimension that needs to be considered during the creation of the data set. The set of attributes in a propensity-model data set are drawn from two separate time periods: the observation period and the outcome period. The observation period is when the values of the input attributes are calculated. The outcome period is when the target attribute is calculated. The business goal of creating a customer-churn model is to enable the business to carry out some sort of intervention before the customer churns—in other words, to entice the customer to stay with the service. This means that the prediction about the customer churning must be made sometime in advance of the customer’s actually leaving the service. The length of this period is the length of the outcome period, and the prediction that the churn model returns is actually that a customer will churn within this outcome period. For example, the model might be trained to predict that the customer will churn within one month or two months, depending on the speed of the business process to carry out the intervention.
Defining the outcome period affects what data should be used as input to the model. If the model is designed to predict that a customer will churn within two months from the day the model is run on that customer’s record, then when the model is being trained, the input attributes that describe the historic customers who have already churned should be calculated using only the data that were available about those customers two months prior to their leaving the service. The input attributes describing currently active customers should similarly be calculated with the data available about these customers’ activity two months earlier. Creating the data set in this way ensures that all the instances in the data set, including both churned and active customers, describe the customers at the time in their individual customer journeys that the model is being designed to make a prediction about them: in this example, two months before they churn or stay.
Nearly all customer-propensity models will use attributes describing the customer’s demographic information as input: age, gender, occupation, and so on. In scenarios relating to an ongoing service, they are also likely to include attributes describing the customer’s position in the customer life cycle: coming on board, standing still midcycle, approaching end of a contract. There are also likely to be attributes that are specific to the industry. For example, typical attributes used in telecommunication industry customer-churn models include the customer’s average bill, changes in billing amount, average usage, staying within or generally exceeding plan minutes, the ratio of calls within the network to those outside the network, and potentially the type of phone used.1 However, the specific attributes used in each model will vary from one project to the next. Gordon Linoff and Michael Berry (2011) report that in one churn-prediction project in South Korea, the researchers found it useful to include an attribute that described the churn rate associated with a customer’s phone (i.e., What percentage of customers with this particular phone churned during the observation period?). However, when they went to build a similar customer-churn model in Canada, the handset/churn-rate attribute was useless. The difference was that in South Korea the cell phone service company offered large discounts on new phones to new customers, whereas in Canada the same discounts were offered to both existing and new customers. The overall effect was that in South Korea phones going out of date drove customer churn; people were incentivized to leave one operator for another in order to avail themselves of discounts, but in Canada this incentive to leave did not exist.
Once a labeled data set has been created, the major stage in creating a classification model is to use an ML algorithm to build the classification model. During modeling, it is good practice to experiment with a number of different ML algorithms to find out which algorithm works best on the data set. Once the final model has been selected, the likely accuracy of the predictions of this model on new instances is estimated by testing it on a subset of the data set that was not used during the model-training phase. If a model is deemed accurate enough and suitable for the business need, the model is then deployed and applied to new data either in a batch process or in real time. A really important part of deploying the model is ensuring that the appropriate business processes and resources are put in place so that the model is used effectively. There is no point in creating a customer-churn model unless there is a process whereby the model’s predictions result in triggering customer interventions so that the business retains customers.
In addition to predicting the classification label, prediction models can also give a measure of how confident the model is in the prediction. This measure is called the prediction probability and will have a value between 0 and 1. The higher the value, the more likely the prediction is correct. The prediction-probability value can be used to prioritize which customers to focus on. For example, in customer-churn prediction the organization wants to concentrate on the customers who are most likely to leave. By using the prediction probability and sorting the churners based on this value, a business can focus on the key customers (those most likely to leave) first before moving on to customers with a lower prediction-probability score.
Price prediction is the task of estimating the price that a product will cost at a particular point in time. The product could be a car, a house, a barrel of oil, a stock, or a medical procedure. Having a good estimate of what something will cost is obviously valuable to anyone who is considering buying the item. The accuracy of a price-prediction model is domain dependent. For example, due to the variability in the stock market, predicting the price of a stock tomorrow is very difficult. By comparison, it may be easier to predict the price of a house at an auction because the variation in house prices fluctuates much more slowly than stocks.
The fact that price prediction involves estimating the value of a continuous attribute means that it is treated as a regression problem. A regression problem is structurally very similar to a classification problem; in both cases, the data science solution involves building a model that can predict the missing value of an attribute given a set of input attributes. The only difference is that classification involves estimating the value of a categorical attribute and regression involves estimating the value of a continuous attribute. Regression analysis requires a data set where the value of the target attribute for each of the historic instances is listed. The multi-input linear-regression model introduced in chapter 4 illustrated the basic structure of a regression model, with most other regression models being variants of this approach. The basic structure of a regression model for price prediction is the same no matter what product it is applied to; all that varies are the name and number of the attributes. For example, to predict the price of a house, the input would include attributes such as the size of the house, the number of rooms, the number of floors, the average house price in the area, the average house size in the area, and so on. By comparison, to predict the price of a car, the attributes would include the age of the car, the number of miles on the odometer, the engine size, the make of the car, the number of doors, and so on. In each case, given the appropriate data, the regression algorithm works out how each of the attributes contributes to the final price.
As has been the case with all the examples given throughout this chapter, the application example of using a regression model for price prediction is illustrative only of the type of problem that it is appropriate to frame as a regression-modeling task. Regression prediction can be used in a wide variety of other real-world problems. Typical regression-prediction problems include calculating profit, value and volume of sales, sizes, demand, distances, and dosage.