Index

Symbols and Numerics

# symbol, 120, 124

* (asterisk), in SQL, 147

2 × 4 matrix, generating with NumPy, 115

A

Absolute macros, Microsoft Excel, 159–160
accountability for AI solutions, 319–320
accreditations, 346–348
accuracy, degree of, 52
accuracy in representations, 164
activation function, 45
activists, data art for, 163
actors, 200
Adobe Analytics, 239
advertising, 230–231, 283–284
affective computing, 267, 268
AI. See artificial intelligence
algorithms for machine learning
- classification
  - general discussion, 90–91
  - instance-based learning classifiers, 90
  - overfitting in, 92–93
  - overgeneralization in, 92–93
  - overview, 89
- clustering
  - DBScan, 87–88
  - general discussion, 78
  - with hierarchical algorithms, 84–87
  - kernel density estimation, 84
  - k-means algorithm, 82–84
  - overview, 79–81
  - similarity metrics, 81–82
- decision trees, 44, 46, 88–89
- nearest neighbor
  - average nearest neighbor algorithms, 94–97, 101–102
  - k-nearest neighbor algorithms, 90, 97–100, 101
  - overview, 93–94
  - solving real-world problems with, 100–102
- random forest, 89
- selecting based on function, 44–48
- supervised, 42
- unsupervised, 43
alternatives analysis, 335–336
Amazon Redshift, 30–31
Amazon Web Services S3 platform, 23
analysts, data showcasing for, 163
annotations, including in data visualization, 185
AP (Associated Press) case study, 224–228
Apache Cassandra, 32
Apache Flume, 22
Apache Kafka, 22
Apache Spark, 35, 48–49
Apache Sqoop, 22
app development competitions, 371
appendixes, in technical plan, 334–335
Apple, 285–286
application files, 11
application programming interface (API), 371, 379
application step, in machine learning, 40
applications. See also Microsoft Excel; Structured Query Language
- CARTO, 388–390
- in data science strategy, 138–139
- DataWrangler, 383
- Gephi, 384–386
- ImageQuilts, 382–383
- Infogram, 394–395
- overview, 137, 381–382
- Piktochart, 395–396
- RAWGraphs, 392–393
- Shiny by RStudio, 387–388
- Tableau Public, 390–391
- WEKA, 386
architecture standards, 296
area chart, 171, 172
arguments of functions, 124, 147
ARIMA (AutoRegressive Integrated Moving Average), 132
ARMA (autoregressive moving average), 75–76
arrays, NumPy, 114–115
art, data. See data art
artificial intelligence (AI)
- accountability, assessing, 319–320
- bias in systems, assessing, 322–323
- case studies
  - AP content generation, 224–228
  - call center operations, 257–262
  - debt collection processes, 211–216
  - logistical efficiencies, 216–217
  - real-time optimized logistics routing, 217–222
- ethics
  - assessing, 318–323
  - collecting information about, 307–308
  - in technical plan, 330
- explainability, assessing, 320–321
- hype about, 275–277
- overview, 42
Artists of Data Science podcast, 345–346
assessment
- of AI ethics, 318–323
- of data governance, 323–325
- of data privacy policies, 325
- data skill gap analysis, 317–318
Associated Press (AP) case study, 224–228
association rule learning algorithms, 44, 46
asterisk (*), in SQL, 147
atomic vectors, 122, 123
attribute value, 93
attributes, 112, 129–131
audience, as currency, 344
audience for data visualization
- designing for, 164–167
- emotional response, inducing, 168–169
- identifying, 162
- logical and calculating response, inducing, 167–168
Automated Insights’ Wordsmith platform, 225–228
automation in business operations, 210. See also debt collection processes case study
autoregression techniques, 75
AutoRegressive Integrated Moving Average (ARIMA), 132
autoregressive moving average (ARMA), 75–76
autoreply functionality, Gmail, 47
average nearest neighbor algorithms, 94–97, 101–102

B

B2B (business-to-business) company, 207
bar chart, 171, 172, 395, 396
behavior-based learning model, 43
BernoulliNB, 56
best practices, 306
BI (business intelligence), 247–250, 262–263
bias in AI systems, assessing, 322–323
big data, 11. See also data storage
- defining, 19–20
- differences between data approaches, 24–28
- real-time processing framework, 35
- sources of, 23–24
- three Vs of, 21–23
big data paradigm, 24
BigQuery, Google, 31
bi-grams, 271–272
binary membership, 66
binning, 56
binomial distributions, 55
boot camps, 348–349
Borne, Kirk, 355
bottom-up dendrograms, 85
boundary smoothing, 99
BPA (business process automation), 210
brackets, in R programming language, 122
brainstorming
- for data visualization, 164, 165–166
- with POTI model, 338–339
brick-and-mortar store, 233
bubble plot, 174, 175
built-in Python functions, 110–111
business acumen
- data roles, support for generating profit, 192–194
- defining, 191
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- generating profit, 192–197
- increasing, 195–196
- leadership skills, fortifying, 196–197
- overview, 189
- STAR framework, 197–199
- subject matter expertise versus, 190
business consulting, 364
business intelligence (BI), 247–250, 262–263
business managers, feedback from, 305
business mission, 330
business model, choosing, 357–359
business operations, improving, 210
business process automation (BPA), 210
Business Science University, 365
business vision, 294–296, 330
businesses, starting
- business model, choosing, 357–359
- data science entrepreneurs, 364–366
- Kam Lee, example of, 361–364
- overview, 357
- revenue model, choosing, 359–361
business-to-business (B2B) company, 207

C

call center operations case study, 257–262
- need, 257
- results, 258
- solution, 257
- technology stack, 262
- use case diagram, 261
- use cases, 258–260
Cambridge Analytica scandal, 238, 283–284
Canada Open Data website, 371–372
career paths
- accreditations, 346–348
- businesses, starting
  - business model, choosing, 357–359
  - data science entrepreneurs, 364–366
  - Kam Lee, example of, 361–364
  - overview, 357
  - revenue model, choosing, 359–361
- career accelerators, 348–349
- coding bootcamps, 348–349
- common, 341–343
- data implementers, 345–346
- data leaders, 354–356
- general discussion, 15–18
- generating wealth, 343–345
- networking and relationship-building, 349–350
- overview, 341
- project portfolio, building, 351–354
- thought leadership, 350–351
CARTO, 388–390
case studies
- AP content generation, 224–228
- call center operations, 257–262
- debt collection processes, 211–216
- defining, 202
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- logistical efficiencies, 216–217
- real-time optimized logistics routing, 217–222
- STAR framework, 197–199
cases, in machine learning, 41
cash, as currency, 344
Cassandra, Apache, 32
categorical distributions, 55
CCTV cameras, use of k-nearest neighbor algorithms by, 101
Central Bank of Malaysia, 266–267
centroids, 82
channels
- mapping, 233–234
- performance, building around, 235
- scoring, 231, 235–237
Character data type, SQL, 144
chart graphics, standard, 171–173
charts, Microsoft Excel, 152, 154–156
choropleth map, 389
churn, 14
circle of influence, 363
civic hackers, 370
classes, 112–113, 121
classification algorithms
- versus clustering, 90
- general discussion, 90–91
- instance-based learning classifiers, 90
- nearest neighbor analysis
  - average, 94–97, 101–102
  - k-nearest, 90, 97–100, 101
  - overview, 93–94
  - solving real-world problems with, 100–102
- overfitting in, 92–93
- overgeneralization in, 92–93
- overview, 89
classification methods, 77
click-streams, 22–23
clinical informatics scientists, 14
Cloropleth map, 181, 182
cloud storage solutions, 28–32
cloud-warehouse solutions, 30–31
clustering
- algorithms, 79–81
- classification versus, 90
- DBScan, 87–88
- general discussion, 78
- with hierarchical algorithms, 84–87
- kernel density estimation, 84
- k-means algorithm, 82–84
- overview, 77
- selecting algorithms based on function, 44, 46
- similarity metrics, 81–82
clusters, 80
coding, 13, 103, 104. See also Python; R programming language
coding bootcamps, 348–349
coding documentation, 202–203
coding portfolio, building, 351–354
Cogito, 257–262
cognitive bias, 322
collaborative data visualization platforms, 390–391
collecting data, 11
collective outliers, 71, 87
colon operator, 129
Color Scales conditional formatting, Microsoft Excel, 154
column indexes, in SQL, 145–146
comma-separated values (CSV), 11
comments, in code, 120, 124
commissioning manager, 308
commodity servers, 33
communicating data insights, 14–15
company, researching
- business vision, mission, and values, 294–296
- data ethics, 306–308
- data resources, inventorying, 298–302
- data science team, unifying, 292–293
- data technologies, inventorying, 296–298
- efficient process for, 308–310
- overview, 291–292
- people-mapping, 303–304
- project pitfalls, avoiding, 305–306
company culture, 334
company email list, 233
company website, 233
comparative graphics, 173–176
competitions, app development, 371
computer vision technology, 210
concatenate function, 123
conditional formatting, Microsoft Excel, 152, 154
conditional probability, 55–56
connective edge, 94
constant time series, 74
constraints, in SQL, 145
consulting and advising businesses, 358–359
contact strategy, 212
content analysis, 269
content creation, AI-assisted
- Associated Press case study, 224–228
- GPT-3, 222–224
- for marketing improvements, 230
- overview, 222
context, adding to data visualization, 184–186
contextual outliers, 71
continuous probability distribution, 54
copyleft, 369
core samples, 87
correlation
- overview, 56
- Pearson, 56–58
- Spearman's rank, 58–59
cosine similarity, 81
Creative Commons licenses, 370
credit card use patterns, 101
CRM (customer relationship management), 100
crowdsourcing, 378
CSV (comma-separated values), 11
culture, organizational, 334
cumulative variance explained (CVE), 60–62, 64–65
currency, types of, 344
current state of company, assessing, 312
current state summary statement, 329
custom Python functions, 111
customer acquisition, 237
customer avatar, designing, 236
customer churn analysis, 230
customer relationship management (CRM), 100
customer retention, 237
customer service support, automated, 210
CVE (cumulative variance explained), 60–62, 64–65

D

DaaS (Data-as-a-Service) platform, 284–285
Dancho, Matt, 365
dashboard design, 167
data analytics
- appropriate use of, 262–263
- BI versus, 249–250
- common challenges in, 252–253
- data-wrangling, 253–254
- Google Analytics, 250–251
- overview, 249
- types of analytics, 252
data art
- data graphics for, 171, 184
- designing for, 164, 166–167
data cleaning, 279, 379
data companies, use of personal data by, 286
data dictionary, requesting, 298–300, 315
data engineering, 7, 8, 26–28, 364
data entrepreneurs
- career paths, 343
- operational improvements, 207
- overview, 17–18
- real-life examples, 364–366
- separating dedicated team members as, 293
data ethics. See also data privacy
- AI ethics
  - assessing, 318–323
  - collecting information about, 307–308
  - in technical plan, 330
- of NLP, 274
data frame objects, 122, 123
Data Futurology Podcast, 355
data governance council, 324
data governance policies, assessing, 308, 323–325
data governance standards, 324
data graphics
- comparative graphics, 173–176
- context, adding, 184–186
- overview, 170–171
- spatial plots and maps, 180–183
- standard chart graphics, 171–173
- statistical plots, 176–179
- testing, 183–184
- tools for
  - CARTO, 388–390
  - Infogram, 394–395
  - Piktochart, 395–396
  - RAWGraphs, 392–393
  - Shiny applications by RStudio, 387–388
  - Tableau Public, 390–391
- topology structures, 179–180
data implementers
- career paths, 341–342
- decision support, 263
- documentation for, 202–203
- operational improvements, 206
- overview, 16
- real-life examples, 345–346
- separating dedicated team members as, 293
data infrastructure architecture, 296–297
data ingestion, 22, 280
data insights, 8, 9–10, 14–15
data integrity, 142
data journalists, 14
data lake, 23
data leaders
- career paths, 342
- data storytelling by, 163
- documentation for, 199–202
- operational improvements, 206
- overview, 16–17
- real-life examples, 354–356
- separating dedicated team members as, 293
data literacy training, 333
data mart, 23
data mining, 280
data monetization
- AI hype, 275–277
- Clive Humby, 278
- data privacy initiatives, 285–288
- data products, 282
- data resources, 283–285
- data services, 278–281
- overview, 275
data munging, 151
data normalization, 279
data partnerships, 284
data points, 41
data policies, 324
data preparation services, 279–280
data preprocessing, 269
data privacy
- advertising, 283–284
- breaches of, 263
- company data ethics, researching, 306–307
- data partnerships, 284
- demand for, 238
- GDPR, 320–321
- latest developments, 285–288
- policies, assessing, 307, 325
data privacy policy, 307, 325
data processing, 279
data product manager, 14, 239
data products, 238–239, 282
data quality issues, 300–302
data resources
- alternatives analysis, 336
- current, in technical plan, 331
- inventorying, 298–302
- monetization of, 283–285
- recommendations, in technical plan, 332
data roles, support for profit generation, 192–194
data science. See also career paths
- applying to subject areas, 13–14
- data engineering versus, 8
- for decision support, 254–263
- defining, 8, 25
- key components, 10–15
- making use of, 8–10
- overview, 7–8
- statistics versus, 13
Data Science Association (DSA), 349
data science careers. See career paths
data science strategy, 104, 138–139
data science team
- recommendations in technical plan, 333–334
- unifying, 292–293
data scientists
- communicating data insights, 14–15
- defining, 8, 25, 27–28
- key components of role, 10–15
data service providers, directory of, 281
data services, 278–281
data showcasing
- coding portfolio, building, 352
- data graphics for, 171
- designing for, 166–167
- overview, 163
data silos, 300–302
data skillset
- alternatives analysis, 336
- data skill gap analysis, 317–318
- of relevant personnel, surveying, 304
- in technical plan, 331, 333
data sources, 23–24
data storage
- in cloud, 28–32
- cloud-warehouse solutions, 30–31
- Hadoop, 33–34
- HDFS, 33–34
- Kubernetes, 30
- MapReduce, 33
- MPP platforms, 34
- NoSQL databases, 31–32
- on-premise solutions, 32–34
- overview, 28
- serverless computing solutions, 29
data storytelling
- coding portfolio, building, 352
- data graphics for, 171
- designing for, 162–163, 166–167
- overview, 14–15
Data Strategy Plan Template, 328–335
data structures, 107
Data Studio, Google, 239
data superhero archetypes, 15–18, 163
data taxonomy services, 279
data technology
- alternatives analysis, 336
- assessing use case impact on, 315
- brainstorming recommendations for, 339
- current, in technical plan, 331
- inventorying, 296–298
- recommendations, in technical plan, 332, 333
data transformation, 279
data types
- in Python, 106–109
- in SQL, 144
data variety, 22–23
data velocity, 21–22
data visualization, 14–15. See also data graphics
- annotations, including, 185
- brainstorming, 165–166
- context, adding, 184–186
- data art, 164
- data showcasing, 163
- data storytelling, 162–163
- design style, choosing, 167–169
- emotional response, design for, 168–169
- ggplot2 package, 132, 133–134
- graphical elements, including, 186
- logical and calculating response, design for, 167–168
- main types of, 162–164
- MatPlotLib library, 117, 118–119
- in Microsoft Excel, 154–156
- overview, 161–162
- persuasive design, 186
- process for creating, 162
- purpose, defining, 166
- target audience, designing for, 164–167
- testing data graphics, 183–184
- tools for
  - CARTO, 388–390
  - RAWGraphs, 392–393
  - Shiny applications by RStudio, 387–388
  - Tableau Public, 390–391
- type of, choosing, 166–167
data volume, 21
data warehouse systems, 23, 30–31
Data-as-a-Service (DaaS) platform, 284–285
DATAcated, 364–365
data-exploration tools, 384–386
DataFrame object, Pandas library, 117
Data.gov program, 370–371
data.gov.uk, 372–373
DATA.NASA.GOV, 374–375
dataset X, 67–68
DataWrangler, 383
data-wrangling, 253–254, 352, 383
Datawrapper, 396
Date data type, SQL, 144
DATE function, SQL, 148
Date-Time data type, SQL, 144
DBScan (density-based spatial clustering of applications with noise), 87–88
debt collection processes case study, 211–216
- result, 212–213
- solution, 211–212
- technology stack, 216
- use case, 214–215
- use case diagram, 215
decision engines, 103
decision makers, data storytelling for, 162–163
decision support system
- business intelligence, 247–249, 262–263
- data analytics
  - appropriate use of, 262–263
  - BI versus, 249–250
  - common challenges in, 252–253
  - data-wrangling, 253–254
  - Google Analytics, 250–251
  - overview, 249
  - types of analytics, 252
- data science for
  - appropriate use of, 262–263
  - call center operations case study, 257–262
  - data sources, 255–256
  - overview, 254–255
- improving, 245–246
- overview, 245
decision trees, 44, 46, 88
decision-making
- FMCDM, 67
- MCDM, 65–67
deep learning, 45, 46, 47–48, 227, 267
defect analysis, 209
degree of accuracy, 52
dendrograms, 85, 86
density, 83, 84
density smoothing function, 84
density-based spatial clustering of applications with noise (DBScan), 87–88
dependent variable (DV), 41
descriptive analytics, 60, 64
descriptive statistics, 52–53
design style of data visualization, choosing, 167–169
designing SQL database, 144–147
diagnostic analytics, 64–65
dictionaries, Python, 109
digital product, 238–239, 282
dimensionality reduction
- factor analysis, 63
- overview, 59
- principal component analysis, 64–65
- selecting algorithms, 44, 46
- singular value decomposition, 59–62
directional movement, 83
directors of data science, 14
directory of online data service providers, 281
dirty data, compressing, 60, 61
disclaimers analysis, 269
discrete probability distribution, 54
dividend performance, 66
document analysis, automated, 210
documentation
- accountability for AI solutions, 320
- current state of company, assessing, 312
- data governance policies, 323–325
- for data implementers, 202–203
- for data leaders, 199–202
- data privacy policies, 325
- overview, 199
document-oriented database, 32
domain-specific data science, 25
dot notation, 114
DSA (Data Science Association), 349
Dstreams, 48
duplication of effort, 302
DV (dependent variable), 41
dynamic pricing, 231

E

earnings growth potential, 66
earnings quality rating, 66
eigenvector, 62
email list, company, 233
emotionally provocative data visualizations, 168–169
employees
- hiring new, 317, 334
- interviewing intended users, 337–338
- recommendations in technical plan, 333
- skillsets, surveying, 304
ensemble algorithms, 45, 46
enterprise IT architecture, 296
entity analysis, 271–272
error propagation, 88
essential context, establishing, 206–207
e-surveillance platform, 268
ethics, data. See data ethics
Euclidean distance metric, 81
Excel, Microsoft
- Charting tool, 154–156
- Conditional Formatting feature, 154
- filtering in, 153–154
- macros, 158–160
- overview, 151–152
- PivotTables, 157–158
- quick data analysis with, 152–153
executing data science project, 339–340
executive summary, 329
expectation value, 54
explainability of AI solutions, 320–321
Exversion, 379
eyeballing clusters, 80, 84, 85

F

FaaS (Function as a Service), 29
Facebook DeepFace, 47
factor analysis, 63, 132
fault-tolerance, in HDFS, 34
feature engineering, 42
feature selection, 40, 42, 98
features, 41, 95
feedback, collecting, 305
fee-for-service revenue model, 361
file formats, 11–12
filters, in Microsoft Excel, 152, 153–154
financial services industry
- fraud prevention with NLP
  - content analysis, 269
  - data preprocessing, 269
  - disclaimers analysis, 269
  - metadata analysis, 269
  - normalization, 270
  - overview, 267–268
  - phrase and entity analysis, 271–272
  - token analysis, 271
  - trade-off between risk and reward, 272–273
- lending risk, decreasing, 266–267
- overview, 265
Flores, Felipe, 355
Flume, Apache, 22
FMCDM (fuzzy multiple criteria decision-making), 67
for loop, 110
forecast package, 132
foreign key, 141, 142
Forrester Research, Inc., 273
four Ps, 240–241
fourth industrial revolution, 208–209
fraud prevention with NLP, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
- trade-off between risk and reward, 272–273
full outer JOIN function, SQL, 148, 149
Function as a Service (FaaS), 29
function call, 125
functions
- in Python, 106, 110–112
- in R, 123–127
- in SQL, 147–151
Funnel Gorgeous, 365
future state of company, 312
future state vision statement, 330
fuzzy multiple criteria decision-making (FMCDM), 67

G

Gantt chart, 174, 176
GaussianNB, 56
General Architecture for Text Engineering (GATE), 151
General Data Protection Regulation (GDPR), 320–321, 330
generic vectors, 122
geometric metrics, 81
Gephi, 384–386
ggplot2 package, 133–134
GitHub portfolio, 353–354
Gmail, 47
goals of data science projects, 332
Google Analytics, 239, 250–251
Google BigQuery, 31
Google Data Studio, 239
Google Sheets, 152
government data, open, 370–373
GPT-3, 222–224, 230
graph mesh network topology, 180
graph models, 180
graphics, data. See data graphics
GraphX library, Apache Spark, 48
Grayeb, Jennifer, 365
grocery retail, use of average nearest neighbor algorithms by, 101–102
GROUP function, SQL, 150
Guru Path of the Data Science Bootcamp, Data Science Dojo, 349

H

Hadoop, 20, 33–34, 35
Hadoop distributed file system (HDFS), 22, 23, 33–34
hairball graph, 385, 386
hardware companies, use of personal data by, 286
hash symbol, 120, 124
HAVING function, SQL, 150
HDFS blocks, 33
Heartbeat algorithm, TrueAccord, 211–216
help function, SciPy library, 117
hidden layer, 45
hierarchical clustering algorithms, 79, 84–87
hierarchical tree topology, 180, 181
high-variety data, 22
hiring new employees, 317, 334
histogram, 176–177, 178
HR managers, feedback from, 305
Humana case study, 257–262
- need, 257
- results, 258
- solution, 257
- technology stack, 262
- use case diagram, 261
- use cases, 258–260
Humby, Clive, 275, 278
hyperparameters, 203
hypertargeted advertising, 230–231

I

icons, used in book, 4
igraph package, 134
ImageQuilts, 382–383
implementation plan. See technical plan
implementing data use cases, 222
implicit risk in AI, 319
independent variable (IV), 41
index, 141
inferential statistics, 52–53
Infogram, 394–395
infographic tools, 393–396
information, in POTI modeling, 315, 339
information products, 282
information products businesses, 358
information redundancy, 63
in-memory computing, 35, 48, 144
inner JOIN function, SQL, 148, 149
instance
- in machine learning, 41
- in R, 121
instance-based algorithms, 44, 46
instance-based learning classifiers, 90
instantiating objects, 112
interactive mode, R programming language, 121
Internet of things, 23
interpreter, SQL, 147
inter-quartile range (IQR), 71–72
interviews, conducting, 300, 305, 337–338
investments, evaluating potential, 66–67
IPUMS, 374
IT professionals, feedback from, 305
iterating, in R programming language, 127–129
IV (independent variable), 41

J

Jaccard distance metric, 82
Jee, Ken, 346
JOIN function, SQL, 148

K

Kafka, Apache, 22
kernel density estimation (KDE), 84
kernel smoothing methods, 84
key-value pair, 31
key-values stores, 32
k-means clustering algorithm, 82–84, 209
k-nearest neighbor classification algorithm (kNN), 90, 97–100, 101
Knoema, 376–377
Kozyrkov, Cassie, 354
KubeFlow product, 30
Kubernetes, 30, 32

L

labeled data, 90
lake, data, 23
latency, 21
latent variables, 60, 63
lazy learners, 90, 97
lead scoring, 231
leadership skills, fortifying, 196–197
learning step, in machine learning, 40
Lee, Kam, 361–364
Lee, Vincent, 266–267
left JOIN function, SQL, 148
len function, 108
lending risk, decreasing, 266–267
LendUp case study, 211–216
- result, 212–213
- solution, 211–212
- technology stack, 216
- use case, 214–215
- use case diagram, 215
libraries, Python
- MatPlotLib, 118–119
- NumPy, 114–116
- overview, 114
- Pandas, 117–118
- Scikit-learn, 119–120
- SciPy, 116–117
licensing revenue model, 360
lifetime value forecasting (LTV), 230
line chart, 173
Line Chart feature, Microsoft Excel, 156
linear algebra
- factor analysis, 63
- overview, 59
- principal component analysis, 64–65
- singular value decomposition, 59–62
linear regression, 67–69
linear relationships, 57
linear topological structure, 179
lists, 108, 122, 123
live events, 234
local maximum density, 83
local minimum density, 83, 84
logical and calculating response, data visualization design for, 167–168
logistic regression, 69, 132
logistical operations, improving
- overview, 216–217
- real-time optimized logistics routing case study, 217–222
loops, using in Python, 109–110
low value of big data, 21
low-code environment, 138–139
low-density regions, 83
lowest-hanging-fruit use case, 291, 311
- AI ethics, assessing, 318–323
- data governance, assessing, 323–325
- data privacy policies, assessing, 323–325
- data skill gap analysis, 317–318
- quick-win use cases, selecting, 313–316
- reviewing documentation, 312
LTV (lifetime value forecasting), 230

M

Ma, Danny, 345
machine learning. See also clustering; nearest neighbor analysis
- Apache Spark, generating real-time analytics with, 48–49
- classification algorithms, 89–93
- coding portfolio, building, 352
- decision trees, 44, 46, 88–89
- defining, 26, 40
- learning styles, 42–43
- overview, 39
- processes, 40–41
- random forest algorithms, 89
- regression methods, 67–70
- reinforcement learning, 43
- selecting algorithms based on function, 44–48
- supervised algorithms, 42
- unsupervised algorithms, 43
- use cases, 40
- vocabulary associated with, 41–42
- WEKA application, 386
machine learning engineers, 14, 26, 27–28
machine learning model selection, 280
machine learning model-tuning, 280
macros, Microsoft Excel, 158–160
management plan. See technical plan
Manhattan distance metric, 81
manuals for AI systems, 321
manufacturing operations, improving, 208–210
many-to-many relationship structure, 180
map layers, 389
mapbox, 396
mapping application, 388–390
mapping channels, 233–234
MapReduce, 33, 35
market basket analysis, 231
marketing channels, omnichannel analytics of, 233–238
marketing data scientists, 14
marketing improvements
- data products, 238–239
- marketing mix modeling, 239–243
- omnichannel analytics
  - channel performance, building around, 235
  - channels, mapping, 233–234
  - data privacy, 238
  - defining, 233
  - overview, 232
  - scoring channels, 235–237
- overview, 229
- popular use cases for, 229–232
marketing mix modeling (MMM), 239–243
marketing professionals, feedback from, 305
marketing strategy development, 10
mass customization production, 208–209
massively parallel processing (MPP) platforms, 26, 34
mathematical modeling, 12
MatPlotLib library, 105, 118–119
matrix
- generating with NumPy, 115
- R, 122
MCDM (multiple criteria decision-making), 65–67
media content creation. See content creation, AI-assisted
metadata analysis, 269
metadata repository, 298–300
methods, Python, 112
microbatch processing, 48
microdata, 376
Microsoft Excel
- Charting tool, 154–156
- Conditional Formatting feature, 154
- filtering in, 153–154
- macros, 158–160
- overview, 151–152
- PivotTables, 157–158
- quick data analysis with, 152–153
Minkowski distance, 81
mission statement, 294–296
MLlib submodule, Apache Spark, 49
mlogit (multinomial logit model), 132
MMM (marketing mix modeling), 239–243
model building services, 280–281
model overfitting, 92–93
model overgeneralization, 92–93
model selection, machine learning, 280
model-tuning, machine learning, 280
monetization, data. See data monetization
money, as currency, 344
MongoDB, 32
moving average techniques, 75
MPP (massively parallel processing) platforms, 26, 34
multicollinearity, 70
multidimensional arrays, generating with NumPy, 114–115
multidimensional datasets, 59
multi-label learning, 99–100
multinomial logit model (mlogit), 132
MultinomialNB, 56
multiple criteria decision-making (MCDM), 65–67
multiple criteria evaluation, 65
multiple dependencies, in SQL, 145
multiple linear regression, 68
multivariate analysis, 132
multivariate normality (MVN), 64
multivariate outlier detection, 73
munging, data, 151
MySQL, 141

N

Naïve Bayes method, 44, 46, 55–56
NASA open data, 374–375
natural language processing (NLP), 230, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
- trade-off between risk and reward, 272–273
Natural Language Toolkit, Python, 151
n-dimensional arrays, generating with NumPy, 114–115
n-dimensional plot, 81
nearest neighbor analysis
- average nearest neighbor algorithms, 94–97, 101–102
- k-nearest neighbor algorithms, 90, 97–100, 101
- overview, 93–94
- solving real-world problems with, 100–102
neighborhood clustering algorithms, 87–88
network analysis, 384
network graph analysis, 134–135
Network Planning Tools (NPT), 218–222
network topologies, 384
networking, 349–350
neural networks, 45, 46, 47, 210
n-grams, 271–272
Nimble Company, The, 365
NLP. See natural language processing
no-code environment, 138–139
nodes, 33
noise, 53, 99
noncore samples, 87
nonglobular clustering, 86–87
non-interactive mode, R programming language, 121
nonlinear relationships, 58, 59
nonredundancy of columns, in SQL, 145
nonstationary processes in time series, 74–75
normal distributions, 55
normalization, 145–147, 270
NoSQL databases, 31–32
Notion, 203
NPT (Network Planning Tools), 218–222
NULL values, SQL, 145, 147
numbers data type, Python, 107
Numerical data type, SQL, 144
NumPy library, 114–116

O

object-oriented language, 121
objects
- in Python, 105–106
- in R, 121–122, 129–131
observations, 41, 94, 95
OLS (ordinary least squares) regression methods, 70
omnichannel analytics
- channel performance, building around, 235
- channels, mapping, 233–234
- data privacy, 238
- defining, 233
- overview, 232
- scoring channels, 235–237
online data service providers, directory of, 281
on-premise storage solutions, 32–34
open data resources
- Canada Open Data website, 371–372
- Data.gov program, 370–371
- data.gov.uk, 372–373
- Exversion, 379
- Knoema, 376–377
- NASA, 374–375
- OSM, 380
- overview, 369–370
- Quandl, 378–379
- US Census Bureau data, 373–374
- World Bank Open Data page, 375–376
open government license, 372, 373
open movement, 369
OpenAI, 223
open-source SQL implementations, 141
OpenStreetMap (OSM), 380
operational improvement
- in business operations, 210
- content creation, AI-assisted
  - AP content generation rates, increasing, 224–228
  - GPT-3, 222–224
  - overview, 222
- data science contributions, 207–208
- debt collection processes case study, 211–216
- essential context, establishing, 206–207
- logistical efficiencies, 216–217
- in manufacturing operations, 208–210
- overview, 205
- real-time optimized logistics routing case study, 217–222
operators, in R programming language, 124–127
optical character recognition, 217
optimal use case, selecting
- AI ethics, assessing, 318–323
- data governance, assessing, 323–325
- data privacy policies, assessing, 325
- data skill gap analysis, 317–318
- overview, 311
- quick-win, selecting, 313–316
- reviewing documentation, 312
ordinal variables, 55
ordinary least squares (OLS) regression methods, 70
organization, in POTI modeling, 315, 339
organizational charts, requesting, 303–304
organizational culture, 334
organizational structure, in technical plan, 330–331
OSM (OpenStreetMap), 380
outer JOIN function, SQL, 148
outlier detection, 65
- DBScan for, 87
- extreme values, analyzing, 70–71
- in Microsoft Excel, 154–156
- with multivariate analysis, 73
- types of outliers, 71
- with univariate analysis, 71–72
overfitting, model, 92–93
overgeneralization, 92–93

P

packages, R programming language, 131–135
packed circle diagram, 174, 175
paid ads, 234
Pandas library, 117–118
parallel distributed processing, 33
parallel processing, 30
partitional clustering algorithms, 79
patterns in time series, identifying, 74–75
PCA (principal component analysis), 60, 64–65, 73
Peak Volume Alignment Tool (PVAT), 218
Pearson correlation, 56–58
people-mapping, 303–304
perceptron, 45
personal data. See data privacy
persuasive design, in data visualization, 186
phrase analysis, 271–272
pie chart, 173, 174
Piktochart, 395–396
PivotCharts, Microsoft Excel, 158
PivotTables, Microsoft Excel, 157–158
placing product, 241
plan of action. See technical plan
PLC (programmable logic controller), 209
point map, 181, 182
point outliers, 71
point pattern data, 135
policing applications, 10
polymorphic classes, 121
polymorphic functions, 131
population, 53
portfolio, building, 351–354
post conditions, 200
PostgreSQL, 141
POTI model, 314–316, 338–339
preconditions, 200
predictant, 41, 67–68
predictive analytics, 64–65
predictive applications, 10, 30
predictive maintenance, 217
prescriptive analytics, 64–65
presentation skills, 163
price, 240–241
pricing, dynamic, 231
primary key, 141, 142, 145
principal component analysis (PCA), 60, 64–65, 73
principal components, 64
print function, 110–111, 121
probability
- conditional, 55–56
- distributions, 53–55
- inferential statistics, 52–53
processes, in POTI modeling, 314, 338
processing data, 35
product development, 237
product features, 240
production forecasting, 210
profit-forming data science projects, 192–197
- data roles, support for, 192–194
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- STAR framework, 197–199
programmable logic controller (PLC), 209
programming languages. See also Python; R programming language
- in data science strategy, 104
- overview, 13
- Visual Basic for Applications, 158, 160
project milestones, keeping close eye on, 340
promotion, 241
psychographics, 165
purpose of data visualization, defining, 164, 166
PVAT (Peak Volume Alignment Tool), 218
pyplot function, 118–119
Python
- classes, 112–113
- in data science strategy, 104
- data types, 106–109
- dictionaries, 109
- functions, 106, 110–112
- general discussion, 104–106
- libraries
  - MatPlotLib, 118–119
  - NumPy, 114–116
  - overview, 114
  - Pandas, 117–118
  - Scikit-learn, 119–120
  - SciPy, 116–117
- lists, 108
- loops, using, 109–110
- Natural Language Toolkit, 151
- numbers data type, 107
- objects, 105–106
- overview, 13
- sets, 109
- strings, 107–108
- tuples, 108–109

Q

Quality Control Charts package (qcc), 132
Quandl, 378–379
quantitative methods, 12
querying data, 11
question-and-asset request database, 308–309
quick-win use cases, selecting, 313–316

R

R programming language
- basic vocabulary, 121–124
- in data science strategy, 104
- functions, 123–124
- functions and operators, methods for using, 124–127
- general discussion, 120
- ggplot2 package, 133–134
- igraph package, 134
- iterating in, 127–129
- objects in, 121–122, 129–131
- overview, 13
- spatstat package, 135
- statistical analysis packages, 131–133
- statnet package, 134–135
r variable, 56–58
random forest algorithms, 89
random sampling, 40–41
random variable, 54
ranking and scoring data, 280
raster surface map, 181, 183
RAWGraphs, 392–393
RDA (Research Data Alliance), 349
RDBMS (relational database management system), 8, 22–23, 26, 31, 141–143
Read phase, Synthesys AI, 269–271
real-time big data analytics, generating with Apache Spark, 48–49
real-time optimized logistics routing case study, 217–222
- result, 218–222
- solution, 218
- technology stack, 221–222
- use case, 219
- use case diagram, 220
real-time processing framework, 35
recommendation engines, 229–230
recommendations summary, 329
recommending plan of action. See technical plan
recyclability, 128
Redshift, Amazon, 30–31
redundancy, in HDFS, 34
reference table, 299
referral sites, 234
regression algorithms, 46
regression methods
- linear, 67–69
- logistic, 69
- in marketing mix modeling, 241–242
- ordinary least squares, 70
- overview, 67
- production forecasting, 210
- selecting algorithms, 44
regularization algorithms, 44, 46
reinforcement learning, 43
relational database management system (RDBMS), 8, 22–23, 26, 31, 141–143
relationship-building, 349–350
Relative macros, Microsoft Excel, 159–160
Remember icon, 4
report writing, automated, 210
Research Data Alliance (RDA), 349
researching company
- business vision, mission, and values, 294–296
- data ethics, 306–308
- data resources, inventorying, 298–302
- data science team, unifying, 292–293
- data technologies, inventorying, 296–298
- efficient process for, 308–310
- overview, 291–292
- people-mapping, 303–304
- project pitfalls, avoiding, 305–306
residuals, 68
Resolve phase, Synthesys AI, 271–272
resources, open data. See open data resources
retail stores, use of k-nearest neighbor algorithms by, 101
revenue model, choosing, 359–361
right JOIN function, SQL, 148
risk priority number (RPN), 242
robot workcell, 209

S

SaaS (Software as a Service), 26, 282
SaaS business model, 359
SafeGraph, 282, 284–285
Sahota, Harpreet, 345–346
sales calls, 234
sales channels, omnichannel analytics approach for, 233–238
sales professionals, feedback from, 305
sample, 53
scatterplot, 177, 178
scatterplot charts, 133–134
scatterplot matrix, 177, 179
Scikit-learn library, 119–120
SciPy library, 116–117
scoring channels, 231, 235–237
scraping websites, 14
script files, 11
sculpting data, 383
search engine optimization (SEO), 234
seasonality, 74
security cameras, use of k-nearest neighbor algorithms by, 101
security of cloud storage, 29
SELECT function, SQL, 147–148
self-learning networks, 45
Self-Taught Data Scientist Curriculum, 362
self-tuning vision systems, 210
semistructured data, 8, 22–23
sentiment analysis, 267, 269–273
SEO (search engine optimization), 234
sequences, 95
Series object, Pandas library, 117
serverless computing solutions, 29
service development, 237
service-based businesses, 357–358
services revenue model, 361
sets, Python, 109
shared variance, 63
Sheets, Google, 152
Shiny applications, RStudio, 387–388
showcasing, data. See data showcasing
silhouette coefficient, 83
similarity metrics, 81–82
single-link algorithm, 94
singular value decomposition (SVD), 59–62
skills
- alternatives analysis, 336
- coding portfolio, building, 351–354
- data skill gap analysis, 317–318
- of relevant personnel, surveying, 304
- in technical plan, 331, 333
- upgrading, 9
Smart-Reply, Gmail, 47
SME (subject matter expert), 10–11, 13–14, 190
Smith, Heather, 355–356
Snowflake service, 31
social media, website traffic from, 234
social network analysis, 134–135
software, as data product, 238–239
software applications. See applications
Software as a Service (SaaS), 26, 282
software feature, as data product, 238–239
sources of big data, 23–24
Spark, Apache, 35, 48–49
Spark SQL, 48
sparse matrices, compressing, 60, 61
spatial data analytics, 388–390
spatial map, 180–183
spatial plot, 180–183
spatial point pattern analysis, 135
spatstat package, 135
Spearman's rank correlation, 58–59
SQL. See Structured Query Language
Sqoop, Apache, 22
St. Lawrence, Sadie, 365
stacked chart, 174, 176
stakeholder management, 163
stakeholders, feedback from, 305
standard chart graphics, 171–173
STAR framework, 295–296
- assessing current state, 312
- data skill gap analysis, 317–318
- general discussion, 197–199
- recommending plan of action, 328–329
- Survey step, 313
start-ups
- business model, choosing, 357–359
- data science entrepreneurs, 364–366
- Kam Lee, example of, 361–364
- overview, 357
- revenue model, choosing, 359–361
state machine, 212
statistical plots, 176–179
statistics
- versus data science, 13
- defining, 52
- deriving insights from, 12–13
- descriptive, 52–53
- inferential, 52–53
statnet package, 134–135
STEM degrees, 346–348
stochastic approach, 12
storing data. See data storage
storytelling, data. See data storytelling
Strachnyi, Kate, 364–365
strategic plan. See technical plan
Streaming module, Apache Spark, 48
strings, Python, 107–108
structured data, 8, 22–23
Structured Query Language (SQL)
- constraints, designing, 145
- data types, defining, 144
- database design, 144–147
- functions, 147–151
- general discussion, 139–141
- normalization, 145–147
- open-source implementations, 141
- overview, 11, 13
- RDBMSs, understanding, 141–143
- text mining in, 151
subject areas, applying data science to, 13–14
subject matter expert (SME), 10–11, 13–14, 190
subject-matter segregation, in SQL, 146
subscriptions revenue model, 360–361
success scenario, 201
sum function, 108
superhero archetypes, 15–18, 163
supervised machine learning, 42, 77
surveys, conducting, 300, 305
survival analysis, 42
SVD (singular value decomposition), 59–62
SWOT analysis, 312
Sydney Data Science, 345
Synthesys AI solution, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271

T

Tableau Public, 390–391
tangible products, 238–239
target audience, designing for, 164–167
target variable, 41
technical plan
- alternatives analysis, 335–336
- executing, 339–340
- general discussion, 327–328
- interviewing intended users, 337–338
- outline for, 329–335
- POTI modeling future state, 338–339
- purpose of, 327
Technical Stuff icon, 4
technology, data. See data technology
technology stacks in case studies
- AP content generation rates, increasing, 228
- call center operations, 262
- debt collection processes, 216
- real-time optimized logistics routing, 221–222
test set, 40, 92
testing data graphics, 183–184
Text data type, SQL, 144
text mining, in SQL, 151
thought leadership, 350–351
three Vs of big data, 21–23
throughput, 22
time, as currency, 344
time series analysis, 73–76
time-series data, 111, 376
time-to-market, 207, 359
Tip icon, 4
Titanic catastrophe, survival rates from, 88–89
token analysis, 271
tools, data science
- CARTO, 388–390
- DataWrangler, 383
- Gephi, 384–386
- ImageQuilts, 382–383
- Infogram, 394–395
- overview, 381–382
- Piktochart, 395–396
- RAWGraphs, 392–393
- Shiny by RStudio, 387–388
- Tableau Public, 390–391
- WEKA application, 386
top-down dendrograms, 85
topology structures, 179–180
training and retraining machine learning models, 280
training recommendations, 333
training set, 40, 92
tree map, 174, 177
tree network topology, 180
trended time series, 74
trends, identifying, 154–156
tri-grams, 271–272
TrueAccord, 211–216
Tukey boxplotting, 71–72
Tukey outlier labeling, 71–72
tuples, 33, 95, 108–109
Turing test, 223

2 × 4 matrix, generating with NumPy, 115

U

UCLA Extension: 10-week data science intensive course, 348
under-pricing, 241
uniform probability distribution, 53
unit sales revenue model, 360
United Parcel Service (UPS) case study, 217–222
- result, 218–222
- solution, 218
- technology stack, 221–222
- use case, 219
- use case diagram, 220
univariate analysis
- outlier detection, 71–72
- time series data, modeling, 75–76
unstructured data, 8, 22–23
unsupervised machine learning, 43, 78, 80
upgrading data skills, 9
UPS case study. See United Parcel Service case study
US Census Bureau data, 373–374
use case diagram
- AP content generation case study, 227
- call center operations case study, 261
- debt collection processes case study, 215
- overview, 201
- real-time optimized logistics routing case study, 220
use cases
- AP content generation rates, increasing, 226
- call center operations case study, 258–260
- debt collection processes case study, 214–215
- defining, 200
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- elements to include in, 200
- for marketing
  - data products, 238–239
  - marketing mix modeling, 239–243
  - omnichannel analytics, 232–238
  - popular, 229–232
- operational improvement, by industry, 208
- optimal, selecting
  - AI ethics, assessing, 318–323
  - data governance, assessing, 323–325
  - data privacy policies, assessing, 325
  - data skill gap analysis, 317–318
  - overview, 311
  - quick-win, selecting, 313–316
  - reviewing documentation, 312
- real-time optimized logistics routing case study, 219
- STAR framework, 197–199
- in technical plan, 330

V

values of company, 294–296
value-to-data-quantity ratio, 21
Vanderplas, Jake, 353–354
variance, shared, 63
variety, data, 22–23
VBA (Visual Basic for Applications), 158, 160
vectorization, 127
vectors
- generating with NumPy, 115
- R, 121–122, 123
velocity, data, 21–22
vertically stacked card infographics, 394
Visual Basic for Applications (VBA), 158, 160
visual estimation of clusters, 80, 84, 85
volume, data, 21

W

Waikato Environment for Knowledge Analysis (WEKA), 386
Warning icon, 4
wealth, generating, 343–345
web analytics, 232. See also omnichannel analytics
web-based data visualization, 392–393
web-based documents, 12
Weber, Eric, 354
web-scraping tools, 382–383
website, company, 233
weighted average, 54
WEKA (Waikato Environment for Knowledge Analysis), 386
while loop, 110
Wilson, Zach, 346
Women in Data, 350, 365
word cloud, 174, 177
World Bank Open Data page, 375–376
wrappers, 379

Y

YARN, 33, 35

Z

zero-sum system, 65