- * (asterisk), in SQL, 147
- 2 × 4 matrix, generating with NumPy, 115
A
- Absolute macros, Microsoft Excel, 159–160
- accountability for AI solutions, 319–320
- accreditations, 346–348
- accuracy, degree of, 52
- accuracy in representations, 164
- activation function, 45
- activists, data art for, 163
- actors, 200
- Adobe Analytics, 239
- advertising, 230–231, 283–284
- affective computing, 267, 268
- AI. See artificial intelligence
- algorithms for machine learning
- classification
- general discussion, 90–91
- instance-based learning classifiers, 90
- overfitting in, 92–93
- overgeneralization in, 92–93
- overview, 89
- clustering
- DBScan, 87–88
- general discussion, 78
- with hierarchical algorithms, 84–87
- kernel density estimation, 84
- k-means algorithm, 82–84
- overview, 79–81
- similarity metrics, 81–82
- decision trees, 44, 46, 88–89
- nearest neighbor
- average nearest neighbor algorithms, 94–97, 101–102
- k-nearest neighbor algorithms, 90, 97–100, 101
- overview, 93–94
- solving real-world problems with, 100–102
- random forest, 89
- selecting based on function, 44–48
- supervised, 42
- unsupervised, 43
- alternatives analysis, 335–336
- Amazon Redshift, 30–31
- Amazon Web Services S3 platform, 23
- analysts, data showcasing for, 163
- annotations, including in data visualization, 185
- AP (Associated Press) case study, 224–228
- Apache Cassandra, 32
- Apache Flume, 22
- Apache Kafka, 22
- Apache Spark, 35, 48–49
- Apache Sqoop, 22
- app development competitions, 371
- appendixes, in technical plan, 334–335
- Apple, 285–286
- application files, 11
- application programming interface (API), 371, 379
- application step, in machine learning, 40
- applications. See also Microsoft Excel; Structured Query Language
- CARTO, 388–390
- in data science strategy, 138–139
- DataWrangler, 383
- Gephi, 384–386
- ImageQuilts, 382–383
- Infogram, 394–395
- overview, 137, 381–382
- Piktochart, 395–396
- RAWGraphs, 392–393
- Shiny by RStudio, 387–388
- Tableau Public, 390–391
- WEKA, 386
- architecture standards, 296
- area chart, 171, 172
- arguments of functions, 124, 147
- ARIMA (AutoRegressive Integrated Moving Average), 132
- ARMA (autoregressive moving average), 75–76
- arrays, NumPy, 114–115
- art, data. See data art
- artificial intelligence (AI)
- accountability, assessing, 319–320
- bias in systems, assessing, 322–323
- case studies
- AP content generation, 224–228
- call center operations, 257–262
- debt collection processes, 211–216
- logistical efficiencies, 216–217
- real-time optimized logistics routing, 217–222
- ethics
- assessing, 318–323
- collecting information about, 307–308
- in technical plan, 330
- explainability, assessing, 320–321
- hype about, 275–277
- overview, 42
- Artists of Data Science podcast, 345–346
- assessment
- of AI ethics, 318–323
- of data governance, 323–325
- of data privacy policies, 325
- data skill gap analysis, 317–318
- Associated Press (AP) case study, 224–228
- association rule learning algorithms, 44, 46
- asterisk (*), in SQL, 147
- atomic vectors, 122, 123
- attribute value, 93
- attributes, 112, 129–131
- audience, as currency, 344
- audience for data visualization
- designing for, 164–167
- emotional response, inducing, 168–169
- identifying, 162
- logical and calculating response, inducing, 167–168
- Automated Insights’ Wordsmith platform, 225–228
- automation in business operations, 210. See also debt collection processes case study
- autoregression techniques, 75
- AutoRegressive Integrated Moving Average (ARIMA), 132
- autoregressive moving average (ARMA), 75–76
- autoreply functionality, Gmail, 47
- average nearest neighbor algorithms, 94–97, 101–102
B
- B2B (business-to-business) company, 207
- bar chart, 171, 172, 395, 396
- behavior-based learning model, 43
- BernoulliNB, 56
- best practices, 306
- BI (business intelligence), 247–250, 262–263
- bias in AI systems, assessing, 322–323
- big data, 11. See also data storage
- defining, 19–20
- differences between data approaches, 24–28
- real-time processing framework, 35
- sources of, 23–24
- three Vs of, 21–23
- big data paradigm, 24
- BigQuery, Google, 31
- bi-grams, 271–272
- binary membership, 66
- binning, 56
- binomial distributions, 55
- boot camps, 348–349
- Borne, Kirk, 355
- bottom-up dendrograms, 85
- boundary smoothing, 99
- BPA (business process automation), 210
- brackets, in R programming language, 122
- brainstorming
- for data visualization, 164, 165–166
- with POTI model, 338–339
- brick-and-mortar store, 233
-
bubble plot, 174, 175
- built-in Python functions, 110–111
- business acumen
- data roles, support for generating profit, 192–194
- defining, 191
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- generating profit, 192–197
- increasing, 195–196
- leadership skills, fortifying, 196–197
- overview, 189
- STAR framework, 197–199
- subject matter expertise versus, 190
- business consulting, 364
- business intelligence (BI), 247–250, 262–263
- business managers, feedback from, 305
- business mission, 330
- business model, choosing, 357–359
- business operations, improving, 210
- business process automation (BPA), 210
- Business Science University, 365
- business vision, 294–296, 330
- businesses, starting
- business model, choosing, 357–359
- data science entrepreneurs, 364–366
- Kam Lee, example of, 361–364
- overview, 357
- revenue model, choosing, 359–361
- business-to-business (B2B) company, 207
C
- call center operations case study, 257–262
- need, 257
- results, 258
- solution, 257
- technology stack, 262
- use case diagram, 261
- use cases, 258–260
- Cambridge Analytica scandal, 238, 283–284
- Canada Open Data website, 371–372
- career paths
- accreditations, 346–348
- businesses, starting
- business model, choosing, 357–359
- data science entrepreneurs, 364–366
- Kam Lee, example of, 361–364
- overview, 357
- revenue model, choosing, 359–361
- career accelerators, 348–349
- coding bootcamps, 348–349
- common, 341–343
- data implementers, 345–346
- data leaders, 354–356
- general discussion, 15–18
- generating wealth, 343–345
- networking and relationship-building, 349–350
- overview, 341
- project portfolio, building, 351–354
- thought leadership, 350–351
- CARTO, 388–390
- case studies
- AP content generation, 224–228
- call center operations, 257–262
- debt collection processes, 211–216
- defining, 202
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- logistical efficiencies, 216–217
- real-time optimized logistics routing, 217–222
- STAR framework, 197–199
- cases, in machine learning, 41
- cash, as currency, 344
- Cassandra, Apache, 32
- categorical distributions, 55
- CCTV cameras, use of k-nearest neighbor algorithms by, 101
- Central Bank of Malaysia, 266–267
- centroids, 82
- channels
- mapping, 233–234
- performance, building around, 235
- scoring, 231, 235–237
- Character data type, SQL, 144
- chart graphics, standard, 171–173
- charts, Microsoft Excel, 152, 154–156
- choropleth map, 389
- churn, 14
- circle of influence, 363
- civic hackers, 370
- classes, 112–113, 121
- classification algorithms
- versus clustering, 90
- general discussion, 90–91
- instance-based learning classifiers, 90
- nearest neighbor analysis
- average, 94–97, 101–102
- k-nearest, 90, 97–100, 101
- overview, 93–94
- solving real-world problems with, 100–102
- overfitting in, 92–93
- overgeneralization in, 92–93
- overview, 89
- classification methods, 77
- click-streams, 22–23
- clinical informatics scientists, 14
- Cloropleth map, 181, 182
- cloud storage solutions, 28–32
- cloud-warehouse solutions, 30–31
- clustering
- algorithms, 79–81
- classification versus, 90
- DBScan, 87–88
- general discussion, 78
- with hierarchical algorithms, 84–87
- kernel density estimation, 84
- k-means algorithm, 82–84
- overview, 77
- selecting algorithms based on function, 44, 46
- similarity metrics, 81–82
- clusters, 80
- coding, 13, 103, 104. See also Python; R programming language
- coding bootcamps, 348–349
- coding documentation, 202–203
- coding portfolio, building, 351–354
- Cogito, 257–262
- cognitive bias, 322
- collaborative data visualization platforms, 390–391
- collecting data, 11
- collective outliers, 71, 87
- colon operator, 129
- Color Scales conditional formatting, Microsoft Excel, 154
- column indexes, in SQL, 145–146
- comma-separated values (CSV), 11
- comments, in code, 120, 124
- commissioning manager, 308
- commodity servers, 33
- communicating data insights, 14–15
- company, researching
- business vision, mission, and values, 294–296
- data ethics, 306–308
- data resources, inventorying, 298–302
- data science team, unifying, 292–293
- data technologies, inventorying, 296–298
- efficient process for, 308–310
- overview, 291–292
- people-mapping, 303–304
- project pitfalls, avoiding, 305–306
- company culture, 334
- company email list, 233
- company website, 233
- comparative graphics, 173–176
- competitions, app development, 371
- computer vision technology, 210
-
concatenate function, 123
- conditional formatting, Microsoft Excel, 152, 154
- conditional probability, 55–56
- connective edge, 94
- constant time series, 74
- constraints, in SQL, 145
- consulting and advising businesses, 358–359
- contact strategy, 212
- content analysis, 269
- content creation, AI-assisted
- Associated Press case study, 224–228
- GPT-3, 222–224
- for marketing improvements, 230
- overview, 222
- context, adding to data visualization, 184–186
- contextual outliers, 71
- continuous probability distribution, 54
- copyleft, 369
- core samples, 87
- correlation
- overview, 56
- Pearson, 56–58
- Spearman's rank, 58–59
- cosine similarity, 81
- Creative Commons licenses, 370
- credit card use patterns, 101
- CRM (customer relationship management), 100
- crowdsourcing, 378
- CSV (comma-separated values), 11
- culture, organizational, 334
- cumulative variance explained (CVE), 60–62, 64–65
- currency, types of, 344
- current state of company, assessing, 312
- current state summary statement, 329
- custom Python functions, 111
- customer acquisition, 237
- customer avatar, designing, 236
- customer churn analysis, 230
- customer relationship management (CRM), 100
- customer retention, 237
- customer service support, automated, 210
- CVE (cumulative variance explained), 60–62, 64–65
D
- DaaS (Data-as-a-Service) platform, 284–285
- Dancho, Matt, 365
- dashboard design, 167
- data analytics
- appropriate use of, 262–263
- BI versus, 249–250
- common challenges in, 252–253
- data-wrangling, 253–254
- Google Analytics, 250–251
- overview, 249
- types of analytics, 252
- data art
- data cleaning, 279, 379
- data companies, use of personal data by, 286
- data dictionary, requesting, 298–300, 315
- data engineering, 7, 8, 26–28, 364
- data entrepreneurs
- career paths, 343
- operational improvements, 207
- overview, 17–18
- real-life examples, 364–366
- separating dedicated team members as, 293
- data ethics. See also data privacy
- AI ethics
- assessing, 318–323
- collecting information about, 307–308
- in technical plan, 330
- data frame objects, 122, 123
- Data Futurology Podcast, 355
- data governance council, 324
- data governance policies, assessing, 308, 323–325
- data governance standards, 324
- data graphics
- comparative graphics, 173–176
- context, adding, 184–186
- overview, 170–171
- spatial plots and maps, 180–183
- standard chart graphics, 171–173
- statistical plots, 176–179
- testing, 183–184
- tools for
- CARTO, 388–390
- Infogram, 394–395
- Piktochart, 395–396
- RAWGraphs, 392–393
- Shiny applications by RStudio, 387–388
- Tableau Public, 390–391
- topology structures, 179–180
- data implementers
- career paths, 341–342
- decision support, 263
- documentation for, 202–203
- operational improvements, 206
- overview, 16
- real-life examples, 345–346
- separating dedicated team members as, 293
- data infrastructure architecture, 296–297
- data ingestion, 22, 280
- data insights, 8, 9–10, 14–15
- data integrity, 142
- data journalists, 14
- data lake, 23
- data leaders
- career paths, 342
- data storytelling by, 163
- documentation for, 199–202
- operational improvements, 206
- overview, 16–17
- real-life examples, 354–356
- separating dedicated team members as, 293
- data literacy training, 333
- data mart, 23
- data mining, 280
- data monetization
- AI hype, 275–277
- Clive Humby, 278
- data privacy initiatives, 285–288
- data products, 282
- data resources, 283–285
- data services, 278–281
- overview, 275
- data munging, 151
- data normalization, 279
- data partnerships, 284
- data points, 41
- data policies, 324
- data preparation services, 279–280
- data preprocessing, 269
- data privacy
- advertising, 283–284
- breaches of, 263
- company data ethics, researching, 306–307
- data partnerships, 284
- demand for, 238
- GDPR, 320–321
- latest developments, 285–288
- policies, assessing, 307, 325
- data privacy policy, 307, 325
- data processing, 279
- data product manager, 14, 239
- data products, 238–239, 282
- data quality issues, 300–302
- data resources
- alternatives analysis, 336
- current, in technical plan, 331
- inventorying, 298–302
- monetization of, 283–285
- recommendations, in technical plan, 332
- data roles, support for profit generation, 192–194
- data science. See also career paths
- applying to subject areas, 13–14
- data engineering versus, 8
- for decision support, 254–263
- defining, 8, 25
- key components, 10–15
- making use of, 8–10
- overview, 7–8
- statistics versus, 13
- Data Science Association (DSA), 349
- data science careers. See career paths
- data science strategy, 104, 138–139
- data science team
- recommendations in technical plan, 333–334
- unifying, 292–293
- data scientists
- communicating data insights, 14–15
- defining, 8, 25, 27–28
- key components of role, 10–15
- data service providers, directory of, 281
- data services, 278–281
- data showcasing
- coding portfolio, building, 352
- data graphics for, 171
- designing for, 166–167
- overview, 163
- data silos, 300–302
- data skillset
- alternatives analysis, 336
- data skill gap analysis, 317–318
- of relevant personnel, surveying, 304
- in technical plan, 331, 333
- data sources, 23–24
- data storage
- in cloud, 28–32
- cloud-warehouse solutions, 30–31
- Hadoop, 33–34
- HDFS, 33–34
- Kubernetes, 30
- MapReduce, 33
- MPP platforms, 34
- NoSQL databases, 31–32
- on-premise solutions, 32–34
- overview, 28
- serverless computing solutions, 29
- data storytelling
- coding portfolio, building, 352
- data graphics for, 171
- designing for, 162–163, 166–167
- overview, 14–15
- Data Strategy Plan Template, 328–335
- data structures, 107
- Data Studio, Google, 239
- data superhero archetypes, 15–18, 163
- data taxonomy services, 279
- data technology
- alternatives analysis, 336
- assessing use case impact on, 315
- brainstorming recommendations for, 339
- current, in technical plan, 331
- inventorying, 296–298
- recommendations, in technical plan, 332, 333
- data transformation, 279
- data types
- in Python, 106–109
- in SQL, 144
- data variety, 22–23
- data velocity, 21–22
- data visualization, 14–15. See also data graphics
- annotations, including, 185
- brainstorming, 165–166
- context, adding, 184–186
- data art, 164
- data showcasing, 163
- data storytelling, 162–163
- design style, choosing, 167–169
- emotional response, design for, 168–169
ggplot2
package, 132, 133–134
- graphical elements, including, 186
- logical and calculating response, design for, 167–168
- main types of, 162–164
- MatPlotLib library, 117, 118–119
- in Microsoft Excel, 154–156
- overview, 161–162
- persuasive design, 186
- process for creating, 162
- purpose, defining, 166
- target audience, designing for, 164–167
- testing data graphics, 183–184
- tools for
- CARTO, 388–390
- RAWGraphs, 392–393
- Shiny applications by RStudio, 387–388
- Tableau Public, 390–391
- type of, choosing, 166–167
- data volume, 21
- data warehouse systems, 23, 30–31
- Data-as-a-Service (DaaS) platform, 284–285
- DATAcated, 364–365
- data-exploration tools, 384–386
- DataFrame object, Pandas library, 117
- Data.gov program, 370–371
- data.gov.uk, 372–373
- DATA.NASA.GOV, 374–375
- dataset X, 67–68
- DataWrangler, 383
- data-wrangling, 253–254, 352, 383
-
Datawrapper, 396
- Date data type, SQL, 144
-
DATE function, SQL, 148
- Date-Time data type, SQL, 144
- DBScan (density-based spatial clustering of applications with noise), 87–88
- debt collection processes case study, 211–216
- result, 212–213
- solution, 211–212
- technology stack, 216
- use case, 214–215
- use case diagram, 215
- decision engines, 103
- decision makers, data storytelling for, 162–163
- decision support system
- business intelligence, 247–249, 262–263
- data analytics
- appropriate use of, 262–263
- BI versus, 249–250
- common challenges in, 252–253
- data-wrangling, 253–254
- Google Analytics, 250–251
- overview, 249
- types of analytics, 252
- data science for
- appropriate use of, 262–263
- call center operations case study, 257–262
- data sources, 255–256
- overview, 254–255
- improving, 245–246
- overview, 245
- decision trees, 44, 46, 88
- decision-making
- deep learning, 45, 46, 47–48, 227, 267
- defect analysis, 209
- degree of accuracy, 52
- dendrograms, 85, 86
- density, 83, 84
- density smoothing function, 84
- density-based spatial clustering of applications with noise (DBScan), 87–88
- dependent variable (DV), 41
- descriptive analytics, 60, 64
- descriptive statistics, 52–53
- design style of data visualization, choosing, 167–169
- designing SQL database, 144–147
- diagnostic analytics, 64–65
- dictionaries, Python, 109
- digital product, 238–239, 282
- dimensionality reduction
- factor analysis, 63
- overview, 59
- principal component analysis, 64–65
- selecting algorithms, 44, 46
- singular value decomposition, 59–62
- directional movement, 83
- directors of data science, 14
- directory of online data service providers, 281
- dirty data, compressing, 60, 61
- disclaimers analysis, 269
- discrete probability distribution, 54
- dividend performance, 66
- document analysis, automated, 210
- documentation
- accountability for AI solutions, 320
- current state of company, assessing, 312
- data governance policies, 323–325
- for data implementers, 202–203
- for data leaders, 199–202
- data privacy policies, 325
- overview, 199
- document-oriented database, 32
- domain-specific data science, 25
- dot notation, 114
- DSA (Data Science Association), 349
- Dstreams, 48
- duplication of effort, 302
- DV (dependent variable), 41
- dynamic pricing, 231
E
- earnings growth potential, 66
-
earnings quality rating, 66
- eigenvector, 62
- email list, company, 233
- emotionally provocative data visualizations, 168–169
- employees
- hiring new, 317, 334
- interviewing intended users, 337–338
- recommendations in technical plan, 333
- skillsets, surveying, 304
- ensemble algorithms, 45, 46
- enterprise IT architecture, 296
- entity analysis, 271–272
- error propagation, 88
- essential context, establishing, 206–207
- e-surveillance platform, 268
- ethics, data. See data ethics
- Euclidean distance metric, 81
- Excel, Microsoft
- Charting tool, 154–156
- Conditional Formatting feature, 154
- filtering in, 153–154
- macros, 158–160
- overview, 151–152
- PivotTables, 157–158
- quick data analysis with, 152–153
- executing data science project, 339–340
- executive summary, 329
- expectation value, 54
- explainability of AI solutions, 320–321
- Exversion, 379
- eyeballing clusters, 80, 84, 85
F
- FaaS (Function as a Service), 29
- Facebook DeepFace, 47
- factor analysis, 63, 132
- fault-tolerance, in HDFS, 34
- feature engineering, 42
- feature selection, 40, 42, 98
- features, 41, 95
- feedback, collecting, 305
- fee-for-service revenue model, 361
- file formats, 11–12
- filters, in Microsoft Excel, 152, 153–154
- financial services industry
- fraud prevention with NLP
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
- trade-off between risk and reward, 272–273
- lending risk, decreasing, 266–267
- overview, 265
- Flores, Felipe, 355
- Flume, Apache, 22
- FMCDM (fuzzy multiple criteria decision-making), 67
-
for loop, 110
-
forecast package, 132
- foreign key, 141, 142
- Forrester Research, Inc., 273
- four Ps, 240–241
- fourth industrial revolution, 208–209
- fraud prevention with NLP, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
- trade-off between risk and reward, 272–273
- full outer
JOIN
function, SQL, 148, 149
- Function as a Service (FaaS), 29
- function call, 125
- functions
- Funnel Gorgeous, 365
- future state of company, 312
- future state vision statement, 330
- fuzzy multiple criteria decision-making (FMCDM), 67
G
- Gantt chart, 174, 176
- GaussianNB, 56
- General Architecture for Text Engineering (GATE), 151
- General Data Protection Regulation (GDPR), 320–321, 330
- generic vectors, 122
- geometric metrics, 81
- Gephi, 384–386
-
ggplot2 package, 133–134
- GitHub portfolio, 353–354
- Gmail, 47
- goals of data science projects, 332
- Google Analytics, 239, 250–251
- Google BigQuery, 31
- Google Data Studio, 239
- Google Sheets, 152
- government data, open, 370–373
- GPT-3, 222–224, 230
- graph mesh network topology, 180
- graph models, 180
- graphics, data. See data graphics
- GraphX library, Apache Spark, 48
- Grayeb, Jennifer, 365
- grocery retail, use of average nearest neighbor algorithms by, 101–102
-
GROUP function, SQL, 150
- Guru Path of the Data Science Bootcamp, Data Science Dojo, 349
H
- Hadoop, 20, 33–34, 35
- Hadoop distributed file system (HDFS), 22, 23, 33–34
- hairball graph, 385, 386
- hardware companies, use of personal data by, 286
- hash symbol, 120, 124
-
HAVING function, SQL, 150
- HDFS blocks, 33
- Heartbeat algorithm, TrueAccord, 211–216
-
help function, SciPy library, 117
- hidden layer, 45
- hierarchical clustering algorithms, 79, 84–87
- hierarchical tree topology, 180, 181
- high-variety data, 22
- hiring new employees, 317, 334
- histogram, 176–177, 178
- HR managers, feedback from, 305
- Humana case study, 257–262
- need, 257
- results, 258
- solution, 257
- technology stack, 262
- use case diagram, 261
- use cases, 258–260
- Humby, Clive, 275, 278
- hyperparameters, 203
- hypertargeted advertising, 230–231
I
- icons, used in book, 4
-
igraph package, 134
- ImageQuilts, 382–383
- implementation plan. See technical plan
- implementing data use cases, 222
- implicit risk in AI, 319
- independent variable (IV), 41
- index, 141
- inferential statistics, 52–53
- Infogram, 394–395
- infographic tools, 393–396
- information, in POTI modeling, 315, 339
- information products, 282
- information products businesses, 358
- information redundancy, 63
- in-memory computing, 35, 48, 144
- inner
JOIN
function, SQL, 148, 149
-
instance
- in machine learning, 41
- in R, 121
- instance-based algorithms, 44, 46
- instance-based learning classifiers, 90
- instantiating objects, 112
- interactive mode, R programming language, 121
- Internet of things, 23
- interpreter, SQL, 147
- inter-quartile range (IQR), 71–72
- interviews, conducting, 300, 305, 337–338
- investments, evaluating potential, 66–67
- IPUMS, 374
- IT professionals, feedback from, 305
- iterating, in R programming language, 127–129
- IV (independent variable), 41
J
- Jaccard distance metric, 82
- Jee, Ken, 346
-
JOIN function, SQL, 148
K
- Kafka, Apache, 22
- kernel density estimation (KDE), 84
- kernel smoothing methods, 84
- key-value pair, 31
- key-values stores, 32
- k-means clustering algorithm, 82–84, 209
- k-nearest neighbor classification algorithm (kNN), 90, 97–100, 101
- Knoema, 376–377
- Kozyrkov, Cassie, 354
- KubeFlow product, 30
- Kubernetes, 30, 32
L
- labeled data, 90
- lake, data, 23
- latency, 21
- latent variables, 60, 63
- lazy learners, 90, 97
- lead scoring, 231
- leadership skills, fortifying, 196–197
- learning step, in machine learning, 40
- Lee, Kam, 361–364
- Lee, Vincent, 266–267
- left
JOIN
function, SQL, 148
-
len function, 108
- lending risk, decreasing, 266–267
- LendUp case study, 211–216
- result, 212–213
- solution, 211–212
- technology stack, 216
- use case, 214–215
- use case diagram, 215
- libraries, Python
- MatPlotLib, 118–119
- NumPy, 114–116
- overview, 114
- Pandas, 117–118
- Scikit-learn, 119–120
- SciPy, 116–117
- licensing revenue model, 360
- lifetime value forecasting (LTV), 230
- line chart, 173
- Line Chart feature, Microsoft Excel, 156
- linear algebra
- factor analysis, 63
- overview, 59
- principal component analysis, 64–65
- singular value decomposition, 59–62
- linear regression, 67–69
- linear relationships, 57
- linear topological structure, 179
- lists, 108, 122, 123
- live events, 234
- local maximum density, 83
- local minimum density, 83, 84
- logical and calculating response, data visualization design for, 167–168
- logistic regression, 69, 132
- logistical operations, improving
- overview, 216–217
- real-time optimized logistics routing case study, 217–222
- loops, using in Python, 109–110
- low value of big data, 21
- low-code environment, 138–139
- low-density regions, 83
- lowest-hanging-fruit use case, 291, 311
- AI ethics, assessing, 318–323
- data governance, assessing, 323–325
- data privacy policies, assessing, 323–325
- data skill gap analysis, 317–318
- quick-win use cases, selecting, 313–316
- reviewing documentation, 312
- LTV (lifetime value forecasting), 230
M
- Ma, Danny, 345
- machine learning. See also clustering; nearest neighbor analysis
- Apache Spark, generating real-time analytics with, 48–49
- classification algorithms, 89–93
- coding portfolio, building, 352
- decision trees, 44, 46, 88–89
- defining, 26, 40
- learning styles, 42–43
- overview, 39
- processes, 40–41
- random forest algorithms, 89
- regression methods, 67–70
- reinforcement learning, 43
- selecting algorithms based on function, 44–48
- supervised algorithms, 42
- unsupervised algorithms, 43
- use cases, 40
- vocabulary associated with, 41–42
- WEKA application, 386
- machine learning engineers, 14, 26, 27–28
- machine learning model selection, 280
- machine learning model-tuning, 280
- macros, Microsoft Excel, 158–160
- management plan. See technical plan
- Manhattan distance metric, 81
- manuals for AI systems, 321
- manufacturing operations, improving, 208–210
- many-to-many relationship structure, 180
- map layers, 389
- mapbox, 396
- mapping application, 388–390
- mapping channels, 233–234
- MapReduce, 33, 35
- market basket analysis, 231
- marketing channels, omnichannel analytics of, 233–238
- marketing data scientists, 14
- marketing improvements
- data products, 238–239
- marketing mix modeling, 239–243
- omnichannel analytics
- channel performance, building around, 235
- channels, mapping, 233–234
- data privacy, 238
- defining, 233
- overview, 232
- scoring channels, 235–237
- overview, 229
- popular use cases for, 229–232
- marketing mix modeling (MMM), 239–243
- marketing professionals, feedback from, 305
- marketing strategy development, 10
- mass customization production, 208–209
- massively parallel processing (MPP) platforms, 26, 34
- mathematical modeling, 12
- MatPlotLib library, 105, 118–119
- matrix
- generating with NumPy, 115
- R, 122
- MCDM (multiple criteria decision-making), 65–67
- media content creation. See content creation, AI-assisted
- metadata analysis, 269
- metadata repository, 298–300
- methods, Python, 112
- microbatch processing, 48
-
microdata, 376
- Microsoft Excel
- Charting tool, 154–156
- Conditional Formatting feature, 154
- filtering in, 153–154
- macros, 158–160
- overview, 151–152
- PivotTables, 157–158
- quick data analysis with, 152–153
- Minkowski distance, 81
- mission statement, 294–296
- MLlib submodule, Apache Spark, 49
-
mlogit (multinomial logit model), 132
- MMM (marketing mix modeling), 239–243
- model building services, 280–281
- model overfitting, 92–93
- model overgeneralization, 92–93
- model selection, machine learning, 280
- model-tuning, machine learning, 280
- monetization, data. See data monetization
- money, as currency, 344
- MongoDB, 32
- moving average techniques, 75
- MPP (massively parallel processing) platforms, 26, 34
- multicollinearity, 70
- multidimensional arrays, generating with NumPy, 114–115
- multidimensional datasets, 59
- multi-label learning, 99–100
- multinomial logit model (
mlogit
), 132
- MultinomialNB, 56
- multiple criteria decision-making (MCDM), 65–67
- multiple criteria evaluation, 65
- multiple dependencies, in SQL, 145
- multiple linear regression, 68
- multivariate analysis, 132
- multivariate normality (MVN), 64
- multivariate outlier detection, 73
- munging, data, 151
- MySQL, 141
N
- Naïve Bayes method, 44, 46, 55–56
- NASA open data, 374–375
- natural language processing (NLP), 230, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
- trade-off between risk and reward, 272–273
- Natural Language Toolkit, Python, 151
- n-dimensional arrays, generating with NumPy, 114–115
- n-dimensional plot, 81
- nearest neighbor analysis
- average nearest neighbor algorithms, 94–97, 101–102
- k-nearest neighbor algorithms, 90, 97–100, 101
- overview, 93–94
- solving real-world problems with, 100–102
- neighborhood clustering algorithms, 87–88
- network analysis, 384
- network graph analysis, 134–135
- Network Planning Tools (NPT), 218–222
- network topologies, 384
- networking, 349–350
- neural networks, 45, 46, 47, 210
- n-grams, 271–272
- Nimble Company, The, 365
- NLP. See natural language processing
- no-code environment, 138–139
- nodes, 33
- noise, 53, 99
- noncore samples, 87
- nonglobular clustering, 86–87
- non-interactive mode, R programming language, 121
- nonlinear relationships, 58, 59
- nonredundancy of columns, in SQL, 145
-
nonstationary processes in time series, 74–75
- normal distributions, 55
- normalization, 145–147, 270
- NoSQL databases, 31–32
- Notion, 203
- NPT (Network Planning Tools), 218–222
-
NULL values, SQL, 145, 147
- numbers data type, Python, 107
- Numerical data type, SQL, 144
- NumPy library, 114–116
O
- object-oriented language, 121
- objects
- observations, 41, 94, 95
- OLS (ordinary least squares) regression methods, 70
- omnichannel analytics
- channel performance, building around, 235
- channels, mapping, 233–234
- data privacy, 238
- defining, 233
- overview, 232
- scoring channels, 235–237
- online data service providers, directory of, 281
- on-premise storage solutions, 32–34
- open data resources
- Canada Open Data website, 371–372
- Data.gov program, 370–371
- data.gov.uk, 372–373
- Exversion, 379
- Knoema, 376–377
- NASA, 374–375
- OSM, 380
- overview, 369–370
- Quandl, 378–379
- US Census Bureau data, 373–374
- World Bank Open Data page, 375–376
- open government license, 372, 373
- open movement, 369
- OpenAI, 223
- open-source SQL implementations, 141
- OpenStreetMap (OSM), 380
- operational improvement
- in business operations, 210
- content creation, AI-assisted
- AP content generation rates, increasing, 224–228
- GPT-3, 222–224
- overview, 222
- data science contributions, 207–208
- debt collection processes case study, 211–216
- essential context, establishing, 206–207
- logistical efficiencies, 216–217
- in manufacturing operations, 208–210
- overview, 205
- real-time optimized logistics routing case study, 217–222
- operators, in R programming language, 124–127
- optical character recognition, 217
- optimal use case, selecting
- AI ethics, assessing, 318–323
- data governance, assessing, 323–325
- data privacy policies, assessing, 325
- data skill gap analysis, 317–318
- overview, 311
- quick-win, selecting, 313–316
- reviewing documentation, 312
- ordinal variables, 55
- ordinary least squares (OLS) regression methods, 70
- organization, in POTI modeling, 315, 339
- organizational charts, requesting, 303–304
- organizational culture, 334
- organizational structure, in technical plan, 330–331
- OSM (OpenStreetMap), 380
- outer
JOIN
function, SQL, 148
- outlier detection, 65
- DBScan for, 87
- extreme values, analyzing, 70–71
- in Microsoft Excel, 154–156
- with multivariate analysis, 73
- types of outliers, 71
- with univariate analysis, 71–72
- overfitting, model, 92–93
- overgeneralization, 92–93
P
- packages, R programming language, 131–135
- packed circle diagram, 174, 175
- paid ads, 234
- Pandas library, 117–118
- parallel distributed processing, 33
- parallel processing, 30
- partitional clustering algorithms, 79
- patterns in time series, identifying, 74–75
- PCA (principal component analysis), 60, 64–65, 73
- Peak Volume Alignment Tool (PVAT), 218
- Pearson correlation, 56–58
- people-mapping, 303–304
- perceptron, 45
- personal data. See data privacy
- persuasive design, in data visualization, 186
- phrase analysis, 271–272
- pie chart, 173, 174
- Piktochart, 395–396
- PivotCharts, Microsoft Excel, 158
- PivotTables, Microsoft Excel, 157–158
- placing product, 241
- plan of action. See technical plan
- PLC (programmable logic controller), 209
- point map, 181, 182
- point outliers, 71
- point pattern data, 135
- policing applications, 10
- polymorphic classes, 121
- polymorphic functions, 131
- population, 53
- portfolio, building, 351–354
- post conditions, 200
- PostgreSQL, 141
- POTI model, 314–316, 338–339
- preconditions, 200
- predictant, 41, 67–68
- predictive analytics, 64–65
- predictive applications, 10, 30
- predictive maintenance, 217
- prescriptive analytics, 64–65
- presentation skills, 163
- price, 240–241
- pricing, dynamic, 231
- primary key, 141, 142, 145
- principal component analysis (PCA), 60, 64–65, 73
- principal components, 64
-
print function, 110–111, 121
- probability
- conditional, 55–56
- distributions, 53–55
- inferential statistics, 52–53
- processes, in POTI modeling, 314, 338
- processing data, 35
- product development, 237
- product features, 240
- production forecasting, 210
- profit-forming data science projects, 192–197
- data roles, support for, 192–194
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- STAR framework, 197–199
- programmable logic controller (PLC), 209
- programming languages. See also Python; R programming language
- in data science strategy, 104
- overview, 13
- Visual Basic for Applications, 158, 160
- project milestones, keeping close eye on, 340
- promotion, 241
- psychographics, 165
- purpose of data visualization, defining, 164, 166
- PVAT (Peak Volume Alignment Tool), 218
-
pyplot function, 118–119
- Python
- classes, 112–113
- in data science strategy, 104
- data types, 106–109
- dictionaries, 109
- functions, 106, 110–112
- general discussion, 104–106
- libraries
- MatPlotLib, 118–119
- NumPy, 114–116
- overview, 114
- Pandas, 117–118
- Scikit-learn, 119–120
- SciPy, 116–117
- lists, 108
- loops, using, 109–110
- Natural Language Toolkit, 151
- numbers data type, 107
- objects, 105–106
- overview, 13
- sets, 109
- strings, 107–108
- tuples, 108–109
Q
- Quality Control Charts package (
qcc
), 132
- Quandl, 378–379
- quantitative methods, 12
- querying data, 11
- question-and-asset request database, 308–309
- quick-win use cases, selecting, 313–316
R
- R programming language
- basic vocabulary, 121–124
- in data science strategy, 104
- functions, 123–124
- functions and operators, methods for using, 124–127
- general discussion, 120
ggplot2
package, 133–134
igraph
package, 134
- iterating in, 127–129
- objects in, 121–122, 129–131
- overview, 13
spatstat
package, 135
- statistical analysis packages, 131–133
statnet
package, 134–135
- r variable, 56–58
- random forest algorithms, 89
- random sampling, 40–41
- random variable, 54
- ranking and scoring data, 280
- raster surface map, 181, 183
- RAWGraphs, 392–393
- RDA (Research Data Alliance), 349
- RDBMS (relational database management system), 8, 22–23, 26, 31, 141–143
- Read phase, Synthesys AI, 269–271
- real-time big data analytics, generating with Apache Spark, 48–49
- real-time optimized logistics routing case study, 217–222
- result, 218–222
- solution, 218
- technology stack, 221–222
- use case, 219
- use case diagram, 220
- real-time processing framework, 35
- recommendation engines, 229–230
- recommendations summary, 329
- recommending plan of action. See technical plan
- recyclability, 128
- Redshift, Amazon, 30–31
- redundancy, in HDFS, 34
- reference table, 299
- referral sites, 234
- regression algorithms, 46
- regression methods
- linear, 67–69
- logistic, 69
- in marketing mix modeling, 241–242
- ordinary least squares, 70
- overview, 67
- production forecasting, 210
- selecting algorithms, 44
- regularization algorithms, 44, 46
- reinforcement learning, 43
- relational database management system (RDBMS), 8, 22–23, 26, 31, 141–143
- relationship-building, 349–350
- Relative macros, Microsoft Excel, 159–160
- Remember icon, 4
- report writing, automated, 210
- Research Data Alliance (RDA), 349
- researching company
- business vision, mission, and values, 294–296
- data ethics, 306–308
- data resources, inventorying, 298–302
- data science team, unifying, 292–293
- data technologies, inventorying, 296–298
- efficient process for, 308–310
- overview, 291–292
- people-mapping, 303–304
- project pitfalls, avoiding, 305–306
- residuals, 68
- Resolve phase, Synthesys AI, 271–272
- resources, open data. See open data resources
- retail stores, use of k-nearest neighbor algorithms by, 101
- revenue model, choosing, 359–361
- right
JOIN
function, SQL, 148
- risk priority number (RPN), 242
- robot workcell, 209
S
- SaaS (Software as a Service), 26, 282
- SaaS business model, 359
- SafeGraph, 282, 284–285
- Sahota, Harpreet, 345–346
- sales calls, 234
- sales channels, omnichannel analytics approach for, 233–238
- sales professionals, feedback from, 305
- sample, 53
- scatterplot, 177, 178
- scatterplot charts, 133–134
- scatterplot matrix, 177, 179
- Scikit-learn library, 119–120
- SciPy library, 116–117
- scoring channels, 231, 235–237
- scraping websites, 14
- script files, 11
- sculpting data, 383
- search engine optimization (SEO), 234
- seasonality, 74
- security cameras, use of k-nearest neighbor algorithms by, 101
- security of cloud storage, 29
-
SELECT function, SQL, 147–148
- self-learning networks, 45
- Self-Taught Data Scientist Curriculum, 362
- self-tuning vision systems, 210
- semistructured data, 8, 22–23
- sentiment analysis, 267, 269–273
- SEO (search engine optimization), 234
- sequences, 95
- Series object, Pandas library, 117
- serverless computing solutions, 29
- service development, 237
- service-based businesses, 357–358
- services revenue model, 361
- sets, Python, 109
- shared variance, 63
- Sheets, Google, 152
- Shiny applications, RStudio, 387–388
- showcasing, data. See data showcasing
- silhouette coefficient, 83
- similarity metrics, 81–82
- single-link algorithm, 94
- singular value decomposition (SVD), 59–62
- skills
- alternatives analysis, 336
- coding portfolio, building, 351–354
- data skill gap analysis, 317–318
- of relevant personnel, surveying, 304
- in technical plan, 331, 333
- upgrading, 9
- Smart-Reply, Gmail, 47
- SME (subject matter expert), 10–11, 13–14, 190
-
Smith, Heather, 355–356
- Snowflake service, 31
- social media, website traffic from, 234
- social network analysis, 134–135
- software, as data product, 238–239
- software applications. See applications
- Software as a Service (SaaS), 26, 282
- software feature, as data product, 238–239
- sources of big data, 23–24
- Spark, Apache, 35, 48–49
- Spark SQL, 48
- sparse matrices, compressing, 60, 61
- spatial data analytics, 388–390
- spatial map, 180–183
- spatial plot, 180–183
- spatial point pattern analysis, 135
-
spatstat package, 135
- Spearman's rank correlation, 58–59
- SQL. See Structured Query Language
- Sqoop, Apache, 22
- St. Lawrence, Sadie, 365
- stacked chart, 174, 176
- stakeholder management, 163
- stakeholders, feedback from, 305
- standard chart graphics, 171–173
- STAR framework, 295–296
- assessing current state, 312
- data skill gap analysis, 317–318
- general discussion, 197–199
- recommending plan of action, 328–329
- Survey step, 313
- start-ups
- business model, choosing, 357–359
- data science entrepreneurs, 364–366
- Kam Lee, example of, 361–364
- overview, 357
- revenue model, choosing, 359–361
- state machine, 212
- statistical plots, 176–179
- statistics
- versus data science, 13
- defining, 52
- deriving insights from, 12–13
- descriptive, 52–53
- inferential, 52–53
-
statnet package, 134–135
- STEM degrees, 346–348
- stochastic approach, 12
- storing data. See data storage
- storytelling, data. See data storytelling
- Strachnyi, Kate, 364–365
- strategic plan. See technical plan
- Streaming module, Apache Spark, 48
- strings, Python, 107–108
- structured data, 8, 22–23
- Structured Query Language (SQL)
- constraints, designing, 145
- data types, defining, 144
- database design, 144–147
- functions, 147–151
- general discussion, 139–141
- normalization, 145–147
- open-source implementations, 141
- overview, 11, 13
- RDBMSs, understanding, 141–143
- text mining in, 151
- subject areas, applying data science to, 13–14
- subject matter expert (SME), 10–11, 13–14, 190
- subject-matter segregation, in SQL, 146
- subscriptions revenue model, 360–361
- success scenario, 201
-
sum function, 108
- superhero archetypes, 15–18, 163
- supervised machine learning, 42, 77
- surveys, conducting, 300, 305
- survival analysis, 42
- SVD (singular value decomposition), 59–62
- SWOT analysis, 312
- Sydney Data Science, 345
- Synthesys AI solution, 267–274
- content analysis, 269
- data preprocessing, 269
- disclaimers analysis, 269
- metadata analysis, 269
- normalization, 270
- overview, 267–268
- phrase and entity analysis, 271–272
- token analysis, 271
T
- Tableau Public, 390–391
- tangible products, 238–239
- target audience, designing for, 164–167
- target variable, 41
- technical plan
- alternatives analysis, 335–336
- executing, 339–340
- general discussion, 327–328
- interviewing intended users, 337–338
- outline for, 329–335
- POTI modeling future state, 338–339
- purpose of, 327
- Technical Stuff icon, 4
- technology, data. See data technology
- technology stacks in case studies
- AP content generation rates, increasing, 228
- call center operations, 262
- debt collection processes, 216
- real-time optimized logistics routing, 221–222
- test set, 40, 92
- testing data graphics, 183–184
- Text data type, SQL, 144
- text mining, in SQL, 151
- thought leadership, 350–351
- three Vs of big data, 21–23
- throughput, 22
- time, as currency, 344
- time series analysis, 73–76
- time-series data, 111, 376
- time-to-market, 207, 359
- Tip icon, 4
-
Titanic catastrophe, survival rates from, 88–89
- token analysis, 271
- tools, data science
- CARTO, 388–390
- DataWrangler, 383
- Gephi, 384–386
- ImageQuilts, 382–383
- Infogram, 394–395
- overview, 381–382
- Piktochart, 395–396
- RAWGraphs, 392–393
- Shiny by RStudio, 387–388
- Tableau Public, 390–391
- WEKA application, 386
- top-down dendrograms, 85
- topology structures, 179–180
- training and retraining machine learning models, 280
- training recommendations, 333
- training set, 40, 92
- tree map, 174, 177
- tree network topology, 180
- trended time series, 74
- trends, identifying, 154–156
- tri-grams, 271–272
- TrueAccord, 211–216
- Tukey boxplotting, 71–72
- Tukey outlier labeling, 71–72
- tuples, 33, 95, 108–109
- Turing test, 223
- 2 × 4 matrix, generating with NumPy, 115
U
- UCLA Extension: 10-week data science intensive course, 348
- under-pricing, 241
- uniform probability distribution, 53
- unit sales revenue model, 360
- United Parcel Service (UPS) case study, 217–222
- result, 218–222
- solution, 218
- technology stack, 221–222
- use case, 219
- use case diagram, 220
- univariate analysis
- outlier detection, 71–72
- time series data, modeling, 75–76
- unstructured data, 8, 22–23
- unsupervised machine learning, 43, 78, 80
- upgrading data skills, 9
- UPS case study. See United Parcel Service case study
- US Census Bureau data, 373–374
- use case diagram
- AP content generation case study, 227
- call center operations case study, 261
- debt collection processes case study, 215
- overview, 201
- real-time optimized logistics routing case study, 220
- use cases
- AP content generation rates, increasing, 226
- call center operations case study, 258–260
- debt collection processes case study, 214–215
- defining, 200
- documentation for data implementers, 202–203
- documentation for data leaders, 199–202
- elements to include in, 200
- for marketing
- data products, 238–239
- marketing mix modeling, 239–243
- omnichannel analytics, 232–238
- popular, 229–232
- operational improvement, by industry, 208
- optimal, selecting
- AI ethics, assessing, 318–323
- data governance, assessing, 323–325
- data privacy policies, assessing, 325
- data skill gap analysis, 317–318
- overview, 311
- quick-win, selecting, 313–316
- reviewing documentation, 312
- real-time optimized logistics routing case study, 219
- STAR framework, 197–199
- in technical plan, 330
V
- values of company, 294–296
- value-to-data-quantity ratio, 21
- Vanderplas, Jake, 353–354
- variance, shared, 63
- variety, data, 22–23
- VBA (Visual Basic for Applications), 158, 160
- vectorization, 127
- vectors
- velocity, data, 21–22
- vertically stacked card infographics, 394
- Visual Basic for Applications (VBA), 158, 160
- visual estimation of clusters, 80, 84, 85
- volume, data, 21
W
- Waikato Environment for Knowledge Analysis (WEKA), 386
- Warning icon, 4
- wealth, generating, 343–345
- web analytics, 232. See also omnichannel analytics
- web-based data visualization, 392–393
- web-based documents, 12
- Weber, Eric, 354
- web-scraping tools, 382–383
- website, company, 233
- weighted average, 54
- WEKA (Waikato Environment for Knowledge Analysis), 386
-
while loop, 110
- Wilson, Zach, 346
- Women in Data, 350, 365
- word cloud, 174, 177
- World Bank Open Data page, 375–376
- wrappers, 379