Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Clojure for Data Science
Table of Contents Clojure for Data Science Credits About the Author Acknowledgments About the Reviewer www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe? Free access for Packt account holders
Preface
What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support
Downloading the example code Downloading the color images of this book Errata Piracy Questions
1. Statistics
Downloading the sample code Running the examples Downloading the data Inspecting the data Data scrubbing Descriptive statistics
The mean Interpreting mathematical notation The median
Variance Quantiles Binning data Histograms The normal distribution
The central limit theorem
Poincaré's baker
Generating distributions
Skewness
Quantile-quantile plots
Comparative visualizations
Box plots Cumulative distribution functions
The importance of visualizations
Visualizing electorate data
Adding columns
Adding derived columns
Comparative visualizations of electorate data Visualizing the Russian election data Comparative visualizations
Probability mass functions Scatter plots Scatter transparency
Summary
2. Inference
Introducing AcmeContent Download the sample code Load and inspect the data Visualizing the dwell times The exponential distribution
The distribution of daily means
The central limit theorem Standard error Samples and populations Confidence intervals
Sample comparisons Bias
Visualizing different populations Hypothesis testing
Significance
Testing a new site design
Performing a z-test Student's t-distribution Degrees of freedom
The t-statistic Performing the t-test
Two-tailed tests
One-sample t-test Resampling Testing multiple designs
Calculating sample means
Multiple comparisons
Introducing the simulation Compile the simulation
The browser simulation jStat B1
Scalable Vector Graphics
Plotting probability densities State and Reagent
Updating state Binding the interface
Simulating multiple tests The Bonferroni correction Analysis of variance The F-distribution The F-statistic The F-test Effect size
Cohen's d
Summary
3. Correlation
About the data Inspecting the data Visualizing the data The log-normal distribution
Visualizing correlation Jittering
Covariance Pearson's correlation
Sample r and population rho
Hypothesis testing Confidence intervals Regression
Linear equations Residuals
Ordinary least squares
Slope and intercept Interpretation Visualization Assumptions
Goodness-of-fit and R-square Multiple linear regression Matrices
Dimensions Vectors Construction Addition and scalar multiplication Matrix-vector multiplication Matrix-matrix multiplication Transposition The identity matrix Inversion
The normal equation
More features
Multiple R-squared Adjusted R-squared
Incanter's linear model
The F-test of model significance
Categorical and dummy variables Relative power
Collinearity
Multicollinearity
Prediction
The confidence interval of a prediction Model scope The final model
Summary
4. Classification
About the data Inspecting the data Comparisons with relative risk and odds The standard error of a proportion
Estimation using bootstrapping
The binomial distribution
The standard error of a proportion formula
Significance testing proportions
Adjusting standard errors for large samples
Chi-squared multiple significance testing
Visualizing the categories The chi-squared test The chi-squared statistic The chi-squared test
Classification with logistic regression
The sigmoid function The logistic regression cost function Parameter optimization with gradient descent Gradient descent with Incanter Convexity
Implementing logistic regression with Incanter
Creating a feature matrix Evaluating the logistic regression classifier The confusion matrix The kappa statistic
Probability
Bayes theorem Bayes theorem with multiple predictors
Naive Bayes classification
Implementing a naive Bayes classifier Evaluating the naive Bayes classifier
Comparing the logistic regression and naive Bayes approaches
Decision trees
Information Entropy Information gain Using information gain to identify the best predictor Recursively building a decision tree Using the decision tree for classification Evaluating the decision tree classifier
Classification with clj-ml
Loading data with clj-ml Building a decision tree in clj-ml
Bias and variance
Overfitting Cross-validation Addressing high bias
Ensemble learning and random forests
Bagging and boosting
Saving the classifier to a file Summary
5. Big Data
Downloading the code and data
Inspecting the data Counting the records
The reducers library
Parallel folds with reducers Loading large files with iota Creating a reducers processing pipeline Curried reductions with reducers Statistical folds with reducers Associativity Calculating the mean using fold Calculating the variance using fold
Mathematical folds with Tesser
Calculating covariance with Tesser Commutativity Simple linear regression with Tesser Calculating a correlation matrix
Multiple regression with gradient descent
The gradient descent update rule The gradient descent learning rate Feature scaling Feature extraction Creating a custom Tesser fold
Creating a matrix-sum fold
Calculating the total model error
Creating a matrix-mean fold
Applying a single step of gradient descent Running iterative gradient descent
Scaling gradient descent with Hadoop
Gradient descent on Hadoop with Tesser and Parkour
Parkour distributed sources and sinks Running a feature scale fold with Hadoop Running gradient descent with Hadoop Preparing our code for a Hadoop cluster Building an uberjar Submitting the uberjar to Hadoop
Stochastic gradient descent
Stochastic gradient descent with Parkour
Defining a mapper Parkour shaping functions Defining a reducer Specifying Hadoop jobs with Parkour graph Chaining mappers and reducers with Parkour graph
Summary
6. Clustering
Downloading the data Extracting the data Inspecting the data Clustering text
Set-of-words and the Jaccard index Tokenizing the Reuters files
Applying the Jaccard index to documents The bag-of-words and Euclidean distance
Representing text as vectors Creating a dictionary
Creating term frequency vectors
The vector space model and cosine distance Removing stop words Stemming
Clustering with k-means and Incanter
Clustering the Reuters documents
Better clustering with TF-IDF
Zipf's law Calculating the TF-IDF weight k-means clustering with TF-IDF Better clustering with n-grams
Large-scale clustering with Mahout
Converting text documents to a sequence file Using Parkour to create Mahout vectors Creating distributed unique IDs Distributed unique IDs with Hadoop Sharing data with the distributed cache Building Mahout vectors from input documents
Running k-means clustering with Mahout
Viewing k-means clustering results Interpreting the clustered output
Cluster evaluation measures
Inter-cluster density Intra-cluster density Calculating the root mean square error with Parkour
Loading clustered points and centroids
Calculating the cluster RMSE Determining optimal k with the elbow method Determining optimal k with the Dunn index Determining optimal k with the Davies-Bouldin index
The drawbacks of k-means
The Mahalanobis distance measure
The curse of dimensionality Summary
7. Recommender Systems
Download the code and data Inspect the data Parse the data Types of recommender systems
Collaborative filtering
Item-based and user-based recommenders Slope One recommenders
Calculating the item differences Making recommendations Practical considerations for user and item recommenders
Building a user-based recommender with Mahout k-nearest neighbors Recommender evaluation with Mahout
Evaluating distance measures
The Pearson correlation similarity Spearman's rank similarity
Determining optimum neighborhood size Information retrieval statistics
Precision Recall
Mahout's information retrieval evaluator
F-measure and the harmonic mean Fall-out Normalized discounted cumulative gain Plotting the information retrieval results
Recommendation with Boolean preferences
Implicit versus explicit feedback
Probabilistic methods for large sets
Testing set membership with Bloom filters
Jaccard similarity for large sets with MinHash
Reducing pair comparisons with locality-sensitive hashing
Bucketing signatures
Dimensionality reduction
Plotting the Iris dataset Principle component analysis Singular value decomposition
Large-scale machine learning with Apache Spark and MLlib
Loading data with Sparkling Mapping data Distributed datasets and tuples Filtering data Persistence and caching
Machine learning on Spark with MLlib
Movie recommendations with alternating least squares ALS with Spark and MLlib Making predictions with ALS Evaluating ALS Calculating the sum of squared errors
Summary
8. Network Analysis
Download the data
Inspecting the data Visualizing graphs with Loom
Graph traversal with Loom
The seven bridges of Königsberg
Breadth-first and depth-first search Finding the shortest path
Minimum spanning trees Subgraphs and connected components SCC and the bow-tie structure of the web
Whole-graph analysis Scale-free networks Distributed graph computation with GraphX
Creating RDGs with Glittering Measuring graph density with triangle counting
GraphX partitioning strategies
Running the built-in triangle counting algorithm Implement triangle counting with Glittering
Step one – collecting neighbor IDs Steps two, three, and four – aggregate messages Step five – dividing the counts
Running the custom triangle counting algorithm The Pregel API Connected components with the Pregel API
Step one – map vertices Steps two and three – the message function Step four – update the attributes Step five – iterate to convergence
Running connected components Calculating the size of the largest connected component Detecting communities with label propagation
Step one – map vertices Step two – send the vertex attribute Step three – aggregate value Step four – vertex function Step five – set the maximum iterations count
Running label propagation Measuring community influence using PageRank The flow formulation
Implementing PageRank with Glittering Sort by highest influence
Running PageRank to determine community influencers
Summary
9. Time Series
About the data
Loading the Longley data
Fitting curves with a linear model Time series decomposition
Inspecting the airline data
Visualizing the airline data
Stationarity De-trending and differencing
Discrete time models
Random walks Autoregressive models Determining autocorrelation in AR models Moving-average models Determining autocorrelation in MA models Combining the AR and MA models Calculating partial autocorrelation
Autocovariance PACF with Durbin-Levinson recursion Plotting partial autocorrelation Determining ARMA model order with ACF and PACF
ACF and PACF of airline data Removing seasonality with differencing
Maximum likelihood estimation
Calculating the likelihood Estimating the maximum likelihood
Nelder-Mead optimization with Apache Commons Math
Identifying better models with Akaike Information Criterion
Time series forecasting
Forecasting with Monte Carlo simulation
Summary
10. Visualization
Download the code and data Exploratory data visualization
Representing a two-dimensional histogram
Using Quil for visualization
Drawing to the sketch window Quil's coordinate system Plotting the grid Specifying the fill color Color and fill Outputting an image file
Visualization for communication
Visualizing wealth distribution Bringing data to life with Quil Drawing bars of differing widths Adding a title and axis labels Improving the clarity with illustrations Adding text to the bars Incorporating additional data Drawing complex shapes Drawing curves Plotting compound charts Output to PDF
Summary
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion