Preface

The term “Big Data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.

How can you perform statistical analysis of Big Data using software that you can download for free?

Big Data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. (Wikipedia, 2019a).

This book provides a multitude of web access points for downloads of “Free and open source software” (FOSS).

Open source software (OSS) is software with source code that anyone can inspect, modify, and enhance. According to Wikipedia (2019b), Open source software is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Open source software may be developed in a collaborative public manner.

The role of open source software in Big Data Storage has been presented by others such as Ahuja et al. (2018) in Segall & Cook (2018). The following are several listings of Open Source software and tools: Big Data Tools (Eduonix, 2018), Harvey (2012), Open Source Big Data Analysis Platforms and Tools (Harvey, 2016a), Open Source Big Data Business Intelligence Tools and Resources ((Harvey, 2016b), Intelligence, 2016)), Open Source File Systems and Programming (Harvey, 2016c), Open Source Big Data Tools for Transfer and Aggregation (Harvey, 2016d), Open Source Data Mining Tools (Harvey, 2016e), and Open Source Big Data Databases (Harvey, 2016f).

Open Source items have been discussed for: Forge your future with open source by Brasseur (2018), how developers promote open source projects by Borges & Valente (2019), bot coordinating work in open source software projects by Huket et al. (2019), 23 open source free statistical, data analysis and notebook projects for data scientists by Mu (2019), the innovations of open source by Riehle (2019), open source license compliance by Schottle (2019), and how to select open source components by Spinellis (2019),

This book first explains what is open source software (OSS) and provides an introductory background on Big Data with many examples.

The motivation of this book is that open source software (OSS) is a popular method for solving certain issues in the field of computer science. One key challenge is analyzing big data due to the high amounts that organizations are processing. Researchers and professionals need research on the foundations of open source software programs and how they can successfully analyze statistical data.

The book is organized into seven chapters. Descriptions of each of the chapters are provided below the following Table that indicates which of the seven chapters cover the topics listed:

Table 1.

TOPIC	Chapter 1	Chapter 2	Chapter 3	Chapter 4	Chapter 5	Chapter 6	Chapter 7
Cluster Analysis				X
Data Analytics	X	X		X	X
Data Visualization	X		X	X	X	X	X
Fatality Rate Modeling	X				X	X
High Performance Computing	X	X
Machine Learning						X	X
Neural Networks				X	X		X
Python			X			X	X
R Programming			X	X	X
Statistical Coding		X	X	X	X	X
Time Series Forecasting						X	X

Chapter 1 discusses what Open Source Software (OSS) is and its relationship to Big Data, and how Open Source Software differs from other types of software and its software development cycle. Open Source Software is a type of computer software whose source code is released under a license for which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Big Data are data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with.

Chapter 1 also discusses what is Big Data, and its characteristics, and how this information revolution of Big Data is transforming our lives. Big Data can be discrete or continuous stream data, and can be accessed using many types and kinds of computing devices ranging from supercomputers, personal work stations to mobile devices and tablets. Discussion is presented of how Fog Computing can be performed with Cloud Computing as a mechanism for visualization of Big Data. Examples of visualization techniques for Big Data transmitted by devices connected by Internet of Things (IoT) is presented for real data from Fatality Analysis Reporting System (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of United States Department of Transportation (USDoT). Chapter 1 also presents a summary of additional web-based Big Data visualization software

Chapter 2 discusses open source software (OSS) and associated technologies for the processing of Big Data. This includes discussions of Hadoop-related projects and representative OSS for Big Data Stack. The current top open source data tools and frameworks are presented such as SMACK that is acronym for open source technologies Spark, Mesos, Akka, Cassandra and Kafka that together compose the ingestion, aggregation, analysis, and storage layers for Big Data processing. An introduction to open source statistical software is presented with tabular summaries and categories of each of 38 OSS, that also includes descriptive features and URLs for free downloads. The current challenges of Big Data and open source software (OSS) are also discussed.

Chapter 3 introduces the two most popular Open Source Statistical Software (OSSS), R and Python, along with their Integrated Development Environment (IDE) and Graphical User Interface (GUI). Secondly, additional OSSS, such as JASP, PSPP, GRETL, SOFA Statistics, Octave, KNIME and Scilab, will also be introduced in this chapter with function descriptions and modeling examples. Chapter 3 intends to create a reference for readers to make proper selection of the open source software when a statistical analysis task is in demand. Chapter 3 describes software explicitly in words. In addition, working platform and selective numerical, descriptive and analysis examples are provided for each software. Readers could have a direct and in-depth understanding of each software and its functional highlights.

Chapter 4 discusses several popular clustering functions and open source software packages in R, and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (Density-based Spatial Clustering of Applications with Noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (Principle Component Analysis) and SVD (Singular Value Decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-Sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA) (Akin, 2016).

Chapter 5 demonstrates the descriptive and statistical modeling function in R. The automobile fatal accident data of the United States is extracted from the Fatality Analysis Reporting System (FARS). The model will be used to understand significant contributing factors of automobile accident death when a fatal crash happens. First, descriptive analysis is performed by basic R functions and packages. Then, Generalized Linear Model (GLM) with logit link function is explored and constructed. Finally, multiple validation metrics are introduced and calculated to ensure the reasonability and accuracy of the predictions. The focus of Chapter 5 is to demonstrate the power and flexibility of the most popular Open Source Statistical Software (OSSS) through a real data analysis.

Chapter 6 introduces the history of Python and its IDEs (Integrated Development Environment) and code editors as developing environment. The history tells how Python started from ABC programming language in Netherlands to a community with developers from different areas, and later became one of the most popular programming languages in the world. Popular used IDEs and Code Editor for professional developers and beginners are also introduced with their advantages and disadvantages. Chapter 6 introduces Python libraries which could be used in statistical analysis and give out a simple case on how these methods can be applied.

Chapter 7 compares the performances of multiple big data techniques applied for time series forecasting and traditional time series models on three big data sets. The traditional time series models, Autoregressive integrated moving average (ARIMA) and exponential smoothing models are used as the baseline models against big data analysis methods in the machine learning. These big data techniques include Regression Trees, Support Vector Machines (SVM), Multilayer Perceptrons (MLP), Recurrent Neural Networks (RNN), and Long Short-Term Memory neural networks (LSTM). Across three time series data sets used (unemployment rate, bike rentals, and transportation), this study finds that LSTM neural networks performed the best. In conclusion, this study points out that big data machine learning algorithms applied in time series can outperform traditional time series models. The computations in this work are done by Python, one of the most popular open-sourced platforms for data science and big data analysis.

Ahuja, R., Malik, J., Tyagi, R., & Brinda, R. (2018). Role of Open Source Software in Big Data Storage . In Segall, R. S., & Cook, J. (Eds.), Handbook of Big Data Storage and Visualization Technologies (pp. 123–151). IGI Global. doi:10.4018/978-1-5225-3142-5.ch005

Borges, H. S., & Valente, M. T. (2019). How do Developers Promote Open Source Projects? Computer , 52(August), 27–33. doi:10.1109/MC.2018.2888770

Brasseur, V. M. (2018). Forge your Future with Open Source. Pragmatic Bookshelf.

Eduonix. (2018). 10 Popular Open Source Big Data Tools. Retrieved December 14, 2019 from https://blog.eduonix.com/bigdata-and-hadoop/10-popular-open-source-big-data-tools/

Harvey, C. (2012). 50 Top Open Source Tools for Big Data. Datamation , 4. Retrieved from https://www.datamation.com/data-center/50-top-open-source-tools-for-big-data-1.html

Harvey, C. (2016a). 5 Open Source Big Data Analysis Platforms and Tools . Datamation , 14. Retrieved from https://www.datamation.com/data-center/slideshows/5-open-source-big-data-analysis-platforms-and-tools.html

Harvey, C. (2016b). 7 Open Source Big Data Business Intelligence Tools . Datamation , 14. Retrieved from https://www.datamation.com/data-center/slideshows/7-open-source-big-data-business-intelligence-tools.html

Harvey, C. (2016c). 5 Open Source File Systems and Programming Languages . Datamation , 14. Retrieved from https://www.datamation.com/data-center/slideshows/5-open-source-big-data-file-systems-and-programming-languages.html

Harvey, C. (2016d). 5 Open Source Big Data Tools: Transfer and Aggregate. Datamation. Retrieved December 15, 2019 from https://www.datamation.com/data-center/slideshows/5-open-source-big-data-tools-transfer-and-aggregate.html

Harvey, C. (2016e). 8 Open Source Big Data Mining Tools . Datamation , 14. Retrieved from https://www.datamation.com/data-center/slideshows/8-open-source-big-data-mining-tools.html

Harvey, C. (2016f). 16 Open Source Big Data Databases . Datamation , 16. Retrieved from https://www.datamation.com/data-center/slideshows/16-open-source-big-data-databases.htm

Hukal, P., Berente, N., Germonprez, M., & Schecter, A. (2019). Bot Coordinating Work in Open Source Software Projects . Computer , 52, 52–60. doi:10.1109/MC.2018.2885970

I-Intelligence. (2016). Open Source Intelligence Tools and Resource Handbook. Academic Press.

Mu, H. (2019). 23 Open-source Free Statistical, Data analysis and Notebook Projects for Data Scientists. Retrieved December 14, 2019 from https://medevel.com/open-source-data-science-analysis/

Riehle, D. (2019). The Innovations of Open Source. Computer , 52, 59–63. doi:10.1109/MC.2019.2898163

Schottle, H. (2019). Open Source License Compliance . Computer , 52, 63–67. doi:10.1109/MC.2019.2915690

Spinellis, D. (2019). How to Select Open Source Components. Computer , 52, 103–106. doi:10.1109/MC.2019.2940809

Wikipedia. (2019a). Big Data. Retrieved December 17, 2019 from https://en.wikipedia.org/wiki/Big_data

Wikipedia. (2019b). Open source software. Retrieved December 17, 2019 from https://en.wikipedia.org/wiki/Open-source_software