Chapter 20

Ten Free or Low-Cost Data Science Tools and Applications

IN THIS CHAPTER

check Using open-source tools for scraping, collecting, and handling data

check Exploring your data with free open-source tools

check Getting creative with free and low-cost data visualization tools

check Having fun with infographics

Because data collection, analysis, and visualization comprise the crux of the data scientist’s toolkit, it should come as no surprise that you can use quite a few free SaaS tools to carry out these tasks with greater ease. These simple applications can sometimes be useful to more advanced data scientists, but at other times, data science experts simply need more technical tools to help them delve deeper into datasets. In this chapter, I present ten free or low-cost applications you can use to complete data science tasks. You can download and install many of these applications on your personal computer, and most of the downloadable applications are available for multiple operating systems.

Warning Always read and understand the licensing requirements of any app you use. Protect yourself by determining how you’re allowed to use the products you create with that app.

Technicalstuff For more free and low-cost data science and analytics tool recommendations, be sure to check out the ones I’ve left over on the companion website, https://businessgrowth.ai.

Scraping, Collecting, and Handling Data Tools

Whether you need data to support a business analysis or for a new market research data product you’re building, web-scraping can help you track down interesting and unique data sources. In web-scraping, you set up automated programs and then let them scour the web for the data you need. In this section, you get a quick peek at some amazing free tools you can use to capture data in the wild and start wrangling it into a useful format.

Sourcing and aggregating image data with ImageQuilts

ImageQuilts (http://imagequilts.com) is a Chrome extension developed in part by the legendary Edward Tufte, one of the first great pioneers in data visualization — he popularized the use of the data-to-ink ratio to judge the effectiveness of charts.

The task that ImageQuilts performs is deceptively simple to describe but quite complex to implement. ImageQuilts makes collages of tens of images and then pieces them all together into one “quilt” that’s composed of multiple rows of equal height. This task can be complex because the source images are almost never the same height. ImageQuilts scrapes and resizes the images before stitching them together into a single output image. The image quilt shown in Figure 20-1 was derived from a Labeled for Reuse search for the term data science at Google Images.

ImageQuilts even allows you to choose the order of images or to randomize them. You can use the tool to drag-and-drop any image to any location, remove an image, zoom all images at the same time, or zoom each image individually. You can even use the tool to convert between image colors — from color to grayscale or inverted color (which is handy for making contact sheets of negatives, if you’re one of those rare people who still processes analog photography).

Snapshot of an ImageQuilts output from the Google Images search term data science.

FIGURE 20-1: An ImageQuilts output from the Google Images search term data science.

Wrangling data with DataWrangler

DataWrangler (http://vis.stanford.edu/wrangler) is an online tool that’s supported by the University of Washington Interactive Data Lab. (At the time DataWrangler was developed, this group was called the Stanford Visualization Group.) This same group developed Lyra, an interactive data visualization environment you can use to create complex visualizations without programming experience.

If your goal is to sculpt your dataset — clean things up by moving them around like a sculptor would (split this part in two, slice off that bit and move it over there, push this down so that everything below it gets shifted to the right, and so on) — DataWrangler is the tool for you.

You can do manipulations with DataWrangler similar to the ones you can do in Excel using Visual Basic. For example, you can use DataWrangler or Excel with Visual Basic to copy, paste, and format information from lists you scrape from the Internet.

DataWrangler even suggests actions based on your dataset and can repeat complex actions across entire datasets — actions such as eliminating skipped rows, splitting data from one column into two, and turning a header into column data. DataWrangler can also show you where your dataset is missing data.

Remember Missing data can indicate a formatting error that needs to be cleaned up.

Data-Exploration Tools

When I talk about data science, I tend to talk a lot about the free tools that people can use to visualize their data. And although visualization can help clarify and communicate your data’s meaning, you need to make sure that the data insights you’re communicating are correct — that requires great care and attention in the data analysis phase. In this section, I introduce you to a few free tools you can use for some advanced data analysis and data science tasks.

Getting up to speed in Gephi

Remember back in school when you were taught how to use graph paper to do math and then were told to draw graphs of the results? Well, apparently that nomenclature is incorrect. Those things with an x-axis and a y-axis are called charts. Graphs are network topologies — the same type of network topologies I talk about in Chapter 8.

If this book is your first introduction to network topologies, welcome to this weird and wonderful world. You’re in for a voyage of discovery. Gephi (http://gephi.github.io) is an open-source software package you can use to create graph layouts and then manipulate them to get the clearest and most effective results. The kinds of connection-based visualizations you can create in Gephi are useful in all types of network analyses — from social media data analysis to an analysis of protein interactions or horizontal gene transfers between bacteria.

To illustrate a network analysis, imagine that you want to analyze the interconnectedness of people in your social networks. You can use Gephi to quickly and easily present the different aspects of interconnectedness between your Facebook friends. So, imagine that you’re friends with Alice. You and Alice share 10 of the same friends on Facebook, but Alice has an additional 200 friends with whom you’re not connected. One of the friends that you and Alice share is named Bob. You and Bob share 20 of the same friends on Facebook also, but Bob has only 5 friends in common with Alice. On the basis of shared friends, you can easily surmise that you and Bob are the most similar, but you can use Gephi to visually graph the friend links between you, Alice, and Bob.

In another example, imagine that you have a graph showing which characters appear in the same chapter as which other characters in Victor Hugo’s immense novel Les Misérables. (Actually, you don’t have to imagine it — Figure 20-2 shows just such a graph, created in the Gephi application.) The larger bubbles indicate that these characters appear most often — the more lines attached to a bubble, the more they co-occur with others. The big bubble in the center left is, of course, Jean Valjean.

Snapshot of a moderate-size graph on characters in the novel Les Mis’rables.

FIGURE 20-2: A moderate-size graph on characters in the novel Les Misérables.

When you use Gephi, the application automatically colors your data according to different criteria. Looking to the upper left corner of Figure 20-2, the cluster of characters in blue (the somewhat darker color in this black-and-white image) are characters who mostly appear only with each other. (They’re the friends of Fantine, such as Félix Tholomyès — if you’ve only seen the musical, they don’t appear in that production.) These characters are connected to the rest of the book’s characters by way of only a single character, Fantine. If a group of characters appear only together and never with any other characters, they would be in a separate cluster of their own and not be attached to the rest of the graph in any way.

To take one final example, check out Figure 20-3, which shows a graph of the US power grid and the degrees of interconnectedness between thousands of power-generation and power-distribution facilities. This type of graph is commonly referred to as a hairball graph, for obvious reasons. You can make it less dense and more visually clear, but making those kinds of adjustments is as much of an art as it is a science. The best way to learn is through practice, using trial-and-error.

Snapshot of a Gephi hairball graph of the US power grid.

FIGURE 20-3: A Gephi hairball graph of the US power grid.

Machine learning with the WEKA suite

Machine learning is the class of artificial intelligence that’s dedicated to developing and applying algorithms to data so that the algorithms can automatically learn and detect patterns in large datasets. Waikato Environment for Knowledge Analysis (WEKA; www.cs.waikato.ac.nz/ml/weka) is a popular suite of tools that’s useful for machine learning tasks. It was written in Java and developed at the University of Waikato, New Zealand.

You can use the stand-alone WEKA application to analyze patterns in your datasets and then visualize those patterns in all sorts of interesting ways. For advanced users, WEKA’s true value is derived from its suite of machine learning algorithms that you can use to cluster or categorize your data. WEKA even allows you to run different machine learning algorithms in parallel to see which ones perform most efficiently. WEKA can be run through a graphical user interface (GUI) or by command line. Thanks to the well-written Weka Wiki documentation, the learning curve for WEKA isn’t as steep as you might expect for a piece of software this powerful.

Designing Data Visualizations

Ready for the artsy stuff? Here are some tools you can use to create neat-looking web-based data visualizations.

Getting Shiny by RStudio

Once upon a time, you needed to know how to use a statistics-capable programming language like R if you wanted to do any kind of serious data analysis. And, if you needed to make interactive web visualizations, you’d have to know how to code in languages like JavaScript or PHP. Of course, if you wanted to do both simultaneously, you’d have to know how to code in an additional two or three more programming languages. In other words, SaaS data visualization based on statistical analyses was a cumbersome task.

The good news is that the situation has changed. Because of the work of a few dedicated developers, the walls between analysis and presentation have crumbled. After the 2012 launch of RStudio’s Shiny package (http://shiny.rstudio.com), both statistical analysis and SaaS data visualization can be carried out in the same framework.

RStudio —by far the most popular integrated development environment (IDE) for R — developed the Shiny package to allow R users to create web apps. Apps made in Shiny run on a web server and are interactive — you can move sliders, select check boxes, or click the data itself in order to manipulate said data to your heart’s desire. Because these apps run on a server, they’re considered live — when you make changes to the underlying data, those changes are automatically reflected in the appearance of the data visualization. Web apps created in Shiny are also reactive — in other words, their output updates instantly in response to a user interaction, without the user having to click the Submit button.

If you want to quickly use a few lines of code to instantly generate a SaaS data visualization application, use R’s Shiny package. What’s more, if you want to customize your SaaS data visualization app to be more aesthetically appealing, you can do that by simply editing the HTML, CSS, and JavaScript that underlies the Shiny application.

Remember Because Shiny produces server-side web apps, you need a server host and the know-how to host your web app on a server before you can make useful web apps by using the package.

Tip Over on https://businessgrowth.ai, I keep a listing of my favorite Shiny applications. They’re useful if you’re considering using Shiny to build your first data product.

Tip Shiny runs the public web server ShinyApps.io (www.shinyapps.io). You can use that server to host an app for free, or you can pay to host there if your requirements are more resource-intensive. The starter level of service costs $9 per month and promises you 100 hours of application runtime per month.

Mapmaking and spatial data analytics with CARTO

If you’re not a professional programmer or cartographer, just know that CARTO is about the most powerful online mapping solution that’s available. People in information services, software engineering, media and entertainment, and urban development industries often use CARTO for digital visual communications.

By using CARTO, you can create a heat map by simply uploading or linking to a list of spatial coordinates. Likewise, if you want to create a choropleth map to show values for quantitative attributes, simply upload or link to a set of spatial coordinates that includes attribute data. (Choropleth maps show spatial data plotted out according to area boundary polygons rather than by point, line, or raster coverage. I discuss them in greater detail in Chapter 8.)

CARTO allows you to overlay markers and shapes on all sorts of interesting base maps. You can use it to make anything from simple outline maps of geographic regions to stylish, antiqued, glossy map books that come out looking like magazines. You can even use it to generate street maps from satellite imagery. CARTO’s geocoding functionality is so well implemented that you can drill down to a location using individual addresses, postal codes, or even IP addresses.

Remember To get going in CARTO, you need to first set up a user account. You can do that via the CARTO home page (at https://carto.com).

More advanced users can use CARTO to

  • Link to SQL databases.
  • Customize Cascading Style Sheets (CSS).
  • Incorporate other chart types in the form of superimposed graphs, outlines, and 3-dimensional surface plots.

Figure 20-4 shows CARTO’s version of the sample choropleth map of a sample dataset derived from the number of people who moved within the same county in the US in 2011, and Figure 20-5 shows a bubble map of the same dataset.

Snapshot of an interactive choropleth map in CARTO.

FIGURE 20-4: An interactive choropleth map in CARTO.

Remember CARTO is interactive: It allows you to click features to see attribute information and turn map layers on and off in the same map interface.

Technicalstuff Map layers are spatial datasets that represent different features on a map. In shared areas, layers often overlap one another in the same spatial region. To better understand this concept, think again about a map that shows an election outcome. This type of map has a States layer and a Political Party layer. The States layer shows you the name and spatial boundary of the state. The Political Party layer, geographically overlaid on top of the States layer, tells you, state by state, how the majority of voters voted in the election. Although the layers overlap in physical location, both the States layer and the Political Party layer are based on separate, individual datasets. This is how layers work in mapping applications.

Snapshot of an interactive bubble map in CARTO.

FIGURE 20-5: An interactive bubble map in CARTO.

Talking about Tableau Public

Collaborative data visualization platforms are SaaS platforms you can use to design data visualizations and then share those visualizations with other platform users to get their feedback on the visualization’s design or on the data insights conveyed.

Collaborative data visualization platforms have been described as the YouTube of data visualization, but these platforms are in reality far more interactive than YouTube. A collaborative data visualization platform is like a version of YouTube that lets you instantly copy and edit every video using your own software tools and then republish the video via your own social channels.

Collaborative platforms are quite efficient and effective for working in teams. Rather than have to email versions back-and-forth, or (heaven forbid) learn a dedicated version-control system like GitHub, you and your teammates can use the platform’s sharing features to work on visualizations as a team.

Even if you don’t need or want to work with collaborators, collaborative platforms still have much to offer in the way of useful data analysis and visualization tools. These tools are often as powerful as (and sometimes even more powerful than) comparable desktop packages — just keep in mind that they often require users to publicly share their data and results so that others can view, modify, or use those results for their specific needs.

Tip Many sites offer free plans that allow you to keep your work private if you purchase a paid account.

Tableau Public (www.tableausoftware.com/public), a free desktop application, aims to be a complete package for chart-making. If its name sounds familiar, it may be because Tableau Public is the free version of the popular Tableau Desktop software program. As part of the freeware limitation, the application doesn’t let you save files locally to your computer. All your work must be uploaded to Tableau Public’s cloud server, unless you purchase the software.

Tableau Public creates three levels of document: the worksheet, the dashboard, and the story. In the worksheet, you can create individual charts from data you’ve imported from Access, Excel, or a text-format .csv file. You can then use Tableau Public to easily do things such as choose between different data graphic types or drag columns to different axes or subgroups.

Warning You have to deal with a bit of a learning curve when working with the flow of the application and its nomenclature — for example, dimensions are categorical data, and measures are numerical data.

Tableau Public offers many different default chart types: bar charts, scatter plots, line charts, bubble charts, Gantt charts, and even geographical maps. It can even look at the type of data you have and suggest types of charts you can use to best represent it. For example, imagine that you have two dimensions and one measure. In this situation, a bar chart is a popular choice because you have two categories of data and only one numeric measure for those two categories. But if you have two dimensions and two measures, a scatter plot might be a good option because the scatter plot data graphic allows you to visualize two sets of numerical data for two categories of data.

You can use a Tableau Public dashboard to combine charts with text annotations or with other data charts. You can also use the dashboard to add interactive filters, such as check boxes or sliders, so that users can interact with your data to visualize only certain time series or categories. With a Tableau Public story, you can combine several dashboards into a sort of slide show presentation that shows a linear story revealed through your data.

And at last, you can use Tableau Public's online gallery to collaborate and share all the worksheets, dashboards, and stories you generate within the application. You can also embed them into websites that link back to the Tableau Public cloud server.

Using RAWGraphs for web-based data visualization

You can use RAWGraphs, a unique and unusual web application, to make artistic and creative visualizations from your dataset. RAWGraphs’ layout provides you with a simple drag-and-drop interface you can use to make unique and interesting data visualizations with just a few clicks of the mouse. If you want to get funky and cool with your data visualization but you lack the time or money it takes to learn how to code this sort of thing for yourself, RAWGraphs is the perfect data visualization alternative.

Remember Like I said, RAWGraphs is unique — it doesn’t even offer standard bar chart visualizations. It does, however, offer clustered force diagrams, Voronoi tessellations, Reingold-Tilford trees, and other, less-well-known chart types.

To use RAWGraphs, first go to the RAWGraphs home page (at https://rawgraphs.io/) and then navigate to the Use It Now! button. You don’t even need to create an account to use the application — just copy and paste your raw data into the application, and then choose the optimal chart types for that data. RAWGraphs makes it easy to choose between chart types by telling you the precise number of quantitative attributes, categorical attributes, and labels that are required to generate each plot.

This service wasn’t designed for novices, but its simple, straightforward interface makes it a fun, user-friendly application for playing with your data and figuring out how to generate unique chart types. Even if you don’t know a convex hull from a hexagonal bin, you can play around with settings, drag columns from place to place, and view how those changes affect the overall visualization. With enough practice, you may even end up using some of the visualization strategies you learn from RAWGraphs in other contexts.

Remember You can have fun getting cool and funky with visualization design, but always make sure that your visual result is easy to understand for the average viewer.

Figure 20-6 shows a diagram I created in RAWGraphs of the dataset I used for my CARTO example, earlier in this chapter. (Note: This is just about the only type of visualization RAWGraphs offers that would work with such a simple dataset!)

Snapshot of a diagram from RAWCharts.

FIGURE 20-6: A diagram from RAWCharts.

Communicating with Infographics

Although the contextual difference between an infographic and a data visualization is often clear, even data visualization professionals can sometimes have a hard time distinguishing between the two. A good rule of thumb is that if the data graphics are primarily produced in an automated fashion using a data graphing application, it’s a data visualization. But if you use a custom graphic design tool, such as Photoshop or Illustrator, to produce the final product, it’s an infographic.

This categorization grows a bit more complicated, though. An infographic often incorporates one or more charts, making it more difficult to determine the manner in which the visualization was produced. Complicating the issue, online infographic design applications, such as Piktochart and Infogram, have dual functionality that allows for automated data graphing and customizable, artistic graphic design.

Remember An even broader rule of thumb is that if the visualization looks artfully designed, it’s an infographic, but if it looks rather plain and analytical, it’s a data visualization.

Although infographics can be dynamic or static, when you’re designing a graphic for print, a slide for PowerPoint, or an image for social media syndication, just stick with static infographics. If you want to tell a story with your data or create data art, use a dynamic infographic.

Tip You can easily and directly embed static graphics into a social media post. Social content that has an embedded graphic tends to get more attention and engagement than social content that’s posted as text-only.

Applications used to create infographics provide many more creative alternatives than do traditional data visualization applications. In fact, this is as good a time as any to introduce you to a few of the better applications that are available for infographic design. Read on for all the details.

Making cool infographics with Infogram

You can use the online tool Infogram to make aesthetically appealing, vertically stacked card infographics — visualizations that are composed of a series of cards, stacked vertically on top of one another, each with its own set of data graphics, in other words. Because the cards are stacked vertically, one on top of the other, the end infographic is often longer than it is wide.

Infogram offers a variety of trendy color schemes, design schemes, and chart types. With Infogram, you can import your own images to make an infographic that’s much more personalized. Infogram also provides you with sharing capabilities so that you can spread an infographic quickly and easily across social channels or via private email.

You can use Infogram to create stylish infographics that display bar charts, column charts, pie charts, line charts, area charts, scatter charts, bubble charts, pictorials, hierarchical charts, tables, progress displays, word clouds, tree maps, or even financial charts. To get started using Infogram, just head over to the home page (at https://infogram.com) and register for an account. Its freemium plan is robust enough to supply all your more basic infographic-making needs.

Figure 20-7 shows a bar chart of the (by now familiar) in-county moving dataset in Infogram.

Tip If you want to check out some great Infogram examples before you get started, you can view a live feed of featured infographics at Infogram’s Featured Infographics page (https://infogram.com/examples).

Snapshot of a bar chart in Infogram.

FIGURE 20-7: A bar chart in Infogram.

Making cool infographics with Piktochart

The Piktochart web application provides an easy-to-use interface that people like you and me can use to quickly create beautiful infographics. Piktochart offers a large selection of attractive templates, but be warned that only members who have paying accounts can access most of these templates. These templates are a great option if you want to save time and money on design but need to produce documents in an infographic format. Piktochart offers more creative flexibility than other comparable web applications, which makes Piktochart useful in a wide range of industries, from nonprofit grassroots to media and entertainment.

You can use Piktochart to make either static or dynamic infographics, and you can also link your infographics to Google Sheets for live updating. Piktochart offers the usual array of chart types, in addition to more infographic-oriented types, such as Venn diagrams, gauges, and matrixes.

If you use the free version of Piktochart to create your infographic, be warned that your infographic will be made available to the public. If you sign up for a paid account, however, you have the option of keeping your work private. You can register for Piktochart on the application’s home page at http://piktochart.com.

Using Piktochart, you can create infographics that display bar charts, triangle charts, line charts, area charts, scatter charts, pie charts, Venn diagrams, matrixes, pyramids, gauges, donuts, swatches, and icons. Figure 20-8 shows a Piktochart version of a bar chart of the in-county moving dataset example.

Snapshot of a bar chart in Piktochart.

FIGURE 20-8: A bar chart in Piktochart.

Tip This list is awesome, but too short. It’s only a list of 10. There are so many other tool options available to you out there. A few other places you might want to explore includes Datawrapper (www.datawrapper.de), which is used by The New York Times, Wired, and Fortune for their data visualizations. It lets you create charts, choropleth maps, tables, and so much more! Additionally, there is mapbox (www.mapbox.com) which is useful for creating high-quality geographic data and shapefiles for GIS.