Appendix A
Resources
Staying current on tools and trends is always a challenge in emerging technologies. There’s good news though. An abundance of resources, vendor research, webcasts, and standards groups (groups that promote best practices) await you online, where you can learn new tools and keep up on the latest vendor news and offerings.
Vendor Websites
With more than 1,000 companies that have big data products or services, I can’t list them all. Instead, the following list includes the major players in the big data space:
- Amazon Web Services (http://aws.amazon.com/big-data): Amazon Web Services offers a set of cloud services around big data storage and computing resources.
- Big Data University (www.bigdatauniversity.com): Big Data University is an online community dedicated to educating people on big data. It’s funded by IBM and is free.
- Birst (www.birst.com): Birst is a cloud-based business intelligence and analytics tool.
- Cloudera (www.cloudera.com): Cloudera provides software, training, and support for the Apache Hadoop framework.
- Couchbase (http://couchbase.com): Couchbase is a NoSQL document database for interactive web applications.
- EMC (http://bigdatablog.emc.com): EMC is a hardware vendor providing storage for on-site big data analytics processing.
- Google (http://research.google.com): Google has a host of tools and services that helps developers deliver big data solutions. BigQuery is one of those products, focused solely on big data. It also offers visualization tools and cloud computing services (see Chapter 7).
- Hortonworks (http://hortonworks.com): Hortonworks is a framework for using open-source Hadoop in the enterprise.
- IBM Netezza (www.netezza.com): Netezza is an on-site data warehouse appliance.
- Informatica (www.informatica.com/bigdata): Informatica provides tools for data integration and migration. Its big data offerings help couple traditional databases and NoSQL data stores to make data integration easy for big data processing.
- Jaspersoft (http://jaspersoft.com): Jaspersoft provides open-source analytics tools for data visualization from the dashboard.
- MapR (http://mapr.com): MapR is a complete distribution system for Apache Hadoop.
- Microsoft (www.microsoft.com/enterprise/it-trends/big-data): Microsoft uses its Azure cloud platform and Azure HDInsight products for analytics.
- MicroStrategy (www.microstrategy.com): MicroStrategy delivers business intelligence and analytics tools to large and medium-size businesses.
- MongoDB (www.mongodb.com): This is an open-source document-centric NoSQL database. It’s the most popular NoSQL database today.
- Oracle (www.oracle.com/us/technologies/big-data): Oracle is the creator of the world’s most widely used relational database management system (RDMS). It also creates in-memory database systems and manages the open-source MySQL Database Management System.
- Pentaho (www.pentaho.com): This is an open-source business analytics tool.
- Predictive Analytics Today (www.predictiveanalyticstoday.com/top-30-software-for-text-analysis-text-mining-text-analytics): This is a curated software list for text analytics.
- Qlik (www.qlik.com): Qlik is business intelligence and visualization software.
- RapidMiner (www.rapidminder.com): RapidMiner is an open-source analytics modeling software application. It’s used for statistical analysis.
- SAP (www.sap.com/solution/big-data): SAP is one of the world’s largest enterprise software firms. It provides business intelligence tools, cloud services, SAP HANA, and in-memory database systems for big data analytics.
- SAS (www.sas.com/en_us/insights/big-data.html): The world’s largest privately held software company, SAS provides the premier statistical analytics software package.
- Splunk (www.splunk.com): Splunk is a big data analytics tool used for the analysis and collection of machine data.
- Spotfire (http://spotfire.tibco.com): Spotfire, now owned by Tibco, provides business intelligence and visualization tools.
- Sumo Logic (www.sumologic.com): Sumo Logic is a cloud-based analytics engine that specializes in log file analysis.
- Tableau Software (www.tableausoftware.com): Tableau produces dynamic and interactive visualization software for big data and analytics.
- Teradata (http://bigdata.teradata.com): Teradata builds high-end data warehouse software for high-performance analytics systems.
Bookmark the VentureBeat article on the big data ecosystem at www.venturebeat.com/2014/05/11/the-state-of-big-data-in-2014-chart. This article has a chart of the current big data ecosystem version 3.0 (which is three versions ahead of its famous 1.0 list of ecosystem companies published a few years ago). In technology, an ecosystem is a collection of a company’s products and services in a given industry (in this case, big data).
Standards Organizations
As technologies gain traction within organizations, people tend to organize groups to drive standards and best practices. Several active groups working today foster standards around security, exchanges, and best practices:
- The Cloud Security Alliance Big Data Working Group (BDWG;https://cloudsecurityalliance.org/research/big-data): The BDWG is a subset group within the Cloud Security Alliance whose purpose is to find scalable and secure solutions to use big data in the cloud.
- The National Institute of Standards and Technology (www.nist.gov): NIST is a federal technology agency that has been part of the U.S. federal government since 1901. Its goal is to build standards for measurement to propel U.S. firms in a leadership position globally. For big data, the focus of the NIST is to enable research and discovery of best practices to push.
- OASIS group (www.oasis-open.org/committees/tc_cat.php?cat=bigdata): This group promotes open standards in the information industry.
- The Object Management Group (OMG;www.omg.org): The OMG is a consortium of member firms that collaborate to solidify standards across multiple technology sectors.
- The Open Data Foundation (www.opendatafoundation.org): The Open Data Foundation is a nonprofit working on a global scale to drive metadata standards for the use of statistical data across every industry. Metadata is information about data; it describes data to ensure compatibility between systems and fosters standards among companies and industries.
- The Open Group (http://blog.opengroup.org/tag/big-data): The Open Group works with customers and vendors to build standards and interoperability.
Open-Source Projects
The Apache Software Foundation is a nonprofit group that administers the community development of key technologies within big data. Because Apache solutions are delivered under the Apache License, users and organizations can utilize, change, and distribute the software without having to pay royalties. Many very important big data projects are vetted by the Apache Foundation.
The following is a list of important Apache Foundation Projects related to big data:
- Accumulo (http://accumulo.apache.org): A secure implementation of Google’s BigTable.
- Cassandra (http://cassandra.apache.org): A distributed database system originally developed by Facebook with a focus on fast access.
- CouchDB (http://couchdb.apache.org): A NoSQL document-oriented database.
- Flume (http://flume.apache.org): This is a distributed system to collect, aggregate, and transport large amounts of log data generated from web site traffic.
- Hadoop (http://hadoop.apache.org): Large-scale processing of distributed datasets, including the following:
- Hadoop Common: The collection of software libraries used by other modules
- Hadoop Distributed Files System (HDFS): The core file system that stores data across computers providing huge power when aggregated
- Hadoop YARN: A resource manager for scheduling and streamlining MapReduce jobs
- Hadoop MapReduce: A model for processing large datasets
- HBase (http://hbase.apache.org): Manages real-time read/write access to big data files.
- Mahout (http://mahout.apache.org): An open-source machine-learning framework.
- MongoDB (www.mongodb.org): A document-oriented database system that is cross-platform (able to run on many operating systems).
- Python (www.python.org): A powerful scripting programming language.
- Solr (http://lucene.apache.org/solr): Supports full text search for text analytics.
- Spark (http://spark.apache.org): An Apache project built upon the Hadoop Distributed Files System (HDFS); can deliver up to 100 times the speed that traditional MapReduce systems, like Hadoop, can.
- Sqoop (http://sqoop.apache.org): Aallows for relational data to be moved into a Hadoop data store.
- ZooKeeper (http://zookeeper.apache.org): A framework to create and manage redundant software systems.
Big Data Conferences and Trade Shows
Conferences and trade shows are an excellent way to stay connected to the community, get hands-on experience in boot camps and labs, and hear from the leaders in the industry. New conferences are added every year, but here are a few important ones to attend:
- AWS ReInvent: Amazon Web Services’ annual user event in Las Vegas. This popular event has thousands of cloud and big data professionals and AWS partners, and covers the latest in AWS technology.
- Big Data TechCon: Big Data TechCon is big data conference with hands-on training, seminars, and an expo of the latest in big data technology. It’s usually held in the fall in San Francisco.
- The Data Warehouse Institute (TDWI): The TDWI hosts many events throughout the year. It’s supported widely by key vendors in the industry.
- IEEE International Congress on Big Data: Provides an international forum to explore big data topics.
- Gigaom Structure Data: A conference on big data’s impact on the information economy.
- Hadoop World: Collocated with Strata, Hadoop World is the biggest conference for Hadoop users and is sponsored by leading Hadoop vendors.
- O’Reilly Strata Conference: An extremely popular series that hosts multiple events around the world promoting big data, education, and research.
Strata has a great Twitter feed (http://twitter.com/strataconf), which is kept current with information about technologies, speakers, and trends.
Leading Analysts Research Group
A huge amount of resources are spent on research to help businesses make informed decisions about technology. All these research groups are major influencers in the world of technology and big data. Take time to familiarize yourself with their reports and analysis. You can use this information to spot trends in the market, select the best technology for your needs, and stay current with leading-edge thinking. The term thought leader has lost much of its value today, but these firms are among those that still can retain their claim on the term:
- Aberdeen Group (www.aberdeen.com): This group provides custom business intelligence research to help companies improve their overall performance.
- The Data Warehouse Institute (www.tdwi.org): A nonprofit organization that has provided training, certification, conferences, and research for all things data since 1995.
- Forrester Research (www.forrester.com): Forrester Research is a technology market research firm. It publishes Forrester Playbooks to guide firms through technology adoption. You can find a list of publications at https://www.forrester.com/marketing/playbooks-all.html. The Forrester Wave is a tool to evaluate technology vendors.
- Gartner (www.gartner.com): Gartner is a leading technology research and analysis firm, famous for its technology hype cycles and Magic Quadrant research methodology. The hype cycles have been used for years to help professionals understand the maturity of technology by mapping the development on an adoption curve. The Magic Quadrant maps vendors in a matrix measuring maturity of technology vendors.
- International Data Corporation (IDC;www.idc.com): IDC is an independent research firm covering many global market segments, including big data.
- McKinsey Global Institute (www.mckinsey.com/insights/mgi): The McKinsey Global Institute is part of McKinsey & Company, one of the world’s most trusted advisory firms. Its influence reaches across private, public, and social sectors. The 2011 McKinsey Report on big data was a seminal piece of published research, coinciding with a tidal wave of interest in big data globally.