10

Know Your Data

“There’s a tendency to think all the HR data need to be together in one place to get started, but this is not the case. Don't fall into the trap of unnecessarily postponing analytics."

—Peter O’Hanlon
Founder and Managing Director, Lever Analytics

Data are the very foundation of analytics. Without data, there are no analytics. As we emphasized in Chapter 7, “Set Your Direction,” the starting point for analytics should always reflect your vision and include a clear articulation of what you are striving to accomplish. However, achieving that vision and mission is not possible without data.

This chapter covers the following:

• A practical approach to data

• Data challenges and solutions

• Types and sources of data

• Data governance

A Pragmatic View of Data

All analytics projects require data, but they do not require data perfection. High data quality should always be a goal, but the pursuit of complete, perfectly clean data shouldn’t be an impediment to progress or a reason not to undertake an analytics project. In many cases, data are incomplete, inconsistently defined, outdated, missing, “dirty” (containing errors of some sort), or stored in multiple disconnected systems. The challenges are real and numerous, but they are not insurmountable.

Techniques do exist for dealing with all these issues and more. Those who have faced these problems successfully agree that the goal is to do your best with the resources you have. You will make progress. You will bring valuable insights. And you will improve as you go. Laurie Bassi, CEO of McBassi & Company, emphasizes, “Do what you can with what you’ve got. You can still move forward.”

The good news is that most organizations have plenty of data for workforce analytics and ample opportunities to address business questions with existing data. It’s important to note that many organizations have restrictions on who can access certain data and for what purpose, and sound justification for data requests is needed. Be sure to allot sufficient time to address data-related issues, but by all means, forge ahead.

Solving Data Quality Challenges

Data analysts routinely take several steps to assess data quality and determine the best path forward. This section discusses common approaches. In addition to understanding these fundamentals, it can be helpful to remain current on the latest views of data challenges and solutions by participating in relevant online forums and user groups.

“The biggest challenge? Data. Not many organizations have a global data warehouse. Data aggregation, data cleansing, having a single trusted source—these are the things we spend most of our time on with clients, not analysis. The latter turns out to be straightforward once the data are in shape.”

—Michael Bazigos
Managing Director and Global Head of Organizational
Analytics & Change Tracking, Accenture Strategy

What Is Good Enough?

The usefulness of analytics hinges on the quality of the data being analyzed and the relevance of the data to the business problem. The well-worn phrase “garbage in, garbage out” is wholly appropriate in the context of workforce analytics. That said, all hope is not lost when faced with imperfect datasets—expecting data nirvana is unrealistic. Don’t become so consumed with trying to fill all data gaps and fix all problems that you lose sight of the overall analytics objectives. Data issues will always arise. When you have confirmed the relevance of the data, the question becomes, how do you know whether the data quality is good enough for the project you are undertaking?

To answer this question, you must get familiar with the data. In many cases, this means learning from others’ expertise. You need to know what to look for when examining the data so that you can proceed with analysis. As a very simple example, suppose you have a data element that represents people’s ages, and you see negative values (for example, –11, –29, –4). You know something is wrong. Most people can be considered subject matter experts in such a common domain (that is, we all understand the age variable and what values are associated with it; an age of –11 does not make sense).

Now suppose you have another data element representing sales. What if you see negative values in this field (such as –$151,783 or –$22.99)? Does this indicate an error, too? Perhaps, but if you check with experts in sales operations, you might learn that negative sales values are, in fact, valid and represent a cancelled order or a renegotiated price from a previous transaction. Negative sales numbers might seem wrong at first (surely the organization didn’t pay people to take their products), but investing time to understand the data before analyzing it can clarify the situation and confirm the validity of the data. Working with subject matter experts helps you educate yourself about what to look for and how best to address any problems or unusual values in the dataset (and also avoids rework and the need for post-analysis troubleshooting).

Automated data profiling can also help in overcoming data challenges. Data profiling refers to checking datasets for allowable values, logic, and consistency. Data profiling tools (available as open source software or from vendors) analyze data for consistency with business rules and provide recommendations on areas to investigate further in a dataset.

After profiling your data, how do you determine whether the data are “good enough” to proceed with analysis? Again, you turn to the data owners. They are best positioned to know when the data quality is sufficient to produce useful results. If you can expose people to their own data for validation, even better. People will often spot errors in the information about them, and providing this type of visibility can serve as a useful crosscheck of data validity.

Common Data Challenges and Solutions

What if you determine that the data are not good enough to proceed with analysis? The first step is to understand the challenges. Sometimes the data element you want to analyze has missing values for some cases. Sometimes the data haven’t been refreshed and, therefore, are not reflecting the most recent values. In some cases, the data you want to analyze do not even exist. Each of these scenarios might seem frustrating and even daunting, but there is almost always a way forward. Following are some examples to illustrate both the challenges and the corresponding solutions for addressing them.

Missing Data

Suppose you want to determine whether experiences at work differ for men and women. You might choose to conduct a survey to answer this question. Several participants in the survey might choose not to answer the question asking them to indicate their gender, resulting in output like Figure 10.1. With missing data such as this, you must determine how to proceed with the analysis.

image

Figure 10.1 Example survey data with missing values.

The missing data challenge is common. Depending on the specifics (how many data points are missing, the nature of what’s missing), you can apply methods to account for the missing data. The first step is determining the cause of missing values. Understanding the reasons data are missing is very important and can serve as a guide to determining whether the dataset can be used as is.

One consideration is whether the data are missing for random reasons or reasons that actually relate to the topic of study. If some data fields are blank because of intermittent data entry errors, for example, these are likely random. In contrast, suppose that you are studying attitudes about privacy. The people most concerned about data privacy are less likely to respond to certain questions on a survey because they do not trust what will happen with their data. In that case, the very topic being studied is directly related to the cause of missing data—that is, respondents who are sensitive to privacy issues intentionally skip questions they believe could be used to identify them. Ignoring the missing data and proceeding with the analysis would likely lead to incorrect conclusions in this case. In this type of scenario, the best solution might be to use a proxy variable (that is, a different variable) for the desired measure or to identify a different method to study the topic of interest. For this specific example, observing people’s actual online behaviors (what privacy settings they choose, for example) might be a better course than asking for opinions via a survey. Although an observational study could be more difficult to conduct, the increase in validity will yield far superior insights.

If you have verified that you can proceed with the analysis, you must determine the best approach to address the missing values. One option is simply to eliminate cases with missing values from the analysis, although this can reduce the representativeness of the sample (causing biased results) and the overall sample size (which is undesirable—in general, the more data, the better). Alternatively, missing values can be “filled in” by estimating them, using appropriate assumptions or modeling techniques (such as regression equations to estimate the missing values).

Dealing with missing data requires a degree of expertise, and data profiling tools (discussed earlier) can help. If your team is inexperienced in dealing with missing data, you might also want to seek guidance and counsel from data experts elsewhere in your organization. Marketing and finance functions often have people with these skills. Another option that has served many practitioners well is to hire a data scientist directly onto the workforce analytics team. If direct hiring is not feasible, working with an intern offers a cost-effective way for some analytics functions to acquire the needed expertise and keep projects on track. Interns bring the added benefit of exposure to the latest analytical thinking and techniques. As another solution, companies can contract external partners to help with data issues.

Outdated Data

You might have access to the data you need for your analysis, but some of the values might not be up-to-date. As an example, suppose you want to determine whether compensation is related to productivity (to find out whether more highly paid people produce more). To do this analysis, you get a data extract from the organization’s core Human Resources Information System (HRIS). You learn that the dataset does not reflect recent off-cycle salary increases because the compensation system (which records salaries) has not yet synchronized with the core system. If you are able to source this information only from the HRIS, the current salary will be incorrect for the people who were part of this off-cycle salary program. You then need to determine the implications for your analysis.

Judgment is needed to determine how much of an issue this is for your analysis and conclusions. If the number of outdated values is large enough to appreciably influence the outcome of the analysis, you will want to make every effort to obtain updated information. A sensitivity analysis can be helpful in determining whether updated values will significantly change the findings. If updates are required, work with data owners to identify the best option (some of which are described here).

If data refresh cycles are frequent (or imminent), your best course of action might be simply to wait for the next refresh. For example, if the compensation system feeds the core HRIS monthly and you need the very latest compensation data, find the specific update schedules and determine whether it makes sense to wait. If waiting for a refresh does not fit with your timeline, you might be able to obtain access to the data from a different, more updated source for specific variables (for example, the source compensation system itself); then you can merge this data extraction with your master dataset.

Another option is to manually update values, if necessary. If you opt for this approach, make sure your updates match the subsequent source system updates. You always want to be consistent with a single trusted source of data. Technology known as change data capture (CDC) can automate the data update process. The goal of CDC is to ensure that data are synchronized across the organization; it achieves this by replicating data changes from a source system to other systems. Updates can be scheduled for specific points in time or even real time, and CDC mitigates risks associated with manual updates.

If none of these options is feasible, you must collect new data for precisely the information you need. Depending on the importance of the analysis and the timing, this might be a worthwhile endeavor.

No Data Available

Sometimes no systems or databases have the data you need for an analysis. Imagine that employees in a specific part of the business are quitting their jobs at a high rate, and you need to determine the cause. You have several factors to consider, but perhaps the sponsor of the project is particularly interested in looking at promotion history (that is, when and how often people have been promoted to the next job level). However, you learn that no system has recorded promotion information. The data you need for the analysis simply do not exist.

Does a lack of data mean that you cannot consider this variable in your analysis? As with other data challenges, one solution is to initiate a new data collection effort. However, you might be able find a better option with a bit of creativity and ingenuity: You might be able to approximate the data you need by using a combination of variables that do exist. For example, if you need data on people’s promotion history, you could look for instances in which people had a title change and a corresponding salary change. If someone got a new job title and a raise at exactly the same time, this combination of events is a strong indicator of a promotion and can be used to create what you need without any incremental data collection. Another creative approach is to consider external publicly available data as a proxy. For example, this could be represented by job title changes posted on LinkedIn.

Additional Data Challenges

Although missing, outdated, or unavailable data are obvious challenges to overcome, it’s important to be aware of less obvious data challenges as well. Specifically, characteristics of the data themselves can potentially lead an analyst astray if they are overlooked. The following sections discuss these types of challenges.

Non-normal Data Distributions

Many commonly used statistics (for example, mean-difference tests and regression) are based on assumptions about the data being analyzed. One important assumption is that the data are normally distributed: If you took many samples, calculated the mean (average) score from each sample, and plotted all the means in a frequency graph, they would look like a bell-shaped curve. Not all variables are normally distributed, though. Consider net worth as an example: Very few people have extremely high net worth values, relative to the overall population.

Two common indicators of non-normal distributions are skewness and kurtosis, which are measures of a distribution’s shape. Skewness measures lack of symmetry in a data distribution (as in the net worth example); zero skewness indicates perfect symmetry, as would be expected in a normal distribution. Kurtosis reflects whether the lengths of the tails in a data distribution are extreme; zero kurtosis indicates tails that are neither longer nor shorter than would be expected in a normal distribution. See Figure 10.2 for examples of skewness and kurtosis.

image

Figure 10.2 Examples of skewness and kurtosis.

Why does this matter? If you use statistics that require normality, but this assumption is not met, the statistical tests could yield misleading results. The whole point of statistics is to build a fact base on which to inform decisions; if the fact base is inaccurate because analysis tools were misused, it runs counter to that goal. As described earlier, tests are available to determine whether your data meet the normality assumption. If you suspect they do not (see Figure 10.3 for another example of a non-normal distribution), you can either apply corrections to approximate normality in the data or use alternative statistical techniques for analysis. Experts such as data scientists, analysts, and industrial-organizational psychologists (within your team, from other functions in the organization, or outside partners) can offer guidance on how best to proceed.

image

Figure 10.3 Example of a non-normal distribution of data.

Data Outliers

Outliers are values that are abnormally higher or lower than most other values in a sample of data. Identifying outliers in your data is important because a few extreme values can alter the results considerably. Sometimes outliers are legitimate values; other times, they are the result of a data error. Either way, they can lead to misleading conclusions. Tests are available to measure outliers. At a minimum, always examine the distributions of your data as a check before you run statistical tests (see the examples in Figure 10.4 to get a sense of what to look for when plotting data graphically). If you identify outliers, you need to make an informed decision on whether to include or exclude them in the analysis. Including extreme values might mask an important relationship or insight. Excluding them might hide a meaningful variation. Consult the data owners to help with this decision.

image

Figure 10.4 Examples of data outliers: Single Variable Histogram (top) and Bivariate Scatter-plot (bottom).

Inconsistent Data Definitions

Another frequently encountered challenge is inconsistent definitions of the same data elements. As an example, you might want to join two or more different datasets, linking them with a common identifier. In one dataset, the identifier might be a six-digit, alphanumeric employee identification number. In the second dataset, the identifier might be the same employee identification number with a three-digit alphanumeric country indicator appended, making it a nine-digit character. And in yet a third dataset, the identifier might be a government-issued number such as a Social Security number.

How can you connect these datasets in this scenario? The first and second datasets are relatively straightforward but require a data calculation to create a new field in the second dataset. Let’s assume you are analyzing data from one country only, so the country indicator is unnecessary. Using computer programming or data modeling software, you can automate the removal of the three-digit country indicator from each case, resulting in an exact match of employee identifier in the first two datasets. (Alternatively, if you need the country indicator, you can automate the addition of that indicator to the first dataset, although that requires a few more steps.) For the third dataset, you can use a lookup function from data in the core HRIS that equates Social Security number with employee identification number. Apply a matching algorithm, and you then have three datasets with identical employee identifiers, allowing you to focus on the analyses of interest.

Data inconsistency can also arise when information is not entered in a standardized way across the organization. Andre Obereigner, Manager for Workforce Analytics at Groupon, explains: “It’s very important to get high-level agreement of HR metrics—which are most relevant for the organization and how we define them. For example, consider headcount: It seems easy, but does it include contingent workers? What about interns? Or are you only considering regular fixed-term workers? What about people on leave—are they headcount or not? Define the metrics you are using and write that definition down. It can be very challenging if different parts of the organization use different definitions.”

Andre managed these metric challenges by educating the local HR teams through guidelines and videos showing how to record data in a standard way, and he advised on steps to ensure data quality. Andre also noted the importance of senior management support: When the head of HR is very focused on data quality, this also becomes a priority for the HR community.

Ultimately, establishing robust data governance processes is good practice, as discussed later in this chapter. This helps in improving data quality and gaining efficiencies for the long term.

Data Types and Sources

To get the most benefit from workforce analytics, it’s important to think broadly about the types of data to incorporate into analyses. Traditional sources such as employee information stored in the core HRIS and other HR systems should certainly be considered in scope, along with non-HR data such as financial performance and customer satisfaction. Also think beyond your organization’s walls—for example, relevant social media data. New technologies such as sensors and “smart” devices are continually creating additional data sources to consider as well.

DON’T LET THE LACK OF ONE INTEGRATED HRIS STOP YOU

Mariëlle Sonnenberg has been leading the Global HR Strategy and Analytics team at Wolters Kluwer1 since 2013. In the last few years, she has achieved a strong reputation both internally and externally. Although many workforce analytics practitioners begin their work using the data in the core HRIS for their analyses, Mariëlle did not have a single enterprise-wide source of HR data, and she was able to use that to her advantage.

Instrumental to Mariëlle’s early success was the fact that she was not beholden to the enormous amount of reporting, metrics, and data that consume some analytics professionals when they have, or are implementing, an integrated HRIS.

“When I started, we had many different systems that were not all integrated. We got our headcount information from our finance reporting systems. There were no specific analytical capabilities within the HR function. I therefore started by looking at questions asked of HR leaders, mostly customer-related questions—for example, how much revenue-generating capabilities do we have? I went to our finance system, which is accurate, and I started with their definitions. I began reporting to the CHRO and the CEO based on finance information.”

While Mariëlle brought together the data she needed, she didn’t spend time implementing a single HRIS. Instead, she concentrated on the most important metrics to the business and used the most appropriate system for that metric when she needed it.

“I remember a question our CEO had around varying personnel costs across countries and businesses,” Mariëlle says. “The analysis we conducted had a great impact, and it prompted discussions about which workforce-related metrics would enhance our operational focus to drive cost and margin improvements.”

Mariëlle could answer the personnel costs question without the need for an integrated HRIS. Instead of holding her back, the lack of such a system enabled her to focus only on core metrics and business questions like the one from the CEO. “This is where I got lucky,” she says. “We had nothing in terms of systems (so no complexity), and rather than consuming time with endless amounts of reports and data, I focused on a few important topics.”

Data from Inside the HR Function

An organization’s core HRIS provides a ready source of relevant and analyzable data, with information such as tenure, promotion history, compensation information, job category, educational background, and various demographic variables. Additional data typically available in HR systems (often outside the HRIS) include learning history, performance ratings, aptitude scores, personality scores, skills, competencies, and employee engagement scores. As Chapter 11, “Know Your Technology,” discusses, you can extract these data elements directly from the HRIS and other systems, or you can access them through data feeds to reporting systems or data warehouses. These elements will likely form the core of many workforce analyses.

Data from Outside the HR Function

Workforce analytics should strive to show a link between HR data and key business metrics. To do so, you need to bring together data that are often housed in different systems within the organization, such as financial or customer databases. This can be tricky because different systems have different owners. Patrick Coolen, Head of People Analytics at ABN AMRO, describes this challenge: “Sometimes we had trouble getting the data from the business. They said that they wanted to do it and they had the data. But it can be a struggle to get the right data in the right format on time. Management can say it is okay, but you also need a specialist to deliver the data to your team. And only then can you start connecting and cleaning the data.”

Advice from practitioners on accessing needed data is to build strong, trusting relationships early on with the owners of the different data sources. You will need to partner with the data owners throughout the analytics process, so these relationships are essential. In addition to obtaining access to the data, you will likely need help interpreting the data (as illustrated by the negative sales numbers referred to earlier).

Expect this to be an iterative process. As you build positive relationships, set expectations that you will likely have periodic questions about the data over the course of the project, and secure permission for follow-up discussions and queries. Ultimately, it is best to establish roles (such as a data steward) and practices (such as building a business glossary or data dictionary, as discussed in the data governance section later in this chapter) to achieve efficiencies through repeatable data management processes.

Initial contact with data owners typically comes from senior sponsors of the project. The sponsors identify the people in the organization who can provide data access. They can also remove any roadblocks encountered along the way. As Michael Bazigos, Managing Director and Global Head of Organizational Analytics & Change Tracking at Accenture Strategy, describes: “The stakeholders defined the problem and provided the political juice to access the data, and when we had trouble, we were able to get help from the sponsor. The road to change is lined with a thousand guardians of the past.”

Nontraditional Data Sources

The early twenty-first century has been a time of data proliferation. It stretches the imagination to think about the volume and types of data the future will hold, especially with the emergence and growth of the Internet of Things (such as wearables, sensors, and tracking devices) and social media. This, of course, presents a tremendous opportunity for workforce analytics. In addition to the types of data that might typically be considered for analysis (employee engagement scores, years of service, performance ratings, revenue and market share, to name a few), new insights can come from tapping into less traditional sources of data.

Consider social network postings. Not only was this information virtually nonexistent (or outside the mainstream) until the early 2000s, but the technology to analyze such large amounts of unstructured data was also out of reach for most organizations. Subsequently, entire businesses have been built around the ability to capture and react to real-time insights (for example, the job review website Glassdoor can reveal to potential employees what it’s like to work for any one of thousands of companies). With this evolution, new sources of insights have opened up to organizations from publicly available external social media sites, internal intranets, and collaboration software.

An example from IBM illustrates the potential power of this data source. In 2015, IBM had a policy of not reimbursing expenses associated with rideshare services (due to employee safety concerns in this newly emerging mode of transportation). Using an internal social media platform, an employee expressed frustration with this policy. The posting quickly went viral internally, with many employees responding and expressing support for ridesharing. An online petition arose. Within hours the company’s leadership became aware of this trending topic, thanks to the use of an analytics tool that detects rare events. And within 24 hours, the issue was discussed and a resolution agreed: The policy was changed to allow for rideshare service expense reimbursement (for more details, see Business Insider, 3 June 2015). This is an example of the need for evolving types of analytics for emerging types of data. Organizations are advised to keep pace with developments such as these so they can remain well informed and respond to the internal and external environment in a timely manner.

Another nontraditional source is employee benefits call center data. Recorded information can be analyzed to determine which aspects of their benefits package employees find confusing or need assistance in navigating. Information such as this represents a ready source of “employee voice” data that has already been collected and stored.

Metadata of website activity (also known as click stream or click path data) is another nontraditional data source to consider. This refers to tracking web page viewing behavior (for example, where and when people click on a page, time spent on a page, and viewing patterns). Companies can gain insight into the type of online content that is more or less valuable to employees, and not always in the ways expected. As an example, a company’s IT support function might be happy to know that people are spending a great deal of time on the information and support pages of the website. This could indicate that the content on those pages is helpful and worth visiting. However, an alternative explanation is that people are spending time there because they are having difficulty finding the information they need on the main website pages. A traditional data collection method (for example, focus groups) can complement the valuable web tracking data in this situation and help explain the online behavior being observed.

Recruiting is another area that can apply this technology to good effect. Evaluating a candidate’s click path provides a view of the candidate’s experience, shows where candidates drop out of the application process, and sheds light on the technology’s effectiveness in attracting candidates to jobs. Thanks to machine-learning technology, all this insight can be “learned” by cognitive systems, resulting in work experiences customized to an individual employee’s needs. Technology is potentially transforming the very nature of work and the way it gets done.

Bringing Together Different Data Sources

To realize the potential impact of all these data sources, it is helpful to connect them. The reality is, data are typically stored in multiple places in any given organization, and systems are likely even more disparate if a company has had one or more acquisitions in its past. Knitting together data sources should not be the first order of business (as Chapter 9, “Get a Quick Win,” underscores, you are better served demonstrating the value of workforce analytics by first successfully executing a lower complexity project), but tremendous benefit can come from creating a mechanism for connecting data sources.

Cloud technology help address this challenge, as discussed in Chapter 11. Data service providers can also be of great help, particularly in knowing what questions to ask, advising on best practices, facilitating the process, and defining reference architectures (descriptions of system structures) that account for various data types. The goal is to develop a manageable, ongoing approach that avoids having to re-create datasets time and again.

Data Governance

After the first successful analytics project has been implemented and recognized, it’s important to take the time to invest in data governance. Data governance refers to comprehensive strategies, policies, standards, and rules for managing data in your organization. This includes decisions and agreements on all things data—what data elements to measure and store, how to define each element, who is responsible for integrity and maintenance, who can access the data, and more.

“Data governance has become a hot topic in HR. Ten years ago, it wasn’t a consideration. Data quality is recognized as important today, and it’s essential in getting to a level of analytics maturity with defined and repeatable systems.”

—Jeremy Shapiro
Head of Talent Analytics, Morgan Stanley

One important aspect of data governance is ensuring accountability for data quality. Establishing roles such as data stewards can accomplish this. Another recommended practice is to establish a data dictionary or business glossary for each data source used. A data dictionary includes the business and technical definitions of the elements within a dataset, reducing the need to repeatedly define datasets. The data steward is typically responsible for establishing the business glossary.

Data governance can be established incrementally, starting inside the HR function. It helps build a deeper understanding of the data, guides decisions on data management and access, and creates end-to-end data lifecycle management across various systems. Data governance is part of the larger workforce analytics governance discussed in Chapter 14, “Establish an Operating Model.” Taking the time to get this right (and agreed upon) will serve you well for subsequent data analytics projects.

Remember the Basics

The world of Big Data and nontraditional data sources opens up a new realm of possibilities for workforce analytics, but it is important not to lose sight of the basics. In some cases, collecting a “small” new dataset might be better than analyzing a “big” existing dataset that does not contain the necessary variables or cases to answer your question accurately. Max Blumberg, Founder of Blumberg Partnership Limited, advocates: “Big Data is very fashionable at the moment but sometimes not very practical. Fishing in Big Data for relationships that may or may not exist is not the best use of time if there are business problems to be solved. Instead, if the analytics resources are put to work on specific problem solving, analytics teams are likely to see a much better ROI.”

A DATA DICTIONARY BRINGS YOU CREDIBILITY

Giovanni Everduin, Head of Strategic HR, Communications, and Change, used his management consulting experience to help bring credibility to the HR function at Tanfeeth,2 a relatively new business services company based in Dubai with approximately 2,300 employees.

“When I arrived in 2011, I noticed there was no discipline for uploading HR data at the company. We couldn’t even answer a simple question like, how many people do we have today?”

The company was growing incredibly quickly and needed to get a good grip on its basic data to be able to forecast and predict workforce costs. In addition, the CEO at Tanfeeth wanted to make the company a data-driven organization. Giovanni explains that HR couldn’t initially operate like that: “We were in a meeting and the CEO asked what the attrition rate was. Three people came up with three different answers. All were correct, but each used a different definition; it was embarrassing. We went away and worked with the finance function to get a clearly agreed definition. Getting that level of clarity and agreement is so important. Without that, it is just confusing.”

Instead of jumping straight into the analytics, Giovanni focused on building a data dictionary. Giovanni advises using best-practice definitions for every metric and data element: “There are best practices I had from my work as a consultant, but you don’t need that background. Just do a Google search and you’ll find good definitions. Or go to SHRM3 or CIPD.4

Having a clearly agreed-upon set of definitions for all data elements and HR metrics allows the analytics function to build credibility and avoid the sort of situation Giovanni found himself in.

The best approach might be to conduct an experiment to answer your specific question. For example, suppose a company is no longer satisfied with its performance management system, and the HR team designs what it believes is a better approach. How will they know whether the new approach works better than the old? The best way to answer this question is to conduct an experiment (or a quasi-experiment, if a true experiment is not feasible). See Chapter 5, “Basics of Data Analysis,” for more information on research designs.

In the performance management example, the HR team can implement the new approach in one business unit while other business units continue with the old approach. The team needs to define the desired outcome (for example, employees will be more motivated to improve their performance in the new system versus the old system). Next, the team needs to measure employees’ motivation before and after experiencing the new approach, as well as measure similar attitudes, at a similar time, for the control group of employees experiencing the old approach. Finally, the team needs to compare the measurements. If statistical analysis shows that employees in the new approach are more motivated relative to both their baseline levels and the control group, strong evidence exists to indicate that the new approach is better (at least, in terms of employee motivation).

The term A/B testing, often used in marketing and web development, refers to a randomized experiment with two groups: a control group that experiences the current design and an experimental group that receives some variation of the design (the A group and B group, thus the name A/B testing). In the context of web design, the objective is to introduce a change to a web page and determine whether that specific change results in corresponding changes in an outcome (such as click-through rates for advertisements). By randomly assigning users to the A and B groups, and ensuring that the only difference in the web page is the one thing you are varying (for example, advertisement placement), you can confidently attribute any differences observed in click-through rates to the change.

Similar approaches can be applied in the workplace. As an example, during benefits enrollment cycles, a company could randomly assign people to enrollment activities using either a traditional approach (the A group) or a digital assistant (the B group) and then track differences in benefit choices between the groups. If employees in Group B are choosing best-fit plans at a higher rate than employees in Group A, evidence shows that the digital assistant is beneficial.

Deciding Between Big and Small Data

Previous sections discussed examples of Big Data (for example, social media postings) and more traditional small data (such as survey responses), as well as various data challenges that you will likely encounter along the way. How do you decide the best way forward?

A pragmatic approach is recommended. Start with a clear idea of the questions you are trying to answer and find the best available data sources at your disposal to answer those questions. Then dive in with eyes wide open. Apply analytics to the best of your ability, being aware of potential pitfalls (highlighted in this chapter) and heeding the advice of experts as you go. But don’t get paralyzed into inaction; have rigorous debates, challenge yourself, and triangulate where you can (that is, use multiple data sources and techniques to point to the same conclusion). Know your audience and over-deliver on data quality, if necessary for stakeholder buy-in. But don’t lose sight of the purpose of the project, and don’t let the data become the project. In sum, strike the right balance. You will be much better off than if you rely purely on intuition and assumptions.

Summary

Data are essential building blocks for workforce analytics, and relevant, high-quality data are needed for quality results. The following guidance helps you strike the right balance between ensuring data quality and progressing the workforce analytics agenda:

• Build relationships with data owners to facilitate data access and learn the details of the datasets (such as allowable data values, methods for interpreting the data, and ways to spot errors); utilize data profiling technology to assist with data checking.

• Check data for missing values, determine reasons for the missing data, and take appropriate corrective action (drawing on expertise as needed).

• Verify that you have the most current and complete version of the data needed for the analysis.

• Determine whether you can create the data you need from the data you have, or find a proxy for the data you need.

• Consider a full spectrum of data sources and choose those that best answer your questions.

• Take the time to establish data governance processes, including data steward roles, to ensure ongoing data quality.

• Participate in online forums and user groups to stay current with the latest views on data challenges and solutions.

• Hire or partner with a data scientist to assist with data decisions; an intern can be a cost-effective approach.

• Recognize that data collection and analysis will be iterative and that you will refine and improve as you go.

 

1 Wolter Kluwers is a global leader in information services and solutions for professionals in the health, tax and accounting, risk and compliance, finance, and legal sectors. The company serves customers in more than 180 countries, maintains operations in more than 40 countries, and employs 19,000 employees worldwide. It is headquartered in the Netherlands and is listed on the Euronext Amsterdam (www.wolterskluwer.com).

2 Tanfeeth is a large-scale business service partner based in Dubai that handles back-office operations for the Emirates NBD Group. Tanfeeth was established on September 19, 2011 (www.emiratesnbd.com).

3 Society for Human Resource Management (headquartered in Virginia, United States).

4 Chartered Institute of Personnel and Development (headquartered in London, United Kingdom).