16 Further Exercises for Practice

This final chapter contains a set of scenarios and exercises that will allow you to put into further practice some of the techniques you have learned in this book, and are supplementary to the exercises provided at the end of each of the earlier chapters. All the exercises are based on data that is available in the peopleanalyticsdata package in R, or alternatively can be downloaded from the internet. While the scenarios are fictitious, they are intended to represent typical questions and situations that arise when doing statistical modeling in people analytics.

As you work through these scenarios, I encourage you to document your work using either an R Markdown document or a Jupyter Notebook if you prefer. This will help you keep a record of your method, approach and code in case you need to put it into practice again in the future. It will also make it easy for you to share your work with others (for example by putting it in a Github repository), which will allow you to collaborate, discuss and open your work to critique. If you are starting out on your analytics journey, exposing your work to others is one of the best ways to learn. If you are more experienced, then there are others that will undoubtedly benefit from seeing how you went about solving these problems.

16.1 Analyzing graduate salaries

Graduate salary levels are important economic indicators of the value of tertiary education. They can provide important insight about the value of education to employers or to the economy as a whole. When studied in detail they can highlight which particular disciplines have higher or lower levels of demand for graduates, and they can be important factors in determining what subjects or majors students choose to specialize in.

Government agencies and educational institutions will analyze graduate salaries regularly to help critique or validate policy or strategy. Employers will also regularly study publicly available graduate salary information to help them benchmark their compensation and benefits against the external market for graduates.

16.1.1 The `graduates` data set

The graduates data set contains information on graduates currently in the United States across 173 specific subject majors grouped into 16 disciplines of study. This data set is sourced from the FiveThirtyEight data repository¹.

Load the graduates data set via the peopleanalyticsdata package or download it from the internet². The fields in the graduate data set are as follows:

Major is the specific subject major.
Discipline is the broad subject discipline.
Total is the number of graduates of working age in the US.
Unemployment_rate is the proportion of graduates currently unemployed.
Median_salary is the current median salary of those employed in US dollars.

16.1.2 Discussion questions

What kind of outcome is the Median_salary column?
Which of the variables in the data set would you be interested in using to explain the Median_salary outcome? Why?
Are there any transformations you would consider on any of the input variables to help with interpreting the model?
What type of model would you use to try to explain the Median_salary outcome using these variables?
Describe the data type of each of the input variables. How would you interpret the coefficients of the model for each of these data types?

16.1.3 Data exercises

Perform an exploratory data analysis on the data set. Do you see any interesting patterns?
Conduct appropriate hypothesis tests on any of the input variables that interest you to determine if they relate to statistically significant differences in the Median_salary outcome.
If you wish to, perform transformations on the data to help with the interpretation of model results. For example, are the numerical scales intuitive for interpretation?
Run an appropriate multivariate model to determine which input variables have a significant effect on median graduate salary.
Articulate the results of your model, including an estimate of the effect of the input variables and the overall fit and goodness-of-fit of the model.

16.2 Analyzing a recruiting process

Organizations are often very interested in analyzing data from recruiting processes, usually with a couple of goals in mind. Firstly, there is an interest in whether the process is efficient and effective. Secondly, there is an interest in whether the process is fair to different groups and a foundation for the recruiting of a sufficiently diverse set of employees.

The efficiency and effectiveness of a recruiting process can depend heavily on how it is organized, what methods are used and whether those methods are helpful in determining hiring decisions. Statistics from individual elements of the process such as interviews or assessments are often studied. Models can be built to determine which elements influence the decision to hire or not to hire. In an ideal world, some sort of future job performance outcome would be particularly useful in studying the efficiency and effectiveness of a recruiting process, but often this is a very difficult thing to do, especially if the process is very selective. Hiring only a small proportion of applicants usually results in a statistical phenomenon known as range restriction, where the statistics of those hired fall in a very narrow range that makes useful analysis extremely challenging. For this reason, many organizations focus primarily on the final hiring decision as an outcome of interest.

Understanding fairness in a recruiting process usually involves studying how the statistics of that process differ between subgroups of interest and whether any of the differences are significant enough to infer potential bias. Understanding whether these differences are attributable to a particular element of the process, whether it be a test or the rating behaviors of interviewers, is also important in determining whether specific action can be taken to remedy the situation.

16.2.1 The `recruiting` data set

The recruiting data set contains information on 966 applicants who went through the final stages of a recruiting process for graduate positions at a large international financial services company. The recruiting process operates as follows:

Applications are screened according to a number of criteria, including their SAT scores and their undergraduate GPA as well as an online aptitude test they are requested to take and numerous other judgments made by the individuals screening the applications.
Applicants who pass the screening stage are invited for three interviews, two of them with line managers and a third with a human resources professional. Different line managers or HR professionals conduct the interviews on different interview days. Each interviewer independently gives an applicant a score on a Likert scale of 1–5 indicating increasing positive sentiment.
Interviewers and human resources professionals gather to discuss each case and make a final decision on whether to hire or not to hire. All the information used in screening and evaluating applicants is made available to decision makers during this discussion.

Load the recruiting data set via the peopleanalyticsdata package or download it from the internet³. The fields in the recruiting data set are as follows:

gender is the gender of the applicant.
sat is the SAT score of the applicant.
gpa is the Undergraduate GPA of the applicant.
apttest is the score on the aptitude test given to the applicant.
int1 is the rating of the first line manager interviewer.
int2 is the rating of the second line manager interviewer.
int3 is the rating of the human resources interviewer.
hired is a binary indicator of whether a decision was made to hire the applicant.

16.2.2 Discussion questions

Considering the way the recruiting process works, what kinds of inferential analysis or modeling would you be interested in applying to help understand its efficiency and effectiveness?
What kind of model is most appropriate for explaining the hired outcome?
One of your stakeholders is suggesting that the aptitude test is a waste of time and that the information it provides can already be gleaned from the applicants’ SAT scores and GPA. What kind of statistical analysis or model would help you confirm or reject this?
Do you think that collinearity might pose a risk in this data? If so, what variables would concern you?
What kind of hypothesis test would you use to determine if the hiring outcome may be different by gender?
What kind of hypothesis test would you use to determine if the aptitude test score may be different by gender?
How would you go about determining if any gender difference in the hiring outcome can be attributed to a specific part of the process?

16.2.3 Data exercises

Perform an exploratory data analysis on the recruiting data set. Be sure to convert data to the best type for your purposes.
Develop a model to test how the aptitude test results are explained by an applicant’s SAT and GPA. What can you conclude from this? Have you considered possible collinearity in this model?
Develop a model to explain how all the inputs in the hiring decision (interview ratings, aptitude test, SAT and GPA) influence the hiring decision. Reduce this to the most parsimonious model you are comfortable with. What can you conclude from this model about the role of the different elements of the recruiting process in the final hiring decision?
Test whether there is a statistically significant difference in the hiring outcome for males versus females.
By adding gender into your model from Data Exercise 3, determine what element or elements of the recruiting process may be related to any differences in gender in the hiring outcome.

16.3 Analyzing the drivers of performance ratings

In many organizations and for many job types, promotion and performance ratings are the primary indicators of the success of employees. However, promotion is not always available to employees and can be very dependent on role and timing. Since performance ratings are usually generated on a regular basis, it is usually these that garner the most attention in the analysis of success.

However, performance ratings are not perfect indicators of reality. They are usually the result of some judgment from one or more evaluators. Part of that judgment will be informed by data and part will be informed by contextual considerations or personal preferences outside of the data. Therefore it is frequently of interest to analyze performance ratings as the outcome of a decision making process. Such analysis can inform us as to what parts of the evaluation process are operating as intended and what parts are not. Multivariate models around performance can help us understand the degree to which the evaluation process is data driven, the degree to which unfairness might exist in some evaluation decisions and what might be the source of that unfairness.

16.3.1 The `employee_performance` data set

The employee_performance data set contains data on the most recent performance evaluations of 366 salespeople in a technology company. Each employee is evaluated by their manager, who considers certain performance indicators together with their own judgment and awards a performance rating from 1 to 3, where 1 means ‘Needs improvement’, 2 means ‘Performing well’ and 3 means ‘Outstanding’.

Load the employee_performance data set via the peopleanalyticsdata package or download it from the internet⁴. The fields in the employee_performance data set are as follows:

sales: The sales in millions of dollars made by the salesperson during the evaluation period
new_customers: The number of new customers acquired by the salesperson during the evaluation period
region: The region that the salesperson operates in: North, South, East or West
gender: The gender of the salesperson
rating: The performance rating awarded by the manager

16.3.2 Discussion questions

What hypothesis test should be used to determine if there is a significant relationship between sales and performance rating?
What hypothesis test should be used to determine if there is a similar distribution of performance ratings between the four regions?
What type of an outcome is the performance rating? What kind of model is appropriate for explaining what influences performance ratings?
Which input variables would you want to be significant and which would you want to be insignificant in your model in order to support an argument that the evaluation process is fair and relevant to the job?
What assumptions would you need to check after you have run your model to have confidence that you can trust your inferences?

16.3.3 Data exercises

Run separate and appropriate hypothesis tests on each of the input variables to determine if they have a significant effect on the performance outcome.
Prepare your data for running an appropriate multivariate model to explain the performance outcome. Be sure to convert to appropriate data types.
Run an appropriate multivariate model to explain the performance outcome. Report on which variables are significant and estimate the effects of the significant variables.
Determine and comment on the overall fit and goodness-of-fit of your model. Use this to make a comment on how ‘data driven’ you believe the evaluation process to be.
Perform appropriate checks on the assumptions of your model. What approach might you take if any of these tests fail?

16.4 Analyzing promotion differences between groups

As mentioned in the previous section, promotion is a more challenging outcome to analyze because it can happen at different times for different people or groups. Nevertheless, over certain time periods—usually several years—organizations will be interested in understanding what affects the likelihood of promotion among their employees.

It is particularly interesting to compare subgroups of employees that have common characteristics to see if there is a difference in their likelihood of promotion over a specified time period. This is highly analogous to the study of retention or attrition in that promotion can be considered a singular event that can happen to different individuals at different points in time. Like retention or attrition, we are not only interested in whether this event occurred as at the end of a period of time, we are also interested in when it occurred throughout the period.

16.4.1 The `promotion` data set

The promotion data set contains data on 1134 individuals who joined a retailer in an entry-level job, and tracks them for up to eight subsequent years post joining. For each individual the data records whether or not they were promoted, and if so in which year the promotion occurred, where the date of their joining is Year 0. Once promotion occurs, an individual is no longer tracked. If an individual was not promoted, then the year in which the last record occurred is captured.

Load the promotion data set via the peopleanalyticsdata package or download it from the internet⁵. The fields in the data are as follows:

diverse indicates whether or not the individual is a member of one of the organization’s diversity programs.
flexible indicates whether the individual worked on a part-time program for at least 6 months.
store indicates if the individual joined the company in a retail store position.
promoted indicates whether the individual was promoted.
year indicates the year in which the last record was made of the individual.

16.4.2 Discussion questions

What type of analysis is most appropriate to understand if there is a difference in promotion likelihood for employees who are on flexible hours, in diversity programs or who work in-store?
What type of illustration would you use to show whether each of these three factors individually have an effect on promotion likelihood?
What type of model would be most appropriate to determine the combined effects of all of these factors on promotion likelihood?
How would you go about determining whether any differences in promotion likelihood in a given group (such as the flexible working group) is mediated by membership of another group?
What assumptions would you need to check to validate that your analysis is trustworthy?

16.4.3 Data exercises

Run an exploratory data analysis to understand any general patterns of interest in the data.
Perform an analysis and generate an appropriate graph to illustrate the impact of flexible working on the likelihood of promotion. Determine if there is a statistically significant effect.
Repeat this analysis to determine the impact of diversity program membership.
Run an appropriate multivariate model to determine how all three variables of flexible working, diversity and in-store working affect the likelihood of promotion. Remember to check the assumptions of your model.
How would you explain your conclusions? Are there any corrective actions that this analysis might point to?

16.5 Analyzing feedback on learning programs

Assessing the effectiveness of learning programs remains one of the most challenging problems in people analytics. As with many challenging analytics problems, measurement is the key issue. It is exceptionally difficult to track and measure the impact of learning on the future day-to-day success of the individuals who have had access to it. While it makes sense to try to understand the influence of learning on important outcomes like employment, promotion or attrition, these outcomes can often be too distant, too generic in nature and too influenced by context and other factors to expect specific learning participation to show any meaningful influence on them.

Because of the challenges in objectively measuring the impact of learning, analysts often rely on the reaction and feedback of learning participants as an important measure of the success of learning programs. If the content of the program is known to be important to future work and related to future success, and if the participants report that the program was effective for them, then this can create a compelling argument for the success of the program.

16.5.1 The `learning` data set

The learning data set contains 4974 instances of feedback from 326 different participants in a range of learning programs offered by an executive education provider. Each row of data represents the feedback of a specific participant on a specific program that they participated in. Participants were not required to respond to all feedback questions and any question where no response was given is indicated as NA.

Load the learning data set via the peopleanalyticsdata package or download it from the internet⁶. The fields in the learning data set are as follows:

idcode is the unique ID of the participant.
rec is a binary value indicating whether the participant would recommend the program to others.
rel is a rating from the participant on the relevance of the program to their work, where 1 is Very Low and 5 is Very High.
fun is a rating on how enjoyable and fun the participant found the program, where 1 is Very Low and 5 is Very High
clar is a rating from the participant on the clarity of the content and teaching in the program, where 1 is Very Low and 5 is Very High.
home is a rating from the participant on the quality of the homework or project work in the program, where 1 is Very Low and 5 is Very High.
class is a rating from the participant on the quality of the overall class who attended the program, where 1 is Very Low and 5 is Very High.
fac is a rating from the participant on the quality of the program faculty and instructors, where 1 is Very Low and 5 is Very High.

16.5.2 Discussion questions

What kind of outcome is the rec column and which kind of model best suits this outcome type?
Describe the nature of the hierarchy in this data set.
Describe what question we would be answering if we ignored the hierarchy in modeling what influences rec.
Describe what question we would be answering if we considered the hierarchy in modeling what influences rec.
What kind of model would you use to explicitly consider this hierarchy in modeling what influences rec? What kinds of parameters would you experiment with in running this model?
Describe what you would expect to see in the output of a model that considered the hierarchy in modeling what influences rec.

16.5.3 Data exercises

Prepare and run a model to determine which elements of feedback influence whether or not the program will be recommended to others.
Prepare and run a separate model to determine which elements of feedback influence a participant in deciding if they would recommend a program to others.
Experiment with the model from Data Exercise 2 by adjusting which parameters you model at the participant level.
Describe the different outputs of your models from Data Exercises 2 and 3 and how to interpret them.
Compare the outputs of your models from Data Exercises 2 and 3 to those from Data Exercise 1. How might your conclusions differ between the two modeling approaches?

https://github.com/fivethirtyeight/data↩︎
http://peopleanalytics-regression-book.org/data/graduates.csv↩︎
http://peopleanalytics-regression-book.org/data/recruiting.csv↩︎
http://peopleanalytics-regression-book.org/data/employee_performance.csv↩︎
http://peopleanalytics-regression-book.org/data/promotion.csv↩︎
http://peopleanalytics-regression-book.org/data/learning.csv↩︎