This article was originally published in Medium on July 26, 2018.
Artificial Intelligence and machine learning seem to be perennially in the eye of a storm, with new cases of unintended or ill-designed negative outcomes emerging nearly every week. A groundswell of actors — academics, activists, and professionals in the community — are coming together to try to prevent harm to citizens individually, and to society collectively. There is the Fairness, Accountability and Transparency conference, the Partnership on AI industry initiative, Accenture’s new AI ethics toolkit, and much more. AI training conferences are beginning to include ethics panels.
Much of the focus on remedies necessarily comes from within the data science practitioner community. Less attention has been paid to ways of introducing tomorrow’s data scientists and analysts, graduating with advanced degrees from our leading universities, to the ethical dimensions of their work. To forward this goal, the Markkula Center for Applied Ethics at Santa Clara University (SCU) recently conducted a pilot experiment in a machine learning class taught at the Leavey School of Business.
The goal and the opportunity
Our goal was to work with a professor and expert on machine learning, Sanjiv Das, and pilot an experiment in a mainline class with real-world use cases, i.e., big datasets from the financial, consumer, or criminal justice sphere, rich with both prediction possibilities and ethical concerns. Das is William and Janice Terry Professor of Finance and Business Analytics at Leavey School of Business.
There is an interesting opportunity lurking in graduate-level data science classes in universities. Theory and practice converge in a setup where experiments on models can be run easily. Students use R or Python systems to setup and execute machine learning pipelines all the way from data preparation to prediction and analysis, and draw comparisons across a range of “competing” models. While industry-level compute power and data engineering producing hundreds of millions of real-time predictions may be missing, the fundamentals of training data preparation, model development, and validation are very real for students.
Cover slide of one of the student presentations done on June 14th 2018, at the end of the machine learning class. Credit: Chungfeng Wu, Joohyeun Yi, Katharine Grant, Ming-Chang Chiang, Yuexi Li.
The approach, the datasets, and the ethics context
As part of our pilot, we reached out to ProPublica’s data store and identified three datasets. The idea was to find datasets representative of known or very likely areas of bias. We selected datasets on mortgage lending, small business lending, and criminal recidivism:
1. Federal Home Mortgage Loan data going back two decades, with roughly 11.8 million records. The ethical context of this database is that it may be a record of systematic bias in how different types of people are treated when applying for a mortgage. Discovery of bias here would indicate not only that unethical decisions had been made systematically in the past, but also that they could possibly be corrected for in the future.
2. The infamous COMPAS criminal recidivism dataset used in Florida’s Brown County for decisions about the release of criminal defendants pre-trial, based on scoring their likelihood to reoffend. The ethical context here is clear — as ProPublica has reported, in the COMPAS system, people were treated unfairly based on characteristics that aligned with race more than the reality of their potential for criminal recidivism. As with mortgage and business loans, though more directly affecting people’s bodies and freedom, such unfair biasing creates a systematically unjust world where people are not treated equally and lives can be ruined for no reason at all.
3. The Small Business Administration’s loan approval dataset, also going back two decades, with 1.3 million records. As with the mortgage database, discovery of bias here would indicate systematic past wrongdoing, and perhaps then open up the possibility of correcting this wrong for the future.
We offered the students the choice to select one of these three datasets for their project. For each dataset we gave them descriptions with additional language contextualising ethics concerns in terms of bias in the training data, and fairness in predictions.
Das sized up the opportunity this way: “In each of these three areas, prediction (classification) models can be trained to decide who to grant loans or parole. It is well-known that these models are biased because they are trained on historical data, where race, age, and gender were a large part of the decision modeling process. Machine learning models learn to use these data variables to make decisions and perpetuate the original biases in the data. Removing these variables makes the models more fair but also less accurate in prediction. Therein lies the fairness-accuracy tradeoff”.
Given this tradeoff, is there a way to improve data analysis and reduce bias while still retaining accuracy? This is where a new, promising tool comes in; we introduced the students to Themis-ml, a fairness-aware, open source, machine learning Python library package developed by Niels Bantilan and demonstrated at the Fairness Accuracy and Transparency conference in 2018. Themis-ml lets you modify your machine learning pipeline at various stages for discrimination detection and fairness in predictions. Bantilan describes an ML pipeline as consisting of five steps: “Data ingestion, data preprocessing, model training, model evaluation, and prediction generation on new examples”.
Ethically, this opportunity to detect bias and determine what needs to be done to remove it gives data scientists the potential to correct for major systemic injustices that exist in our world.
Eight student groups — roughly thirty-five of the 70+ class strength — chose to work on the three data sets. Of these, three groups that selected the Small Business Loans dataset quickly found out that there were insufficient features in the data to warrant an ethics analysis, such as demographics of the loan applicants in particular. They decided to proceed with the dataset as they would with any other, and did a conventional ML project. Five groups selected the other datasets and proceeded with the ethics angle.
What the students attempted from an ethics perspective
Students began by exploring the datasets and drafting their central inquiry question. Crucial to this was the fairness vs accuracy trade-off across different models and their modifications.
“They used Themis-ML to measure the extent of fairness/bias in prediction models, and and then correct the bias using different approaches such as adjusting the data or the eventual predictions”, said Das. “Their goal was to examine the change in fairness and accuracy of the models from applying bias corrections, and hopefully, achieve a result where huge improvements in fairness come from minor degradations in model accuracy.”
In the process, students used the same vocabulary and metrics that machine learning professionals use to evaluate their data and models. They also identified useful metrics vocabulary that helps us connect the ethics and data worlds together. For instance, ‘mean-difference’ emerges as a measure of group-level discrimination, and ‘consistency’ as a measure of individual-level discrimination. (Section 4.5 of the Themis-ML paper).
In other words, their findings and projects could be expected to stand up to scrutiny in any normal presentation on machine learning work. They presented their findings on the last day of class in a marathon presentation session along with all the other machine learning projects that did not have an ethics angle.
What the students found
On home mortgages, students found that fairness-aware approaches such as relabeling can help reduce bias against race and ethnicity while maintaining accuracy. They also identified an opportunity to reduce bias and actually improve accuracy, an important discovery.
On criminal recidivism, students found the Themis package was relatively effective in mitigating bias. In one model, they found a substantial reduction in discrimination on recidivism scores (mean difference), with correspondingly very little loss in accuracy.
Cover slide of another student presentation on June 14th 2018, at the end of the machine learning class. Credit: Priyanka Pandey, Behrang Zandi, Elva Shen, Gohar Eloyan.
Across all four projects there was a commonality. For each dataset, there seem to be specific models and sensitive variable combinations that perform better on the fairness vs accuracy tradeoff line, i.e. where fairness is improved without compromising accuracy.
What the students shared with us about their learnings
We briefly surveyed the students after their presentations. We asked them what surprised them the most, the least, and how they might advise their fellow data science professionals who might not have had this exposure. We received a range of reactions:
Reflecting on adjusting the sample proportions of black and non-black offenders in the criminal recidivism dataset, Behrang Zandi said, “Once we forced an equal number of black/nonblack (offenders) in the sample, the results turned more favorable towards black.”
“Some fairness-aware approaches work really well. I always thought that there is a tradeoff between fairness and accuracy, and I am surprised that that’s not always the case”, said Ka Yan Yeung, who worked on home mortgage data.
Yeung also pointed out that credit scores were not disclosed as part of the federal dataset due to privacy and legal considerations, but are relied upon by banks for mortgage loan approvals. “The argument that there’s bias in the loan approval process is weakly supported without credit score data,” she added, as a caveat. Yet credit scores might also reflect bias.
“Using mean difference is a great way to determine differences among different groups. This is important for then determining the disadvantaged vs advantaged class”, said Kailin Hu, who also worked on the home mortgage data.
“Understanding why we care about bias and how the different techniques try to remove it from your predictions” matters, says Leo Barbosa, who worked on the criminal recidivism dataset. He adds that eliminating bias is likely to be important in a legal and compliance context for organizations.
What the faculty felt
Das says that several learning outcomes emerged from the exercise, not all of which were anticipated at the outset. He summarized his detailed take, in five points.
“First, the machine learning class contains an important new element, i.e., the consideration of data ethics. Whereas only five of the 25 class projects were about machine bias and fairness issues, these projects were among the better ones. As of now, we know of no ML courses that include an ethics component that is rigorous and statistically based, so this constitutes a unique offering.
“Second, we learnt that technical skills in data science may be rigorously applied to issues in public policy and ethics, a great example of how big data brings computer science and ethical science together.
“Third, students saw very clearly that data science is about people and solving first-order problems. The role of data science in the human domain is inextricably and unavoidably woven with ethical matters.
“Fourth, the timing of the project was fortuitous, in the arena of internet manipulation and fake news, especially around local tech companies like Facebook and Google.
“Fifth, students also anticipated that skills in the area of fairness-aware machine learning might lead to interesting new job opportunities in tech firms as they seek data science staff with training in ethics.”
Overall, said Das, the learning outcomes were much higher than anticipated, and this was a great example of the benefits of interdisciplinary collaboration.
What the Markkula Center learnt
These students now have a direct understanding of ways to detect and correct for biases found in datasets. This is crucial for the integrity of the data analysis process itself, and also provides insight into the operations of the systems providing the data. In this way, their knowledge can correct for bias they detect at their level, but it also offers the possibility to return to the sources of the data and help them correct their operations to eliminate this bias at the point of origin.
These students not only demonstrated to themselves that datasets record unjust biases, but also shared this understanding with their classmates.
On July 6th, we met for an informal lunch at the Markkula Center with some of the students. We tabulated our findings from their survey responses, after their class. We had asked them 1) How they might use their learnings 2) What surprised them 3) What did not surprise them and 4) What might they share as insight with their fellow data science professionals who may not have had this exposure.
Students who are better able to recognize the big-picture aspects of their work are more able to perceive ethically relevant situations, think about how to solve them, and lead effective responses to these problems. The students who took this course are now aware of something that many data science students are not: that data can record and perpetuate serious injustices, and that through careful consideration those injustices may be accounted for in analysis and perhaps even, with further effort in the real world, corrected in reality.
This experiment also demonstrated the nature of systematic bias in many ethically-significant datasets. This evidence should give us the motivation to try to fix the systems that are creating and perpetuating these injustices. Rather than being a tool that perpetuates bias and covers it up with a layer of math, ML, then, can serve as a way to clearly and unequivocally expose and possibly correct historical injustice. The next question is whether we will take the initiative to act on this information.
Connecting back to the Markkula Framework for ethical decision making
It turns out there is a direct and natural tie-in between the Markkula Center for Applied Ethics’ Framework for Ethical Decision Making and the work the SCU students actually did. Here is the framework as a set of bullets, and there is more here.
- Recognize an Ethical Issue
- Get the Facts
- Evaluate Alternative Actions
- Make a Decision and Test It
- Act and Reflect on the Outcome
In our experiment, we tipped off the students that ethically-relevant things might be going on in the datasets, and then they stepped in to establish the facts, based on their ML analyses. They could then think about solutions to the problem and evaluate those possible actions. They made decisions in their projects about how to deal with the historical data, such as potentially correcting factors to reduce bias. Lastly, while their “actions” were only on a dataset in class, their work led to their own reflections on these clear injustices, as well as sharing this knowledge with their fellow students.
Caveats
This is merely a beginning from the vantage point of introducing ethics-based constraints into machine learning projects taught in classes. It is only one experiment; much work remains to be done, both in the educational context and in the actual application of AI/ML in various social contexts.
Conclusion
With interested and expert faculty, a mainline machine learning class is easily amenable to ethics experimentation. “Ethics” is a broad and sweeping term, and in machine learning projects, localizing the vocabulary matters. Thus, for us, the framing and terminology used in the Themis-ml package seemed an easier entry point for hands-on, measurable work. It seems to lead to confidence building on introducing ethical analysis into a machine learning pipeline.
This approach — giving specific dataset projects with anticipated bias issues and a little bit of ethics context/guidance — was well appreciated by students. An ongoing supply of such datasets and guidance might be useful for further classes.
“We would consider this very much as an experiment and encourage other data science professors to try it out and send us feedback.” says Irina Raicu, director of Internet Ethics at the Markkula Center, who attended the student presentations along with the authors. Raicu wrote her own perspective on this class, which you can find here.
Subramaniam Vincent and Brian Patrick Green
Subramaniam Vincent is Tech Lead for the Trust Project at the Markkula Center for Applied Ethics. Brian Patrick Green is Director for Technology Ethics at the center.
Irina Raicu provided inputs for the project and editorial suggestions for this article.
Credits
- Neils Bantilan, author of the Themis-ml fairness-aware machine learning library package.
- Santa Clara University students: Ka Yan Heung, Sydney Akers, Kailin Hu, Jiying Lu, Chunfeng Wu, Joohyeun Yi, Katharine Grant, Ming-Chang Chiang, Yuexi Li, Priyanka Pandey, Behrang Zandi, Elva Shen, Gohar Eloyan, Leo Barbosa, and Carlos Rengifo.
- ProPublica Data Store, for giving us academic licenses to two datasets (Home Mortgage Disclosure Act, Small Business Loans) compiled collaboration with IRE and NICAR.
References
- Themis-ml: A Fairness-aware Machine Learning Interface for End-to-end Discrimination Discovery and Mitigation, Niels Bantilan.
- An Introduction to Data Ethics, Teaching module, Shannon Vallor, William J. Rewak, S.J. Professor of Philosophy, Santa Clara University.
Future Reading
- Fairness in Machine Learning: Lessons from Political Philosophy, Reuben Binns, Department of Computer Science, University of Oxford, United Kingdom, Proceedings of Machine Learning Research, Conference on Fairness, Accountability, and Transparency.
- Fairness and machine learning, Limitations and Opportunities, Solon Barocas, Moritz Hardt, Arvind Narayanan. (Open textbook, work in progress, requesting feedback.)