Wordlists packaged and shared among researchers to measure bias in online texts often contain words, or “seeds,” with built-in biases and stereotypes, which could skew their conclusions, according to new Cornell research.
For example, the presence of the starting term “mom” in a text analysis exploring gender in domestic work would skew the female results.
“We need to know what biases are encoded in the models and the data sets. Our article takes a step back and takes a critical look at the measurement tools themselves, ”said Maria Antoniak, PhD student and first author of“ Bad Seeds: Evaluating Lexical Methods for Bias Measurement ”, presented to the Association for Computational Linguistics and the 11th Joint International Conference on Automatic Natural Language Processing in August.
“What we find is that there can be biases there too. Even the tools we use are designed in a special way,” said Antoniak.
Antoniak co-wrote “Bad Seeds” with his advisor, David Mimno, associate professor in the Department of Information Science at Cornell Ann S. Bowers College of Computing and Information Science.
“The seeds can contain prejudices and stereotypes, especially if they are packaged for future researchers to use,” she said. “Some seeds are not documented or lie deep in the code. If you just use the tool, you’ll never know.
In the digital humanities and the broader field of natural language processing (NLP), researchers use computational power to influence written language, extracting thousands of digitized volumes and millions of words to find patterns that inform a wide range of surveys.
It is through this type of computer analysis that Cornell’s digital humanities and NLP researchers learn about gender bias in sports journalism, the impeccable skills of ancient Greek writers to emulate their predecessors, the best approaches. to appease a person who reaches out to a crisis text. line, and the culture that informed British fiction in the late 19th and early 20th centuries.
In previous research, Antoniak pulled birth stories online and discovered new parents’ feelings of helplessness in the delivery room. More recently, in an article published at this month’s ACM Conference on Computer Assisted Cooperative Work and Social Computing (CSCW), she and the Cornell co-authors analyzed a community of book reviews in line to understand how users are refining and redefining literary genres.
This type of text analysis can also be used to measure bias across an entire digital library or corpus, whether it’s the entire Wikipedia, for example, or the collected works. by Shakespeare. To do this, researchers use online lexicons, or source word and term banks. These lexicons are not always verified: some are collected by the public, hand-selected by researchers or drawn from previous research.
Antoniak’s motivation to look for bias in lexicons came after seeing wobbly results in his own research when using an online lexicon of basic terms.
“I trusted the words, which came from trusted authors, but when I looked at the lexicon, it wasn’t what I expected,” said Antoniak. “The original researchers may have done a fabulous job saving their seeds for their datasets. It doesn’t mean that you can just pick it up and apply it to the next case.
As explained in “Bad Seeds”, the seeds used for the measurement of biases may themselves have cultural and cognitive biases.
“The aim is not to undermine the results but to help researchers think about the potential risks of the seed sets used for bias detection,” she said. “Investigate them and test them for yourself to make sure the results are trustworthy. “
As part of his findings, Antoniak recommends that digital humanities and NLP researchers trace the origins of sets of seeds and features, examine and test them manually, and document all seeds and rationale.
This research is supported by the National Science Foundation.
Louis DiPietro is a communications specialist at Cornell Ann S. Bowers College of Computing and Information Science.