Quantify Systematics from Mislabeled Tables in Supervised Learning
Many real world classification problems use ground truth labels created by human annotators. However, observed data is never perfect, and even labels assigned by perfect annotators can be systematically biased due to poor quality of the data they are labeling. This bias is not created by the annotators from measurement error, but is intrinsic to the observational data. We present a method for de-biasing labels which simultaneously learns a classification model, estimates the intrinsic biases in the ground truth, and provides new de-biased labels. We test our algorithm on simulated and real data and show that it is superior to standard denoising algorithms, like instance weighted logistic regression. We apply our technique to galaxy images and find that the morphologies based on supervised machine-learning trained over features such as colors, shape, and concentration show significantly less bias than morphologies based on expert or citizen-science classifiers. This result holds even when there is underlying bias present in the training sets used in the supervised machine learning process
Chris Miller is a leader in astroinformatics "“ mixing computer science, advanced statistics, and data mining to answer key cosmological questions. His specialty is using galaxy clusters to trace the distribution of matter in the universe. After years exploiting the Sloan Digital Sky Survey, he is now heavily involved in the Dark Energy Survey and Dark Energy Spectroscopic Survey, two of the largest current astronomical survey efforts. Professor Miller used his galaxy-cluster research to support the Big Bang theory by aligning findings from opposing cosmological epochs. He was the first to see the signatures of sound waves from the very early universe that were "frozen into" the matter-density distribution that we observe today. His analysis of the current universe synched neatly with the acoustic oscillations of the early universe detected in the cosmic microwave background, and demonstrated the power of combining big-survey with focused observational follow-up data. He has published in a variety of journals outside his own fields of physics and astronomy, including NIPS, ICPR, The Annals of Applied Statistics, and Statistical Science.