Lu Wang earns CAREER Award to summarize long text with machine learning
Wang hopes that, by summarizing longer documents, she can make a new class of information more accessible to a variety of audiences.
Tired of skimming a long report or textbook trying to find the point? A new NSF-funded project aims to make customizable summaries of long documents to help different audiences get what they need at a glance. Prof. Lu Wang has earned an NSF CAREER Award for the project, which will tackle problems of a new scale in the area of text summarization.
“My long term goal is to dismantle the barrier that information overload presents to knowledge acquisition,” says Wang. “We have so much to learn, but so little time. How can we learn more efficiently?”
To this point, document summarization research has been restricted to short documents and rigid, one-size-fits-all summaries. Wang hopes that, by tackling longer works, she can make a new class of information more accessible to a variety of audiences, including technical papers, financial reports, and even books. The ability of users to customize their summaries will be central to this solution, allowing users with different technical proficiencies to request a different level of detail to match their interest in the topic.
The resulting machine learning model should also allow tweaks to the language style, Wang says, while still capturing the style of the original text. For example, “when you talk to a person who has a college degree and has learned a lot about machine learning, the way you talk to them will be very different because they know the terminology.”
Wang faces several key challenges along the way. One major obstacle to summarizing long documents is a lack of training data – models need to learn from pairs of documents and pre-written summaries, which take much longer to produce for long documents than short news articles. The training data also has to be high-quality in order for the customization goal to work intuitively, so that the resulting outputs are coherent and connected to the longer source material. Wang intends to design a model that can learn effectively from a smaller training sample to overcome this.
The resulting model should also be more widely applicable to other problems in natural language processing, according to Wang.
Wang intends to make use of this project’s educational applications as a part of the CAREER grant. The summarization tools developed, Wang says, can help students with reading comprehension by automatically providing a text’s key points, help them more effectively summarize their learning by providing examples, and help higher-level academics quickly learn the writing style of a new subject.
As part of her grant, Wang also intends to expand her efforts to recruit and mentor a diverse group of STEM students, through organizations such as Women in Science and Engineering, Girls Encoded, and M-Engin.