$1M NSF grant supports new system for gathering, structuring data with ease

The team's new tool will combine of software and data to make gathering structured data dramatically easier.

cafarella nsf Enlarge

Massive collections of structured data, called knowledge networks, help pull together all the information generated in day to day life into a useful dataset for research. Projects like Wikidata, storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, and Wikisource, abstract the knowledge of the web into highly-structured information about a very wide range of topics.

Knowledge networks have made many new and compelling applications possible, such as structured search engine results and voice assistants. But the problem with the sheer size of these networks is the amount of work that goes into them – the most useful networks have been very difficult and expensive to construct, putting the threshold for new ones that cover different topics very high.

Researchers at the University of Michigan believe this process doesn’t need to be so burdensome. With an NSF grant from the organization’s Convergence Accelerator program, Prof. Mike Cafarella intends to create a combination of software and data that should make novel knowledge network systems dramatically easier to produce. The project, “Simultaneous Knowledge Network Programming and Extraction,” was funded as part of the Accelerator’s Harnessing the Data Revolution track.

The team plans to work on developing new, faster knowledge networks for a variety of specific fields of research. In their first effort, they’ll focus on economics. With a proposed integrated knowledge network and tool system, Cafarella believes they have the potential to dramatically improve the ease of performing higher-quality economic measurement and analysis.

The knowledge network tool can improve economic research efforts that will benefit national prosperity, as well as broader understanding of economic phenomena and data as it is gathered.

“The even greater value of the effort,” Cafarella says in his proposal, “will be a tool that allows knowledge networks on any topic to be developed more easily and with less programming expertise.”

If successful, this toolset could easily be abstracted to allow easier gathering of structured data for nearly any field of study.

“Although knowledge networks are thought to be key to future data-enabled discovery,” the proposal continues, “knowledge network-driven applications have generally not been developed using a reproducible system.”

To rectify this, the project will build a knowledge application development system that should make knowledge applications easier to write, existing knowledge networks easier to improve, and entirely novel knowledge networks easier to construct. The team’s effort is based on a novel and extremely succinct form of programming that they have developed that allows simultaneous programming and extraction of relevant information to contribute to a knowledge network. The proposed simultaneous programming and extraction system will help construct knowledge networks, but will also improve knowledge network data quality, by providing additional weak supervision for the information extraction pipelines that are commonly used to produce the networks.

The system will be tested on real data and users in the field of economics, but the methods and tools will not be topic-specific. Cafarella believes his new system will be widely applicable to knowledge networks in many topical domains.

Explore:
Algorithms, Languages & Databases; Big Data; Michael Cafarella; Research News