Disambiguation of Entity Association Statements
Navy SBIR 2010.2 - Topic N102-176
ONR - Mrs. Tracy Frost - [email protected]
Opens: May 19, 2010 - Closes: June 23, 2010

N102-176 TITLE: Disambiguation of Entity Association Statements

TECHNOLOGY AREAS: Information Systems, Human Systems

ACQUISITION PROGRAM: PM Intel

OBJECTIVE: Advances have been made with regard to our ability to express large disparate unstructured data sources (e.g. text, images, audio) as connected entity graphs in resource description framework (RDF). There remains practical problems, however, working with the large the RDF data store that can easily be generated from even modest sized data stores. Due to entity and association uncertainty, current implementations of RDF data stores become filled with redundant statements, preventing the expression of a large data corpus as one connected graph. The objective of this topic is to develop algorithms and techniques for level 1 fusion of association statements in a large RDF data store.

DESCRIPTION: There are a number of technologies and systems today that perform named entity disambiguation in support of entity extraction on structured and unstructured data. Reference sets are used to resolve ambiguity and return results. Other implementations may be as basic as providing a list of possible results and have the ambiguity resolved by the information seeker such as that implemented in Wikipedia. It has proved beneficial to anchor terms in a recognized vocabulary and ontology. For instance, WordNet offers a lexical database of words for the English language. This topic seeks to tailor or develop disambiguation algorithms that can be effectively applied to large RDF graphs.

The objective of a large RDF data store is to enable the warfighter or analyst to find everything known about a specific entity (person, group, place, object, event/behavior) rapidly and accurately. The search strategies are dependent on a level of clustering of related RDF that has not to date been demonstrated. Redundant entities and associations cause broken connections.

The goal of this topic is to develop and demonstrate a new class of level 1 fusion (disambiguation) algorithms that can be applied to large RDF data stores. Offerors may examine tagging RDF if that is found to support the overall objective. The offeror can also use information contained within the triple itself.

Research is needed to expand entity disambiguation concepts into the domain of large association (RDF) data stores. The offerror needs to assume that a RDF data store of interest is populated with disparate statements derived from a wide variety of data stores. The goal of the topic is to generate a single connected graph from large RDF that contains no redundant entities and no missed connections. In order to achieve the necessary information refinement on a topic and support evolution of RDF knowledge bases, examples of areas which need to be addressed include: 1) dealing with entity uncertainty 2) dealing with entity information from different knowledge bases that results in a contradiction, 3) creation and updating of statements regarding an entity in a knowledge base or common feature space that do not contradict existing statements on that entity, and 3) deletion of an entity or entity statements without breaking other associations that may refer to that entity.

The Navy is interested in innovative R&D that involves technical risk. Proposed work should have technical and scientific merit. Creative solutions are desired.

PHASE I: Develop algorithms that can identify redundant statements and missed connections in a large RDF data store. Measure and show clear progress in RDF statement disambiguation and in fixing missed connections. Perform a proof of concept against a data store containing tens of thousands of statements. Results from the model development and tests are to be documented in a technical report and presented at a selected conference.

PHASE II: Produce a prototype system that is capable scalable to very large data stores. The prototype system will be able to automatically process and display/catalog on numerous topics defined by the user in near real-time. The model(s) and techniques are to include other forms of data besides textual and should include audio and image type sources. Context based tie points that can be developed on text, audio, and images will be demonstrated in the prototype. The prototype should be a software application that is compatible with a service oriented architecture and demonstrated against real tactical data sources (secret level).

PHASE III: Produce a system capable of deployment and operational evaluation. The system should address topics or themes that are specific to developing a terrorist threat assessment or identification of techniques, tactics, and procedures based on system developed tie points. Tie points will be presented in human understandable form. The system should be modified to operate in accordance with guidelines provided by a program of record.

PRIVATE SECTOR COMMERCIAL POTENTIAL/DUAL-USE APPLICATIONS: There are many commercial applications including credit card fraud detection, business activity monitoring, and security monitoring that would benefit from advanced data enterprise library services. Presently, there is a strong need to protect military and civilian personnel from terrorist attack by analyzing large data stores. To facilitate interoperability, the systems should operate in a net-centric environment and provide reliable performance. Commercial value and cost savings is achieved by operation in a distributed service oriented architecture with other applications.

REFERENCES:
1. Paolo Bouquet, Luciano Serafini, and Heiko Stoermer. "Introducing Context into RDF Knowledge Bases", in Proceedings of SWAP 2005, the 2nd Italian Semantic Web Workshop, Trento, Italy, December 14-16, 2005. CEUR Workshop Proceedings, ISSN 1613-0073. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-166/70.pdf

2. Barbara Bazzanella, Paolo Bouquet, and Heiko Stoermer. "A Cognitive Contribution to Entity Representation and Matching", Technical Report DISI-09-004, Ingegneria e Scienza dell'Informazione, University of Trento. 2009. http://eprints.biblio.unitn.it/archive/00001540/

3. Deepak P, Jyothi John, Sandeep Parameswaran. "Context Disambiguation in Web Search Results", in Proceedings of the IEEE International Conference on Web Services 2004 (ICWS�04) 0-7695-2167-3/04.

4. Smith, Barry, Lowell Viznor and James Schoening, "Universal Core Semantic Layer", OIC2009, http://c4i.gmu.edu/OIC09/papers/OIC2009_5_SmithEtAll.pdf

KEYWORDS: correlation; data fusion; terrorist threats; human language; entity disambiguation; entity extraction

** TOPIC AUTHOR (TPOC) **
DoD Notice:  
Between April 21 and May 19, 2010, you may talk directly with the Topic Authors to ask technical questions about the topics. For reasons of competitive fairness, direct communication between proposers and topic authors is
not allowed starting May 19, 2010, when DoD begins accepting proposals for this solicitation.
However, proposers may still submit written questions about solicitation topics through the DoD's SBIR/STTR Interactive Topic Information System (SITIS), in which the questioner and respondent remain anonymous and all questions and answers are posted electronically for general viewing until the solicitation closes. All proposers are advised to monitor SITIS (10.2 Q&A) during the solicitation period for questions and answers, and other significant information, relevant to the SBIR 10.2 topic under which they are proposing.

If you have general questions about DoD SBIR program, please contact the DoD SBIR Help Desk at (866) 724-7457 or email weblink.