Connecting Disparate Documents Enabled by Semantic Search
Navy SBIR 2010.2 - Topic N102-180
ONR - Mrs. Tracy Frost - [email protected]
Opens: May 19, 2010 - Closes: June 23, 2010

N102-180 TITLE: Connecting Disparate Documents Enabled by Semantic Search

TECHNOLOGY AREAS: Information Systems, Human Systems

ACQUISITION PROGRAM: PM IDF&D

OBJECTIVE: Leverage technologies that analyze documents (similarity, theme and entity discovery, etc.) to mature evidence search technologies that match against example relational evidence or Ontology-based search terms, regardless of where in an enterprise information is stored.

DESCRIPTION: Currently, information is collected through various ingest mechanisms, resulting in numerous text-based reports, spreadsheets, databases, images, and videos. The goal of the ISR Enterprise is to integrate these diverse data sets in order to provide the analyst/war-fighter richer situational awareness, delivered in a concise report. In order to accomplish this, an enhanced and more automated method is needed to find specific information located in documents that are most closely related to a subject or another document. Currently, documents can be found by key word searches, and document similarity can be addressed by theme extraction or by looking at word usage rates. These techniques allow for document clustering, but fall short of the requirement for semantic searches. The goal of the topic is to combine methods such as keyword, theme and proper name extraction with social network analysis metrics in order to more rigorously and accurately compute the closeness and betweenness of entities and concepts. By representing an entire corpus of disparate information sources as a graph, related evidence can be found using standard social network metrics. Social Network Analysis concepts can be leveraged in the exploration of such linked data sets.

Semantic Search Technologies
When data enters into an IT-based system, it initially exists on its own, remaining unrelated to other information both internal and external to the system. A more complete intelligence picture is formed when new information is linked to existing internal information and external sources. Automated fusion of data from various sources will assist evidence collection and entity profile formation by acting like a zipper between disparate data sets. Enterprise ontologies can be used as the basis of this capability by creating a structure by which diverse information can be related. One of the main challenges in accomplishing this task is the process of discovering what critical information remains unknown and disparate. Automatically providing greater fidelity to existing information by discovering relationships between the information will reduce the amount of time it takes to form a complete profile.

Social Network Analysis (SNA) techniques can assist by pro-actively monitoring changes in related information. By understanding the relationship between information, the system can incorporate new information in the right places. The result is a reduction in the amount of time it takes to relate new information. This effort looks to leverage existing document clustering techniques that both discover entities, themes and associations. The current data classifying products act naively against new data sets, ignoring the data that has already been collected (i.e.. the previous states of entity relationships). However, adding additional information and relating it to existing information changes the graphical structure of the data (i.e.. relationships between entities and concepts). SNA metrics quantify the relationships and enable matching capability.

Rather than improving the field of Natural Language Processing (NLP), this effort will utilize, as much as possible, existing NLP tools that extract relational data from documents, in order to explore the possibilities of applying Social Network Analysis to such data. Further, the effort will utilize ontologies and reasoning capabilities enabled by semantic web concepts for analyzing the relationships.

The focus of this effort will be on evaluating the Social Network Analysis techniques applicable to linked data in an Information System, where the data collected is representative of real world events as observed by humans and sensors. Level 2 fusion, where relations between objects are established, will be relied upon to provide the necessary data structures. The effort will look to build upon Level 2 fusion (Situation Assessment) mechanisms, to further increase understanding of the information.

Continuously computing social network metrics against linked data will help classify entities and structures. It is then possible to compare detected structures to known social network structures. Social network theory has thus far provided us with ways of understanding community structures and roles played by individual entities given such structures. For example, knowledge flows through a community of individuals can be analyzed, and the constraints certain network structures place on information flow are known. The hypothesis of this effort is that such knowledge of social networks can be applied to linked data in an automated collection system aggregating information from diverse sources.

The result of the effort is intended to be a prototype system that can perform such analysis on an on-going basis. One of the primary tasks of the effort will be researching methods of comparing structure found in data sets of linked information to existing known social network structures/signatures. The system should be able to classify entities based on their relations to other entities and present this information to the user in a meaningful way. It should be able to represent, in understandable plain language, the characteristics of entities and groups to users searching for information. The characteristics of the entities are the result of the analysis of the metrics that are continually being computed by the system. The system can, thus, perform entity classification, and report on an entity's position/role in a given network. The evidence search system, as envisioned, aims to provide a mechanism that can assist in finding relevant evidence in the absence of direct ties between clues.

Additionally, this effort looks to be an enabler for future research. Areas of future research include longitudinal and multi-modal network analysis of data sets containing information gathered by both humans and automated sensors. Another area includes "what if" exploratory analysis of such networks, which can present potential scenarios to users.

Phase I: Provide a proof of concept demonstration against one data store of the utility of coupling of state of the art text analytics with social network analysis metrics. Examine and report on the technical risk of developing a real time enterprise application service. Compare computed entity to entity and document to document closeness/betweenness with subject matter expert assessment.

Phase II: Produce a prototype system that includes both real time document preprocessing and document (entity/concept) social network analysis metrics. Demonstrate that the "accuracy" of semantic searches and the efficiency of data mining are enhanced by this method.

Phase III: Mature the prototype developed under phase II while showing continued improvement against key metrics (search accuracy, data retrieval). Support a transition to a Distributed Common Ground Station (DCGS) program of record.

PRIVATE SECTOR COMMERCIAL BENEFIT: The commercial market for semantic search and smart data retrieval capabilities is expanding at a rapid pace. The development of a tool that can automatically and correctly retrieve relevant related information from documents has both military and commercial value.

REFERENCES:
1. M. Girvan and M. E. J Newman, "Community structure in social and biological networks," Proceedings of the National Academy of Sciences of the United States of America 99, no. 12 (2002): 7821-7826.

2. Bettina Hoser et al., "Semantic Network Analysis of Ontologies," in The Semantic Web: Research and Applications, vol. 4011, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2006), 514-529.

3. Xiang Zhang, Gong Cheng, and Yuzhong Qu, "Ontology summarization based on rdf sentence graph," Proceedings of the 16th international conference on World Wide Web (2007): 707-716.

4. R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect," Computer Networks 33, no. 1-6 (June 2000): 387-401.

5. Heiner Stuckenschmidt, "Network Analysis as a Basis for Partitioning Class Hierarchies.," in Proceedings of the ISWC 2005 Workshop on Semantic Network Analysis, vol. 171 (presented at the SNA 2005 Semantic Network Analysis, Galway, Ireland, 2005), 43-54.

6. Silvio Peroni, Enrico Motta, and Mathieu d�Aquin, "Identifying Key Concepts in an Ontology, through the Integration of Cognitive Principles with Statistical and Topological Measures," in The Semantic Web, vol. 5367, Lecture Notes in Computer Science 0302-9743 (Print) 1611-3349 (Online) (Springer Berlin / Heidelberg, 2008), 242-256.

7. Styliani Kleanthous and Vania Dimitrova, "Modelling Semantic Relationships and Centrality to Facilitate Community Knowledge Sharing," in Adaptive Hypermedia and Adaptive Web-Based Systems, vol. 5149, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2008), 123-132.

8. D.V. Kalashnikov et al., "Web People Search via Connection Analysis," Knowledge and Data Engineering, IEEE Transactions on 20, no. 11 (November 2008): 1550-1565.

KEYWORDS: semantic; social network analysis; search; closeness; clustering; information

** TOPIC AUTHOR (TPOC) **
DoD Notice:  
Between April 21 and May 19, 2010, you may talk directly with the Topic Authors to ask technical questions about the topics. For reasons of competitive fairness, direct communication between proposers and topic authors is
not allowed starting May 19, 2010, when DoD begins accepting proposals for this solicitation.
However, proposers may still submit written questions about solicitation topics through the DoD's SBIR/STTR Interactive Topic Information System (SITIS), in which the questioner and respondent remain anonymous and all questions and answers are posted electronically for general viewing until the solicitation closes. All proposers are advised to monitor SITIS (10.2 Q&A) during the solicitation period for questions and answers, and other significant information, relevant to the SBIR 10.2 topic under which they are proposing.

If you have general questions about DoD SBIR program, please contact the DoD SBIR Help Desk at (866) 724-7457 or email weblink.