Connecting Disparate Documents Enabled by Semantic Search
Navy SBIR 2010.2 - Topic N102-180 ONR - Mrs. Tracy Frost - [email protected] Opens: May 19, 2010 - Closes: June 23, 2010 N102-180 TITLE: Connecting Disparate Documents Enabled by Semantic Search TECHNOLOGY AREAS: Information Systems, Human Systems ACQUISITION PROGRAM: PM IDF&D OBJECTIVE: Leverage technologies that analyze documents (similarity, theme and entity discovery, etc.) to mature evidence search technologies that match against example relational evidence or Ontology-based search terms, regardless of where in an enterprise information is stored. DESCRIPTION: Currently, information is collected through various ingest mechanisms, resulting in numerous text-based reports, spreadsheets, databases, images, and videos. The goal of the ISR Enterprise is to integrate these diverse data sets in order to provide the analyst/war-fighter richer situational awareness, delivered in a concise report. In order to accomplish this, an enhanced and more automated method is needed to find specific information located in documents that are most closely related to a subject or another document. Currently, documents can be found by key word searches, and document similarity can be addressed by theme extraction or by looking at word usage rates. These techniques allow for document clustering, but fall short of the requirement for semantic searches. The goal of the topic is to combine methods such as keyword, theme and proper name extraction with social network analysis metrics in order to more rigorously and accurately compute the closeness and betweenness of entities and concepts. By representing an entire corpus of disparate information sources as a graph, related evidence can be found using standard social network metrics. Social Network Analysis concepts can be leveraged in the exploration of such linked data sets. Semantic Search Technologies Social Network Analysis (SNA) techniques can assist by pro-actively monitoring changes in related information. By understanding the relationship between information, the system can incorporate new information in the right places. The result is a reduction in the amount of time it takes to relate new information. This effort looks to leverage existing document clustering techniques that both discover entities, themes and associations. The current data classifying products act naively against new data sets, ignoring the data that has already been collected (i.e.. the previous states of entity relationships). However, adding additional information and relating it to existing information changes the graphical structure of the data (i.e.. relationships between entities and concepts). SNA metrics quantify the relationships and enable matching capability. Rather than improving the field of Natural Language Processing (NLP), this effort will utilize, as much as possible, existing NLP tools that extract relational data from documents, in order to explore the possibilities of applying Social Network Analysis to such data. Further, the effort will utilize ontologies and reasoning capabilities enabled by semantic web concepts for analyzing the relationships. The focus of this effort will be on evaluating the Social Network Analysis techniques applicable to linked data in an Information System, where the data collected is representative of real world events as observed by humans and sensors. Level 2 fusion, where relations between objects are established, will be relied upon to provide the necessary data structures. The effort will look to build upon Level 2 fusion (Situation Assessment) mechanisms, to further increase understanding of the information. Continuously computing social network metrics against linked data will help classify entities and structures. It is then possible to compare detected structures to known social network structures. Social network theory has thus far provided us with ways of understanding community structures and roles played by individual entities given such structures. For example, knowledge flows through a community of individuals can be analyzed, and the constraints certain network structures place on information flow are known. The hypothesis of this effort is that such knowledge of social networks can be applied to linked data in an automated collection system aggregating information from diverse sources. The result of the effort is intended to be a prototype system that can perform such analysis on an on-going basis. One of the primary tasks of the effort will be researching methods of comparing structure found in data sets of linked information to existing known social network structures/signatures. The system should be able to classify entities based on their relations to other entities and present this information to the user in a meaningful way. It should be able to represent, in understandable plain language, the characteristics of entities and groups to users searching for information. The characteristics of the entities are the result of the analysis of the metrics that are continually being computed by the system. The system can, thus, perform entity classification, and report on an entity's position/role in a given network. The evidence search system, as envisioned, aims to provide a mechanism that can assist in finding relevant evidence in the absence of direct ties between clues. Additionally, this effort looks to be an enabler for future research. Areas of future research include longitudinal and multi-modal network analysis of data sets containing information gathered by both humans and automated sensors. Another area includes "what if" exploratory analysis of such networks, which can present potential scenarios to users. Phase I: Provide a proof of concept demonstration against one data store of the utility of coupling of state of the art text analytics with social network analysis metrics. Examine and report on the technical risk of developing a real time enterprise application service. Compare computed entity to entity and document to document closeness/betweenness with subject matter expert assessment. Phase II: Produce a prototype system that includes both real time document preprocessing and document (entity/concept) social network analysis metrics. Demonstrate that the "accuracy" of semantic searches and the efficiency of data mining are enhanced by this method. Phase III: Mature the prototype developed under phase II while showing continued improvement against key metrics (search accuracy, data retrieval). Support a transition to a Distributed Common Ground Station (DCGS) program of record. PRIVATE SECTOR COMMERCIAL BENEFIT: The commercial market for semantic search and smart data retrieval capabilities is expanding at a rapid pace. The development of a tool that can automatically and correctly retrieve relevant related information from documents has both military and commercial value. REFERENCES: 2. Bettina Hoser et al., "Semantic Network Analysis of Ontologies," in The Semantic Web: Research and Applications, vol. 4011, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2006), 514-529. 3. Xiang Zhang, Gong Cheng, and Yuzhong Qu, "Ontology summarization based on rdf sentence graph," Proceedings of the 16th international conference on World Wide Web (2007): 707-716. 4. R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect," Computer Networks 33, no. 1-6 (June 2000): 387-401. 5. Heiner Stuckenschmidt, "Network Analysis as a Basis for Partitioning Class Hierarchies.," in Proceedings of the ISWC 2005 Workshop on Semantic Network Analysis, vol. 171 (presented at the SNA 2005 Semantic Network Analysis, Galway, Ireland, 2005), 43-54. 6. Silvio Peroni, Enrico Motta, and Mathieu d�Aquin, "Identifying Key Concepts in an Ontology, through the Integration of Cognitive Principles with Statistical and Topological Measures," in The Semantic Web, vol. 5367, Lecture Notes in Computer Science 0302-9743 (Print) 1611-3349 (Online) (Springer Berlin / Heidelberg, 2008), 242-256. 7. Styliani Kleanthous and Vania Dimitrova, "Modelling Semantic Relationships and Centrality to Facilitate Community Knowledge Sharing," in Adaptive Hypermedia and Adaptive Web-Based Systems, vol. 5149, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2008), 123-132. 8. D.V. Kalashnikov et al., "Web People Search via Connection Analysis," Knowledge and Data Engineering, IEEE Transactions on 20, no. 11 (November 2008): 1550-1565. KEYWORDS: semantic; social network analysis; search; closeness; clustering; information
|