Embedded Space Analytics
Navy STTR 2016.A - Topic N16A-T020
ONR - Ms. Dusty Lang - firstname.lastname@example.org
Opens: January 11, 2016 - Closes: February 17, 2016
N16A-T020 TITLE: Embedded Space Analytics
TECHNOLOGY AREA(S): Information Systems
ACQUISITION PROGRAM: FNT FY15-02 DF Naval Tactical Cloud, PMMI (MCSC), DCGS-N (PMW 120)
OBJECTIVE: Develop a capability to detect people, places and events of interest from big data by developing anomaly detection and supervised learning algorithms that can operate effectively on compressed data and data embeddings.
DESCRIPTION: Model based understanding technology has enabled machines to generate huge graphs (millions of nodes and billions of edges) from diverse sources (structured and unstructured data) . Even for capable machines, real-time operation of complex anomaly detection and supervised learning algorithms requires a reduction in data volume. There exists a variety of data embedding algorithms that achieve a reduction in graph dimensionality through unsupervised or semi-supervised techniques . The goal of this topic is to mature algorithms to understand the significance of the movement of entity vector representations over time in embedded spaces . Specific technical challenges include the development of: 1) learning algorithms that allow data vectors to be characterized in a behavior space; 2) anomaly detection algorithms that generate useful alerts of real entities; 3) supervised learning algorithms that predict the meaning behind the movement of entities in embedded spaces; and 4) system scoring of the confidence of knowledge generated.
Enabled by service oriented (SOA) and cloud architectures, intelligence programs have unprecedented access to big data whose detailed content is represented by even larger graphs. Advances in dynamic graph analysis are needed to show the military value of holding and indexing these big data stores to strategic and tactical use cases. Algorithms to generate lower dimensional graphs, and supervised learning algorithms that can be applied to data vector representations, currently exist. Progress has been made on analyzing static embedded data representations such as inferring missing data and classification decisions. More work is needed, however, to link vector positions to real world meaning, particularly over time. Dynamic analysis techniques are less developed but needed for the generation of time sensitive alerts from streaming data such as for change detection and event discovery. Research institutions and universities are active in the development of unsupervised (e.g. anomaly detection and data embeddings) and supervised learning algorithm development. The dynamic algorithms needed to understand the movement of entity vector representation over time is a natural extension of their current research activity.
A mature system should also be easy to use and compatible to the computational architecture of a transition program of record.
Tasks to consider include the following: 1) Entity/ relationship declarations in support of knowledge discovery that are task/ mission essential; 2) Unsupervised and semi-supervised methods for the construction of embedding spaces from very large graphs that are rational and human understandable (not black boxes); 3) Supervised learning algorithm development to support dynamic inferencing of embedded spaces; and 4) Visualizations of high order embedded spaces at lower dimensions that are user instructive.
PHASE I: For a bounded set of data and information requirements, show an embedded space representation of a large graph and train classifiers to learn the relationships between embeddings and real world entity descriptors. Produce a use case and workflows relevant to a military customer and/or commercial market. Provide a proof of concept demonstration for identified transition targets. During the Phase I effort, performers are expected to identify metrics to validate performance of analytic products with the goal of reducing the technical risk associated with building a working prototype should work progress. Performers should produce Phase II plans with a technology roadmap and milestones.
PHASE II: Produce a prototype system based on the preliminary design from Phase I. The prototype should enable users to infer information not overtly evident in the data and provide measures of effectiveness. In Phase II, performer may be given data by the Government to validate capabilities. The small business should assume that the prototype system will need to run as a distributed application in a cloud architecture that could scale to millions of nodes and billions of edges. Phase II deliverables will include a working prototype of the system, software documentation including a userís manual, and a demonstration using operational data or accurate surrogates of operational data.
PHASE III DUAL USE APPLICATIONS: Based on Phase II effort, deliver to the Navy a system capable of deployment and operational evaluation. The system should consume available operational and open source data sets and focus on areas/missions that are of interest to specific transition programs or commercial applications. The system needs to have an easy to use human systems interface. The software and hardware should be modified to operate in accordance with guidelines provided by transition sponsor. Internet search engines would benefit from the maturation of data retrieval based on distances between concepts in embedded spaces. Currently, information retrieval is limited to word searches with some support to graph searches. Information retrieval based on second or higher order association (similarity between feature vectors) would transform content delivery by improving returns to "you might also like".
1. Alexei Pozdnokhov, "Dynamic network data exploration through semi-supervised functional embedding", ACM GIS '09 Seattle, WA, 2009.
2. Jian Tang, et. al., "LINE: Large-Scale Information Network Embedding" WWW 2015 May 18-22, 2015, Florence, Italy. http://arxiv.org/pdf/1503.03578v1.pdf
3. Onur Sava, et. al, "Tactical Big Data Analytics: Challenges, Use Cases, and Solutions", Big Data Analytics Workshop in conjunction with ACM Sigmetrics 2013, June 21, 2013.
4. Thomas Hansmann and P. Niemeyer, "Big Data - Characterizing an Emerging Research Filed using Topic Models", IEEE WIC ACM Int. Joint Conference on Web Intelligence and Intelligent Agent Technologies, 2014.
5. Amr Osman, et. al., "Towards Real-Time Analytics in the Cloud", 2013 IEEE 9th World Congress on Services, 2013.
6. Amr Ahmed, et. al, "Distributed Large-Scale Natural Graph Factorization", Int. World Wide Web Conference Committee (IW3C2), Rio-de Janeiro, Brazil, May 13-17, 2013.
KEYWORDS: Data embeddings; Graph theory; Data science; Advanced analytics; Cloud computing; Unsupervised learning; Supervised learning
TPOC-1: Martin Kruger
TPOC-2: Scott McGirr
TPOC-3: Joan Kaina
Questions may also be submitted through DoD SBIR/STTR SITIS website.