2902 - Efficient parallel abstraction of heterogeneous data for AI-based exploration
Modern data lakes support heterogeneous formats, from the classic relational format, which is widely used, to more flexible and less regular formats like XML, graphs, etc. The ConnectionLens project (https://team.inria.fr/cedar/connectionlens/) integrates data of any format (relational, CSV, JSON, XML, RDF) into a directed data graph model, enriched, with the help of AI (Information Extraction), with entities they contain, such as: People, Organizations, Locations, emails, URIs, etc. This system can be seen as a “data lake” platform. Based on ConnectionLens, the Abstra (https://team.inria.fr/cedar/projects/abstra/) tool automatically calculates, from the directed data graph, an abstraction, very close to an entity-relationship (ER) model. Thus, Abstra automatically identifies:
Currently Abstra is capable of only identifying binary relationships (between two entities). Further, it can abstract only one dataset at a time.
The CEDAR team developed ConnectionLens as well as Abstra mainly targeting users who are not experts in data-related technologies, and in particular journalists from Le Monde, ICIJ, etc. A major challenge with data lakes is their size, sometimes going upto terabytes of data. Thus, the goal of this internship will be to scale Abstra to make it usable for such data lakes by:
The selected student should have a strong background in algorithms, databases, systems and good programming skills. Development can take place in Java (preferred) and SQL, in a collaborative team, synchronizing and reconciling code versions using Git.