Neural network embedding: representation learning and latent knowledge extraction for data mining applications

Lombardo, Gianfranco

Please use this identifier to cite or link to this item: https://hdl.handle.net/1889/4262

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Poggi, Agostino	-
dc.contributor.advisor	Pardalos, Panos M.	-
dc.contributor.author	Lombardo, Gianfranco	-
dc.date.accessioned	2021-04-08T09:50:09Z	-
dc.date.available	2021-04-08T09:50:09Z	-
dc.date.issued	2021	-
dc.identifier.uri	https://hdl.handle.net/1889/4262	-
dc.description.abstract	Big data represents what machine learning models need to learn concepts and tasks, providing enough generalization margins. Moreover, it can also be useful for data mining applications when the goal is to extract latent and valuable knowledge. The nature of data we can collect every day is unstructured: no predefined schema ispresent. Some examples are documents, messages, interactions in social media platforms and in e-commerce web sites. Indeed, the unstructured nature of these data, inevitably, adds more challenges in the previously reported cases. Machine learning models need input data in real-valued vectors (feature or design matrix) for supervised and unsupervised learning. The same issue comes back again when we are interested in finding quantitative information (e.g., the similarity between two or more documents or measure any statistics). Another interesting case is the one related to network science techniques to analyze data having (or looking for) a graph structure. Several systems can be efficiently described as nodes that interact with each other (e.g., social networks, recommendation systems, interaction among proteins, financial market). In this case, we cannot apply machine learning techniques directly because of the lack of a vector representation.The advances in the neural networks field, enable to learn feature vectors directly from the input data distribution. This task can be seen as a pre-processing step in a modern machine learning project and involve machine learning itself. This automatic feature extraction is also known as “Representation learning”. We can summarize itas learning a vector representation of input data in a supervised or unsupervised way. ivAbstractWe can learn these representations or embeddings for different kinds of data: from words, entire documents, nodes or edges in a graph, images, and signals. Moreover, embeddings encode data preserving and providing more information, for example, semantic similarity for words that will be close in their vector representation.The choice of how to achieve this preliminary task influences all of the next stages in a typical data-mining pipeline. The challenge consists into finding a way to preserve much information as possible when looking for a structure. Secondary, with the promising results achieved recently, it is interesting to exploit these resulting vector spaces also to make knowledge extraction. In light of the current challenges and advances in the field of Representation Learn-ing with unstructured data, my research activity has been focused on this topic. In particular, in this thesis are reported the achieved results in two main directions:•Using representation learning techniques based on neural networks to extract latent knowledge from documents: If neural networks can capture fundamental aspects among data by learning a different representation in the out-put layer and considering that this representation makes the classification or clustering easier, can we exploit these techniques to extract new knowledge? Part of my research tries to answer to this question. In particular, two use cases are reported: the first analyzes scientific documents from the public repositoryScopus combining word embeddings and Human mobility metrics. The second one regards neural network embeddings to extract knowledge from financial reports of thousands of American companies in the stock market.•Overcome current limits of representation learning on graphs: Machine learning models can benefit from input data derived from graph structures. However, to apply most of the available models it is necessary to get a vector form for nodes and edges. The most promising way is related to the neural Abstractvnetwork embedding and the state of the art is represented by the Node2vecalgorithm. However, there are still two open problems in this field: scalability (learning a representation in large-scale graphs) and the lack of support of dynamic contexts: if a new node joins the network it is necessary to compute again the representation of the entire graph. Part of my doctorate tries to address these two problems. A first contribution is represented by an actor-basedversion of Node2Vec that overcomes scalability issues by distributing the bot-tlenecks with agents that organize themself with different behaviors to achievethe embedding in large-scale graphs. A second contribution is related to the de-velopment of a novel algorithm for incremental feature learning over graphs.The algorithm exploits properties of scale-free graphs to encode new nodeswithout recurring to a re-train of the model over all the nodes. It computes alight embedding over 20% of nodes with the highest degree, and then it per-forms a supervised alignment by solving the orthogonal Procrustes problem.	en_US
dc.language.iso	Inglese	en_US
dc.publisher	Università degli Studi di Parma. Dipartimento di Ingegneria e architettura	en_US
dc.relation.ispartofseries	Dottorato di ricerca in Tecnologie dell'informazione	en_US
dc.rights	© Gianfranco Lombardo, 2021	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internazionale	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	word embedding	en_US
dc.subject	neural networks	en_US
dc.subject	machine learning	en_US
dc.subject	representation learning	en_US
dc.subject	graph embedding	en_US
dc.subject	knowledge discovery	en_US
dc.title	Neural network embedding: representation learning and latent knowledge extraction for data mining applications	en_US
dc.type	Doctoral thesis	en_US
dc.subject.miur	ING-INF/05	en_US
Appears in Collections:	Tecnologie dell'informazione. Tesi di dottorato

Files in This Item:

File	Description	Size	Format
Tesi_PHD_Gianfranco_Lombardo.pdf		3.42 MB	Adobe PDF	View/Open
Relazione finale Dottorato in Tecnologie dell’Informazione (2017-2020).pdf Restricted Access		75.62 kB	Adobe PDF	View/Open Request a copy

Show simple item record

This item is licensed under a Creative Commons License

DSpaceUnipr

DSpaceUnipr is the institutional repository of the University of Parma. Its aim is to give visibility to the University's scholarly content and learning material.