Please use this identifier to cite or link to this item: https://hdl.handle.net/1889/4262
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorPoggi, Agostino-
dc.contributor.advisorPardalos, Panos M.-
dc.contributor.authorLombardo, Gianfranco-
dc.date.accessioned2021-04-08T09:50:09Z-
dc.date.available2021-04-08T09:50:09Z-
dc.date.issued2021-
dc.identifier.urihttps://hdl.handle.net/1889/4262-
dc.description.abstractBig data represents what machine learning models need to learn concepts and tasks, providing enough generalization margins. Moreover, it can also be useful for data mining applications when the goal is to extract latent and valuable knowledge. The nature of data we can collect every day is unstructured: no predefined schema ispresent. Some examples are documents, messages, interactions in social media platforms and in e-commerce web sites. Indeed, the unstructured nature of these data, inevitably, adds more challenges in the previously reported cases. Machine learning models need input data in real-valued vectors (feature or design matrix) for supervised and unsupervised learning. The same issue comes back again when we are interested in finding quantitative information (e.g., the similarity between two or more documents or measure any statistics). Another interesting case is the one related to network science techniques to analyze data having (or looking for) a graph structure. Several systems can be efficiently described as nodes that interact with each other (e.g., social networks, recommendation systems, interaction among proteins, financial market). In this case, we cannot apply machine learning techniques directly because of the lack of a vector representation.The advances in the neural networks field, enable to learn feature vectors directly from the input data distribution. This task can be seen as a pre-processing step in a modern machine learning project and involve machine learning itself. This automatic feature extraction is also known as “Representation learning”. We can summarize itas learning a vector representation of input data in a supervised or unsupervised way. ivAbstractWe can learn these representations or embeddings for different kinds of data: from words, entire documents, nodes or edges in a graph, images, and signals. Moreover, embeddings encode data preserving and providing more information, for example, semantic similarity for words that will be close in their vector representation.The choice of how to achieve this preliminary task influences all of the next stages in a typical data-mining pipeline. The challenge consists into finding a way to preserve much information as possible when looking for a structure. Secondary, with the promising results achieved recently, it is interesting to exploit these resulting vector spaces also to make knowledge extraction. In light of the current challenges and advances in the field of Representation Learn-ing with unstructured data, my research activity has been focused on this topic. In particular, in this thesis are reported the achieved results in two main directions:•Using representation learning techniques based on neural networks to extract latent knowledge from documents: If neural networks can capture fundamental aspects among data by learning a different representation in the out-put layer and considering that this representation makes the classification or clustering easier, can we exploit these techniques to extract new knowledge? Part of my research tries to answer to this question. In particular, two use cases are reported: the first analyzes scientific documents from the public repositoryScopus combining word embeddings and Human mobility metrics. The second one regards neural network embeddings to extract knowledge from financial reports of thousands of American companies in the stock market.•Overcome current limits of representation learning on graphs: Machine learning models can benefit from input data derived from graph structures. However, to apply most of the available models it is necessary to get a vector form for nodes and edges. The most promising way is related to the neural Abstractvnetwork embedding and the state of the art is represented by the Node2vecalgorithm. However, there are still two open problems in this field: scalability (learning a representation in large-scale graphs) and the lack of support of dynamic contexts: if a new node joins the network it is necessary to compute again the representation of the entire graph. Part of my doctorate tries to address these two problems. A first contribution is represented by an actor-basedversion of Node2Vec that overcomes scalability issues by distributing the bot-tlenecks with agents that organize themself with different behaviors to achievethe embedding in large-scale graphs. A second contribution is related to the de-velopment of a novel algorithm for incremental feature learning over graphs.The algorithm exploits properties of scale-free graphs to encode new nodeswithout recurring to a re-train of the model over all the nodes. It computes alight embedding over 20% of nodes with the highest degree, and then it per-forms a supervised alignment by solving the orthogonal Procrustes problem.en_US
dc.language.isoIngleseen_US
dc.publisherUniversità degli Studi di Parma. Dipartimento di Ingegneria e architetturaen_US
dc.relation.ispartofseriesDottorato di ricerca in Tecnologie dell'informazioneen_US
dc.rights© Gianfranco Lombardo, 2021en_US
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internazionale*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectword embeddingen_US
dc.subjectneural networksen_US
dc.subjectmachine learningen_US
dc.subjectrepresentation learningen_US
dc.subjectgraph embeddingen_US
dc.subjectknowledge discoveryen_US
dc.titleNeural network embedding: representation learning and latent knowledge extraction for data mining applicationsen_US
dc.typeDoctoral thesisen_US
dc.subject.miurING-INF/05en_US
Appears in Collections:Tecnologie dell'informazione. Tesi di dottorato

Files in This Item:
File Description SizeFormat 
Tesi_PHD_Gianfranco_Lombardo.pdf3.42 MBAdobe PDFView/Open
Relazione finale Dottorato in Tecnologie dell’Informazione (2017-2020).pdf
  Restricted Access
75.62 kBAdobe PDFView/Open Request a copy


This item is licensed under a Creative Commons License Creative Commons