1 Introduction
The discovery of interactions among entities is one of the main link prediction tasks over knowledge graphs. Specifically, the problem of drug-target interaction discovery, i.e., proteins that are targets of drugs, is a crucial task, given the fact, that on average, bringing a new drug to the market, costs billion and takes more than 10 years. Several approaches have been defined to tackle the problem of drug-target interaction discovery (e.g., [2, 4]). Albeit effective, existing approaches are not able to exploit the semantics encoded in the main features of the drugs or targets to enhance prediction. We present SimTransE approach that exploits both similarities between entities, e.g., drugs and target, as well as their connections in a knowledge graph. These features are considered by SimTransE to represent entities into a vector space. SimTransE is based on TransE, which utilizes the gradient descent optimization method to learn the embeddings based on relations stated in a knowledge graphs. Similarly, SimTransE optimizes the distance between embeddings, considering the existing interactions between drugs and targets, but additionally, SimTransE takes into consideration domain similarity values between drugs and between targets. Embeddings generated by SimTransE are utilized to predict new interactions by applying the homophily principle1. We conduct an empirical evaluation to assess the quality of SimTransE with respect to TransE and a benchmarks of interactions between drugs and targets. Our observed results suggest that considering similarity empowers SimTransE and allows for the discovery of interactions between drugs and targets that could be identified by baseline version of TransE.
2 The SimTransE Approach

The Architecture. SimTransE receives an RDF knowledge graph and similarities among its entities. The output is a set of predicted interactions.
2.1 Architecture
The SimTransE architecture comprises a pipeline with three main components. Figure 1 shows the interaction between these components and the data flowing among them. The Data Processor receives an RDF graph and creates dictionaries and matrices understandable by SimTransE. Three sets of entity dictionaries are created, i.e., left entities (the subjects), right entities (the objects), and relational entities. These dictionaries are used throughout the pipeline to create vector embeddings. Secondly, two different sets of binary sparse matrices are created. One representing the positive and negative interactions of entities. Lastly, similarity matrices are built, i.e., given the m number of left entities and n number of right entities, we prepare two square matrices where the similarity score between entities from m to n are kept. The Model Trainer component receives as input the entity and interaction dictionaries and similarity matrices. The Model Trainer resorts to the stochastic gradient descent method to optimize the position and direction of the embeddings in a vector space. The Model Trainer uses interactions and similarities between entities to solve the optimization problem, and generates embeddings as output; (Table 1 shows the SimTransE interaction and objective functions). The Predictor component takes the generated embedding vectors, interactions, and thresholds. Using the embeddings and thresholds, this component iterates over all the entities and identifies interactions of each entity with every other entity. The Predictor component calculates the precision and recall. Additionally, the Area Under Receiver (AUC) and the Area Under the Precision-Recall Curve (AUPRC) are calculated.
2.2 Learning Vector Embeddings
State-of-art approaches use only connectivity patterns between entities to learn the embeddings and perform predictions. Using just interactions among entities is not enough real-world applications where domain-specific knowledge plays a relevant role (e.g., during the prediction of drug-target interactions [8]). There are very few known interactions and the ratio of positive to negative classes is large, impacting, this, in the accuracy of the predictions. To tackle the problem of unbalance ratio of positive to negative classes, SimTransE incorporates not only entities interactions but similarities between entities during the learning process. SimTransE creates duplicate positive classes and adds a set of positive examples, which are generated using the similarity matrices. The similarity score is considered as the weight of example in the learning process.


SimTransE interaction and objective function to learn embeddings
Interaction functions | Objective functions |
---|---|
|
|
|
|
2.3 Predicting Links
The fundamental task of link prediction is to identify a relations between two entities. Yang et al. [9] define the link prediction formally as a task in a network where V is the set of nodes and E is the set of edges. The main challenge to be achieved in this task is to predict whether there is or will be a link e(u, v) between a pair of nodes u and v
V and
. To perform link prediction, SimTransE uses the trained vector embeddings and calculates the distance of each entity to every other entity with respect to the relation between them. Based on this calculated distance and a given threshold, SimTransE decides if the input entities are or not related. SimTransE ranks each entity on the basis of distance and assigns a probability by comparing it with the distance of other entities. If this probability is greater than the given threshold, then SimTransE considers the link in the output.
To evaluate link prediction we measure: Precision, the ratio of correctly predicted interactions to total predictions; Recall, the ratio of correctly predicted interactions to expect predictions; Area under Precision-Recall Curve , we calculate the area under precision recall curve as the metric to evaluate our model, it does not consider true negatives since neither of both precision and recall consider true negatives; Finally, we measure the Area under ROC Curve, to evaluate our method since it works best when the problem of imbalanced classes exist in the dataset [9] (Fig. 2).
3 Empirical Evaluation

SimTransE exhibits good performance in both datasets.
Results and Discussion: From the output of SimTransE, we calculated: true and false positives and true and false negatives. From these values, we derived Precision, Recall, AUC, and AUPR3. We apply a blocking method on the generated similarity-based interactions, through percentiles, i.e., four percentiles are considered: 80, 90, 95, and 100. Link prediction is validated following 10-fold cross-validation, and we report the mean across the results of the ten folds. Based on the observed outcomes, we can positively answer RQ1, i.e., SimTransE performs well on all the datasets, and outperforms the baseline method TransE in all cases. These results suggest that similarities between entities, e.g., drugs and targets, have a positive impact on both the learning process and the link prediction tasks. We observe, as well, that by increasing the number of connections between drugs and target (e.g., by using SemEP results) the effectiveness of the approach improve even further. Few interactions are not predicted properly although they are present in the training set. For most of them, we find that drugs and targets with few numbers of interactions are difficult to train for SimTransE. This situation is improved after using the interactions predicted from SemEP. Therefore, RQ2 is positively answered too.
4 Conclusions
In this paper, we presented SimTransE, a method to analyze interactions in knowledge graphs to predict links, based on the vectorization of the entities. To learn the embeddings, SimTransE uses not only the interactions among entities but also values of similarity between them. To test the accuracy of SimTransE, we compared its results against TransE, a prediction model for translational embeddings that uses only interactions among entities. SimTransE exhibited high accuracy and competitive result and outperformed TransE, one of the state-of-the-art approaches. The observed results suggest that combining interaction and similarity related semantics in the embeddings empowers the prediction model over knowledge graphs. In future work, we plan to conduct a more exhaustive evaluation to guarantee the reproducibility of the results, as well as the comparison with other embedding creation models, e.g., TransH [6] and TransG [7].

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.