Paper deep dive
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
Pietro Demurtas, Ferdinando Zanchetta, Giovanni Perini, Rita Fioresi
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/22/2026, 6:31:14 AM
Summary
This paper introduces a multi-label Temporal Convolutional Network (TCN) framework for predicting Transcription Factor (TF) binding sites on DNA sequences. By treating TF binding as a multi-label classification problem rather than binary, the model captures cooperative regulatory mechanisms and correlations between TFs. The TCN architecture outperforms traditional RNN-based baselines in both performance and stability, while explainability methods like Integrated Gradients and TF-MoDISco are used to identify biologically meaningful motifs.
Entities (5)
Relation Signals (3)
Transcription Factor → regulates → Gene Expression
confidence 100% · Transcription factors (TFs) regulate gene expression through complex and co-operative mechanisms.
Temporal Convolutional Network → predicts → Transcription Factor
confidence 95% · Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles
Temporal Convolutional Network → outperforms → Recurrent Neural Network
confidence 90% · The TCN-based model outperformed the baseline on almost all metrics across all datasets and labels.
Cypher Suggestions (2)
Find all deep learning architectures used for TF binding prediction · confidence 90% · unvalidated
MATCH (a:Architecture)-[:USED_FOR]->(t:Task {name: 'TF binding prediction'}) RETURN a.nameIdentify biological entities regulated by Transcription Factors · confidence 85% · unvalidated
MATCH (tf:TranscriptionFactor)-[:REGULATES]->(target) RETURN tf.name, target.name
Abstract
Abstract:Transcription factors (TFs) regulate gene expression through complex and co-operative mechanisms. While many TFs act together, the logic underlying TFs binding and their interactions is not fully understood yet. Most current approaches for TF binding site prediction focus on individual TFs and binary classification tasks, without a full analysis of the possible interactions among various TFs. In this paper we investigate DNA TF binding site recognition as a multi-label classification problem, achieving reliable predictions for multiple TFs on DNA sequences retrieved in public repositories. Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles, capturing correlations among TFs andtheir cooperative regulatory mechanisms. Our results suggest that multi-label learning leading to reliable predictive performances can reveal biologically meaningful motifs and co-binding patterns consistent with known TF interactions, while also suggesting novel relationships and cooperation among TFs.
Tags
Links
- Source: https://arxiv.org/abs/2603.12073v1
- Canonical: https://arxiv.org/abs/2603.12073v1
PDF not stored locally. Use the link above to view on the source site.
Full Text
46,763 characters extracted from source content.
Expand or collapse full text
1[3]1 [rgb]0.001#3 1[3]001#1#3#1±#2#1±#2 A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization P. Demurtas, F. Zanchetta, G. Perini, R. Fioresi Abstract Transcription factors (TFs) regulate gene expression through complex and cooperative mechanisms. While many TFs act together, the logic underlying TFs binding and their interactions is not fully understood yet. Most current approaches for TF binding site prediction focus on individual TFs and binary classification tasks, without a full analysis of the possible interactions among various TFs. In this paper we investigate DNA TF binding site recognition as a multi-label classification problem, achieving reliable predictions for multiple TFs on DNA sequences retrieved in public repositories. Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles, capturing correlations among TFs and their cooperative regulatory mechanisms. Our results suggest that multi-label learning leading to reliable predictive performances can reveal biologically meaningful motifs and co-binding patterns consistent with known TF interactions, while also suggesting novel relationships and cooperation among TFs. 1 Introduction Transcription factors (TFs) rarely act alone, but they frequently operate through cooperative mechanisms, forming complexes such as homo/hetero dimers, where distinct combinations of TFs can elicit different regulatory effects [24, 33, 37]. In prokaryotes, individual TFs typically recognize relatively long DNA motifs, which are often sufficient to uniquely identify their target genes. In contrast, TFs in organisms with larger and more complex genomes bind shorter DNA sequences, which are insufficient to define unique genomic locations on their own. Moreover, the development and maintenance of multicellular organisms require the emergence of intricate molecular systems capable of implementing combinatorial regulatory logic [26]. To overcome these challenges, eukaryotic organisms have evolved mechanisms for cooperative DNA recognition involving multiple TFs, through direct and indirect protein-protein interactions [31], indirect interactions mediated by chromatin architecture, or through co-binding to adjacent or partially overlapping DNA motifs [2]. Each interaction mechanism confers distinct regulatory properties to the resulting TF complex [26]. A representative example of TF cooperation is the formation of functional heterodimers [37]. Several eukaryotic TFs are unable to bind DNA as monomers and instead require physical interaction with another TF, often a member of the same family, to form a functional dimer capable of recognizing specific DNA sequences. Figure 1: E2F4-DP2-DNA complex [44] A classical example of such a mechanism is the MYC/MAX heterodimer, which plays a central role in transcriptional regulation [3, 1]. While homo/hetero dimerization represent common forms of TF cooperation, they are only a subset of a much broader spectrum of regulatory complexes, whose combinatorial logic and mechanisms of action remain largely unexplored [12]. In this paper we plan to investigate the question of multiple TFs DNA-binding prediction as a multi-label classification problem via deep learning. This perspective will enable the simultaneous prediction of binding events for multiple TFs and provides an opportunity to explore correlations among TFs that reflects the cooperative or combinatorial regulatory mechanisms as in Fig. 1, obtained via a costly procedure, but going beyond. Our techniques can indeed refine and guide investigation complementing expensive lab protocols . To date, deep learning approaches for DNA-binding recognition have focused on binary classification tasks, where the goal is to predict whether a given genomic region is bound by a single TF [39, 42, 18]. Figure 2: Multiple TFs binding to DNA In the present work, we take advantage of novel deep learning architectures, especially designed for sequential data modelling as Temporal Convolutional Networks (TCNs) for multi-label modelling TF binding site predictions, see Figure 2. TCNs exhibit a decisive advantage with respect to traditional recurrent architectures, such as Recurrent Neural Networks (RNNs), which have been widely applied to sequence modelling tasks, including biological sequence analysis. RNNs however suffer from well-known limitations, including vanishing and exploding gradients [21], limited parallelizability, and difficulties in capturing long-range dependencies. Also, more recently, attention-based models, most notably Transformers [36], have achieved state-of-the-art performance across a wide range of sequence and language modelling tasks [8, 16]. Their ability to model global dependencies and their scalability have made them highly successful in many domains. Nevertheless, attention-based architectures [11] come with significant drawbacks, including substantial data requirements, high computational costs during training, and limited interpretability. These issues are particularly problematic in biological applications, where data are often noisy or scarce and model transparency is essential. Temporal Convolutional Networks (TCNs) are able to address the limitations of recurrent models by enabling parallel computation and stable gradient propagation [29] and are particularly suited for the biological data analysis, where they can outperform attention-based models. TCNs were first introduced as a generative audio model named WaveNet [35] and later appeared as an action segmentation model in Lea et al. [25]. The general architecture proposed by Bai et al. [6], however differs from such architectures by being much simpler. In fact, the model proposed by Bai, besides the basic concept of temporal convolutions listed before, uses only depth, dilation and residual connections to build the effective history of the model. The effective history is defined as the ability of the networks to look into the past to make a prediction, or worded differently, it is the size of the context window. TCN have been successfully applied in several field of application, including classification of satellite image time series [29], clinical length of stay and mortality prediction [7] and energy-related time series forecasting [23]. Moreover, compared to attention-based approaches as Transformer based architectures [36], TCNs typically require fewer data and offer a favorable trade-off between model capacity and efficiency. Moreover, their convolutional backbone is naturally well-suited for modelling biological sequences. Key properties such as tunable receptive fields allow TCN-based architectures to capture long-range dependencies while maintaining architectural simplicity, making them especially amenable to downstream explainability analyzes. In the present study, we investigate whether deep learning models, as RNN and TCN, can learn correlations among TF labels solely from DNA sequence data, obtained from public repositories. In addition, we apply explainability methods to trained models to assess whether they can reveal biologically relevant sequence features and plausible interactions among TFs. Through this integrated biological and computational approach, we seek to gain new insights into the cooperative nature of transcriptional regulation. 2 Materials and Methods We describe the datasets and the deep learning architectures we used in our analysis on Transcription Factor Binding Sites (TFBS). To start with, we describe separately the datasets we used for our multi-label TFBS recognition, that we created from raw ChIP-seq data for our study, and the one, not curated by us, that we used to benchmark our algorithms. Explicitly note that we used the formers to solve a multi-label classification problem while we used the latter to solve a binary classification problem, as it was created for that purpose (see [41]). 2.1 Datasets for multi-label TFBS recognition For the problem of multi-label TFBS recognition we constructed 3 datasets using publicly available ChIP-seq experiments provided by the ENCODE Consortium on the ENCODE portal [14]. In particular, we used all the ChIP-seq profiles available on the ENCODE portal matching the selection criteria we now describe. Starting from MYC, a well-characterized TF with known cooperative behaviour, we followed two different approaches to determine which other TFs to include in the dataset. The first approach yields the datasets we call D-5TF-3CLand D-7TF-4CL. We obtain them by selecting 5 and 7 additional TFs based on motif enrichment analysis (SEA) in MYC-bound regions and data availability across 3 and 4 cell lines, respectively, as depicted in Table 1, where the Group 0 represents D-5TF-3CL, while the Group 1 represents D-7TF-4CL. The second approach, yields the dataset we call H-M-E2F, obtained by selected a hand-crafted set of TFs with putative interactions with MYC. More specifically we downloaded Chip-Seq targeting E2F1, E2F6, E2F8, MYC in K562 cell line from the ENCODE Regulation ’TF Clusters’ track. More specifically, the enrichment-driven approach resulted in the D-5TF-3CL and D-7TF-4CL datasets while the hand-curated selection of TFs resulted in the H-M-E2F dataset. K562 GM12878 HepG2 H1-hESC A549 HeLa-S3 TFs ATF3 17244 1677 3291 4808 6580 - SP1 7206 18248 25477 15110 - - TAF1 15246 14278 16659 20547 9984 16100 USF2 3083 9022 6291 6952 - 12306 c-Myc 109625 3690 4413 5768 - 13061 ELF1 27780 23008 18001 - 8611 - MAZ 33323 18952 12090 - - 13409 IRF1 32550 - - - - - ETS1 10726 4120 - - 5525 - ELK1 2961 5584 - - - 4809 E2F4 8181 3440 - - - 2831 IRF4 - 17771 - - - - SP2 3124 - 2626 2469 - - CTCF - - 191734 171742 180057 149989 IRF3 - - 684 - - 1587 ELK4 - - - - - 5916 STAT2 4963 - - - - - ATF2 - 23467 - 5998 - - SP4 - - - 5752 - - Group 1 Group 0 Table 1: ChIP-seq peaks available for each TF with respect to the cell line. Group 0 and Group 1 are noted in color. We collected ChIP-seq profiles for selected TFs from the ENCODE portal [14]. Peak intersections were then computed across selected profiles. For each overlapping region, we extracted a 1000bp1000bp sequence centred at the midpoint. Each sequence was encoded with one-hot encoding and labelled with the TFs corresponding to the intersecting peaks. The datasets thus obtained are then a set of sequence features X and a set of labels Y where: xi∈X,0≤i≤|X|x_i∈ X,0≤ i≤|X| represents the sequence vector of the ithi^th sample and yi=[li0,…,lik]∈Y,0≤i≤|Y|y_i=[l^0_i,...,l^k_i]∈ Y,0≤ i≤|Y| represents the label vector consisting of k binary variables likl^k_i encoding the presence/absence of the k TFs, as usual |X||X| and |Y||Y| denoting the numerosity of samples in the dataset X with labels Y. It is worth noting that while in the binary classification setting we distinguish between bound (positive) and unbound (negative), in our main formalization there is no absolute negative or unbound label; in fact each sequence example is labelled with one or more TFs. We train our proposed deep-learning models to jointly learn the probability of the label vector yiy_i modelling each likl^k_i as individual binary predictions, therefore predicting binding specificities for all TFs simultaneously. 2.2 Dataset for binary TFBS recognition To benchmark our algorithms we used the dataset curated by Zeng et al. [41], which includes 165 ChIP-seq datasets selected by the authors of op. cit. ensuring diversity across cell lines. The source of this data is the ENCODE consortium, the same source of data we used to create our multi-label dataset. This curated dataset was provided to us by the authors of [41] and was already labeled by them. Each dataset was provided to us already split into train (80%80\%) and test (20%20\%) sets with positive and negative instances. During training and development we further spilt each train set to obtain a validation set consisting of 20%20\% the initial train set. DNA sequences are 101 bp long and labeled binarily to indicate Transcription Factor Binding Sites (TFBS) presence. 2.3 Deep Learning Architectures We designed and evaluated several Deep Learning (DL) models to address the main goal of this work, namely TF binding prediction task as a multi-label problem. We implemented a Temporal Convolutional Network(TCN)-based model alongside a hybrid baseline model based upon Recurrent Neural Network(RNN) and Convolutional Neural Networks(CNN). The baseline model is almost identical to the TCN model except for the TCN blocks which are substituted by two layers of Bi-LSTM with hidden_dimention = 50. A visual representation can be found in Figure 6 as well as additional architecture details in Table 2 . TCNs are convolutional architectures for sequence modelling [6], which have demonstrated strong performance across different domains. TCNs’ convolutional backbone enables effective modelling of long-range dependencies, arguably granting them an advantage in comparison with RNNs. In addition, convolutional architectures are characterized by a strong inductive bias suitable for local pattern detection, making them particularly effective for biological sequence analysis. TCNs have two distinguishing features: • All the convolutions operations in the network are causal, there is no information flowing from future to past. • The architecture can take as input a sequence of arbitrary length and map it to an output sequence of the same length similarly to RNN. We will now introduce briefly the main features of the TCN architecture proposed by Bai [6] which we used as the foundations to develop our temporal convolutional models. Causal Convolutions. The classic convolution layer is rendered causal by appropriate use of padding. In fact, the convolution operation is allowed access only to past events by zero-padding asymmetrically the sequence at the start. The padding , thus, ensures that there is no information leakage from future elements of the sequences as show in Fig. 3. Figure 3: Causal convolutions by the use of padding; on the left 1D convolution with "valid" padding, on the right 1D convolution with left padding enforcing causality [22]. The choice of appropriate padding also ensures that the output size matches the input size of the sequence. Dilated Convolutions. The effective history grows linearly with the depth of the network, making it challenging to achieve a good enough context window for longer sequences. The effective history of the network is dependant also on the convolutions’ receptive field; due to this reason dilated convolutions represent an ideal solution to this challenge. Dilated convolutions in fact, enable an exponentially larger receptive field compared to classic convolutions [38], thus increasing the effective history of the network without increasing its depth. The use of dilated convolution in TCN has been first introduced in [35]. More formally a dilated convolution is defined as: (x∗dF)(s)=∑i=0k−1F(i)⋅xs−d⋅i(x*_dF)(s)= _i=0^k-1F(i)· x_s-d· i Where d represents the dilation factor and k represents the kernel size. When d=1d=1 dilation convolution and regular convolution are equivalent. Residual Connections. Residual blocks where first introduced by He et al. [20]. Residual connections works by adding an identity mapping parallel to the convolutional block and then summing it with the output. More formally, for a generic layer F with input x the output O of the residual block can be defined as: O=σ(F(x)+x)O=σ(F(x)+x) Where σ(.)σ(.) is the activation function of choice. It is worth noting that x and F(x)F(x) might not have the same dimensions, in which case a linear projection is added. Figure 4: A residual block Residual blocks allow the layers to learn just the residual feature map while the previous one is carried by the identity connection. Empirical results show that convolutional layers perform much better by learning on the residual feature map, rather than on the whole feature map itself. This enables layers to learn modifications to the identity mapping instead of the full transformation. In addition, residual connections stabilize larger networks by allowing an easier propagation for a fine-tuned signal; in fact, the unmodified propagation requires just setting all the layer’s weights to 0, which is much simpler than learning the identity operation. Due to the dependence of the temporal convolution’s effective history on the depth of the network, residual connections represent an optimal architecture for TCNs, as it allows for deeper architectures. Figure 5: Temporal Convolutional Networks as proposed by Bai et al.[6]. Dataset HC-M-E2F D-7TF-2CL D-5TF-3CL Parameter Final Value Final Value Final Value batch size 64 64 64 dropout ratio 0.5 0.5 0.5 epochs 50 50 50 learning rate 0.00258 0.00508 0.00219 MLP hidden size 100 100 100 CNN kernel number 32 32 32 CNN layers 2 2 2 TCN kernel size 32 32 32 TCN block number 6 5 6 Table 2: Model Hyperparameters 2.4 Attribution Methods To gain insight into the DNA patterns learned by the models, we applied several explainability techniques. Attribution scores that quantify the contribution of each nucleotide to the model’s output were computed using Integrated Gradients [34], with di-nucleotide shuffled sequences as baselines. Attribution was performed separately for each target TF. Subsequently, we used TF-MoDISco [32] to identify and extract informative seqlets, short genomic sequences with high information content, from the attribution maps. 2.5 Implementation We implemented our models with the pytorch[4] framework performing hyperparameter tuning using Tree of Parzen Estimators [9] provided by the hyperopt library [10]. In addition, we also employed MLFlow [40], Pandas [27], scikit-learn [28] and numpy [5] python libraries during the development of our models and training scripts. We trained all our models with the Adam optimizer with weight decay set to 0 in conjunction with a custom learning rate scheduler. The scheduler is composed of a linear warmup phase for the first 20% of training epochs followed by cosine annealing to mitigate overfitting. During training we also employed early stopping with a patient mechanism. The final hyperparameters can be found in Table 2. GTGCATCTGACTCCTGAGGAGTAGDNA\__box_backend_scale:NnnEmbedding layerCNN layersTCN layers…E2F1E2F6E2F8MYCClassifier (a) TCN-based model GTGCATCTGACTCCTGAGGAGTAGDNA\__box_backend_scale:NnnEmbedding layerCNN layers…Bi-LSTM layer…E2F1E2F6E2F8MYCClassifier (b) Hybrid CNN-RNN baseline Figure 6: Diagrams of the implemented models’ architecture. 3 Results and discussion We detail our results on TF binary classification, as sanity check and benchmarking and then as multi-label TF binding sites classification, the main goal of our work. 3.1 Binary classification as benchmarking In order to assess the general capabilities of the TCN architecture on the classification of genetic sequences we also tested our model in the binary classification setting. To this end, we used the dataset described in Section 2.2 that has been widely used for DNA–TF binding site prediction [43, 17]. AP AU-ROC Accuracy mean 0.88 0.87 0.80 std 0.11 0.11 0.11 min 0.49 0.49 0.48 25% 0.84 0.83 0.75 50% 0.91 0.90 0.82 75% 0.96 0.95 0.89 max 0.99 0.99 0.95 Table 3: Descriptive statistiscs for the distributions of AP, AU-ROC and Accuracy across the 165 binary datasets. Figure 7: Violin plot of AUC, APs and accuracy (left) dataset size (middle) and model size in terms of # of parameters (right) across the 165 binary datasets Overall, the TCN-based model adapted to binary classification achieved satisfactory performances detailed in Table 3 and Fig. 7, comparable and in line with the state-of-the-art of TFBS binary classification [43, 17]. The comparison is unfavorable for our model as we are comparing it with deep learning models specifically developed for binary classification, some of which are also trained with DNA shape data to further improve on the task and thus having access to more data modalities besides sequence features. It is important to note, as highlighted by Figure 8, that while there is a moderate correlation between metrics and dataset size, namely a Pearson coefficient of 0.610.61, 0.560.56 and 0.570.57 for accuracy, AP and AU-ROC respectively, the TCN-based model obtained outstanding results also on small datasets, achieving less than 0.70.7 AP only on a 13 small size datasets out of 165. These results confirm the suitability of our architecture for TFBS classification tasks in general, moreover the model exhibited robust performances also on several small-sized datasets, showing a satisfactory behaviour also while trained in a regime of data scarcity. Figure 8: Joint distribution plot of AP and dataset size. Marginal plots show as boxplots the marginal distributions of AP (x-axis) and dataset size (y-axis). The joint plot shows the bi-variate distribution as a scatterplot annotated by contour lines computed with Kernel Density Estimate (KDE) 3.2 Multi-label classification results We trained and tested our TCN-based model as well as the Bi-LSTM-based baseline on the three multi label dataset. We evaluated the performance of the trained models using both label-specific and summary metrics. In order to assess the label-specific metrics we used F1-score [13], precision and recall [30]. As summary metrics we adopted Area Under the Receiving-Operator Curve (AUC or AU-ROC) [13] and Average Precision (AP) defined as: AP=∑n(Rn−Rn−1)PnAP= _n(R_n-R_n-1)P_n where Rn,PnR_n,P_n are respectively precision and recall at the nthn^th decision threshold. AP does approximate the area under the Precision-Recall curve without using linear interpolation. It has been noted that estimating the area under the curve with linear interpolation results in an overly optimistic metric [15, 19], while AP is a more conservative approximation. In addition, it is important to note that, while it is a widespread and generally accepted metric, AU-ROC is not suited to compare performances across different datasets. This is due to the fact that the baseline value depends on the dataset composition, making comparison across differently skewed datasets rather challenging. This is particularly relevant in cases where the class imbalance is severe as it is in ours. Overall, the TCN-based model outperformed the baseline on almost all metrics across all datasets and labels. The TCN-based model obtained a general significant gain over the RNN-based model both in terms of performance and stability as can be seen from the plot of F1 scores in Figure 9. We will now discuss in details the performance of both models on each joint dataset. Figure 9: F1-score comparison across the 3 different datasets. The H-M-E2F dataset. The H-M-E2F datasets represents the less complex datasets in terms of number of labels and the smallest in terms of number of sequences. The TCN-based model demonstrated a clear and substantial gain over the baseline, achieving satisfactory performance in areas where the baseline model proved inadequate. In terms of AP and AUC, the TCN-model achieved a stable and considerable gain. On the labels specific metrics, the highest gain for F1 and recall was obtained on the most frequent class, namely MYC, this improvement is also reflected in the gain obtained in the samples average which shows the highest increase over the considered averaging methods as expected. The class with the highest precision improvement on the other hand is E2F1. The performance on E2F8, while showing a similar gain in magnitude compared to the other labels, remain underwhelming and significantly lower than performance obtained on the other labels. Model TCN RNN ΔTCN−RNN _TCN-RNN Label f1-score precision recall f1-score precision recall f1-score precision recall support E2F1 0.68±0.040.68± 0.04 0.69±0.090.69± 0.09 0.70±0.140.70± 0.14 0.48±0.010.48± 0.01 0.39±0.000.39± 0.00 0.62±0.030.62± 0.03 +0.20+0.20 [gray]0.8+0.30+0.30 +0.08+0.08 3876 E2F6 0.75±0.020.75± 0.02 0.79±0.040.79± 0.04 0.72±0.070.72± 0.07 0.63±0.010.63± 0.01 0.66±0.020.66± 0.02 0.61±0.030.61± 0.03 +0.12+0.12 +0.13+0.13 +0.11+0.11 4338 E2F8 0.43±0.020.43± 0.02 0.34±0.030.34± 0.03 0.58±0.090.58± 0.09 0.31±0.000.31± 0.00 0.24±0.000.24± 0.00 0.45±0.030.45± 0.03 +0.12+0.12 +0.10+0.10 +0.13+0.13 1977 MYC 0.76±0.030.76± 0.03 0.80±0.020.80± 0.02 0.73±0.060.73± 0.06 0.51±0.030.51± 0.03 0.71±0.000.71± 0.00 0.40±0.040.40± 0.04 [gray]0.8+0.25+0.25 +0.09+0.09 [gray]0.8+0.33+0.33 7012 macro avg 0.65±0.020.65± 0.02 0.66±0.030.66± 0.03 0.68±0.060.68± 0.06 0.48±0.010.48± 0.01 0.50±0.000.50± 0.00 0.52±0.020.52± 0.02 +0.17+0.17 +0.16+0.16 +0.16+0.16 17203 micro avg 0.69±0.010.69± 0.01 0.68±0.040.68± 0.04 0.70±0.040.70± 0.04 0.50±0.010.50± 0.01 0.49±0.000.49± 0.00 0.51±0.020.51± 0.02 +0.19+0.19 +0.19+0.19 +0.19+0.19 samples avg 0.68±0.020.68± 0.02 0.72±0.040.72± 0.04 0.72±0.030.72± 0.03 0.42±0.010.42± 0.01 0.45±0.010.45± 0.01 0.48±0.020.48± 0.02 +0.26+0.26 +0.27+0.27 +0.24+0.24 weighted avg 0.70±0.010.70± 0.01 0.72±0.030.72± 0.03 0.70±0.040.70± 0.04 0.51±0.010.51± 0.01 0.57±0.000.57± 0.00 0.51±0.020.51± 0.02 +0.19+0.19 +0.15+0.15 +0.19+0.19 Summary Metrics APS 0.73±0.010.73± 0.01 0.52±0.000.52± 0.00 +0.21+0.21 AUC 0.80±0.010.80± 0.01 0.59±0.000.59± 0.00 +0.21+0.21 Table 4: Performance of the implemented models on E2F dataset. The reported metrics are averaged across 5 runs and are reported alongside standard deviation. The rightmost part of the table shows the gain of the TCN-based model over the baseline. Highest gain for each metric is highligthed in red. The D-5TF-3CLdataset. The D-5TF-3CLdataset is the biggest dataset taken into consideration, both in terms of number of labels and training sequences. The TCN-based model achieved also on this dataset a substantial gain over the baseline model, achieving higher gains compared to H-M-E2F dataset. It is important to note, however, that the higher gains are largely due to the fact that the baseline performance is way lower compared to the one achieved on the H-M-E2F dataset. This is particularly evident by comparing the AP scores of the two models on the two datasets; the TCN-based model’s AP score are somewhat comparable while the baseline model clearly struggles more on D-5TF-3CL. Taking into consideration label specific metrics, it is particularly noteworthy that the highest gain for each metric has been obtained on the same label, USF2. This is particularly interesting because USF2 is the least frequent class in D-5TF-3CL, this suggests that the TCN-based model’s gain are not just imputable to an improved overall capacity to leverage training data and that the TCN-based model is able to capture and learn label-specific features that the recurrent baseline cannot, even with fewer examples. This in turns suggests that USF2 label is characterized by different sequence features that a recurrent architecture cannot fully learn. It is also worth noting that the performance of the TCN-based model are more stable compared to the baseline as can bee seen from the metrics’ standard deviations. Model TCN RNN ΔTCN−RNN _TCN-RNN Label f1-score precision recall f1-score precision recall f1-score precision recall support ATF3 0.50±0.020.50± 0.02 0.45±0.080.45± 0.08 0.59±0.130.59± 0.13 0.22±0.010.22± 0.01 0.15±0.010.15± 0.01 0.45±0.200.45± 0.20 +0.28+0.28 +0.30+0.30 +0.14+0.14 3257 ELF1 0.78±0.010.78± 0.01 0.86±0.020.86± 0.02 0.71±0.030.71± 0.03 0.57±0.050.57± 0.05 0.59±0.050.59± 0.05 0.57±0.140.57± 0.14 +0.21+0.21 +0.27+0.27 +0.14+0.14 9660 MAZ 0.71±0.000.71± 0.00 0.65±0.010.65± 0.01 0.78±0.020.78± 0.02 0.57±0.040.57± 0.04 0.58±0.050.58± 0.05 0.58±0.140.58± 0.14 +0.14+0.14 +0.07+0.07 +0.20+0.20 9350 SP1 0.61±0.010.61± 0.01 0.53±0.010.53± 0.01 0.71±0.030.71± 0.03 0.32±0.160.32± 0.16 0.36±0.070.36± 0.07 0.41±0.260.41± 0.26 +0.29+0.29 +0.17+0.17 +0.30+0.30 7200 TAF1 0.68±0.000.68± 0.00 0.61±0.010.61± 0.01 0.76±0.010.76± 0.01 0.61±0.050.61± 0.05 0.51±0.100.51± 0.10 0.78±0.090.78± 0.09 +0.07+0.07 +0.10+0.10 −0.02-0.02 5859 USF2 0.64±0.020.64± 0.02 0.54±0.050.54± 0.05 0.80±0.040.80± 0.04 0.13±0.080.13± 0.08 0.09±0.050.09± 0.05 0.28±0.190.28± 0.19 [gray]0.8+0.51+0.51 [gray]0.8+0.45+0.45 [gray]0.8+0.52+0.52 2460 c-Myc 0.55±0.000.55± 0.00 0.47±0.010.47± 0.01 0.66±0.020.66± 0.02 0.40±0.040.40± 0.04 0.34±0.010.34± 0.01 0.50±0.150.50± 0.15 +0.15+0.15 +0.13+0.13 +0.16+0.16 6891 macro avg 0.64±0.010.64± 0.01 0.59±0.020.59± 0.02 0.72±0.030.72± 0.03 0.40±0.020.40± 0.02 0.37±0.030.37± 0.03 0.51±0.050.51± 0.05 +0.24+0.24 +0.22+0.22 +0.21+0.21 44677 micro avg 0.65±0.010.65± 0.01 0.60±0.010.60± 0.01 0.72±0.020.72± 0.02 0.44±0.030.44± 0.03 0.38±0.020.38± 0.02 0.54±0.070.54± 0.07 +0.21+0.21 +0.22+0.22 +0.18+0.18 samples avg 0.64±0.010.64± 0.01 0.63±0.010.63± 0.01 0.75±0.010.75± 0.01 0.33±0.050.33± 0.05 0.30±0.050.30± 0.05 0.49±0.090.49± 0.09 +0.31+0.31 +0.33+0.33 +0.26+0.26 weighted avg 0.66±0.010.66± 0.01 0.62±0.010.62± 0.01 0.72±0.020.72± 0.02 0.46±0.020.46± 0.02 0.44±0.030.44± 0.03 0.54±0.070.54± 0.07 +0.20+0.20 +0.18+0.18 +0.18+0.18 Summary Metrics APS 0.69±0.010.69± 0.01 0.37±0.000.37± 0.00 +0.32+0.32 AUC 0.84±0.000.84± 0.00 0.60±0.010.60± 0.01 +0.24+0.24 Table 5: Performance of the implemented models on g0 dataset. The reported metrics are averaged across 5 runs and are reported alongside standard deviation. The rightmost part of the table shows the gain of the TCN-based model over the baseline. Highest gain for each metric is highligthed in red. The D-7TF-4CLdataset. The D-7TF-4CLdataset lies in a middle ground between H-M-E2F and D-5TF-3CLdataset, both in terms of labels and in terms of training samples. In fact, it is composed by one more label compared to H-M-E2F while being constituted by almost twice the samples. Broadly speaking both models follow the same trend observed for D-5TF-3CLwith the TCN-based model outperforming the baseline. The AP scores of both models are comparable with the ones achieved on the D-5TF-3CLdataset as well. As befeore, the highest gain has been achieved on the less frequent class, USF2, for both F1 and precision while the highest gain in terms of precision has been obtained on c-Myc. Model TCN RNN Diff (TCN - RNN) Label f1-score precision recall f1-score precision recall f1-score precision recall support ATF3 0.56±0.010.56± 0.01 0.53±0.060.53± 0.06 0.61±0.080.61± 0.08 0.23±0.050.23± 0.05 0.18±0.010.18± 0.01 0.40±0.180.40± 0.18 +0.33+0.33 +0.35+0.35 +0.21+0.21 3952 SP1 0.67±0.000.67± 0.00 0.68±0.020.68± 0.02 0.65±0.020.65± 0.02 0.53±0.030.53± 0.03 0.45±0.010.45± 0.01 0.67±0.100.67± 0.10 +0.14+0.14 +0.23+0.23 −0.02-0.02 9452 TAF1 0.78±0.000.78± 0.00 0.77±0.040.77± 0.04 0.79±0.050.79± 0.05 0.70±0.110.70± 0.11 0.77±0.050.77± 0.05 0.67±0.180.67± 0.18 +0.08+0.08 +0.00+0.00 +0.12+0.12 8615 USF2 0.73±0.020.73± 0.02 0.69±0.060.69± 0.06 0.78±0.050.78± 0.05 0.26±0.010.26± 0.01 0.16±0.000.16± 0.00 0.67±0.110.67± 0.11 [gray]0.8+0.47+0.47 [gray]0.8+0.53+0.53 +0.11+0.11 3457 c-Myc 0.61±0.010.61± 0.01 0.57±0.030.57± 0.03 0.65±0.040.65± 0.04 0.35±0.080.35± 0.08 0.37±0.010.37± 0.01 0.36±0.130.36± 0.13 +0.26+0.26 +0.20+0.20 [gray]0.8+0.29+0.29 7889 macro avg 0.67±0.010.67± 0.01 0.65±0.030.65± 0.03 0.70±0.030.70± 0.03 0.42±0.040.42± 0.04 0.38±0.010.38± 0.01 0.55±0.050.55± 0.05 +0.25+0.25 +0.27+0.27 +0.15+0.15 33365 micro avg 0.67±0.000.67± 0.00 0.65±0.030.65± 0.03 0.70±0.020.70± 0.02 0.44±0.030.44± 0.03 0.36±0.010.36± 0.01 0.56±0.050.56± 0.05 +0.23+0.23 +0.29+0.29 +0.14+0.14 samples avg 0.68±0.010.68± 0.01 0.70±0.020.70± 0.02 0.74±0.020.74± 0.02 0.41±0.020.41± 0.02 0.35±0.010.35± 0.01 0.58±0.050.58± 0.05 +0.27+0.27 +0.35+0.35 +0.16+0.16 weighted avg 0.67±0.000.67± 0.00 0.66±0.020.66± 0.02 0.70±0.020.70± 0.02 0.47±0.040.47± 0.04 0.45±0.010.45± 0.01 0.56±0.050.56± 0.05 +0.20+0.20 +0.21+0.21 +0.14+0.14 Summary Metrics APS 0.73±0.010.73± 0.01 0.38±0.000.38± 0.00 +0.35+0.35 AUC 0.84±0.000.84± 0.00 0.58±0.010.58± 0.01 +0.26+0.26 Table 6: Performance of the implemented models on g1 dataset. The reported metrics are averaged across 5 runs and are reported alongside standard deviation. The rightmost part of the table shows the gain of the TCN-based model over the baseline. Highest gain for each metric is highligthed in red. Label Δ E2F Δ g0 Δ g1 f1-score precision recall f1-score precision recall f1-score precision recall Medie macro avg +0.17+0.17 +0.16+0.16 +0.16+0.16 +0.24+0.24 +0.22+0.22 +0.21+0.21 +0.25+0.25 +0.27+0.27 +0.15+0.15 micro avg +0.19+0.19 +0.19+0.19 +0.19+0.19 +0.21+0.21 +0.22+0.22 +0.18+0.18 +0.23+0.23 +0.29+0.29 +0.14+0.14 samples avg +0.26+0.26 +0.27+0.27 +0.24+0.24 +0.31+0.31 +0.33+0.33 +0.26+0.26 +0.27+0.27 +0.35+0.35 +0.16+0.16 weighted avg +0.19+0.19 +0.15+0.15 +0.19+0.19 +0.20+0.20 +0.18+0.18 +0.18+0.18 +0.20+0.20 +0.21+0.21 +0.14+0.14 Summary Metrics APS +0.21+0.21 +0.32+0.32 +0.35+0.35 AUC +0.21+0.21 +0.24+0.24 +0.26+0.26 Table 7: Gain (TCN – RNN) delle tre tabelle, fianco a fianco 3.3 Attributions In order to understand what the models’ have been learning and to gain insight on the effectiveness of our trained TCN-based model, we applied explainability techniques. We derived attribution scores for the TCN-based model trained on the H-M-E2F dataset with Integrated Gradients and used TF‑MoDISco [32] to identify the most informative seqlets. (a) Activity heatmap of all the relevant seq-lets (x-axis) identified by MoDisco from the attribution scores. Each seqlet is associated to an activity pattern across the labes (y-axis). Positive values show increased affinity for the label, i.e. the presence of the seqlet postivively inflences the prediction of the model thowards the correspondign label, and conversely negative value show negative influence of the seqlet on the prediction of the associated label. It is worth noting that several seqlets exibit similar activity patterns across all labels suggesting an underlying biological mechanism. Additional information can be found in the MoDisco documentation. (b) motif logos of identified seq let corresponding to MYC conensus sequence (c) motif logos of identified seq let corresponding to E2F6 conensus sequence Figure 10: Preliminary result of the attribution pipeline The heat-map and sequence logos obtained, as in Figure 10, clearly show that the model is capturing correctly at least part of the underlying biological mechanism. In fact, the box logos obtained from the attribution pipeline represent two well known motifs belonging to consensus sequences of the labels present in the training dataset, namely MYC and E2F6. The nature of the proposed task at hand does not allow for a seamless integration of existing attribution pipelines, due to the multi-label nature of our task it would be preferable to modify and apply attribution methods accordingly. The proper development of a sound attribution pipeline devised specifically for multi-label data, however, is out of the scope of this study and will be considered in future works. 4 Conclusions We address the question of multiple Transcription Factors (TF) DNA-binding recognition modelled via multiple labels, going beyond the binary, single-TF prediction paradigm. Our deep learning framework based on Temporal Convolutional Networks (TCNs) achieves an effective learning in settings characterized by limited and noisy biological data, with a significant predictive performance. Beyond the predictive accuracy, we applied explainable artificial intelligence methods to extract from our models biologically meaningful insights as sequence motifs. Our findings suggest that multi-label deep learning models for TFBS prediction can serve not just as a predictive tool, but also as an hypothesis-generating framework for studying transcriptional regulation. We plan to deepen our investigation in the future towards a more comprehensive understanding of gene regulatory networks and their underlying cooperative mechanisms. Furthermore, we plan to develop an attribution framework specifically tailored on multi-label setting to fully leverage all the information and insights learned by the trained models. 5 Data availability All the experimental data used to construct our dataset is publicly available on ENCODE Consortium [14]. In order to construct the two data-driven datasets we downloaded all available Chip-Seq experiments for the TF and cell-lines listed in Table 1. For the manually curated dataset we downloaded Chip-Seq targeting the already specified transcription factor from in K562 cell line from The dataset used to test the capability of our model in the binary setting is the dataset proposed by [41]. The full list of encode identifiers is available upon request. 6 Code availability The code is fully available upon request to the authors. 7 Conflict of interest All authors declare no conflict of interest. References [1] S. E. Ahmadi, S. Rahimi, B. Zarandi, R. Chegeni, and M. Safa (2021) MYC: a multipurpose oncogene with prognostic and therapeutic implications in blood malignancies. Journal of Hematology & Oncology 14 (1), p. 121. External Links: Document, Link Cited by: §1. [2] U. Alon (2007) Network motifs: theory and experimental approaches. Nature Reviews Genetics 8 (6), p. 450–461. Cited by: §1. [3] B. Amati, S. Dalton, M. W. Brooks, T. D. Littlewood, G. I. Evan, and H. Land (1992) Myc and max form a sequence-specific dna-binding protein complex. Nature 359, p. 423–426. Cited by: §1. [4] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, et al. (2024) Pytorch 2: faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, p. 929–947. Cited by: §2.5. [5] (2020) Array programming with NumPy. Nature 585 (7825), p. 357–362. External Links: Document, Link Cited by: §2.5. [6] S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. External Links: 1803.01271, Document, Link Cited by: §1, Figure 5, Figure 5, §2.3, §2.3. [7] B. P. Bednarski, A. D. Singh, W. Zhang, W. M. Jones, A. Naeim, and R. Ramezani (2022) Temporal convolutional networks and data rebalancing for clinical length of stay and mortality prediction. Scientific Reports 12 (1), p. 21247. External Links: Document, Link, ISSN 2045-2322 Cited by: §1. [8] Y. Bengio, R. Ducharme, and P. Vincent (2000) A neural probabilistic language model. Advances in neural information processing systems 13. Cited by: §1. [9] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24. Cited by: §2.5. [10] J. Bergstra, D. Yamins, and D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, p. 115–123. Cited by: §2.5. [11] G. Brauwers and F. Frasincar (2021) A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering 35 (4), p. 3279–3298. Cited by: §1. [12] N. E. Buchler, U. Gerland, and T. Hwa (2003) On schemes of combinatorial transcription logic. Proceedings of the National Academy of Sciences 100 (9), p. 5136–5141. Cited by: §1. [13] P. Christen, D. J. Hand, and N. Kirielle (2024) A review of the f-measure: its history, properties, criticism, and alternatives. ACM Computing Surveys 56 (3), p. 73:1–73:24. External Links: Document Cited by: §3.2. [14] E. P. Consortium et al. (2012) An integrated encyclopedia of dna elements in the human genome. Nature 489 (7414), p. 57–74. External Links: Document Cited by: §2.1, §2.1, §5. [15] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, p. 233–240. Cited by: §3.2. [16] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, External Links: Link Cited by: §1. [17] P. Ding, Y. Wang, X. Zhang, X. Gao, G. Liu, and B. Yu (2023) DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks combining sequence and shape. Briefings in bioinformatics 24 (4), p. bbad231. Cited by: §3.1, §3.1. [18] R. Fioresi, P. Demurtas, and G. Perini (2022) Deep learning for myc binding site recognition. Frontiers in Bioinformatics 2, p. 1015993. Cited by: §1. [19] P. Flach and M. Kull (2015) Precision-recall-gain curves: pr analysis done right. Advances in neural information processing systems 28. Cited by: §3.2. [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778. Cited by: §2.3. [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), p. 1735–1780. Cited by: §1. [22] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and B. Gong (2021) MoViNets: mobile video networks for efficient video recognition. External Links: 2103.11511 Cited by: Figure 3, Figure 3. [23] P. Lara-Benítez, M. Carranza-García, J. M. Luna-Romera, and J. C. Riquelme (2020) Temporal Convolutional Networks Applied to Energy-Related Time Series Forecasting. Applied Sciences 10 (7), p. 2322. External Links: Document, Link, ISSN 2076-3417 Cited by: §1. [24] D. S. Latchman (1997) Transcription factors: an overview. The international journal of biochemistry & cell biology 29 (12), p. 1305–1312. Cited by: §1. [25] C. Lea, R. Vidal, A. Reiter, and G. D. Hager (2016) Temporal convolutional networks: a unified approach to action segmentation. External Links: 1608.08242 Cited by: §1. [26] E. Morgunova and J. Taipale (2017) Structural perspective of cooperative transcription factor binding. Current opinion in structural biology 47, p. 1–8. Cited by: §1. [27] T. pandas development team (2020) Pandas-dev/pandas: pandas. Zenodo. External Links: Document, Link Cited by: §2.5. [28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, p. 2825–2830. Cited by: §2.5. [29] C. Pelletier, G. Webb, and F. Petitjean (2019) Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sensing 11 (5), p. 523. External Links: Document, Link, ISSN 2072-4292 Cited by: §1. [30] D. M. Powers (2020) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061. Cited by: §3.2. [31] F. Reiter, S. Wienerroither, and A. Stark (2017) Combinatorial function of transcription factors and cofactors. Current Opinion in Genetics & Development 43, p. 73–81. External Links: Document Cited by: §1. [32] A. Shrikumar, K. Tian, Ž. Avsec, A. Shcherbina, A. Banerjee, M. Sharmin, S. Nair, and A. Kundaje (2018) Technical note on transcription factor motif discovery from importance scores (tf-modisco) version 0.5. 6.5. arXiv preprint arXiv:1811.00416. Cited by: §2.4, §3.3. [33] F. Spitz and E. E. M. Furlong (2012) Transcription factors: from enhancer binding to developmental control. Nature Reviews Genetics 13 (9), p. 613–626. External Links: Document Cited by: §1. [34] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International conference on machine learning, p. 3319–3328. Cited by: §2.4. [35] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio. External Links: 1609.03499 Cited by: §1, §2.3. [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §1. [37] Z. Xie, I. Sokolov, M. Osmala, X. Yue, G. Bower, J. P. Pett, Y. Chen, K. Wang, A. D. Cavga, A. Popov, S. A. Teichmann, E. Morgunova, E. Z. Kvon, Y. Yin, and J. Taipale (2025) DNA-guided transcription factor interactions extend human gene regulatory code. Nature 641, p. 1329–1338. External Links: Document, Link Cited by: §1, §1. [38] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.3. [39] H. Yuan, M. Kshirsagar, L. Zamparo, Y. Lu, and C. Leslie (2019) BindSpace decodes transcription factor binding signals by large-scale sequence embedding. Nature Methods 16, p. 1–4. External Links: Document Cited by: §1. [40] M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, et al. (2018) Accelerating the machine learning lifecycle with mlflow.. IEEE Data Eng. Bull. 41 (4), p. 39–45. Cited by: §2.5. [41] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford (2016) Bioinformatics 32 (12), p. i121–i127. Cited by: §2.2, §2, §5. [42] Y. Zhang, Z. Wang, Y. Zeng, Y. Liu, S. Xiong, M. Wang, J. Zhou, and Q. Zou (2022) A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape. Briefings in Bioinformatics 23 (1). Cited by: §1. [43] Y. Zhang, Z. Wang, F. Ge, X. Wang, Y. Zhang, S. Li, Y. Guo, J. Song, and D. Yu (2024) MLSNet: a deep learning model for predicting transcription factor binding sites. Briefings in Bioinformatics 25 (6), p. bbae489. Cited by: §3.1, §3.1. [44] N. Zheng, E. Fraenkel, C. O. Pabo, and N. P. Pavletich (1999) Structural basis of dna recognition by the heterodimeric cell cycle transcription factor e2f–dp. Genes & development 13 (6), p. 666–674. Cited by: Figure 1, Figure 1.