UNIVERSIDAD DE COSTA RICA SISTEMA DE ESTUDIOS DE POSGRADO USO DE MODELOS DE DEEP LEARNING EN LA ESTIMACIÓN DE PROBABILIDADES DE CONVERSIÓN Y CONTRIBUCIÓN DE LOS CANALES DE COMUNICACIÓN EN CAMPAÑAS PUBLICITARIAS Trabajo Final de Investigación Aplicada sometido a la consideración de la Comisión del Programa de Estudios de Posgrado en Matemática para optar al grado y título de Maestría Profesional en Métodos Matemáticos y Aplicaciones Alexa Sánchez Brenes Ciudad Universitaria Rodrigo Facio, Costa Rica 2025 Este trabajo final de investigación aplicada fue aceptado por la Comisión del Programa de Estudios de Posgrado en Matemática de la Universidad de Costa Rica, como requisito parcial para optar al grado y título de Maestría Profesional en Métodos Matemáticos y sus Aplicaciones. Dr. Alexander Ramírez González Representante de la Decanatura Sistema de Estudios de Posgrado Dr. Maikol Solís Chacón Profesor Guía Dr. Hugo Solís Sánchez Lector MSc Juan Felipe González Évora Lector Dr. Dario Alberto Mena Arias Director del Programa de Posgrado Alexa Sánchez Brenes Sustentante ii Contents Acta de defensa ii Summary v List of Tables vi List of Figures vii 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objetives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background of the Study 3 2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Deep Neural Net with Attention for Multi-channel Multi-touch Attribu- tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Deep sequential model . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Attention mechanism . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Binary classification problem . . . . . . . . . . . . . . . . . . . . 12 2.4 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Accuracy Metrics for Imbalanced Datasets . . . . . . . . . . . . . . . . . 13 3 Methodology 17 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Samples Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Model Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Model structure and computational tools . . . . . . . . . . . . . . . . . . 25 3.4.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.2 Attention Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.3 Long-Short Term Memory . . . . . . . . . . . . . . . . . . . . . . 28 iii 3.4.4 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Results and Discussion 32 4.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Heuristic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Model 1: Long-Short Term Memory . . . . . . . . . . . . . . . . . . . . . 35 4.4 Model 2: Encoder-Decoder Modification . . . . . . . . . . . . . . . . . . 41 5 Conclusions 45 References 47 iv Resumen En este trabajo de investigación, se muestran modelos de deep learning para estimar la probabilidad de conversión y la contribución de los canales de comunicación en cam- pañas publicitarias. La relevancia de este tipo de metodologías surge como respuesta a la necesidad de optimizar los recursos en el diseño de campañas publicitarias, ante el crecimiento en la variedad de medios por los que consumidores interactúan con la publicidad. Basado en el modelo propuesto en li et al. (2018), se implementaron modelos secuenciales de redes neuronales, específicamente Long Short-Term Memory (LSTM). Este modelo, además de permitir estimar la probabilidad de conversión de los poteciales clientes que mantienen contacto con una campaña pulicitaria, incorpora un mecanismo de atención. Dichos mecanismos de atención permiten estimar la relevancia que tienen los canales de comuncicación en la desición de conversión. Por otra parte, se realizó la incorporar una estructura de Encoder-Decoder a la estructura original del modelo antes descrito. Ambos modelo secuenciales mostraron un poder predictivo similar, sin embargo, de acuerdo al Precision-Recall Auc, el modelo original con redes LSTM fue mejor. Además, los resultados demostraron que canales con mayor número de interacciones, como Face- book y Paid Search, no siempre son los más influyentes en la conversión final. En con- traste, los modelos secuenciales mostraron que Instagram y Online Display fueron iden- tificados como los canales con mayor impacto en la toma de decisiones de los consumi- dores. Finalmente, si bien, la incorporación del Encoder-Decoder no mostró cambios relevantes en la relevancia de los canales de comunicación; si presentó una mejora relevante en los tiempos de ejecución. v List of Tables 1 Original Data Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Distribution of Client Across Different Sequence Lengths . . . . . . . . . 20 3 Overview of Input Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Channel Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Padding-Truncated Density Vectors . . . . . . . . . . . . . . . . . . . . . 34 6 Evaluation Metrics for LSTM Model Across Different Sequences . . . . . 36 7 Cross-Validation Performed for the LSTM Selected Model . . . . . . . . . 38 8 Evaluation Metrics for Encoder-Decoder Model Across Different Sequences 41 9 Cross-Validation Performed for the Encoder-Decoder Selected Model . . 42 vi List of Figures 1 LSTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Daily Positive Conversions (July 2018) . . . . . . . . . . . . . . . . . . . 18 3 Proportion of interactions by channel . . . . . . . . . . . . . . . . . . . . 19 4 Comparison of PR AUC Across Different Sequence Lengths . . . . . . . . 21 5 Comparison of Execution Time Across Different Sequence Lengths . . . 22 6 Architecture of the Attention Layer . . . . . . . . . . . . . . . . . . . . . 27 7 Architecture of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . 30 8 Architecture of the Encoder-Decoder model . . . . . . . . . . . . . . . . 31 9 Contribution based on Heuristic Models . . . . . . . . . . . . . . . . . . 35 10 Training vs Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 11 Actual vs Predicted Classifications . . . . . . . . . . . . . . . . . . . . . . 39 12 Performance Across Thresholds . . . . . . . . . . . . . . . . . . . . . . . 39 13 Contribution of Communication Channels . . . . . . . . . . . . . . . . . 40 14 Training vs Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 15 Actual vs Predicted Classifications . . . . . . . . . . . . . . . . . . . . . . 43 16 Performance Across Threshold . . . . . . . . . . . . . . . . . . . . . . . . 43 17 Contribution of Communication Channels . . . . . . . . . . . . . . . . . 44 vii Autorización para digitalización y comunicación pública de Trabajos Finales de Graduación del Sistema de Estudios de Posgrado en el Repositorio Institucional de la Universidad de Costa Rica. Yo, _______________________________________, con cédula de identidad _____________________, en mi condición de autor del TFG titulado ___________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ Autorizo a la Universidad de Costa Rica para digitalizar y hacer divulgación pública de forma gratuita de dicho TFG a través del Repositorio Institucional u otro medio electrónico, para ser puesto a disposición del público según lo que establezca el Sistema de Estudios de Posgrado. SI NO * *En caso de la negativa favor indicar el tiempo de restricción: ________________ año (s). Este Trabajo Final de Graduación será publicado en formato PDF, o en el formato que en el momento se establezca, de tal forma que el acceso al mismo sea libre, con el fin de permitir la consulta e impresión, pero no su modificación. Manifiesto que mi Trabajo Final de Graduación fue debidamente subido al sistema digital Kerwá y su contenido corresponde al documento original que sirvió para la obtención de mi título, y que su información no infringe ni violenta ningún derecho a terceros. El TFG además cuenta con el visto bueno de mi Director (a) de Tesis o Tutor (a) y cumplió con lo establecido en la revisión del Formato por parte del Sistema de Estudios de Posgrado. FIRMA ESTUDIANTE Nota: El presente documento constituye una declaración jurada, cuyos alcances aseguran a la Universidad, que su contenido sea tomado como cierto. Su importancia radica en que permite abreviar procedimientos administrativos, y al mismo tiempo genera una responsabilidad legal para que quien declare contrario a la verdad de lo que manifiesta, puede como consecuencia, enfrentar un proceso penal por delito de perjurio, tipificado en el artículo 318 de nuestro Código Penal. Lo anterior implica que el estudiante se vea forzado a realizar su mayor esfuerzo para que no sólo incluya información veraz en la Licencia de Publicación, sino que también realice diligentemente la gestión de subir el documento correcto en la plataforma digital Kerwá. Alexa Sánchez Brenes 304670905 USO DE MODELOS DE DEEP LEARNING EN LA ESTIMACIÓN DE PROBABILIDADES DE CONVERSIÓN Y CONTRIBUCIÓN DE LOS CANALES DE COMUNICACIÓN EN CAMPAÑAS PUBLICITARIAS X https://es.wikipedia.org/wiki/Responsabilidad https://es.wikipedia.org/wiki/Perjurio 1 1 Introduction 1.1 Introduction During the past decade, the number and variety of mediums through which consumers can be exposed to advertising campaigns have increased and diversified significantly. These mediums are commonly referred to as communication channels. A touch point is defined as the moment when a consumer interacts with advertising information related to a product or service through a communication channel, such as online graphic ads. When a consumer acquires the product or service offered through advertising informa- tion, it is said to result in a positive conversion. The wide variety of communication channels through which advertising campaigns can reach consumers has made it essen- tial for marketing specialists to apply tools to predict customer conversions in order to identify the most effective channels and touch points with the objective of optimizing budget allocation. This has led to the development of various Marketing Attribution Models, aiming to understand the impact of each interaction with an advertisement on the final decision to purchase a specific product or service. This document will explore throughout heuristic and Markovian models are capable of providing insight into the relevance of communication channels through which an ad- vertising campaign is spread. However, these methodologies do not offer information on conversion rates or consider sequential information. Therefore, the use of neural net- work models is proposed, specifically employing the LSTM architecture. This approach ensures the consideration of historical customer information both in predicting conver- sion rates and identifying the most relevant channels for business actions, facilitated by the incorporation of attention mechanisms. Furthermore, a modification of the original architecture is proposed through an Encoder- Decoder framework. This adjustment demonstrates improved performance through faster execution times and yields slight enhancements in conversion rate prediction results. Lastly, we will elaborate on the challenges associated with obtaining high-quality pub- lic datasets for implementing sequential models, contrasting with the more prevalent availability of data examples used in heuristic and Markovian applications. 2 1.2 Objetives General Objective Implement deep learning models to predict the conversion rate and the contribution of communication channels in the final decision-making of consumers when acquiring or not acquiring a product or service (positive or negative conversion) offered in an advertising campaign. Specific Objectives • Apply the necessary structure to the database to execute a Deep Neural Net with Attention for Multi-channel Multi-touch Attribution model and identify the most in- fluential communication channels in the conversion probability of each consumer. • Develop a modification of the basic Deep Neural Net with Attention for Multi- channel Multi-touch Attribution model by incorporating an LSTM Encoder-Decoder. • Evaluate the performance of the models to identify changes in the relevance of communication channels and execution times. 3 2 Background of the Study 2.1 Literature Review Last-Touch Attribution and First-Touch Attribution are examples of marketing attribu- tion models based on heuristic and simples approaches that assign 100% of the impact of the consumer’s final decision to the last or first advertising touch point they interacted with, respectively. These methodologies have been criticized for completely ignoring the effects that other intermediate touch points may have on the final decision and leading to biased estimates (Kumar, 2021). In response to these criticisms, the literature proposes the use of models for sequential information, which consider all consumer interactions with different touch points and provide additional information such as the time elapsed between each interaction. In particular, Multi-touch Attribution Models characterize a more realistic view of the customer journey by assigning weight based on the true influ- ence of each touch point on the final decision to acquire a product or service. Markov Chain models are an example of sequential methodologies proposed in the lit- erature. In Kakalejcik et al. (2018), they implement a Markov Chain model with a data set from Google Analytics to understand the path through communication channels that customers follow with positive conversions. More sophisticated models, such as the one proposed in Abhishek Vibhanshu (2012), use a spatially structured Markov Chain model to capture the dynamics of individual consumer behavior and infer the conversion rate. Other models, like Vector Auto Regressive models implemented in de Haan et al. (2016), study the relative effectiveness of various online marketing channels and analyze the duration of the impacts of this advertising information on consumers. Furthermore, in Danaher and Danaher (2013) an ensemble model was recommended to evaluate the rel- ative effectiveness of multiple advertising media, analyzing the incidence of purchases using a Probit model and applying a Tobit model of the second type to estimate the outcome of purchases. Finally, li et al. (2018) suggests a sequential model based on recurrent neural networks called Deep neural net with attention for multitouch attribution. The added value of these models is that they allow for robust capturing of long-term dependencies, providing a better, understanding of the effects and contributions of consumer interaction with com- 4 munication channels. Furthermore, with the application of models like this, it is possible to quantify the impact of each communication channel on a consumer’s positive conver- sion, providing additional information that can be considered in optimizing resources for future advertising campaigns. Furthermore, the previously mentioned model incorporates attention mechanisms, which are particularly advantageous in LSTM models to capture fine-grained dependencies in sequential data (Bahdanau, Cho, & Bengio, 2014). This selective focus on relevant parts of the input sequence enhances both comprehensibility and performance, making these models highly effective in natural language processing tasks such as machine translation and sentiment analysis. Additionally, the Deep neural net with attention for multitouch attribution model includes a time decay function to account for the assumption that the influence of each touch- point diminishes as the time between the interaction and the final purchase decision increases. Assumption of time decay is common in the literature, yet, there are multiple factors that could have a more significant impact. For instance, Dimitrios Buhalis (2021) points out that a touch point does not always generate a positive impact on consumer decision-making and consumers’ previous experiences related to brands may influence their decisions. On the other hand, in Thaichon and Quach (2016) emphasizes how the form and characteristics of marketing communications can create needs in consumers, producing a positive effect on a consumer’s journey. 2.2 Neural Networks According to the authors in Hastie et al. (2017), the term “neural network" has broadened to encompass a wide variety of models and learning methods. To introduce key concepts and definitions, this paper presents a brief description of a single hidden layer back- propagation network, also known as a single-layer perceptron. Neural networks are a set of algorithms designed to recognize relationships in a given set of training data, modeled after the way human neurons process information. The following equation represents how a single-layer neural network processes input data at each layer to produce a predicted output value ŷ = σ(wTx+ b). (1) 5 To generate equation (1), a training vector x is input into the neural network during a process called forward propagation. As the data passes through the layer, parameters wT and b, known as weights and biases, are applied to map the relationship between the inputs and the target outputs. This is achieved by calculating the weighted sum of the inputs at each neuron, and applying a non-linear activation function (usually chosen to be a sigmoid) to produce the output. In multiple-layer neural networks, this process continues layer by layer until the final output is produced. A loss function is subsequently defined to compare the target and predicted output val- ues. During model training, this loss function needs to be minimized. The standard approach for minimizing it is through gradient descent, commonly referred to as back- propagation in this context. Given the compositional structure of the model, the gradient can be efficiently derived using the chain rule for differentiation. This process involves performing a forward and backward sweep through the network, tracking only the quan- tities local to each unit. For binary classification problems, the Binary Cross-Entropy loss function is frequently used. According to the authors in Hastie et al. (2017), it is defined as follows: − 1 N N∑ i=1 [yi log(ŷi) + (1− yi) log(1− ŷi)] . (2) where N is the number of samples, yi is the true label for the i-th sample (either 0 or 1), and ŷi is the predicted probability of the positive class for the i-th sample. Likewise, activation functions are essential in neural networks since they enable the network to address complex problems by introducing non-linearity, which allows the network to learn complex patterns and relationships in the data. Activation functions determine whether a node should be activated by evaluating the importance of the neu- ron’s input through mathematical operations. As was mentioned before, they transform the weighted sum of inputs from a node into an output value, which is then fed to the next layer or used as the final output. There are various activation functions, including binary, linear, and numerous non-linear types, among which the following are particu- larly notable: 6 Sigmoid. This function transforms any real input value into an output within the range of 0 to 1. The property makes it particularly suitable for classification tasks and models requiring probability prediction as an output. However, a drawback is the gradient can vanish to zero for very low and very high input values, which can delay the model’s ability to improve during training. It can be expressed as, f(x) = 1 1 + exp−x . (3) Hyperbolic tangent (Tanh). This function has a range from -1 to 1, is com- monly implemented in the hidden layers of neural networks. The function helps in centering the data around zero, thereby facilitating easier learning for subse- quent layers. This centering property improves the efficiency and performance of neural network models. Mathematically, it can be represented as, f(x) = expx− exp−x expx+exp−x . (4) Rectified linear unit (ReLU). The range of this function is [0,∞), and are ex- tensively utilized in deep learning, particularly in convolutional neural networks. It supports backpropagation due to its derivative function, which accelerates the convergence of gradient descent towards the global minimum of the loss function. Its efficiency arises from its ability to avoid activating all neurons simultaneously, unlike Sigmoid or Tanh functions, making it computationally efficient. However, this characteristic also implies that some neurons might not be updated or ac- tivated, potentially leading to dead neurons that never get activated. It can be expressed as, f(x) = max {0, x}. (5) The following section describes the architecture of a more complex model, which con- sists of different structures of neural networks. In particular, this model is composed of recurrent neural networks, a type of network designed to process sequential data and address problems associated with natural language processing and time-series analysis. 7 2.3 Deep Neural Net with Attention for Multi-channel Multi- touch Attribution Model The basic model Deep Neural Net with Attention for Multi-channel Multi-touch Attribution consists of three stages. The first stage involves a deep sequential model based on a Long- Short Term Memory (LSTM) recurrent neural network, aimed at capturing the long- term dependencies in sequential observations. Subsequently, an attention mechanism is introduced, assigning weights to the touchpoints that make up the representation of a consumer’s path to conversion, determining which ones are more influential. Next, the problem of conversion attribution, viewed as a binary classification problem, is explored by incorporating the results obtained in the previous stages. Each of these stages is described in detail below. 2.3.1 Deep sequential model In Long Short-Term Memory recurrent neural network (LSTM), input values are sent to a node called a neuron, which are grouped into different layers (Sagheer & Kotb, 2019). These types of neural networks are typified as recurrent because the inputs received by the nodes located in one layer are the outputs of generated by the neurons in the previous layer, weighted by a value assigned by a non-linear function, in what is known as the activation process. Simultaneously, the weights assigned by the neurons are modified based on the error obtained, depending on how much each neuron contributed to the previous result, through the process of backpropagation. One of the principal feature of this type of neural network is its ability to learn long- term sequential processes. In LSTM network’s, each node is composed of four layers, and throughout each corresponding gate, a process of updating information is generated and stored in what is known as the cell state. Moreover, each gate contains a non-linear activation function, typically using logistic sigmoid or hyperbolic tangent functions. Ad- ditionally, the components of this type of neural network are the forget gate ft, the input gate it, ct representing the cell state, ot as the output gate, ht−1 is the hidden state of the previous step, and xt is the input in the iteration t . The determination of every element 8 in the neural network during each iteration t is outlined as follows, ft = σ(W f .[ht−1, xt] + bf ) it = σ(W i.[ht−1, xt] + bi) ct = ft ⊙ ct−1 ⊕ it ⊙ tanh(W c.[ht−1, xt] + bc) ot = σ(W o.[ht−1, xt] + bo) ht = ot ⊙ tanh(ct) where W and b are weights and biases, the concatenation of the previous hidden state and current input is giving by [ht−1, xt]. In addition, the operator multiplication is de- noted with the dot operator (·), while the pointwise multiplication and the pointwise addition are represented as ⊙ and ⊕, respectively. On the other hand, the sigmoid acti- vation and hyperbolic tangent functions are giving by σ and tanh, respectively. Figure 1: LSTM Architecture As show in Figure 1, in the first gate ft, a decision is made regarding whether to retain or discard the information coming from the previous hidden state ht−1, . This decision is governed by a sigmoid activation function within the forget gate, generating a value ranging from zero to one. A value of one implies that all information stored in the state cell is retained, whereas a value of zero signifies the discarding of all prior information. Following this, the definition of new information to be stored in the cell state involves two steps. Initially, employing another sigmoid activation function, the input gate it 9 determines which values are to be updated. Subsequently, a tanh function generates a vector c̃ comprising new candidate values that could be incorporated into the cell state. Afterward, the memory cell ct−1 undergoes an update to become ct, defined by a com- bination of the information chosen to persist and the newly introduced information. Next, the final output of this node, denoted as ht, is defined. To achieve this, a sigmoid activation function is applied to ct to determine which information from the state cell will persist. Finally, a hyperbolic tangent function is computed on the state cell, and the result of this transformation is multiplied by the output of the sigmoid gate. LSTM implementation in constructing the model Deep Neural Net with Attention for Multi-channel Multi-touch Attribution is outlined below, along with the preliminary steps to prepare the input for this stage. Input Layer The input data for the model consists of a set of sequences P formed by the touch- points to which each consumer was exposed over a specific period. Touchpoints are denoted as xt ∈ Rn, where n represents the set of enabled communication channels for a specific marketing campaign. Thus, if T is the length of the se- quence, and t represent the relative order of the event in the sequence, instead of the absolute event occurrence time, then a single customer sequence path can be defined as follows, Pi = x0, ..., xT where t ∈ [0, T ]. As an illustration, let’s examine a sequence Pi associated with a client giving by the following touchpoints: video, Facebook, Instagram, Searchpaid and Facebook. In this instance, the set of enabled touchpoints has dimension n = 4 and the length of the sequence T = 5. Embedding Layer In this step, each xt is converted into a density vector et using an Embedding Ma- trix We. This matrix maps word vectors to continuous representations, where each row corresponds to the continuous representation of a specific. During training, these vectors are refined through backpropagation to capture semantic relation- 10 ships between words. Each density vector is obtained through the following for- mula: et = We xt where We ∈ Rnxve . This technique is used to encode words into a sequence of numerical indices. In- dividual words are represented as vectors with real values in a predefined vector space with dimension ve, with the aim of transforming higher-dimensional data into a lower-dimensional vector space. In simple words, it assigns each word a high-dimensional vector, positioning similar words closer together in space. Us- ing the earlier example, the density vector is specified as et = [1, 2, 4, 3, 2], in which 1 represents video; 2, Facebook; 4, Instagram; and 3, Searchpaid. LSTM Layer LSTM neural networks allows us to incorporate contextual information in the his- torical observations. Through a non-linear operator H, often implemented as a recurrent neural network, specifically LSTM in our context, each block iteratively updates the current hidden state ht by using the information of the embedding layer output e1, ..., eT and the previous hidden state ht−1, as depicted in the fol- lowing formula, ht = H(et, ht−1) where t ∈ [0, T ]. 2.3.2 Attention mechanism The primary goal of this stage is to identify the most influential communication channels contributing to client conversions. An attention mechanism is a technique that enhances model performance by focusing on relevant information within the input data. It enables models to selectively attend to different parts of the input, assigning varying degrees of importance to different elements. This is achieved by generating attention weights for various features of the input data, allowing the model to utilize the most pertinent parts of the input sequence. These weights are applied in a weighted combination of all the input vectors, with higher weights attributed to more relevant vectors. Consequently, the attention mechanism determines the level of importance each element contributes to the model’s output, thereby improving the model’s ability to capture complex patterns and relationships. To start, the attention layer processes the hidden state ht at the time step t, through 11 a single-layer Multilayer Perceptron (MLP). Such networks utilize the backpropagation algorithm and are designed to approximate continuous functions and address problems that are not linearly separable (Abirami & Chitra, 2020). In this case, the hidden state ht is transforming it into vt by the expression tanh(Wvht + bv); where Wv is a weight matrix, and bv is a bias vector added to the weighted input, and tanh is the hyperbolic tangent activation function. Following this, the importance level at for the new representation of the communication channel vt is computed. The normalized importance weight, at, is obtained through a softmax function, ensuring positive values for at by design. This construction guarantees that the contribution of each touchpoint remains positive. Next, the context vector s is defined as the convex combination of ht weighted by the at obtained in the previous step. Intuitively, s can be interpreted as a high-level representa- tion of the customer’s journey through different touchpoints, combining hidden outputs and attention weights, typically referred to as the context vector. The complete process of the attention mechanism is described through the following equations, vt = tanh(Wvht + bv) (6) at = exp(vt)∑ t exp(vt) (7) s = ∑ t atht (8) In li et al. (2018), the author proposed some modifications to the model; they suggest that each interaction within the same sequence P has a timing that could impact the consumer’s final conversion. In order to incorporate time penalty into the attention mechanism, they introduce the term λTt in the definition of at. The parameter λ, can be predefined based on the observed conversion trends of customers in previous marketing campaigns, or it can be initialized randomly in the model and adjusted during training. Meanwhile, Tt refers to the time lapse between contact with the communication channel 12 xt and xt+1. With this modification, at can be rewritten as at = exp(vt − λTt)∑ t exp(v T t − λTt) . (9) 2.3.3 Binary classification problem The final step involves addressing the conversion attribution problem. It is known that consumers journeys through touch points conclude when the consumer decides whether to acquire the offered good or service. Therefore, the conversion attribution problem can be treated as a binary classification problem. In this case, the probability of a customer ending with a positive conversion is determined through a sigmoid function fed with the vector s obtained through the aforementioned processes. Explicitly, the probability of positive conversion is described as follows, p = sigmoid(σ(W T s+ b)), where σ(·) is the activation function 5, and W and b are weights and biases respectively. In particular, in the conversion attribution problem, when consumers have had exposure to advertising channels, the probability of their journey concluding in a positive conver- sion is always higher than the probability for those consumers who were not exposed. That’s why it is proposed to use the activation function with equation (5). 2.4 Encoder-Decoder The Encoder-Decoder is an unsupervised learning method in neural networks that aims to learn a compressed representation of input data; capable of capturing the nonlinear patterns inherent in the data. Typically, they are trained as part of a larger model that attempts to recreate the input by extracting its fundamental features. According to Sagheer and Kotb (2019), first part of this model, the encoder, transforms a input sequence into a lower-dimensional representation in a latent space. This can be expressed as, h(x) = f(W1x+ b1). 13 where W1 is a weight matrix, b1is a bias vector, and f is referred to as the Encoder function, which, in this case, is an LSTM layer. During the second step of the model, the decoder maps the latent representation h(x) back to a reconstruction x̂ to the original dimension of the input sequence x, x̂ = g(W2h(x) + b2). where, W2 is a weight matrix, b2 is a bias vector, and g is an activation function. The discrepancy between the input and the reconstructed input is commonly referred to as the reconstruction error; the Encoder Decoder model is trained to minimize a loss function giving by the following, L(x, x̂) = ∥x− x̂∥2. In particular, there is an architecture called the LSTM Encoder-Decoder that enables the model to handle variable-length input sequences and predict or generate variable-length output sequences (Shi & Liu, 2022). In this structure, the LSTM network is integrated into both, the encoding function f(·), and the decoding function g(·). This allows effective utilization of temporal information in sequential input data. In this type of Encoder- Decoder Model, the encoder compresses information from the entire input sequence into a fixed dimensional vector derived from the sequence of LSTM hidden states. This representation is obtained from the last hidden state of the encoding part. In contrast, the decoding part utilizes a single LSTM layer to predict the output sequence. 2.5 Accuracy Metrics for Imbalanced Datasets In accordance with C. Ferri (2009), the precise evaluation of learned models remains a pivotal focus within the domain of pattern recognition. This study, specifically cen- tered on classifiers, undertakes a thorough exploration of a variety of metrics designed to assess different facets of model performance. Our research aims to provide an exhaus- tive analysis and categorization of these metrics, bringing to light both their theoretical foundations and practical implications in the evaluation of classifiers. The authors introduce three distinct groups of metrics. The first group are metrics based 14 on a threshold and a qualitative understanding of error; examples such as accuracy, F score, and the Kappa statistic fall into this category. These metrics aim to minimize the number of errors, with specific measures within this group being more suitable for scenarios involving balanced or imbalanced datasets, signal or fault detection, and in- formation retrieval tasks. The second group is defined by the metrics based on a probabilistic understanding of error; metrics like mean absolute error, mean squared error, and cross entropy belong to this group. This set of metrics proves valuable when evaluating the reliability of classifiers, offering insights not only into instances of failure, but also into the classifier’s ability to choose the correct class with varying probabilities, be it high or low. And the third group is characterized by metrics based on how well the model ranks are listed, among which AUC stands out. Closely tied to the concept of separability, this metric is of significance in various applications where classifiers play a pivotal role in selecting optimal instances from a table or ensuring effective class separation. Given that only about 0.1 of the total clients had a positive conversion, there is a signifi- cant class imbalance present in the database. Thus, it becomes crucial to carefully select the appropriate metrics to evaluate model performance. The authors in Thölke et al. (2023), underscore the limitations of the widely used accuracy metric, particularly in sce- narios with greater class imbalance. This metric, weighs the ratios of correct predictions per class proportionally to the class size, resulting in a notable neglect of performance on the minority class. When a binary classification model consistently favors the majority class, it generates an artificially inflated decoding accuracy that predominantly reflects the imbalance between the two classes, rather than indicating genuine and universally applicable discriminatory capability. Before discussing the most suitable metrics for evaluating imbalanced class scenarios, it is essential to introduce certain concepts using the elements of a confusion matrix. The confusion matrix is a tool used to assess the performance of a classification model on a test dataset, displaying the counts of correct and incorrect predictions categorized by actual and predicted classes. This matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives, enabling a comprehensive evaluation of the performance of the model. 15 • True Positives (TP): number of instances where the model correctly predicted the positive class • True Negatives (TN): number of instances where the model correctly predicted the negative class • False Positives (FP): number of instances where the model incorrectly predicted the positive class • False Negatives (FN): number of instances where the model incorrectly predicted the negative class According with Brownlee (2019), two metrics that are particularly useful for evaluating imbalanced classification are precision and recall. In addition to these metrics, we also present the F1 score, which provides a single metric that balances both concerns, of- fering a comprehensive measure of a model’s performance in imbalanced classification scenarios. 1. Precision. This metric summarizes the ratio of true positive predictions to the total predicted positives, in other words, it measures the accuracy of the positive predictions. Precision = TP TP + FP 2. Recall. This metric represents the True Positive TP rate, also known as sensitivity, and indicates how effectively the model predicts the positive class. It is interpreted as the proportion of actual positive instances that are correctly identified by the model. Provides insight into the model’s ability to detect positive cases, making it particularly useful in scenarios where capturing all positive instances is crucial. Recall = TP TP + FN 3. F1. This metrics is an score that evaluates the overall performance of a classi- fication model, is the harmonic mean of precision and recall and is valuable for evaluating classification models, especially in cases of imbalanced data, as it con- siders both false positives and false negatives. 16 F1 = 2 · Precision · Recall Precision + Recall Moreover, maximizing this metric, can be an effective strategy for setting a classifi- cation threshold in imbalanced classification scenarios (Lipton, Elkan, & Narayanaswamy, 2014). Optimizing the threshold to maximize the F1 score, we ensure the model maintains a good balance between precision and recall. 4. F Beta. This score is an extended adaptation of F1 score. It incorporates a weighted factor denoted as β, to refine the relative impact of precision and recall on the overall metric. This metric represents a weighted harmonic mean of both Precision and Recall, reaching its optimal value at 1 and its poorest at 0. FBeta = (1 + β2) · Precision Recall β2 Precision + Recall 5. Precision-Recall curve (PR AUC). This metric is designed for assessing binary classification models. It is particularly beneficial for imbalanced classes (Boyd, 2013), as it emphasizes the classifier’s efficacy in addressing the minority class. Provides a subtle representation of the balance between precision and recall at diverse decision thresholds. A substantial area beneath the curve signifies elevated values for both precision and recall, indicating a classifier that not only delivers accurate outcomes (high precision) but also captures a significant portion of all positive results (high recall). 17 3 Methodology 3.1 Dataset Obtained from Huyton (2021), the dataset used to implement the proposed methodology is accessible to the public. It is a test dataset that has been employed in the develop- ment of non-sequential Multi-touch Attribution models, including heuristic models and Markov chain models. Due to the limited availability of public datasets for sequential models, the aforementioned dataset was selected. The database comprises around 586,737 interactions involving 240,108 consumers across five distinct communication channels throughout July, 2018. Communication channels includes Paid Search, Facebook, Instagram, Online Display and Online Video. The most relevant variables contained in the data set are described as follows. • Cookie: Unique identifier for each consumer. • Timestamp: Date and time when the consumer interacted with any of the com- munication channels. • Interaction: Variable indicating the type of interaction, “conversion” when the in- teraction leads to a positive conversion, “impression” otherwise. • Conversion: Flag with a value of 1 when there is a positive conversion. • Channel: Communication channel through which the interaction occurred, in- cluding Instagram, Online Display, Paid Search, Facebook, and Online Video. Table 1 displays sample of the database structure to be utilized. For each cookie, assumed to represent an individual consumer, the interactions with the available communication channels are observed along with the corresponding dates. Additionally, the last inter- action for each consumer determines whether there is a positive or negative conversion for that user. 18 Cookie Timestamp Interaction Channel Conversion 00073CFE3FoFCn70fBhB3kfon 2018-07-21T10:52:04Z impression Instagram 0 00079hhBkDF3k3kDkiFi9EFAD 2018-07-10T11:11:24Z impression Paid Search 0 0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-09T16:57:18Z impression Instagram 0 0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-17T16:00:58Z impression Facebook 0 0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-17T16:01:44Z impression Facebook 0 0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-18T17:17:24Z impression Instagram 0 0007o0nfoh9o79DDfD7DAiEnE 2018-07-12T08:07:08Z impression Facebook 0 0007oEBhnoF97AoEE3BCkFnhB 2018-07-06T13:45:29Z conversion Paid Search 1 00090n9EBBEkA000C7Cik999D 2018-07-05T06:53:53Z conversion Facebook 1 000A9AfDohfiBAFB0FDf3kDEE 2018-07-24T00:09:46Z impression Online Video 0 000A9AfDohfiBAFB0FDf3kDEE 2018-07-27T21:08:17Z impression Online Video 0 000A9AfDohfiBAFB0FDf3kDEE 2018-07-27T22:36:07Z impression Online Video 0 Table 1: Original Data Sample During the study period, the total number of conversions fluctuated, with a peak of 13,657 conversions recorded on July 29th; with a significant decline in conversions was observed by July 31st. On the other hand, figure 2 illustrates that the highest number of positive conversions occurred between July 11th and 19th, 2018, displaying a decreasing trend in the subsequent days, except for July 28th. Figure 2: Daily Positive Conversions (July 2018) 19 The influence of each communication channel on consumers ultimate conversion is a pivotal aspect of our outlined objectives. Consequently, a detailed analysis of the inter- action patterns becomes crucial. As depicted in Figure 3, it is evident that Paid Search has the highest number of interactions, followed by Facebook with 28% and 29%, respec- tively. In contrast, Online Video registers the lowest number of interactions. In particular, the substantial proportion of interactions on Paid Search and Facebook does not neces- sarily ensure that these channels will be the most decisive in determining consumer conversion. Figure 3: Proportion of interactions by channel 3.2 Samples Definition The dataset was divided into two portions: 90% of the total records were allocated for the training sample, while the remaining 10% were reserved for the test sample. 3.3 Model Parameterization 3.3.1 Sequence Length As LSTM networks fall under the category of recurrent neural networks, it becomes im- perative to examine the sequence length, denoting the maximum number of interactions a client undergoes. Since the sequence length dictates the extent to which our network can retain historical information and propagate gradients over time, opting for a longer sequence facilitates the model in capturing prolonged dependencies and acquiring intri- 20 cate patterns. However, longer sequences increase computational resources and the risk of gradient issues. Conversely, a shorter sequence length expedites training, mitigating gradient problems, yet compromises the contextual depth and expressive capacity of the model. Therefore, selection of both the maximum and minimum number of interactions considered in the models is crucial. Table 2 presents the distribution of clients according to the length of interaction se- quences found in the database. Although approximately 0.7 of the customers had be- tween one and two interactions with the touch points, we will focus only on the clients who interacted at least three times with the potential touch points. This focus is due to the fact that, as previously mentioned, shorter sequences could compromise the model’s capacity. Additionally, sequences consisting of one or two interactions might be associ- ated with clients who already have a particular engagement with the product or a latent necessity to acquire it. Sequence length Negative conversion Positive conversion Total Percentage dist. 1 116,047 7,417 123,464 0.55 2 48,554 3,326 51,880 0.22 3 22,903 1,950 24,853 0.10 4 11,927 1,179 13,106 0.06 5 7,015 802 7,817 0.03 6 4,346 585 4,931 0.02 7 2,821 459 3,280 0.014 8 2,043 313 2,356 0.010 9 1,439 266 1,705 0.007 10 1,059 196 1,255 0.005 11 819 180 999 0.004 12 637 124 761 0.003 > 12 2,859 842 3,701 0.15 Table 2: Distribution of Client Across Different Sequence Lengths It is important to define the minimum and maximum number of interactions required to train the model. We define min_seq_length as the minimum number of interactions a client must have with the communication channels to be considered for model train- ing. Besides, max_seq_length is defined as the maximum number of interactions to be considered for each individual. Thus, if we take the combination (3, 7), only individuals 21 with at least three interactions with the communication channels will be considered, and from the total of their interactions, the model will be trained on the last seven observed. Within the subset of customers who had at least three interactions with the commu- nication channels, 0.87 of them had a maximum of eight interactions. Therefore, to ensure a representative population for training the model, the behavior of the models is analyzed for the possible combinations of min_seq_length and max_seq_length between three and eight. Figure 4 presents a comparison of the PR AUC achieved by the LSTM and Encoder- Decoder models. The plot displays the PR AUC metric for both models across all combi- nations of minimum (rows) and maximum (columns) sequence lengths ranging from 3 to 8. The slope of the arrow highlights the model that achieved the highest PR AUC for each combination of minimum and maximum sequence lengths. In this way, a positive slope indicates that the Encoder-Decoder model achieved a higher PR AUC compared to the LSTM for the observed combination. Figure 4: Comparison of PR AUC Across Different Sequence Lengths Analyzing the models individually, the combinations of sequence lengths that gener- ate the highest PR AUC for the LSTM are as follows {(3, 6), (3, 8), (5, 7), (5, 8), (6, 8)}. Meanwhile, for the Encoder-Decoder the best combinations of minimum and maximum 22 sequence lengths are {(5, 8), (6, 5), (7, 6), (8, 3), (8, 8)}. In general, Encoder-Decoder achieves better results when considering higher minimum sequences length than those observed in the best combinations obtened for the LSTM. On the other hand, we can compare the PR AUC of the models regardless of where the highest values of each of them are. We see that LSTM has a higher PR AUC than the Encode-Decoder in sequences with larger minimum and maximum length. For example, when the minimum sequence length is greater than four and the maximum sequence length is greater than six, LSTM perform better. Meanwhile, for cases in which the minimum sequence length is smaller than four and the maximum sequence length is smaller than six, the encoder is better, although in those quadrants is not where this model achieves the highest PR AUC. On the other hand, when comparing the execution times of the models, the Encoder- Decoder demonstrates superior performance, delivering the shortest execution times. This is a significant advantage, as the datasets used to train these models are typically large. Furthermore, sequential models generally have high execution times, making this factor a critical consideration in practice when selecting a model. Figure 5: Comparison of Execution Time Across Different Sequence Lengths 23 3.3.2 Hyperparameters Several hyperparameters are essential for defining the structure and behavior of the LSTM network, as well as the complementary layers required to fit the model. Specifi- cally, for the LSTM layer, the most significant hyperparameters included in the Python function LSTM, from Tensor Flow package, are as follows: Neurons. Determines the number of neurons in a layer, moreover, this parameter defines the dimension of the output space and input shape, including the number of time steps and features per time step for sequences. It is widely recognized that insufficient hidden units may hinder the model’s ability to capture data non linearity’s, while an excess can be managed with regularization. The number of hidden layers is determined by background knowledge and experimentation. Return Sequences. When set to True, the LSTM layer returns the hidden state output for each input time step; besides, when is set to False, it returns the final hidden state value at the last time step. Return State. If True, the LSTM layer returns the last cell state, in addition to the output. Activation. Defines the activation function to use for the LSTM cells. Dropout. It is used to prevent overfitting during training by deactivating ran- domly selected neurons, thereby reducing the model’s sensitivity to the specific weights of individual neurons. This parameter is a value between 0 and 1, repre- senting the fraction of units to drop during the linear transformation of the inputs. Recurrent Dropout. Dropout rate for recurrent connections also helps prevent overfitting and is specifically tailored for gated architectures. In the case of LSTM memory cells, it is applied to the updates, taking a value between 0 and 1, repre- senting the fraction of units to drop during the linear transformation of the recur- rent state. Kernel Regularizer. Typically regularization involves adding penalty factors to network layers to modify weight propagation, aiding optimal model convergence. L1 regularization applies the absolute values of weights, while L2 regularization applies the squares of weights. When Kernel regularizer is used, penalty terms are 24 added to the kernel layers, affecting the weights of the neural network, while the bias component remains unchanged. Activity Regularizer. Regularization method applied to the output of the layer. If True, the states for the model will be preserved between batches during training. Furthermore, to prevent overfitting in the model, Batch Normalization layers were added to the model architecture. This technique is employed in deep neural networks to stabilize training by normalizing the inputs of each batch so that they have a mean near zero and a standard deviation close to one. By minimizing internal covariate shift, it ensures that the input distribution to each layer stays consistent throughout training, thereby enhancing the model’s stability and promoting faster convergence. After defin- ing the architecture of the LSTM model, the next step is to compile it, which can be done using the Python function compile. This function prepares the model for training by configuring the loss function, optimizer, and evaluation metrics. Binary Cross Entropy loss, a common choice for binary classification problems, was employed, as outlined in equation (2). The Adam optimizer, a stochastic gradient descent method that utilizes adaptive estimation of first-order and second-order moments, was chosen for optimiza- tion. Finally, accuracy was specified as the metric in the compile function to monitor the model’s performance during training and evaluation. Subsequently, the fitting process of the model needs to be configured. The fit method is employed to input the training data into the model, specifying the number of epochs and the batch size. Through this process, the model iteratively adjusts its weights based on the training data in order to minimize the loss function. Next parameters required to be configured for this step, Learning Rate. is generally set to a small positive value, usually between 0.0 and 0.1; controls the frequency of parameter updates in response to the loss gradi- ent and dictates the magnitude of adjustments made to the model’s weights after each training batch. A higher learning rate speeds up training but risks improper convergence or divergence. Conversely, a lower learning rate facilitates smoother convergence but prolongs the training process due to smaller steps towards the loss function’s minimum. Number of Epochs. it refers to the total number of complete passes through the 25 entire training dataset during the training process. In simple terms, one epoch means that every sample in the training dataset has been used once to update the internal model parameters. Usually, is increased until the validation accuracy begins to decline, indicating potential overfitting. Batch Size. correspond to the defining the number of samples processed before updating the internal model parameters. Controls how many training samples are processed in one iteration of the model’s training. 3.4 Model structure and computational tools This section describes the architecture and tools used to develop the models. The models and data wrangling were implemented using the Python programming language, based on the implementation detailed in (Jeremite, 2019). This base code was modified to in- clude changes highlighted in the following sections. Among the key libraries used in the implementation are TensorFlow, Keras, Scikit-learn, and Matplotlib. 3.4.1 Data Processing Several steps were applied to ensure that the data are properly formatted and balanced for training. Key steps include data cleaning, filtering, tokenizing path data, and split- ting the dataset into training and testing sets. As showed in Listing 1, categorical data is converted into vectors, filtering the data to include only those cases where the se- quences meet a defined sequence length. Then, path data is tokenized into sequences of integers using the fit_on_texts function from the Tokenizer class. After that, the pad sequences function is used to pad or truncate the sequences to a specified length defined by the parameters max_seq_length and min_seq_length. Data is split into training and testing sets, and the sequences are converted into one-hot encoded arrays, making them suitable for input into a machine learning model. Additionally, the implementation of the Synthetic Minority Oversampling Technique (SMOTE) was included in the code as a possible solution to address class imbalance in the data. 26 1 def process_data(dt, max_seq, min_seq): 2 cat(dt, ’path’, "leng_path", s=’>’) 3 dt_original = data[data.leng_path >= min_seq] 4 dt_original = dt_original.reset_index() 5 y = dt_original.total_conversions 6 text = dt_original.path 7 8 tokenizer = Tokenizer() 9 tokenizer.fit_on_texts(text) 10 vocab_size = len(tokenizer.word_index) + 1 11 encoded_docs = tokenizer.texts_to_sequences(text) 12 padded_docs = tf.keras.utils.pad_sequences(encoded_docs, maxlen = max_seq, padding =’pre’, truncating = ’pre’) 13 14 X_train, X_test, Y_train, Y_test = train_test_split(padded_docs, y, test_size = 0.1, random_state = 1119, stratify = y) 15 _, paths = train_test_split(text, y, test_size = 0.1, random_state = 1119, stratify = y) 16 17 X = np.array([to_categorical(doc, num_classes = vocab_size) for doc in padded_docs], ndmin = 3) 18 X_tr = np.array([to_categorical(doc, num_classes = vocab_size) for doc in X_train], ndmin = 3) 19 X_te = np.array([to_categorical(doc, num_classes = vocab_size) for doc in X_test], ndmin = 3) 20 21 oversampler = SMOTE() 22 X_Train_O, Y_Train_O = oversampler.fit_resample(X_train, Y_train) 23 24 return [X, y, X_train, X_test, Y_train, Y_test, X_Train_O, Y_Train_O] Listing 1: Data Wrangling 27 3.4.2 Attention Layer The attention mechanism is implemented using a series of sequential layers, as illustrated in Figure 6. Figure 6: Architecture of the Attention Layer As illustrated in Listing 2, line eight, a Repeat Vector layer repeats the input vector st−1, representing the hidden state from the previous time step, for the length of the sequence. This repetition ensures that the shape of the input vector matches the shape of the hidden states. Next, the Concatenate layer combines st−1 with the hidden states ht along the last axis, effectively merging the past hidden state with the current hidden states. Following this, a hyperbolic tangent activation function,tanh, is applied to the concate- nated tensor. This operation computes the intermediate energies e, which quantify the significance of each hidden state relative to the current state. Then, a ReLU activation function is applied to transform the intermediate energies e into the energies vt, as de- fined in equation (6). These transformed energies are subsequently employed to calculate the attention weights. Although the analyses conducted do not quantify the impact of the time-decay factor, the original code includes a Subtract layer that refines the energy values by subtracting a 28 time-decay factor t0 from them. After this adjustment, the Activation layer applies a Softmax activation function to normalize the energies, resulting in attention weights at that sum to 1 across the sequence length, as described in equation (9). Finally, the Dot Layer computes the dot product of the attention weights at and the hidden states ht, producing the context vector s. This context vector is generated by assigning weights to each hidden state ht according to its respective attention weight at, as indicated in equation (8). 1 def one_step_attention(h_t, s_prev, t): 2 concatenator = Concatenate(axis = -1) 3 dense_tanh = Dense(10, activation = "tanh") 4 dense_relu = Dense(1, activation = "relu") 5 softmax_activation = Activation(softmax) 6 dot_product = Dot(axes = 1) 7 8 s_prev_repeated = RepeatVector(s_prev) 9 concatenated = concatenator([s_prev_repeated, h_t]) 10 e = dense_tanh(concatenated) 11 v_t = dense_relu(e) 12 v_t = Subtract(name = ’timeDecay’)([v_t, t]) 13 a_t = softmax_activation(v_t) 14 s = dot_product([a_t, h_t]) 15 16 return s Listing 2: Attention Layer 3.4.3 Long-Short Term Memory Three input layers are defined, one for the main sequence data, the other for the initial hidden state, and a third for incorporating an optional time decay factor. Subsequently, a four-layer stacked LSTM network structure is established, each layer configured with 420 neurons, a recurrent dropout rate of 0.05, and a regularization of the L2 kernel. Ad- ditionally, since customer interactions with various touchpoints may exhibit temporal dependencies where the entire sequence and order of touchpoints are significant, the re- turn sequences parameter has been set to true. This ensures that the network produces an output at each time step, enabling the capture of contributions from each touchpoint in the sequence. This capability allows subsequent layers, such as attention and dense layers, to effectively analyze these dependencies. Furthermore, batch normalization is applied after each LSTM layer to stabilize and accelerate training by normalizing the 29 outputs. Listing 3 shows an extract of the code used to create this architecture. 1 # Input layers 2 input_att = Input(shape=(max_seq_length, vocab_size)) 3 s_prev = Input(shape=(240)) 4 t = Input(shape=(max_seq_length, 1)) 5 6 # Define the LSTM layers with BatchNormalization applied after each 7 def build_lstm_block(inputs): 8 x = LSTM(240, dropout = 0.05, recurrent_dropout = 0.05, 9 kernel_regularizer = l2(0.01), return_sequences = True)(inputs) 10 return BatchNormalization()(x) 11 h_1 = build_lstm_block(input_att) 12 h_2 = build_lstm_block(h_1) 13 h_3 = build_lstm_block(h_2) 14 h_4 = build_lstm_block(h_3) 15 ht = BatchNormalization()(h_4) 16 17 # Attention context computation 18 s = one_step_attention(ht, s_prev, t) 19 c = Flatten()(s) 20 21 # Final output layer 22 out_att = Dense(1, activation="sigmoid")(c) 23 24 # Build and compile the model 25 model = Model([input_att, s_prev, t], out_att) 26 model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) Listing 3: LSTM An attention mechanism calculates a context vector as described previously. To produce the final model output, a dense layer with a sigmoid activation function is appended, yielding a single output value suitable for binary classification tasks. Subsequently, the model architecture is defined, specifying the inputs and final output. The model is then compiled using the Keras compile function, which configures the learning process with a binary cross-entropy loss function, Adam optimizer, and accuracy metric for model evaluation. Finally, the model is trained using the fit function from Keras, employing 175 epochs, a batch size of 90, and a learning rate of 0.001. Figure 7 illustrates the previously defined structure. 30 Figure 7: Architecture of the LSTM model 3.4.4 Encoder-Decoder The initial structure is similar to the one described earlier; same three inputs are defined, and the first part of the network consists of a two-layer stacked LSTM network. Also, each LSTM layer is configured with 380 neurons, a recurrent dropout rate of 0.05, and an L2 kernel regularize, with return sequences set to true; and batch normalization applied after each layer. Change in the structure is introduced starting from an encoder layer with a ReLU ac- tivation function and return sequences set to false. This setup is appropriate when we want to emphasize the overall sequence impact rather than individual time steps. More- over, this approach simplifies the model architecture and summarizing sequences can improve efficiency in both training and inference, thereby reducing computational com- plexity. Besides, this layout is use when the main objective is whether a conversion happens at the end of the customer interactions. Another change occurs in the structure of the attention layer, where we define the de- coder before calculating the dot product of the attention weights and the hidden states to obtain the context vector s. The rest of the network architecture remains similar to the LSTM described earlier. The primary objective is to evaluate the significance of the channels based on the condensed sequence information. 31 1 # Input layers 2 input_att = Input(shape=(max_seq_length, vocab_size)) 3 s_prev = Input(shape=(neuronasEncoder)) 4 t = Input(shape=(max_seq_length, 1)) 5 6 def build_lstm_block(inputs): 7 x = LSTM(neuronasEncoder, dropout = 0.05, recurrent_dropout = 0.05, 8 kernel_regularizer = l2(0.01), return_sequences = True)(inputs) 9 return BatchNormalization()(x) 10 11 h_1 = build_lstm_block(input_att) 12 h_2 = build_lstm_block(h_1) 13 ht = BatchNormalization()(h_2) 14 encoded = LSTM(neuronasEncoder, activation = ’relu’, return_sequences = False)(ht) 15 16 s_prev_repeated = RepeatVector(s_prev) 17 concatenated = concatenator([s_prev_repeated, encoded) 18 e = dense_tanh(concatenated) 19 v_t = dense_relu(e) 20 v_t = Subtract(name = ’timeDecay’)([v_t, t]) 21 a_t = softmax_activation(v_t) 22 23 decoder = RepeatVector(max_seq_length)(encoded) 24 s = dotor([a_t, decoder]) 25 c = Flatten()(s) 26 out_att = Dense(1, activation = "sigmoid", name = ’single_output’)(c) Listing 4: Encoder-Decoder Figure 8 illustrates the architecture of this model. Figure 8: Architecture of the Encoder-Decoder model 32 4 Results and Discussion 4.1 Data Processing This section describes the data transformation process required to structure the database for the implementation of the proposed models. The initial step involves organizing the database so that each row contains the information pertaining to an individual client. To achieve this objective, it is necessary to aggregate the various channels that a client has visited into a new column named Path. Subsequent stage entails computing the Last time lapse column, representing the elapsed time in seconds between a client’s latest interaction with a channel and their previous interactions with other channels. Finally, the “conversion" column refers to whether, after all the interactions a client had, there was a conversion or not. The following is a demonstrative example of the database structure achieved through the described transformations. Cookie Path Last time laps Conversion 0007iiAiFh3ifoo9Ehn3ABB0F facebook>instagram>searchpaid>facebook 588241,532798,515807,0 1 00079hhBkDF3k3kDkiFi9EFAD video>instagram>instagram 18337, 178,0 0 00073CFE3FoFCn70fBhB3kfon video>facebook>instagram>searchpaid>facebook 588241,532798,515807,0 1 000A9AfDohfiBAFB0FDf3kDEE facebook>video>searchpaid>video>instagram 38337, 7898,0 0 oooh73D7h9hCh03EfhBBhECnB facebook>video>searchpaid>facebook 288241,532798,600504,895807,0 0 oooh7FDi0hBnEDBii70hfEf93 video>video>searchpaid>video>instagram 157941,157304,130504,90007,0 0 ooohkBnnfDooo3hfCnfDfiEiB instagram>video>searchpaid>video 607976,402797,110504,0 0 oooiA9fi99FiAioAo97DohkF3 facebook>video>searchpaid>video’ ’157941,105991,100504,96407,0 0 oooiCAf0Dno3Dfi7h7io9kCk9 facebook>video>searchpaid>video>video>facebook 607976,139794,41504,0 1 oooiFD9977iFC9DC3E3D000Ff instagram>video>searchpaid>facebook>instagram>instagram 157751,132995,100032,0 0 oooik3A7A7FA9oof3hDfin7CB video>video>searchpaid>video>instagram 237941,190007,132799,130504,118241, 0 1 oook0nnhoCo0BoEAho7E9nfEC instagram>video>searchpaid>video 607976,593388,588241,319501, 210804, 0 1 ooooohAFofEnonEikhAi3fF9o facebook>video>searchpaid>video 157941,128798,110504,90007,0 0 000C9BiBFhoFhC7noEFAA7no7 Online Video>Online Video>Online Video>Online Video>Online Video>facebook>searchpaid>instagram 662205,659121,605944,417688,396693,345469,315380,0 0 Table 3: Overview of Input Datasets After organizing the data as described earlier, the subsequent step requires identifying clients with a minimum of three interactions with potential touch points. Following, essential adjustments are applied to both the Paths column and the Last Time Lapse column. To process the paths, it is necessary to implement a tokenizer transformation, the main objective of tokenization is to convert continuous text into discrete units, facilitating analysis and processing. This type of modification represents a crucial step in natu- ral language processing and text analysis; In this instance, the LSTM network architec- ture requires input via channel tokenization. To accomplish this objective, the Python tokenize module, along with its corresponding methods, was utilized. Notably, the 33 texts to sequences method was employed to convert paths, expressed as texts, into sequences of integers, guided by the tokenization procedure executed by the tokenizer. Each channel within the paths received a unique integer value, resulting in the conver- sion of each path into a density vector, as exemplified in the following illustration, Path Density vector facebook, instagram, searchpaid, facebook [2, 4, 3, 2] video, facebook, instagram, searchpaid, facebook [1, 2, 4, 3, 2] facebook, video, searchpaid, video, instagram [2, 1, 3, 1, 4] facebook, video, searchpaid, facebook [2, 1, 3, 2] video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] instagram, video, searchpaid, video [4, 1, 3, 1] facebook, video, searchpaid, video [2, 1, 3, 1] facebook, video, searchpaid, video, video, facebook [2, 1, 3, 1, 1, 2] instagram, video, searchpaid, facebook, instagram, instagram [4, 1, 3, 2, 4, 4] video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] instagram, video, searchpaid, video [4, 1, 3, 1] video, video, video, video, video, facebook, searchpaid, instagram [5, 5, 5, 5, 5, 2, 3, 4] Table 4: Channel Tokenization In order to address the diverse lengths of paths for each client, it is essential to establish a framework that standardizes these sizes. The initial phase of this process involves de- termining the maximum size of an interaction sequence, denoted as max_seq_length. This parameter is a prerequisite for the path sequence method in the Keras Python li- brary. Utilizing this method enables the creation of uniform-length paths by either com- pleting or truncating them. For paths shorter than max_seq_length, padding is applied with a specified value until the desired length is reached. In contrast, in cases where se- quences exceed the defined max_seq_length, truncation takes place to adhere to the specified length, resulting in the removal of values from the beginning of the sequence. Based on the previous example, if max_seq_length is set to seven, the density vectors would adopt the configuration of a uniform density vector. 34 Path Density vector Uniformed vector facebook, instagram, searchpaid, facebook [2, 4, 3, 2] [0, 0, 0, 2, 4, 3, 2] video, facebook, instagram, searchpaid, facebook [1, 2, 4, 3, 2] [0, 0, 1, 2, 4, 3, 2] facebook, video, searchpaid, video, instagram [2, 1, 3, 1, 4] [0, 0, 2, 1, 3, 1, 4] facebook, video, searchpaid, facebook [2, 1, 3, 2] [0, 0, 0, 2, 1, 3, 2] video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] [0, 0, 1, 1, 3, 1, 4] instagram, video, searchpaid, video [4, 1, 3, 1] [0, 0, 0, 4, 1, 3, 1] facebook, video, searchpaid, video [2, 1, 3, 1] [0, 0, 0, 2, 1, 3, 1] facebook, video, searchpaid, video, video, facebook [2, 1, 3, 1, 1, 2] [0 ,2, 1, 3, 1, 1, 2] instagram, video, searchpaid, facebook, instagram, instagram [4, 1, 3, 2, 4, 4] [0, 4, 1, 3, 2, 4, 4] video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] [0, 0, 1, 1, 3, 1, 4] instagram, video, searchpaid, video [4, 1, 3, 1] [0, 0, 0, 4, 1, 3, 1] facebook, video, searchpaid, video [2, 1, 3, 1] [0, 0, 0, 2, 1, 3, 1] video, video, video, video, video, facebook, searchpaid, instagram [1,1,1,1,1, 2, 3, 4] [ 1, 1, 1, 1, 2, 3, 4] Table 5: Padding-Truncated Density Vectors 4.2 Heuristic Models According with Table 2, among the converting customers (7,417), 42% had a single inter- action with one of the touchpoints; of these, 32% interacted with Paid Search, while 27% interacted exclusively through Facebook. Conversely, 32% of customers had two inter- actions with the touchpoints; among them, 24% engaged exclusively with Paid Search, while 16% interacted only through Facebook; overall, 77% of these customers had at least one interaction with either Paid Search or Facebook. Comparing this distribution with the attribution provided by heuristic models to the channels, in Figure 9, we find a consistent trend; the majority of these models attribute greater importance to the channels Paid Search and Facebook. Results from the heuristic models, demonstrate how such models tend to skew their conclusions by relying only on partial information about customer behavior. Moreover, as previously mentioned, customers who had only one or two interactions are typically not the most relevant for the business case under analysis and were, in fact, excluded from the study due to their predisposition to acquire the offered good or service. This leads to the conclusion that, in cases where the objective is to attract new customers, who generally require more inter- actions with the product before converting, heuristic models may result in partial and inadequate conclusions regarding the allocation of resources to communication chan- nels. 35 On the other hand, while Markov models are a more sophisticated tool, they also have disadvantages compared to sequential models. These include being simplistic, as they assume that hidden states are discrete and finite, and that observations are conditionally independent given the hidden states. Moreover, they are susceptible to both overfitting and underfitting, as selecting the appropriate number of hidden states and prior dis- tributions requires careful consideration over the parameters, and they depend on the quantity of observed data. Furthermore, both Markovian and heuristic models are not capable of generating a prediction on the conversion rate. Figure 9: Contribution based on Heuristic Models 4.3 Model 1: Long-Short Term Memory This section presents the results obtained from the original sequential model applied in li et al. (2018). As we know, among the advantages of this model is its ability to consider sequential data and generate predictions on the customer conversion rate. False negatives in this problem correspond to the customers who convert, but the model predicts them as non-converters. Conversely, recall signifies the ratio of accurately pre- dicted converting customers to the total number of converting customers, making it a crucial metric for evaluation. This significance stems from the relatively low cost of reaching out to non-converting customers compared to the potential loss incurred by failing to engage with converting ones. In addition, contactability depends on the channels used to propagate the advertising campaign. Consequently, understanding the channels through which converting customers interacted is vital to informed decision- making and resource allocation optimization. Hence, the emphasis lies more on seeking 36 a model with a good recall measure rather than focusing exclusively on low precision. Furthermore, considering the inherent imbalance in the problem under investigation, ROC AUC may not be sufficient as a suitable metric to evaluate model performance. As discussed previously, the combinations of minimum and maximum sequence lengths that generate the best preliminary results were (3, 6), (3, 8), (5, 7), (5, 8), (6, 8). Table 6 presents the main indicators that should be considered in the selection of the final model for the combination of minimum and maximum sequence mentioned, as the execution time, precision, recall, positive class distribution, PR AUC, delta and ROC AUC over the test sample. It is worth mentioning that Delta has been defined as the disparity between the positive class percentage and the area under the precision-recall curve (PR AUC). It can be observed that the combination (5, 8) achieves the highest PR AUC. Although other combinations, such as (5, 7) and (6, 8), have a higher recall value of 0.66, taking into account the execution time and other performance indicators of the model, (5, 8) was selected as the optimal combination. Min. Max. Exec. (h) Precision Recall F1 Positive class PR AUCa ROC AUC 3 6 3.1 0.17 0.46 0.24 0.11 0.33 0.59 3 8 3.9 0.16 0.47 0.24 0.11 0.27 0.59 5 7 1.6 0.18 0.66 0.28 0.14 0.30 0.59 5 8 1.8 0.19 0.54 0.28 0.14 0.38 0.58 6 8 1.3 0.18 0.66 0.29 0.16 0.26 0.56 Table 6: Evaluation Metrics for LSTM Model Across Different Sequences To improve prediction performance, alternatives commonly used for imbalanced datasets were explored. Among them, the oversampling methodology with the Synthetic Minor- ity Oversampling Technique (SMOTE) was attempted, serving as a notable example (Bao et al., 2020). SMOTE generates additional samples for minority classes through linear in- terpolation; this involves creating synthetic instances along lines connecting minority class samples and their neighboring instances. The algorithm selects a subset of data from minority classes, creates synthetic examples, and integrates them into the original 37 dataset. This augmented dataset then acts as a training sample for classification models, effectively mitigating overfitting concerns associated with simplistic random oversam- pling techniques. However, the results obtained by implementing this technique showed worse perfor- mance than those obtained previously. Although the prediction performance for the data used is not optimal, the sequential LSTM model provides insights into the conver- sion rate, offering an advantage not provided by heuristic or non-Markovian model’s. The following section presents the results obtained from the selected model. The learn- ing curves, depicted in Figure 10, illustrate the model’s performance on the training dataset. Notably, the learning curves begin to stabilize after approximately 50 epochs. The smooth progression of these curves indicates that the model is improving consis- tently over time, reflecting stable learning behavior. Ideally, the validation loss curve should be slightly below the training loss curve, which would indicate effective regularization and strong generalization to unseen data. Fur- thermore, when the validation and training loss curves closely overlap, it suggests that the model is well-balanced, avoiding both overfitting and underfitting, as indicated by similar performance on both the training and validation sets. Figure 10: Training vs Validation 38 To address the potential issue of overfitting in the model, K-Fold Cross-Validation with k = 5 was implemented. A key advantage of this methodology is its effectiveness in mitigating overfitting. The technique involves partitioning the data into multiple folds or subsets. In each iteration, one fold is used as the validation set while the model is trained on the remaining folds. This process is repeated such that each fold serves as the validation set exactly once. As shown in Table 7, the behavior of key performance metrics across the different folds is presented. Generally, the results exhibit stability, yet, there is some variation in the recall, with values ranging from a minimum of 0.52 to a maximum of 0.81. This variation is also reflected in the area under the precision-recall curve (PR AUC), which is impacted by fluctuations in recall and ranges between 0.17 and 0.36. The average PR AUC is 0.22, which is 16 basis points lower than the metric shown in the Table 6. Since Cross-Validation offers a more reliable performance measure, a lower average re- call across folds indicates that the model could benefit from further tuning, regulariza- tion, or an evaluation of the representativeness of the test set. Fold Accuracy Loss Precision Recall F1 ROC AUC PR AUC 1 0.86 0.40 0.15 0.81 0.25 0.54 0.17 2 0.85 0.42 0.18 0.65 0.28 0.57 0.19 3 0.86 0.40 0.17 0.64 0.27 0.57 0.17 4 0.86 0.40 0.17 0.52 0.26 0.56 0.36 5 0.86 0.41 0.20 0.55 0.29 0.58 0.20 Average 0.86 0.40 0.17 0.63 0.27 0.57 0.22 Table 7: Cross-Validation Performed for the LSTM Selected Model As shown in Figure 11, given the nature of the problem, prioritizing a high recall value over precision is preferred. In this case, the precision is 0.19, while the recall is 0.54. This recall value indicates that the model is able to predict 0.54 of the cases with positive conversion; in other words, 0.56 of the positive conversion cases are not being correctly classified by the model. 39 Figure 11: Actual vs Predicted Classifications On the other hand, Figure 12 demonstrates that the model is effectively identifying posi- tive instances at a rate substantially higher than the base rate of the positive class, which is 0.14. A PR AUC of 0.38 reflects the trade off between the precision of positive predic- tions and the ability to identify positive conversions. Although a higher PR AUC would be ideal, a value of 0.38 still indicates that the model is making meaningful predictions, identifying positive cases better than what would be expected by chance, despite the imbalance in the dataset. Figure 12: Performance Across Thresholds 40 As previously mentioned, the primary objective of attribution modeling extends beyond mere model prediction. A robust representation of dynamic pathways holds significant value for future business decisions and budget optimization in strategic decision-making processes. Relevance of the channels is shown in Figure 13; based on the results obtained in the attention layer for the converting clients, a relative weight were assigned. The findings indicate that Instagram, with a contribution of 0.34, reveals the most significant impact on clients with positive conversions, followed by Online Display with 0.31 and 0.15 for Facebook and 0.13 for Paid Search. A variation in the importance of channels can be observed compared to the results from heuristic models, which identify Facebook, Paid Search, and Online Video as the most significant channels. For instance, LSTM sequence model assigns half the importance to Facebook compared to heuristic models. It is important to note that not only in the subset of clients who had only one or two interactions with communication channels was Facebook the touchpoint with the most interactions. Among clients with at least three interactions, who were included in the calibration of sequential models, 30% of their total interactions occurred via Facebook.These results underscore the value of sequential models in attributing key communication channels, enabling more accurate outcomes by using historical customer behavior. Figure 13: Contribution of Communication Channels 41 4.4 Model 2: Encoder-Decoder Modification As shown in Table 8, the combinations of minimum and maximum sequence lengths that generated the best preliminary results were {(5, 8), (6, 5), (7, 6), (8, 3), (8, 8)}. In this case, the (8, 3) combination has the highest PR AUC; this model has been selected as the best option because, in addition to demonstrating the best fitting metrics, it aligns well with business needs where the collected data often includes a large proportion of long interaction sequences. Min. Max. Exec. (h) Precision Recall F1 Positive class PR AUC ROC AUC 5 8 1.5 0.20 0.64 0.30 0.14 0.25 0.61 6 5 0.7 0.18 0.76 0.30 0.16 0.31 0.57 7 6 0.6 0.19 0.82 0.31 0.17 0.28 0.55 8 3 0.3 0.20 0.75 0.32 0.18 0.32 0.55 8 8 0.6 0.21 0.73 0.32 0.18 0.25 0.56 Table 8: Evaluation Metrics for Encoder-Decoder Model Across Different Sequences The two main difference and benefit identified with the modification of the encoder in the original base sequential model, are that the Encoder-Decoder model provides better fitting results with long sequences; furthermore, as can be verified, the execution times for this model are generally significantly shorter than those of the previous model. From Figure 14, a reduction in loss value can be observed, indicating increasingly ac- curate model predictions. A decreasing training error indicates effective learning from the training data, moreover, as the number of epochs increases, the learning curves be- gin to stabilize, indicating that the model’s performance is starting to converge. If these curves do not stabilize, it may indicate that the model requires more epochs to reach convergence, or it could be due to overfitting or underfitting issues. 42 Figure 14: Training vs Validation On the other hand, Table 9 demonstrates consistent model performance across the five folds, indicating stability in the model’s performance with minimal variation in key met- rics such as accuracy, loss, and F1 Score. The consistent accuracy and relatively stable precision, recall, and ROC AUC values suggest that the model is generalizing well across different subsets of the data. This stability implies that the model is not overly sensitive to variations in the training data, which is a positive sign for its robustness and reliability when applied to unseen data. Fold Accuracy Loss Precision Recall F1 ROC AUC PR AUC 1 0.84 0.44 0.19 0.70 0.30 0.55 0.31 2 0.82 0.48 0.20 0.76 0.32 0.55 0.32 3 0.82 0.46 0.21 0.71 0.32 0.56 0.21 4 0.82 0.47 0.21 0.76 0.32 0.56 0.32 5 0.81 0.48 0.22 0.69 0.31 0.55 0.28 Average 0.82 0.46 0.20 0.72 0.31 0.55 0.29 Table 9: Cross-Validation Performed for the Encoder-Decoder Selected Model Similar to previous model results, due to the nature of the problem, it is preferable to pri- oritize a high recall value over precision. According with Figure 15, the model achieves 43 a precision of 0.20 and a recall of 0.75; indicading that the model correctly predicts 0.75 of positive conversion cases. Figure 15: Actual vs Predicted Classifications Figure 16 shows that the model effectively identifies positive instances at a rate much higher than the 0.18 base rate of the positive class. A PR AUC of 0.32 still indicates that the model is making meaningful predictions and identifying positive cases better than would be expected by chance. Figure 16: Performance Across Threshold 44 Another difference observed with this model is its reduced emphasis on Facebook, leading to its exclusion from the top three most important channels. However, as highlighted by the sequence models analyzed, Instagram and Online Display consistently demonstrate the greatest influence on customer conversion. The results obtained from both sequen- tial models demonstrate that channels with the highest number of interactions are not necessarily the most significant in the consumer’s final decision. Consequently, the in- sights generated by these models become valuable tools for optimizing resources when designing advertising campaigns. Figure 17: Contribution of Communication Channels 45 5 Conclusions Comparison of sequential models, such as LSTM and Encoder-Decoder, shows that chan- nels with the highest number of interactions, like Facebook and Paid Search, are not al- ways the most influential in customer conversions. These models capture long-term de- pendencies and provide more accurate conversion predictions, offering deeper insights into channel contribution than heuristic models. For example, with the LSTM model it was concluded that Instagram and Online Display were on average the most influential communication channels; allowing the first proposed objective to be fulfilled. As part of the second objective proposed, when analyzing the modification of the Deep Neural Net with Attention for Multi-channel Multi-touch Attribution model, by includ- ing an LSTM Encoder-Decoder, it is observed that individually, this modification per- formed better with longer sequences. This model provided more precise fitting results and shorter execution times, making it suitable for cases with more extended customer interactions and involving complex interaction patterns across multiple channels. Belong to the last objetive, PR AUC of the models were analyzed. When comparing the PR AUC of the models across different ranges of sequence lengths, regarles of their highest values, different distinctions can be observed. LSTM PR AUC outperforms the Encoder-Decoder in scenarios where both the minimum sequence length exceeds four and the maximum sequence length exceeds six. For instance, for the combinations of maximum and minimum sequence (8,5) and (8,6), the PR AUC of the LSTM is higher than the Encoder-Decoder by 0.13 and 0.08, respectively. Conversely, in cases where the minimum sequence length is less than four and the max- imum sequence length is less than six, the Encoder-Decoder shows better performance. For example, for the combinations of maximum and minimum sequence (3,4) the PR AUC of the Encoder-Decoder is higher than the LSTM by 0.03. It is important to note that the Encoder-Decoder’s superior performance in these shorter sequence ranges does not correspond to its highest PR AUC values. Nevertheless, the application of the Encoder-Decoder allows an accelerated training of the model presenting better results in predicting a conversion in the earlier interactions of the consumer, which allows the user to make decisions without waiting a longer in- teraction of the consumer to predict their behavior. 46 Furthermore, both models showed Instagram and Online Display. These results show that channels with the highest number of interactions are not necessarily the most significant in the consumer’s final decision. Although sequential models provide more robust results, it is crucial to consider the type of information that needs to be collected for proper training. For future implementations, it is recommended to consider a data set collected specifi- cally for sequential models. Such a data set would have provided more personalized data to capture customer interaction patterns over time, allowing for more effective model calibration and improved predictive performance. In this particular case, the accuracy of conversion rate predictions could have been improved if a publicly available data set specifically designed to calibrate sequential models had been available. 47 References Abhishek Vibhanshu, H. K., Fader Peter. (2012). The long road to online conversion: A model of multi-channel attribution. doi: 10.2139/ssrn.2158421 Abirami, S., & Chitra, P. (2020). Chapter fourteen - energy-efficient edge based real-time healthcare support system (Vol. 117; P. Raj & P. Evangeline, Eds.) (No. 1). Elsevier. Retrieved from https://www.sciencedirect.com/ science/article/pii/S0065245819300506 doi: https://doi.org/10.1016/ bs.adcom.2019.09.007 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate (Vol. abs/1409.0473). Retrieved from https:// api.semanticscholar.org/CorpusID:11212020 Bao, F., Wu, Y., Li, Z., Li, Y., Liu, L., & Chen, G. (2020, 09). Effect improved for high- dimensional and unbalanced data anomaly detection model based on knn-smote- lstm. Complexity, 2020, 1-17. doi: 10.1155/2020/9084704 Boyd, E. K. H. P. C. D., Kendrick. (2013). Area under the precision-recall curve: Point estimates and confidence intervals. In (pp. "451–466"). Springer Berlin Heidelberg. Brownlee, J. (2019). Imbalanced classification with python: Better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery. C. Ferri, R. M., J. Hernández-Orallo. (2009). An experimental comparison of per- formance measures for classification. Pattern Recognition Letters, 30(1), 27- 38. Retrieved from https://www.sciencedirect.com/science/article/ pii/S0167865508002687 doi: https://doi.org/10.1016/j.patrec.2008.08.010 Danaher, P., & Danaher, T. (2013, 08). Comparing the relative effectiveness of advertis- ing channels: A case study of a multimedia blitz campaign. Journal of Marketing Research, 50, 517-534. doi: 10.1509/jmr.12.0241 de Haan, E., Wiesel, T., & Pauwels, K. (2016). The effectiveness of different forms of online advertising for purchase conversion in a multiple-channel attribu- tion framework. International Journal of Research in Marketing, 33(3), 491- 507. Retrieved from https://www.sciencedirect.com/science/article/ pii/S0167811615001421 doi: 10.1016/j.ijresmar.2015.12.001 Dimitrios Buhalis, K. V. (2021). Bridging marketing theory and big data analytics: The https://www.sciencedirect.com/science/article/pii/S0065245819300506 https://www.sciencedirect.com/science/article/pii/S0065245819300506 https://api.semanticscholar.org/CorpusID:11212020 https://api.semanticscholar.org/CorpusID:11212020 https://www.sciencedirect.com/science/article/pii/S0167865508002687 https://www.sciencedirect.com/science/article/pii/S0167865508002687 https://www.sciencedirect.com/science/article/pii/S0167811615001421 https://www.sciencedirect.com/science/article/pii/S0167811615001421 48 taxonomy of marketing attribution. International Journal of Information Manage- ment, 56. doi: 10.1016/j.ijinfomgt.2020.102253. Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning : data mining, inference, and prediction. New York, NY, USA: Springer New York Inc. Huyton, H. (2021). Multitouch attribution modelling. Retrieved from https://www .kaggle.com/code/hughhuyton/multitouch-attribution-modelling (Accessed: 2023) Jeremite, T. L. (2019). Ffdna.py: Channel attribution model. Retrieved from https://github.com/jeremite/channel-attribution-model/blob/ master/FFDNA.py (Accessed: 2023) Kakalejcik, L., Ferencova, M., Angelo, P., & Bucko, J. (2018, 01). Multichannel marketing attribution using markov chains. Statistika: Statistics and Economy Journal, 101. Kumar, . S.-D., V. (2021). Marketing accountability for marketing and non-marketing outcomes. Emerald Publishing Limited. li, N., Kumar, S., Dong, C., Yan, Z., & Pani, A. (2018, 09). Deep neural net with attention for multi-channel multi-touch attribution. doi: 10.48550/arXiv.1809.02230 Lipton, Z. C., Elkan, C., & Narayanaswamy, B. (2014). Optimal thresholding of classifiers to maximize f1 measure. Berlin, Heidelberg: Springer Berlin Heidelberg. Sagheer, A., & Kotb, M. (2019, 12). Unsupervised pre-training of a deep lstm-based stacked autoencoder for multivariate time series forecasting problems. Scientific Reports, 9, 19038. doi: 10.1038/s41598-019-55320-6 Shi, M. A. A. K. C. T. W., Zhangyue, & Liu, C. (2022, 01). An lstm-autoencoder based online side channel monitoring approach for cyber-physical attack detection in additive manufacturing. Journal of Intelligent Manufacturing. doi: 10.1007/s10845 -021-01879-9 Thaichon, P., & Quach, S. (2016, 03). Online marketing communications and childhood’s intention to consume unhealthy food. Australasian Marketing Journal (AMJ), 24. doi: 10.1016/j.ausmj.2016.01.007 Thölke, P., Mantilla-Ramos, Y.-J., Abdelhedi, H., Maschke, C., Dehgan, A., Harel, Y., . . . Jerbi, K. (2023). Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. NeuroImage, 277 , 120253. Retrieved from https://www.sciencedirect.com/ https://www.kaggle.com/code/hughhuyton/multitouch-attribution-modelling https://www.kaggle.com/code/hughhuyton/multitouch-attribution-modelling https://github.com/jeremite/channel-attribution-model/blob/master/FFDNA.py https://github.com/jeremite/channel-attribution-model/blob/master/FFDNA.py https://www.sciencedirect.com/science/article/pii/S1053811923004044 https://www.sciencedirect.com/science/article/pii/S1053811923004044 https://www.sciencedirect.com/science/article/pii/S1053811923004044 49 science/article/pii/S1053811923004044 doi: https://doi.org/10.1016/j .neuroimage.2023.120253 https://www.sciencedirect.com/science/article/pii/S1053811923004044 https://www.sciencedirect.com/science/article/pii/S1053811923004044 https://www.sciencedirect.com/science/article/pii/S1053811923004044 Acta de defensa Summary List of Tables List of Figures Introduction Introduction Objetives Background of the Study Literature Review Neural Networks Deep Neural Net with Attention for Multi-channel Multi-touch Attribution Model Deep sequential model Attention mechanism Binary classification problem Encoder-Decoder Accuracy Metrics for Imbalanced Datasets Methodology Dataset Samples Definition Model Parameterization Sequence Length Hyperparameters Model structure and computational tools Data Processing Attention Layer Long-Short Term Memory Encoder-Decoder Results and Discussion Data Processing Heuristic Models Model 1: Long-Short Term Memory Model 2: Encoder-Decoder Modification Conclusions References