UNIVERSIDAD DE COSTA RICA

SISTEMA DE ESTUDIOS DE POSGRADO

USO DE MODELOS DE DEEP LEARNING EN LA
ESTIMACIÓN DE PROBABILIDADES DE CONVERSIÓN Y
CONTRIBUCIÓN DE LOS CANALES DE COMUNICACIÓN

EN CAMPAÑAS PUBLICITARIAS

Trabajo Final de Investigación Aplicada sometido a la
consideración de la Comisión del Programa de Estudios de

Posgrado en Matemática para optar al grado y título de
Maestría Profesional en Métodos Matemáticos y

Aplicaciones

Alexa Sánchez Brenes

Ciudad Universitaria Rodrigo Facio, Costa Rica

2025


Este trabajo final de investigación aplicada fue aceptado por la Comisión del Programa

de Estudios de Posgrado en Matemática de la Universidad de Costa Rica, como requisito

parcial para optar al grado y título de Maestría Profesional en Métodos Matemáticos y

sus Aplicaciones.

Dr. Alexander Ramírez González

Representante de la Decanatura
Sistema de Estudios de Posgrado

Dr. Maikol Solís Chacón

Profesor Guía

Dr. Hugo Solís Sánchez

Lector

MSc Juan Felipe González Évora

Lector

Dr. Dario Alberto Mena Arias

Director del Programa de Posgrado

Alexa Sánchez Brenes

Sustentante

ii


Contents

Acta de defensa ii

Summary v

List of Tables vi

List of Figures vii

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objetives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background of the Study 3
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Deep Neural Net with Attention for Multi-channel Multi-touch Attribu-

tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Deep sequential model . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Attention mechanism . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Binary classification problem . . . . . . . . . . . . . . . . . . . . 12

2.4 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Accuracy Metrics for Imbalanced Datasets . . . . . . . . . . . . . . . . . 13

3 Methodology 17
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Samples Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Model Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Model structure and computational tools . . . . . . . . . . . . . . . . . . 25

3.4.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 Attention Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.3 Long-Short Term Memory . . . . . . . . . . . . . . . . . . . . . . 28

iii


3.4.4 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Results and Discussion 32
4.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Heuristic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Model 1: Long-Short Term Memory . . . . . . . . . . . . . . . . . . . . . 35

4.4 Model 2: Encoder-Decoder Modification . . . . . . . . . . . . . . . . . . 41

5 Conclusions 45

References 47

iv


Resumen

En este trabajo de investigación, se muestran modelos de deep learning para estimar la

probabilidad de conversión y la contribución de los canales de comunicación en cam-

pañas publicitarias. La relevancia de este tipo de metodologías surge como respuesta

a la necesidad de optimizar los recursos en el diseño de campañas publicitarias, ante

el crecimiento en la variedad de medios por los que consumidores interactúan con la

publicidad.

Basado en el modelo propuesto en li et al. (2018), se implementaron modelos secuenciales

de redes neuronales, específicamente Long Short-Term Memory (LSTM). Este modelo,

además de permitir estimar la probabilidad de conversión de los poteciales clientes que

mantienen contacto con una campaña pulicitaria, incorpora un mecanismo de atención.

Dichos mecanismos de atención permiten estimar la relevancia que tienen los canales de

comuncicación en la desición de conversión. Por otra parte, se realizó la incorporar una

estructura de Encoder-Decoder a la estructura original del modelo antes descrito.

Ambos modelo secuenciales mostraron un poder predictivo similar, sin embargo, de

acuerdo al Precision-Recall Auc, el modelo original con redes LSTM fue mejor. Además,

los resultados demostraron que canales con mayor número de interacciones, como Face-
book y Paid Search, no siempre son los más influyentes en la conversión final. En con-

traste, los modelos secuenciales mostraron que Instagram y Online Display fueron iden-

tificados como los canales con mayor impacto en la toma de decisiones de los consumi-

dores.

Finalmente, si bien, la incorporación del Encoder-Decoder no mostró cambios relevantes

en la relevancia de los canales de comunicación; si presentó una mejora relevante en los

tiempos de ejecución.

v


List of Tables

1 Original Data Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Distribution of Client Across Different Sequence Lengths . . . . . . . . . 20

3 Overview of Input Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Channel Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Padding-Truncated Density Vectors . . . . . . . . . . . . . . . . . . . . . 34

6 Evaluation Metrics for LSTM Model Across Different Sequences . . . . . 36

7 Cross-Validation Performed for the LSTM Selected Model . . . . . . . . . 38

8 Evaluation Metrics for Encoder-Decoder Model Across Different Sequences 41

9 Cross-Validation Performed for the Encoder-Decoder Selected Model . . 42

vi


List of Figures

1 LSTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Daily Positive Conversions (July 2018) . . . . . . . . . . . . . . . . . . . 18

3 Proportion of interactions by channel . . . . . . . . . . . . . . . . . . . . 19

4 Comparison of PR AUC Across Different Sequence Lengths . . . . . . . . 21

5 Comparison of Execution Time Across Different Sequence Lengths . . . 22

6 Architecture of the Attention Layer . . . . . . . . . . . . . . . . . . . . . 27

7 Architecture of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . 30

8 Architecture of the Encoder-Decoder model . . . . . . . . . . . . . . . . 31

9 Contribution based on Heuristic Models . . . . . . . . . . . . . . . . . . 35

10 Training vs Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

11 Actual vs Predicted Classifications . . . . . . . . . . . . . . . . . . . . . . 39

12 Performance Across Thresholds . . . . . . . . . . . . . . . . . . . . . . . 39

13 Contribution of Communication Channels . . . . . . . . . . . . . . . . . 40

14 Training vs Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

15 Actual vs Predicted Classifications . . . . . . . . . . . . . . . . . . . . . . 43

16 Performance Across Threshold . . . . . . . . . . . . . . . . . . . . . . . . 43

17 Contribution of Communication Channels . . . . . . . . . . . . . . . . . 44

vii


Autorización para digitalización y comunicación pública de Trabajos Finales de Graduación del Sistema de 

Estudios de Posgrado en el Repositorio Institucional de la Universidad de Costa Rica. 

Yo, _______________________________________, con cédula de identidad _____________________, en mi 

condición de autor del TFG titulado ___________________________________________________ 

_____________________________________________________________________________________________

_____________________________________________________________________________________________ 

Autorizo a la Universidad de Costa Rica para digitalizar y hacer divulgación pública de forma gratuita de dicho TFG  

a través del Repositorio Institucional u otro medio electrónico, para ser puesto a disposición del público según lo que 

establezca el Sistema de Estudios de Posgrado.  SI           NO *     

*En caso de la negativa favor indicar el tiempo de restricción: ________________  año (s). 

Este Trabajo Final  de Graduación será publicado en formato PDF, o en el formato que en el momento se establezca, 

de tal forma que el acceso al mismo sea libre, con el fin de permitir la consulta e impresión, pero no su modificación. 

Manifiesto que mi Trabajo Final de Graduación fue debidamente subido al sistema digital Kerwá y su contenido 

corresponde al documento original que sirvió para la obtención de mi título, y que su información no infringe ni 

violenta ningún derecho a terceros. El TFG además cuenta con el visto bueno de mi Director (a) de Tesis o Tutor (a) 

y cumplió con lo establecido en la revisión del Formato por parte del Sistema de Estudios de Posgrado.  

FIRMA ESTUDIANTE 

Nota: El presente documento constituye una declaración jurada, cuyos alcances aseguran a la Universidad, que su contenido sea tomado como cierto. Su 

importancia radica en que permite abreviar procedimientos administrativos, y al mismo tiempo genera una responsabilidad legal para que quien declare 

contrario a la verdad de lo que manifiesta, puede como consecuencia, enfrentar un proceso penal por delito de perjurio, tipificado en el artículo 318 de nuestro 
Código Penal. Lo anterior implica que el estudiante se vea forzado a realizar su mayor esfuerzo para que no sólo incluya información veraz en la Licencia de 

Publicación, sino que también realice diligentemente la gestión de subir el documento correcto en la plataforma digital Kerwá.   

 
Alexa Sánchez Brenes 304670905

USO DE MODELOS DE DEEP LEARNING EN LA ESTIMACIÓN DE PROBABILIDADES DE CONVERSIÓN

Y CONTRIBUCIÓN DE LOS CANALES DE COMUNICACIÓN EN CAMPAÑAS PUBLICITARIAS

X

https://es.wikipedia.org/wiki/Responsabilidad
https://es.wikipedia.org/wiki/Perjurio


1

1 Introduction

1.1 Introduction

During the past decade, the number and variety of mediums through which consumers

can be exposed to advertising campaigns have increased and diversified significantly.

These mediums are commonly referred to as communication channels. A touch point is

defined as the moment when a consumer interacts with advertising information related

to a product or service through a communication channel, such as online graphic ads.

When a consumer acquires the product or service offered through advertising informa-

tion, it is said to result in a positive conversion. The wide variety of communication

channels through which advertising campaigns can reach consumers has made it essen-

tial for marketing specialists to apply tools to predict customer conversions in order to

identify the most effective channels and touch points with the objective of optimizing

budget allocation. This has led to the development of various Marketing Attribution

Models, aiming to understand the impact of each interaction with an advertisement on

the final decision to purchase a specific product or service.

This document will explore throughout heuristic and Markovian models are capable of

providing insight into the relevance of communication channels through which an ad-

vertising campaign is spread. However, these methodologies do not offer information on

conversion rates or consider sequential information. Therefore, the use of neural net-

work models is proposed, specifically employing the LSTM architecture. This approach

ensures the consideration of historical customer information both in predicting conver-

sion rates and identifying the most relevant channels for business actions, facilitated by

the incorporation of attention mechanisms.

Furthermore, a modification of the original architecture is proposed through an Encoder-

Decoder framework. This adjustment demonstrates improved performance through faster

execution times and yields slight enhancements in conversion rate prediction results.

Lastly, we will elaborate on the challenges associated with obtaining high-quality pub-

lic datasets for implementing sequential models, contrasting with the more prevalent

availability of data examples used in heuristic and Markovian applications.


2

1.2 Objetives

General Objective

Implement deep learning models to predict the conversion rate and the contribution

of communication channels in the final decision-making of consumers when acquiring

or not acquiring a product or service (positive or negative conversion) offered in an

advertising campaign.

Specific Objectives

• Apply the necessary structure to the database to execute a Deep Neural Net with
Attention for Multi-channel Multi-touch Attribution model and identify the most in-

fluential communication channels in the conversion probability of each consumer.

• Develop a modification of the basic Deep Neural Net with Attention for Multi-
channel Multi-touch Attribution model by incorporating an LSTM Encoder-Decoder.

• Evaluate the performance of the models to identify changes in the relevance of

communication channels and execution times.


3

2 Background of the Study

2.1 Literature Review

Last-Touch Attribution and First-Touch Attribution are examples of marketing attribu-

tion models based on heuristic and simples approaches that assign 100% of the impact of

the consumer’s final decision to the last or first advertising touch point they interacted

with, respectively. These methodologies have been criticized for completely ignoring the

effects that other intermediate touch points may have on the final decision and leading

to biased estimates (Kumar, 2021). In response to these criticisms, the literature proposes

the use of models for sequential information, which consider all consumer interactions

with different touch points and provide additional information such as the time elapsed

between each interaction. In particular, Multi-touch Attribution Models characterize a

more realistic view of the customer journey by assigning weight based on the true influ-

ence of each touch point on the final decision to acquire a product or service.

Markov Chain models are an example of sequential methodologies proposed in the lit-

erature. In Kakalejcik et al. (2018), they implement a Markov Chain model with a data

set from Google Analytics to understand the path through communication channels that

customers follow with positive conversions. More sophisticated models, such as the one

proposed in Abhishek Vibhanshu (2012), use a spatially structured Markov Chain model

to capture the dynamics of individual consumer behavior and infer the conversion rate.

Other models, like Vector Auto Regressive models implemented in de Haan et al. (2016),

study the relative effectiveness of various online marketing channels and analyze the

duration of the impacts of this advertising information on consumers. Furthermore, in

Danaher and Danaher (2013) an ensemble model was recommended to evaluate the rel-

ative effectiveness of multiple advertising media, analyzing the incidence of purchases

using a Probit model and applying a Tobit model of the second type to estimate the

outcome of purchases.

Finally, li et al. (2018) suggests a sequential model based on recurrent neural networks

called Deep neural net with attention for multitouch attribution. The added value of these

models is that they allow for robust capturing of long-term dependencies, providing a

better, understanding of the effects and contributions of consumer interaction with com-


4

munication channels. Furthermore, with the application of models like this, it is possible

to quantify the impact of each communication channel on a consumer’s positive conver-

sion, providing additional information that can be considered in optimizing resources for

future advertising campaigns.

Furthermore, the previously mentioned model incorporates attention mechanisms, which

are particularly advantageous in LSTM models to capture fine-grained dependencies in

sequential data (Bahdanau, Cho, & Bengio, 2014). This selective focus on relevant parts

of the input sequence enhances both comprehensibility and performance, making these

models highly effective in natural language processing tasks such as machine translation

and sentiment analysis.

Additionally, the Deep neural net with attention for multitouch attribution model includes

a time decay function to account for the assumption that the influence of each touch-

point diminishes as the time between the interaction and the final purchase decision

increases. Assumption of time decay is common in the literature, yet, there are multiple

factors that could have a more significant impact. For instance, Dimitrios Buhalis (2021)

points out that a touch point does not always generate a positive impact on consumer

decision-making and consumers’ previous experiences related to brands may influence

their decisions. On the other hand, in Thaichon and Quach (2016) emphasizes how the

form and characteristics of marketing communications can create needs in consumers,

producing a positive effect on a consumer’s journey.

2.2 Neural Networks

According to the authors in Hastie et al. (2017), the term “neural network" has broadened

to encompass a wide variety of models and learning methods. To introduce key concepts

and definitions, this paper presents a brief description of a single hidden layer back-

propagation network, also known as a single-layer perceptron.

Neural networks are a set of algorithms designed to recognize relationships in a given

set of training data, modeled after the way human neurons process information. The

following equation represents how a single-layer neural network processes input data

at each layer to produce a predicted output value

ŷ = σ(wTx+ b). (1)


5

To generate equation (1), a training vector x is input into the neural network during a

process called forward propagation. As the data passes through the layer, parameters

wT and b, known as weights and biases, are applied to map the relationship between

the inputs and the target outputs. This is achieved by calculating the weighted sum of

the inputs at each neuron, and applying a non-linear activation function (usually chosen

to be a sigmoid) to produce the output. In multiple-layer neural networks, this process

continues layer by layer until the final output is produced.

A loss function is subsequently defined to compare the target and predicted output val-

ues. During model training, this loss function needs to be minimized. The standard

approach for minimizing it is through gradient descent, commonly referred to as back-

propagation in this context. Given the compositional structure of the model, the gradient

can be efficiently derived using the chain rule for differentiation. This process involves

performing a forward and backward sweep through the network, tracking only the quan-

tities local to each unit. For binary classification problems, the Binary Cross-Entropy loss

function is frequently used. According to the authors in Hastie et al. (2017), it is defined

as follows:

− 1

N

N∑
i=1

[yi log(ŷi) + (1− yi) log(1− ŷi)] . (2)

where N is the number of samples, yi is the true label for the i-th sample (either 0 or 1),

and ŷi is the predicted probability of the positive class for the i-th sample.

Likewise, activation functions are essential in neural networks since they enable the

network to address complex problems by introducing non-linearity, which allows the

network to learn complex patterns and relationships in the data. Activation functions

determine whether a node should be activated by evaluating the importance of the neu-

ron’s input through mathematical operations. As was mentioned before, they transform

the weighted sum of inputs from a node into an output value, which is then fed to the

next layer or used as the final output. There are various activation functions, including

binary, linear, and numerous non-linear types, among which the following are particu-

larly notable:


6

Sigmoid. This function transforms any real input value into an output within the

range of 0 to 1. The property makes it particularly suitable for classification tasks

and models requiring probability prediction as an output. However, a drawback

is the gradient can vanish to zero for very low and very high input values, which

can delay the model’s ability to improve during training. It can be expressed as,

f(x) =
1

1 + exp−x
. (3)

Hyperbolic tangent (Tanh). This function has a range from -1 to 1, is com-

monly implemented in the hidden layers of neural networks. The function helps

in centering the data around zero, thereby facilitating easier learning for subse-

quent layers. This centering property improves the efficiency and performance of

neural network models. Mathematically, it can be represented as,

f(x) =
expx− exp−x

expx+exp−x
. (4)

Rectified linear unit (ReLU). The range of this function is [0,∞), and are ex-

tensively utilized in deep learning, particularly in convolutional neural networks.

It supports backpropagation due to its derivative function, which accelerates the

convergence of gradient descent towards the global minimum of the loss function.

Its efficiency arises from its ability to avoid activating all neurons simultaneously,

unlike Sigmoid or Tanh functions, making it computationally efficient. However,

this characteristic also implies that some neurons might not be updated or ac-

tivated, potentially leading to dead neurons that never get activated. It can be

expressed as,

f(x) = max {0, x}. (5)

The following section describes the architecture of a more complex model, which con-

sists of different structures of neural networks. In particular, this model is composed of

recurrent neural networks, a type of network designed to process sequential data and

address problems associated with natural language processing and time-series analysis.


7

2.3 Deep Neural Net with Attention for Multi-channel Multi-
touch Attribution Model

The basic model Deep Neural Net with Attention for Multi-channel Multi-touch Attribution
consists of three stages. The first stage involves a deep sequential model based on a Long-

Short Term Memory (LSTM) recurrent neural network, aimed at capturing the long-

term dependencies in sequential observations. Subsequently, an attention mechanism is

introduced, assigning weights to the touchpoints that make up the representation of a

consumer’s path to conversion, determining which ones are more influential. Next, the

problem of conversion attribution, viewed as a binary classification problem, is explored

by incorporating the results obtained in the previous stages. Each of these stages is

described in detail below.

2.3.1 Deep sequential model

In Long Short-Term Memory recurrent neural network (LSTM), input values are sent to

a node called a neuron, which are grouped into different layers (Sagheer & Kotb, 2019).

These types of neural networks are typified as recurrent because the inputs received by

the nodes located in one layer are the outputs of generated by the neurons in the previous

layer, weighted by a value assigned by a non-linear function, in what is known as the

activation process. Simultaneously, the weights assigned by the neurons are modified

based on the error obtained, depending on how much each neuron contributed to the

previous result, through the process of backpropagation.

One of the principal feature of this type of neural network is its ability to learn long-

term sequential processes. In LSTM network’s, each node is composed of four layers,

and throughout each corresponding gate, a process of updating information is generated

and stored in what is known as the cell state. Moreover, each gate contains a non-linear

activation function, typically using logistic sigmoid or hyperbolic tangent functions. Ad-

ditionally, the components of this type of neural network are the forget gate ft, the input

gate it, ct representing the cell state, ot as the output gate, ht−1 is the hidden state of the

previous step, and xt is the input in the iteration t . The determination of every element


8

in the neural network during each iteration t is outlined as follows,

ft = σ(W f .[ht−1, xt] + bf )

it = σ(W i.[ht−1, xt] + bi)

ct = ft ⊙ ct−1 ⊕ it ⊙ tanh(W c.[ht−1, xt] + bc)

ot = σ(W o.[ht−1, xt] + bo)

ht = ot ⊙ tanh(ct)

where W and b are weights and biases, the concatenation of the previous hidden state

and current input is giving by [ht−1, xt]. In addition, the operator multiplication is de-

noted with the dot operator (·), while the pointwise multiplication and the pointwise

addition are represented as ⊙ and ⊕, respectively. On the other hand, the sigmoid acti-

vation and hyperbolic tangent functions are giving by σ and tanh, respectively.

Figure 1: LSTM Architecture

As show in Figure 1, in the first gate ft, a decision is made regarding whether to retain

or discard the information coming from the previous hidden state ht−1, . This decision

is governed by a sigmoid activation function within the forget gate, generating a value

ranging from zero to one. A value of one implies that all information stored in the state

cell is retained, whereas a value of zero signifies the discarding of all prior information.

Following this, the definition of new information to be stored in the cell state involves

two steps. Initially, employing another sigmoid activation function, the input gate it


9

determines which values are to be updated. Subsequently, a tanh function generates a

vector c̃ comprising new candidate values that could be incorporated into the cell state.

Afterward, the memory cell ct−1 undergoes an update to become ct, defined by a com-

bination of the information chosen to persist and the newly introduced information.

Next, the final output of this node, denoted as ht, is defined. To achieve this, a sigmoid

activation function is applied to ct to determine which information from the state cell

will persist. Finally, a hyperbolic tangent function is computed on the state cell, and the

result of this transformation is multiplied by the output of the sigmoid gate.

LSTM implementation in constructing the model Deep Neural Net with Attention for
Multi-channel Multi-touch Attribution is outlined below, along with the preliminary steps

to prepare the input for this stage.

Input Layer

The input data for the model consists of a set of sequences P formed by the touch-

points to which each consumer was exposed over a specific period. Touchpoints

are denoted as xt ∈ Rn, where n represents the set of enabled communication

channels for a specific marketing campaign. Thus, if T is the length of the se-

quence, and t represent the relative order of the event in the sequence, instead of

the absolute event occurrence time, then a single customer sequence path can be

defined as follows,

Pi = x0, ..., xT where t ∈ [0, T ].

As an illustration, let’s examine a sequence Pi associated with a client giving by

the following touchpoints: video, Facebook, Instagram, Searchpaid and Facebook.

In this instance, the set of enabled touchpoints has dimension n = 4 and the length

of the sequence T = 5.

Embedding Layer

In this step, each xt is converted into a density vector et using an Embedding Ma-

trix We. This matrix maps word vectors to continuous representations, where each

row corresponds to the continuous representation of a specific. During training,

these vectors are refined through backpropagation to capture semantic relation-


10

ships between words. Each density vector is obtained through the following for-

mula:

et = We xt where We ∈ Rnxve .

This technique is used to encode words into a sequence of numerical indices. In-

dividual words are represented as vectors with real values in a predefined vector

space with dimension ve, with the aim of transforming higher-dimensional data

into a lower-dimensional vector space. In simple words, it assigns each word a

high-dimensional vector, positioning similar words closer together in space. Us-

ing the earlier example, the density vector is specified as et = [1, 2, 4, 3, 2], in which

1 represents video; 2, Facebook; 4, Instagram; and 3, Searchpaid.

LSTM Layer

LSTM neural networks allows us to incorporate contextual information in the his-

torical observations. Through a non-linear operator H, often implemented as a

recurrent neural network, specifically LSTM in our context, each block iteratively

updates the current hidden state ht by using the information of the embedding

layer output e1, ..., eT and the previous hidden state ht−1, as depicted in the fol-

lowing formula, ht = H(et, ht−1) where t ∈ [0, T ].

2.3.2 Attention mechanism

The primary goal of this stage is to identify the most influential communication channels

contributing to client conversions. An attention mechanism is a technique that enhances

model performance by focusing on relevant information within the input data. It enables

models to selectively attend to different parts of the input, assigning varying degrees of

importance to different elements. This is achieved by generating attention weights for

various features of the input data, allowing the model to utilize the most pertinent parts

of the input sequence. These weights are applied in a weighted combination of all the

input vectors, with higher weights attributed to more relevant vectors. Consequently,

the attention mechanism determines the level of importance each element contributes to

the model’s output, thereby improving the model’s ability to capture complex patterns

and relationships.

To start, the attention layer processes the hidden state ht at the time step t, through


11

a single-layer Multilayer Perceptron (MLP). Such networks utilize the backpropagation

algorithm and are designed to approximate continuous functions and address problems

that are not linearly separable (Abirami & Chitra, 2020). In this case, the hidden state

ht is transforming it into vt by the expression tanh(Wvht + bv); where Wv is a weight

matrix, and bv is a bias vector added to the weighted input, and tanh is the hyperbolic

tangent activation function.

Following this, the importance level at for the new representation of the communication

channel vt is computed. The normalized importance weight, at, is obtained through a

softmax function, ensuring positive values for at by design. This construction guarantees

that the contribution of each touchpoint remains positive.

Next, the context vector s is defined as the convex combination of ht weighted by the at
obtained in the previous step. Intuitively, s can be interpreted as a high-level representa-

tion of the customer’s journey through different touchpoints, combining hidden outputs

and attention weights, typically referred to as the context vector.

The complete process of the attention mechanism is described through the following

equations,

vt = tanh(Wvht + bv) (6)

at =
exp(vt)∑
t exp(vt)

(7)

s =
∑
t

atht (8)

In li et al. (2018), the author proposed some modifications to the model; they suggest

that each interaction within the same sequence P has a timing that could impact the

consumer’s final conversion. In order to incorporate time penalty into the attention

mechanism, they introduce the term λTt in the definition of at. The parameter λ, can be

predefined based on the observed conversion trends of customers in previous marketing

campaigns, or it can be initialized randomly in the model and adjusted during training.

Meanwhile, Tt refers to the time lapse between contact with the communication channel


12

xt and xt+1. With this modification, at can be rewritten as

at =
exp(vt − λTt)∑
t exp(v

T
t − λTt)

. (9)

2.3.3 Binary classification problem

The final step involves addressing the conversion attribution problem. It is known that

consumers journeys through touch points conclude when the consumer decides whether

to acquire the offered good or service. Therefore, the conversion attribution problem can

be treated as a binary classification problem. In this case, the probability of a customer

ending with a positive conversion is determined through a sigmoid function fed with

the vector s obtained through the aforementioned processes. Explicitly, the probability

of positive conversion is described as follows,

p = sigmoid(σ(W T s+ b)),

where σ(·) is the activation function 5, and W and b are weights and biases respectively.

In particular, in the conversion attribution problem, when consumers have had exposure

to advertising channels, the probability of their journey concluding in a positive conver-

sion is always higher than the probability for those consumers who were not exposed.

That’s why it is proposed to use the activation function with equation (5).

2.4 Encoder-Decoder

The Encoder-Decoder is an unsupervised learning method in neural networks that aims

to learn a compressed representation of input data; capable of capturing the nonlinear

patterns inherent in the data. Typically, they are trained as part of a larger model that

attempts to recreate the input by extracting its fundamental features.

According to Sagheer and Kotb (2019), first part of this model, the encoder, transforms

a input sequence into a lower-dimensional representation in a latent space. This can be

expressed as,

h(x) = f(W1x+ b1).


13

where W1 is a weight matrix, b1is a bias vector, and f is referred to as the Encoder

function, which, in this case, is an LSTM layer.

During the second step of the model, the decoder maps the latent representation h(x)

back to a reconstruction x̂ to the original dimension of the input sequence x,

x̂ = g(W2h(x) + b2).

where, W2 is a weight matrix, b2 is a bias vector, and g is an activation function.

The discrepancy between the input and the reconstructed input is commonly referred

to as the reconstruction error; the Encoder Decoder model is trained to minimize a loss

function giving by the following,

L(x, x̂) = ∥x− x̂∥2.

In particular, there is an architecture called the LSTM Encoder-Decoder that enables the

model to handle variable-length input sequences and predict or generate variable-length

output sequences (Shi & Liu, 2022). In this structure, the LSTM network is integrated into

both, the encoding function f(·), and the decoding function g(·). This allows effective

utilization of temporal information in sequential input data. In this type of Encoder-

Decoder Model, the encoder compresses information from the entire input sequence

into a fixed dimensional vector derived from the sequence of LSTM hidden states. This

representation is obtained from the last hidden state of the encoding part. In contrast,

the decoding part utilizes a single LSTM layer to predict the output sequence.

2.5 Accuracy Metrics for Imbalanced Datasets

In accordance with C. Ferri (2009), the precise evaluation of learned models remains

a pivotal focus within the domain of pattern recognition. This study, specifically cen-

tered on classifiers, undertakes a thorough exploration of a variety of metrics designed

to assess different facets of model performance. Our research aims to provide an exhaus-

tive analysis and categorization of these metrics, bringing to light both their theoretical

foundations and practical implications in the evaluation of classifiers.

The authors introduce three distinct groups of metrics. The first group are metrics based


14

on a threshold and a qualitative understanding of error; examples such as accuracy, F

score, and the Kappa statistic fall into this category. These metrics aim to minimize

the number of errors, with specific measures within this group being more suitable for

scenarios involving balanced or imbalanced datasets, signal or fault detection, and in-

formation retrieval tasks.

The second group is defined by the metrics based on a probabilistic understanding of

error; metrics like mean absolute error, mean squared error, and cross entropy belong

to this group. This set of metrics proves valuable when evaluating the reliability of

classifiers, offering insights not only into instances of failure, but also into the classifier’s

ability to choose the correct class with varying probabilities, be it high or low.

And the third group is characterized by metrics based on how well the model ranks are

listed, among which AUC stands out. Closely tied to the concept of separability, this

metric is of significance in various applications where classifiers play a pivotal role in

selecting optimal instances from a table or ensuring effective class separation.

Given that only about 0.1 of the total clients had a positive conversion, there is a signifi-

cant class imbalance present in the database. Thus, it becomes crucial to carefully select

the appropriate metrics to evaluate model performance. The authors in Thölke et al.

(2023), underscore the limitations of the widely used accuracy metric, particularly in sce-

narios with greater class imbalance. This metric, weighs the ratios of correct predictions

per class proportionally to the class size, resulting in a notable neglect of performance on

the minority class. When a binary classification model consistently favors the majority

class, it generates an artificially inflated decoding accuracy that predominantly reflects

the imbalance between the two classes, rather than indicating genuine and universally

applicable discriminatory capability.

Before discussing the most suitable metrics for evaluating imbalanced class scenarios, it

is essential to introduce certain concepts using the elements of a confusion matrix. The

confusion matrix is a tool used to assess the performance of a classification model on

a test dataset, displaying the counts of correct and incorrect predictions categorized by

actual and predicted classes. This matrix provides a detailed breakdown of true positives,

true negatives, false positives, and false negatives, enabling a comprehensive evaluation

of the performance of the model.


15

• True Positives (TP): number of instances where the model correctly predicted

the positive class

• True Negatives (TN): number of instances where the model correctly predicted

the negative class

• False Positives (FP): number of instances where the model incorrectly predicted

the positive class

• False Negatives (FN): number of instances where the model incorrectly predicted

the negative class

According with Brownlee (2019), two metrics that are particularly useful for evaluating

imbalanced classification are precision and recall. In addition to these metrics, we also

present the F1 score, which provides a single metric that balances both concerns, of-

fering a comprehensive measure of a model’s performance in imbalanced classification

scenarios.

1. Precision. This metric summarizes the ratio of true positive predictions to the

total predicted positives, in other words, it measures the accuracy of the positive

predictions.

Precision =
TP

TP + FP

2. Recall. This metric represents the True Positive TP rate, also known as sensitivity,

and indicates how effectively the model predicts the positive class. It is interpreted

as the proportion of actual positive instances that are correctly identified by the

model. Provides insight into the model’s ability to detect positive cases, making it

particularly useful in scenarios where capturing all positive instances is crucial.

Recall =
TP

TP + FN

3. F1. This metrics is an score that evaluates the overall performance of a classi-

fication model, is the harmonic mean of precision and recall and is valuable for

evaluating classification models, especially in cases of imbalanced data, as it con-

siders both false positives and false negatives.


16

F1 =
2 · Precision · Recall
Precision + Recall

Moreover, maximizing this metric, can be an effective strategy for setting a classifi-

cation threshold in imbalanced classification scenarios (Lipton, Elkan, & Narayanaswamy,

2014). Optimizing the threshold to maximize the F1 score, we ensure the model

maintains a good balance between precision and recall.

4. F Beta. This score is an extended adaptation of F1 score. It incorporates a weighted

factor denoted as β, to refine the relative impact of precision and recall on the

overall metric. This metric represents a weighted harmonic mean of both Precision

and Recall, reaching its optimal value at 1 and its poorest at 0.

FBeta =
(1 + β2) · Precision Recall

β2 Precision + Recall

5. Precision-Recall curve (PR AUC). This metric is designed for assessing binary

classification models. It is particularly beneficial for imbalanced classes (Boyd,

2013), as it emphasizes the classifier’s efficacy in addressing the minority class.

Provides a subtle representation of the balance between precision and recall at

diverse decision thresholds. A substantial area beneath the curve signifies elevated

values for both precision and recall, indicating a classifier that not only delivers

accurate outcomes (high precision) but also captures a significant portion of all

positive results (high recall).


17

3 Methodology

3.1 Dataset

Obtained from Huyton (2021), the dataset used to implement the proposed methodology

is accessible to the public. It is a test dataset that has been employed in the develop-

ment of non-sequential Multi-touch Attribution models, including heuristic models and

Markov chain models. Due to the limited availability of public datasets for sequential

models, the aforementioned dataset was selected.

The database comprises around 586,737 interactions involving 240,108 consumers across

five distinct communication channels throughout July, 2018. Communication channels

includes Paid Search, Facebook, Instagram, Online Display and Online Video. The most

relevant variables contained in the data set are described as follows.

• Cookie: Unique identifier for each consumer.

• Timestamp: Date and time when the consumer interacted with any of the com-

munication channels.

• Interaction: Variable indicating the type of interaction, “conversion” when the in-

teraction leads to a positive conversion, “impression” otherwise.

• Conversion: Flag with a value of 1 when there is a positive conversion.

• Channel: Communication channel through which the interaction occurred, in-

cluding Instagram, Online Display, Paid Search, Facebook, and Online Video.

Table 1 displays sample of the database structure to be utilized. For each cookie, assumed

to represent an individual consumer, the interactions with the available communication

channels are observed along with the corresponding dates. Additionally, the last inter-

action for each consumer determines whether there is a positive or negative conversion

for that user.


18

Cookie Timestamp Interaction Channel Conversion
00073CFE3FoFCn70fBhB3kfon 2018-07-21T10:52:04Z impression Instagram 0

00079hhBkDF3k3kDkiFi9EFAD 2018-07-10T11:11:24Z impression Paid Search 0

0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-09T16:57:18Z impression Instagram 0

0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-17T16:00:58Z impression Facebook 0

0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-17T16:01:44Z impression Facebook 0

0007iiAiFh3ifoo9Ehn3ABB0F 2018-07-18T17:17:24Z impression Instagram 0

0007o0nfoh9o79DDfD7DAiEnE 2018-07-12T08:07:08Z impression Facebook 0

0007oEBhnoF97AoEE3BCkFnhB 2018-07-06T13:45:29Z conversion Paid Search 1

00090n9EBBEkA000C7Cik999D 2018-07-05T06:53:53Z conversion Facebook 1

000A9AfDohfiBAFB0FDf3kDEE 2018-07-24T00:09:46Z impression Online Video 0

000A9AfDohfiBAFB0FDf3kDEE 2018-07-27T21:08:17Z impression Online Video 0

000A9AfDohfiBAFB0FDf3kDEE 2018-07-27T22:36:07Z impression Online Video 0

Table 1: Original Data Sample

During the study period, the total number of conversions fluctuated, with a peak of

13,657 conversions recorded on July 29th; with a significant decline in conversions was

observed by July 31st. On the other hand, figure 2 illustrates that the highest number of

positive conversions occurred between July 11th and 19th, 2018, displaying a decreasing

trend in the subsequent days, except for July 28th.

Figure 2: Daily Positive Conversions (July 2018)


19

The influence of each communication channel on consumers ultimate conversion is a

pivotal aspect of our outlined objectives. Consequently, a detailed analysis of the inter-

action patterns becomes crucial. As depicted in Figure 3, it is evident that Paid Search
has the highest number of interactions, followed by Facebook with 28% and 29%, respec-

tively. In contrast, Online Video registers the lowest number of interactions. In particular,

the substantial proportion of interactions on Paid Search and Facebook does not neces-

sarily ensure that these channels will be the most decisive in determining consumer

conversion.

Figure 3: Proportion of interactions by channel

3.2 Samples Definition

The dataset was divided into two portions: 90% of the total records were allocated for

the training sample, while the remaining 10% were reserved for the test sample.

3.3 Model Parameterization

3.3.1 Sequence Length

As LSTM networks fall under the category of recurrent neural networks, it becomes im-

perative to examine the sequence length, denoting the maximum number of interactions

a client undergoes. Since the sequence length dictates the extent to which our network

can retain historical information and propagate gradients over time, opting for a longer

sequence facilitates the model in capturing prolonged dependencies and acquiring intri-


20

cate patterns. However, longer sequences increase computational resources and the risk

of gradient issues. Conversely, a shorter sequence length expedites training, mitigating

gradient problems, yet compromises the contextual depth and expressive capacity of the

model. Therefore, selection of both the maximum and minimum number of interactions

considered in the models is crucial.

Table 2 presents the distribution of clients according to the length of interaction se-

quences found in the database. Although approximately 0.7 of the customers had be-

tween one and two interactions with the touch points, we will focus only on the clients

who interacted at least three times with the potential touch points. This focus is due to

the fact that, as previously mentioned, shorter sequences could compromise the model’s

capacity. Additionally, sequences consisting of one or two interactions might be associ-

ated with clients who already have a particular engagement with the product or a latent

necessity to acquire it.

Sequence length Negative conversion Positive conversion Total Percentage dist.
1 116,047 7,417 123,464 0.55

2 48,554 3,326 51,880 0.22

3 22,903 1,950 24,853 0.10

4 11,927 1,179 13,106 0.06

5 7,015 802 7,817 0.03

6 4,346 585 4,931 0.02

7 2,821 459 3,280 0.014

8 2,043 313 2,356 0.010

9 1,439 266 1,705 0.007

10 1,059 196 1,255 0.005

11 819 180 999 0.004

12 637 124 761 0.003

> 12 2,859 842 3,701 0.15

Table 2: Distribution of Client Across Different Sequence Lengths

It is important to define the minimum and maximum number of interactions required to

train the model. We define min_seq_length as the minimum number of interactions

a client must have with the communication channels to be considered for model train-

ing. Besides, max_seq_length is defined as the maximum number of interactions to be

considered for each individual. Thus, if we take the combination (3, 7), only individuals


21

with at least three interactions with the communication channels will be considered, and

from the total of their interactions, the model will be trained on the last seven observed.

Within the subset of customers who had at least three interactions with the commu-

nication channels, 0.87 of them had a maximum of eight interactions. Therefore, to

ensure a representative population for training the model, the behavior of the models

is analyzed for the possible combinations of min_seq_length and max_seq_length

between three and eight.

Figure 4 presents a comparison of the PR AUC achieved by the LSTM and Encoder-

Decoder models. The plot displays the PR AUC metric for both models across all combi-

nations of minimum (rows) and maximum (columns) sequence lengths ranging from 3

to 8. The slope of the arrow highlights the model that achieved the highest PR AUC for

each combination of minimum and maximum sequence lengths. In this way, a positive

slope indicates that the Encoder-Decoder model achieved a higher PR AUC compared to

the LSTM for the observed combination.

Figure 4: Comparison of PR AUC Across Different Sequence Lengths

Analyzing the models individually, the combinations of sequence lengths that gener-

ate the highest PR AUC for the LSTM are as follows {(3, 6), (3, 8), (5, 7), (5, 8), (6, 8)}.

Meanwhile, for the Encoder-Decoder the best combinations of minimum and maximum


22

sequence lengths are {(5, 8), (6, 5), (7, 6), (8, 3), (8, 8)}. In general, Encoder-Decoder

achieves better results when considering higher minimum sequences length than those

observed in the best combinations obtened for the LSTM.

On the other hand, we can compare the PR AUC of the models regardless of where the

highest values of each of them are. We see that LSTM has a higher PR AUC than the

Encode-Decoder in sequences with larger minimum and maximum length. For example,

when the minimum sequence length is greater than four and the maximum sequence

length is greater than six, LSTM perform better.

Meanwhile, for cases in which the minimum sequence length is smaller than four and the

maximum sequence length is smaller than six, the encoder is better, although in those

quadrants is not where this model achieves the highest PR AUC.

On the other hand, when comparing the execution times of the models, the Encoder-

Decoder demonstrates superior performance, delivering the shortest execution times.

This is a significant advantage, as the datasets used to train these models are typically

large. Furthermore, sequential models generally have high execution times, making this

factor a critical consideration in practice when selecting a model.

Figure 5: Comparison of Execution Time Across Different Sequence Lengths


23

3.3.2 Hyperparameters

Several hyperparameters are essential for defining the structure and behavior of the

LSTM network, as well as the complementary layers required to fit the model. Specifi-

cally, for the LSTM layer, the most significant hyperparameters included in the Python

function LSTM, from Tensor Flow package, are as follows:

Neurons. Determines the number of neurons in a layer, moreover, this parameter

defines the dimension of the output space and input shape, including the number

of time steps and features per time step for sequences. It is widely recognized

that insufficient hidden units may hinder the model’s ability to capture data non

linearity’s, while an excess can be managed with regularization. The number of

hidden layers is determined by background knowledge and experimentation.

Return Sequences. When set to True, the LSTM layer returns the hidden state

output for each input time step; besides, when is set to False, it returns the final

hidden state value at the last time step.

Return State. If True, the LSTM layer returns the last cell state, in addition to the

output.

Activation. Defines the activation function to use for the LSTM cells.

Dropout. It is used to prevent overfitting during training by deactivating ran-

domly selected neurons, thereby reducing the model’s sensitivity to the specific

weights of individual neurons. This parameter is a value between 0 and 1, repre-

senting the fraction of units to drop during the linear transformation of the inputs.

Recurrent Dropout. Dropout rate for recurrent connections also helps prevent

overfitting and is specifically tailored for gated architectures. In the case of LSTM

memory cells, it is applied to the updates, taking a value between 0 and 1, repre-

senting the fraction of units to drop during the linear transformation of the recur-

rent state.

Kernel Regularizer. Typically regularization involves adding penalty factors to

network layers to modify weight propagation, aiding optimal model convergence.

L1 regularization applies the absolute values of weights, while L2 regularization

applies the squares of weights. When Kernel regularizer is used, penalty terms are


24

added to the kernel layers, affecting the weights of the neural network, while the

bias component remains unchanged.

Activity Regularizer. Regularization method applied to the output of the layer.

If True, the states for the model will be preserved between batches during training.

Furthermore, to prevent overfitting in the model, Batch Normalization layers were

added to the model architecture. This technique is employed in deep neural networks

to stabilize training by normalizing the inputs of each batch so that they have a mean

near zero and a standard deviation close to one. By minimizing internal covariate shift,

it ensures that the input distribution to each layer stays consistent throughout training,

thereby enhancing the model’s stability and promoting faster convergence. After defin-

ing the architecture of the LSTM model, the next step is to compile it, which can be done

using the Python function compile. This function prepares the model for training by

configuring the loss function, optimizer, and evaluation metrics. Binary Cross Entropy

loss, a common choice for binary classification problems, was employed, as outlined in

equation (2). The Adam optimizer, a stochastic gradient descent method that utilizes

adaptive estimation of first-order and second-order moments, was chosen for optimiza-

tion. Finally, accuracy was specified as the metric in the compile function to monitor the

model’s performance during training and evaluation.

Subsequently, the fitting process of the model needs to be configured. The fit method

is employed to input the training data into the model, specifying the number of epochs

and the batch size. Through this process, the model iteratively adjusts its weights based

on the training data in order to minimize the loss function. Next parameters required to

be configured for this step,

Learning Rate. is generally set to a small positive value, usually between 0.0

and 0.1; controls the frequency of parameter updates in response to the loss gradi-

ent and dictates the magnitude of adjustments made to the model’s weights after

each training batch. A higher learning rate speeds up training but risks improper

convergence or divergence. Conversely, a lower learning rate facilitates smoother

convergence but prolongs the training process due to smaller steps towards the

loss function’s minimum.

Number of Epochs. it refers to the total number of complete passes through the


25

entire training dataset during the training process. In simple terms, one epoch

means that every sample in the training dataset has been used once to update

the internal model parameters. Usually, is increased until the validation accuracy

begins to decline, indicating potential overfitting.

Batch Size. correspond to the defining the number of samples processed before

updating the internal model parameters. Controls how many training samples are

processed in one iteration of the model’s training.

3.4 Model structure and computational tools

This section describes the architecture and tools used to develop the models. The models

and data wrangling were implemented using the Python programming language, based

on the implementation detailed in (Jeremite, 2019). This base code was modified to in-

clude changes highlighted in the following sections. Among the key libraries used in the

implementation are TensorFlow, Keras, Scikit-learn, and Matplotlib.

3.4.1 Data Processing

Several steps were applied to ensure that the data are properly formatted and balanced

for training. Key steps include data cleaning, filtering, tokenizing path data, and split-

ting the dataset into training and testing sets. As showed in Listing 1, categorical data

is converted into vectors, filtering the data to include only those cases where the se-

quences meet a defined sequence length. Then, path data is tokenized into sequences of

integers using the fit_on_texts function from the Tokenizer class. After that, the

pad sequences function is used to pad or truncate the sequences to a specified length

defined by the parameters max_seq_length and min_seq_length. Data is split into

training and testing sets, and the sequences are converted into one-hot encoded arrays,

making them suitable for input into a machine learning model.

Additionally, the implementation of the Synthetic Minority Oversampling Technique

(SMOTE) was included in the code as a possible solution to address class imbalance in

the data.


26

1 def process_data(dt, max_seq, min_seq):

2 cat(dt, ’path’, "leng_path", s=’>’)

3 dt_original = data[data.leng_path >= min_seq]

4 dt_original = dt_original.reset_index()

5 y = dt_original.total_conversions

6 text = dt_original.path

7

8 tokenizer = Tokenizer()

9 tokenizer.fit_on_texts(text)

10 vocab_size = len(tokenizer.word_index) + 1

11 encoded_docs = tokenizer.texts_to_sequences(text)

12 padded_docs = tf.keras.utils.pad_sequences(encoded_docs, maxlen = max_seq, padding =’pre’,

truncating = ’pre’)

13

14 X_train, X_test, Y_train, Y_test = train_test_split(padded_docs, y, test_size = 0.1, random_state =

1119, stratify = y)

15 _, paths = train_test_split(text, y, test_size = 0.1, random_state = 1119, stratify = y)

16

17 X = np.array([to_categorical(doc, num_classes = vocab_size) for doc in padded_docs], ndmin = 3)

18 X_tr = np.array([to_categorical(doc, num_classes = vocab_size) for doc in X_train], ndmin = 3)

19 X_te = np.array([to_categorical(doc, num_classes = vocab_size) for doc in X_test], ndmin = 3)

20

21 oversampler = SMOTE()

22 X_Train_O, Y_Train_O = oversampler.fit_resample(X_train, Y_train)

23

24 return [X, y, X_train, X_test, Y_train, Y_test, X_Train_O, Y_Train_O]

Listing 1: Data Wrangling


27

3.4.2 Attention Layer

The attention mechanism is implemented using a series of sequential layers, as illustrated

in Figure 6.

Figure 6: Architecture of the Attention Layer

As illustrated in Listing 2, line eight, a Repeat Vector layer repeats the input vector

st−1, representing the hidden state from the previous time step, for the length of the

sequence. This repetition ensures that the shape of the input vector matches the shape

of the hidden states. Next, the Concatenate layer combines st−1 with the hidden states

ht along the last axis, effectively merging the past hidden state with the current hidden

states.

Following this, a hyperbolic tangent activation function,tanh, is applied to the concate-

nated tensor. This operation computes the intermediate energies e, which quantify the

significance of each hidden state relative to the current state. Then, a ReLU activation

function is applied to transform the intermediate energies e into the energies vt, as de-

fined in equation (6). These transformed energies are subsequently employed to calculate

the attention weights.

Although the analyses conducted do not quantify the impact of the time-decay factor, the

original code includes a Subtract layer that refines the energy values by subtracting a


28

time-decay factor t0 from them. After this adjustment, the Activation layer applies a

Softmax activation function to normalize the energies, resulting in attention weights at
that sum to 1 across the sequence length, as described in equation (9). Finally, the Dot

Layer computes the dot product of the attention weights at and the hidden states ht,

producing the context vector s. This context vector is generated by assigning weights

to each hidden state ht according to its respective attention weight at, as indicated in

equation (8).

1 def one_step_attention(h_t, s_prev, t):

2 concatenator = Concatenate(axis = -1)

3 dense_tanh = Dense(10, activation = "tanh")

4 dense_relu = Dense(1, activation = "relu")

5 softmax_activation = Activation(softmax)

6 dot_product = Dot(axes = 1)

7

8 s_prev_repeated = RepeatVector(s_prev)

9 concatenated = concatenator([s_prev_repeated, h_t])

10 e = dense_tanh(concatenated)

11 v_t = dense_relu(e)

12 v_t = Subtract(name = ’timeDecay’)([v_t, t])

13 a_t = softmax_activation(v_t)

14 s = dot_product([a_t, h_t])

15

16 return s

Listing 2: Attention Layer

3.4.3 Long-Short Term Memory

Three input layers are defined, one for the main sequence data, the other for the initial

hidden state, and a third for incorporating an optional time decay factor. Subsequently,

a four-layer stacked LSTM network structure is established, each layer configured with

420 neurons, a recurrent dropout rate of 0.05, and a regularization of the L2 kernel. Ad-

ditionally, since customer interactions with various touchpoints may exhibit temporal

dependencies where the entire sequence and order of touchpoints are significant, the re-

turn sequences parameter has been set to true. This ensures that the network produces

an output at each time step, enabling the capture of contributions from each touchpoint

in the sequence. This capability allows subsequent layers, such as attention and dense

layers, to effectively analyze these dependencies. Furthermore, batch normalization is

applied after each LSTM layer to stabilize and accelerate training by normalizing the


29

outputs. Listing 3 shows an extract of the code used to create this architecture.

1 # Input layers

2 input_att = Input(shape=(max_seq_length, vocab_size))

3 s_prev = Input(shape=(240))

4 t = Input(shape=(max_seq_length, 1))

5

6 # Define the LSTM layers with BatchNormalization applied after each

7 def build_lstm_block(inputs):

8 x = LSTM(240, dropout = 0.05, recurrent_dropout = 0.05,

9 kernel_regularizer = l2(0.01), return_sequences = True)(inputs)

10 return BatchNormalization()(x)

11 h_1 = build_lstm_block(input_att)

12 h_2 = build_lstm_block(h_1)

13 h_3 = build_lstm_block(h_2)

14 h_4 = build_lstm_block(h_3)

15 ht = BatchNormalization()(h_4)

16

17 # Attention context computation

18 s = one_step_attention(ht, s_prev, t)

19 c = Flatten()(s)

20

21 # Final output layer

22 out_att = Dense(1, activation="sigmoid")(c)

23

24 # Build and compile the model

25 model = Model([input_att, s_prev, t], out_att)

26 model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[’accuracy’])

Listing 3: LSTM

An attention mechanism calculates a context vector as described previously. To produce

the final model output, a dense layer with a sigmoid activation function is appended,

yielding a single output value suitable for binary classification tasks. Subsequently, the

model architecture is defined, specifying the inputs and final output. The model is then

compiled using the Keras compile function, which configures the learning process with

a binary cross-entropy loss function, Adam optimizer, and accuracy metric for model

evaluation. Finally, the model is trained using the fit function from Keras, employing

175 epochs, a batch size of 90, and a learning rate of 0.001.

Figure 7 illustrates the previously defined structure.


30

Figure 7: Architecture of the LSTM model

3.4.4 Encoder-Decoder

The initial structure is similar to the one described earlier; same three inputs are defined,

and the first part of the network consists of a two-layer stacked LSTM network. Also,

each LSTM layer is configured with 380 neurons, a recurrent dropout rate of 0.05, and an

L2 kernel regularize, with return sequences set to true; and batch normalization applied

after each layer.

Change in the structure is introduced starting from an encoder layer with a ReLU ac-

tivation function and return sequences set to false. This setup is appropriate when we

want to emphasize the overall sequence impact rather than individual time steps. More-

over, this approach simplifies the model architecture and summarizing sequences can

improve efficiency in both training and inference, thereby reducing computational com-

plexity. Besides, this layout is use when the main objective is whether a conversion

happens at the end of the customer interactions.

Another change occurs in the structure of the attention layer, where we define the de-

coder before calculating the dot product of the attention weights and the hidden states

to obtain the context vector s. The rest of the network architecture remains similar to

the LSTM described earlier. The primary objective is to evaluate the significance of the

channels based on the condensed sequence information.


31

1 # Input layers

2 input_att = Input(shape=(max_seq_length, vocab_size))

3 s_prev = Input(shape=(neuronasEncoder))

4 t = Input(shape=(max_seq_length, 1))

5

6 def build_lstm_block(inputs):

7 x = LSTM(neuronasEncoder, dropout = 0.05, recurrent_dropout = 0.05,

8 kernel_regularizer = l2(0.01), return_sequences = True)(inputs)

9 return BatchNormalization()(x)

10

11 h_1 = build_lstm_block(input_att)

12 h_2 = build_lstm_block(h_1)

13 ht = BatchNormalization()(h_2)

14 encoded = LSTM(neuronasEncoder, activation = ’relu’, return_sequences = False)(ht)

15

16 s_prev_repeated = RepeatVector(s_prev)

17 concatenated = concatenator([s_prev_repeated, encoded)

18 e = dense_tanh(concatenated)

19 v_t = dense_relu(e)

20 v_t = Subtract(name = ’timeDecay’)([v_t, t])

21 a_t = softmax_activation(v_t)

22

23 decoder = RepeatVector(max_seq_length)(encoded)

24 s = dotor([a_t, decoder])

25 c = Flatten()(s)

26 out_att = Dense(1, activation = "sigmoid", name = ’single_output’)(c)

Listing 4: Encoder-Decoder

Figure 8 illustrates the architecture of this model.

Figure 8: Architecture of the Encoder-Decoder model


32

4 Results and Discussion

4.1 Data Processing

This section describes the data transformation process required to structure the database

for the implementation of the proposed models. The initial step involves organizing the

database so that each row contains the information pertaining to an individual client.

To achieve this objective, it is necessary to aggregate the various channels that a client

has visited into a new column named Path. Subsequent stage entails computing the Last
time lapse column, representing the elapsed time in seconds between a client’s latest

interaction with a channel and their previous interactions with other channels. Finally,

the “conversion" column refers to whether, after all the interactions a client had, there

was a conversion or not. The following is a demonstrative example of the database

structure achieved through the described transformations.

Cookie Path Last time laps Conversion
0007iiAiFh3ifoo9Ehn3ABB0F facebook>instagram>searchpaid>facebook 588241,532798,515807,0 1

00079hhBkDF3k3kDkiFi9EFAD video>instagram>instagram 18337, 178,0 0

00073CFE3FoFCn70fBhB3kfon video>facebook>instagram>searchpaid>facebook 588241,532798,515807,0 1

000A9AfDohfiBAFB0FDf3kDEE facebook>video>searchpaid>video>instagram 38337, 7898,0 0

oooh73D7h9hCh03EfhBBhECnB facebook>video>searchpaid>facebook 288241,532798,600504,895807,0 0

oooh7FDi0hBnEDBii70hfEf93 video>video>searchpaid>video>instagram 157941,157304,130504,90007,0 0

ooohkBnnfDooo3hfCnfDfiEiB instagram>video>searchpaid>video 607976,402797,110504,0 0

oooiA9fi99FiAioAo97DohkF3 facebook>video>searchpaid>video’ ’157941,105991,100504,96407,0 0

oooiCAf0Dno3Dfi7h7io9kCk9 facebook>video>searchpaid>video>video>facebook 607976,139794,41504,0 1

oooiFD9977iFC9DC3E3D000Ff instagram>video>searchpaid>facebook>instagram>instagram 157751,132995,100032,0 0

oooik3A7A7FA9oof3hDfin7CB video>video>searchpaid>video>instagram 237941,190007,132799,130504,118241, 0 1

oook0nnhoCo0BoEAho7E9nfEC instagram>video>searchpaid>video 607976,593388,588241,319501, 210804, 0 1

ooooohAFofEnonEikhAi3fF9o facebook>video>searchpaid>video 157941,128798,110504,90007,0 0

000C9BiBFhoFhC7noEFAA7no7 Online Video>Online Video>Online Video>Online Video>Online Video>facebook>searchpaid>instagram 662205,659121,605944,417688,396693,345469,315380,0 0

Table 3: Overview of Input Datasets

After organizing the data as described earlier, the subsequent step requires identifying

clients with a minimum of three interactions with potential touch points. Following,

essential adjustments are applied to both the Paths column and the Last Time Lapse

column.

To process the paths, it is necessary to implement a tokenizer transformation, the main

objective of tokenization is to convert continuous text into discrete units, facilitating

analysis and processing. This type of modification represents a crucial step in natu-

ral language processing and text analysis; In this instance, the LSTM network architec-

ture requires input via channel tokenization. To accomplish this objective, the Python

tokenize module, along with its corresponding methods, was utilized. Notably, the


33

texts to sequences method was employed to convert paths, expressed as texts, into

sequences of integers, guided by the tokenization procedure executed by the tokenizer.

Each channel within the paths received a unique integer value, resulting in the conver-

sion of each path into a density vector, as exemplified in the following illustration,

Path Density vector
facebook, instagram, searchpaid, facebook [2, 4, 3, 2]

video, facebook, instagram, searchpaid, facebook [1, 2, 4, 3, 2]

facebook, video, searchpaid, video, instagram [2, 1, 3, 1, 4]

facebook, video, searchpaid, facebook [2, 1, 3, 2]

video, video, searchpaid, video, instagram [1, 1, 3, 1, 4]

instagram, video, searchpaid, video [4, 1, 3, 1]

facebook, video, searchpaid, video [2, 1, 3, 1]

facebook, video, searchpaid, video, video, facebook [2, 1, 3, 1, 1, 2]

instagram, video, searchpaid, facebook, instagram, instagram [4, 1, 3, 2, 4, 4]

video, video, searchpaid, video, instagram [1, 1, 3, 1, 4]

instagram, video, searchpaid, video [4, 1, 3, 1]

video, video, video, video, video, facebook, searchpaid, instagram [5, 5, 5, 5, 5, 2, 3, 4]

Table 4: Channel Tokenization

In order to address the diverse lengths of paths for each client, it is essential to establish

a framework that standardizes these sizes. The initial phase of this process involves de-

termining the maximum size of an interaction sequence, denoted as max_seq_length.

This parameter is a prerequisite for the path sequence method in the Keras Python li-

brary. Utilizing this method enables the creation of uniform-length paths by either com-

pleting or truncating them. For paths shorter than max_seq_length, padding is applied

with a specified value until the desired length is reached. In contrast, in cases where se-

quences exceed the defined max_seq_length, truncation takes place to adhere to the

specified length, resulting in the removal of values from the beginning of the sequence.

Based on the previous example, if max_seq_length is set to seven, the density vectors

would adopt the configuration of a uniform density vector.


34

Path Density vector Uniformed vector
facebook, instagram, searchpaid, facebook [2, 4, 3, 2] [0, 0, 0, 2, 4, 3, 2]

video, facebook, instagram, searchpaid, facebook [1, 2, 4, 3, 2] [0, 0, 1, 2, 4, 3, 2]

facebook, video, searchpaid, video, instagram [2, 1, 3, 1, 4] [0, 0, 2, 1, 3, 1, 4]

facebook, video, searchpaid, facebook [2, 1, 3, 2] [0, 0, 0, 2, 1, 3, 2]

video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] [0, 0, 1, 1, 3, 1, 4]

instagram, video, searchpaid, video [4, 1, 3, 1] [0, 0, 0, 4, 1, 3, 1]

facebook, video, searchpaid, video [2, 1, 3, 1] [0, 0, 0, 2, 1, 3, 1]

facebook, video, searchpaid, video, video, facebook [2, 1, 3, 1, 1, 2] [0 ,2, 1, 3, 1, 1, 2]

instagram, video, searchpaid, facebook, instagram, instagram [4, 1, 3, 2, 4, 4] [0, 4, 1, 3, 2, 4, 4]

video, video, searchpaid, video, instagram [1, 1, 3, 1, 4] [0, 0, 1, 1, 3, 1, 4]

instagram, video, searchpaid, video [4, 1, 3, 1] [0, 0, 0, 4, 1, 3, 1]

facebook, video, searchpaid, video [2, 1, 3, 1] [0, 0, 0, 2, 1, 3, 1]

video, video, video, video, video, facebook, searchpaid, instagram [1,1,1,1,1, 2, 3, 4] [ 1, 1, 1, 1, 2, 3, 4]

Table 5: Padding-Truncated Density Vectors

4.2 Heuristic Models

According with Table 2, among the converting customers (7,417), 42% had a single inter-

action with one of the touchpoints; of these, 32% interacted with Paid Search, while 27%

interacted exclusively through Facebook. Conversely, 32% of customers had two inter-

actions with the touchpoints; among them, 24% engaged exclusively with Paid Search,

while 16% interacted only through Facebook; overall, 77% of these customers had at

least one interaction with either Paid Search or Facebook.

Comparing this distribution with the attribution provided by heuristic models to the

channels, in Figure 9, we find a consistent trend; the majority of these models attribute

greater importance to the channels Paid Search and Facebook. Results from the heuristic

models, demonstrate how such models tend to skew their conclusions by relying only

on partial information about customer behavior. Moreover, as previously mentioned,

customers who had only one or two interactions are typically not the most relevant for

the business case under analysis and were, in fact, excluded from the study due to their

predisposition to acquire the offered good or service. This leads to the conclusion that, in

cases where the objective is to attract new customers, who generally require more inter-

actions with the product before converting, heuristic models may result in partial and

inadequate conclusions regarding the allocation of resources to communication chan-

nels.


35

On the other hand, while Markov models are a more sophisticated tool, they also have

disadvantages compared to sequential models. These include being simplistic, as they

assume that hidden states are discrete and finite, and that observations are conditionally

independent given the hidden states. Moreover, they are susceptible to both overfitting

and underfitting, as selecting the appropriate number of hidden states and prior dis-

tributions requires careful consideration over the parameters, and they depend on the

quantity of observed data. Furthermore, both Markovian and heuristic models are not

capable of generating a prediction on the conversion rate.

Figure 9: Contribution based on Heuristic Models

4.3 Model 1: Long-Short Term Memory

This section presents the results obtained from the original sequential model applied in

li et al. (2018). As we know, among the advantages of this model is its ability to consider

sequential data and generate predictions on the customer conversion rate.

False negatives in this problem correspond to the customers who convert, but the model

predicts them as non-converters. Conversely, recall signifies the ratio of accurately pre-

dicted converting customers to the total number of converting customers, making it

a crucial metric for evaluation. This significance stems from the relatively low cost

of reaching out to non-converting customers compared to the potential loss incurred

by failing to engage with converting ones. In addition, contactability depends on the

channels used to propagate the advertising campaign. Consequently, understanding the

channels through which converting customers interacted is vital to informed decision-

making and resource allocation optimization. Hence, the emphasis lies more on seeking


36

a model with a good recall measure rather than focusing exclusively on low precision.

Furthermore, considering the inherent imbalance in the problem under investigation,

ROC AUC may not be sufficient as a suitable metric to evaluate model performance.

As discussed previously, the combinations of minimum and maximum sequence lengths

that generate the best preliminary results were (3, 6), (3, 8), (5, 7), (5, 8), (6, 8).

Table 6 presents the main indicators that should be considered in the selection of the

final model for the combination of minimum and maximum sequence mentioned, as the

execution time, precision, recall, positive class distribution, PR AUC, delta and ROC AUC

over the test sample. It is worth mentioning that Delta has been defined as the disparity

between the positive class percentage and the area under the precision-recall curve (PR

AUC).

It can be observed that the combination (5, 8) achieves the highest PR AUC. Although

other combinations, such as (5, 7) and (6, 8), have a higher recall value of 0.66, taking

into account the execution time and other performance indicators of the model, (5, 8)

was selected as the optimal combination.

Min. Max. Exec. (h) Precision Recall F1 Positive class PR AUCa ROC AUC

3 6 3.1 0.17 0.46 0.24 0.11 0.33 0.59

3 8 3.9 0.16 0.47 0.24 0.11 0.27 0.59

5 7 1.6 0.18 0.66 0.28 0.14 0.30 0.59

5 8 1.8 0.19 0.54 0.28 0.14 0.38 0.58

6 8 1.3 0.18 0.66 0.29 0.16 0.26 0.56

Table 6: Evaluation Metrics for LSTM Model Across Different Sequences

To improve prediction performance, alternatives commonly used for imbalanced datasets

were explored. Among them, the oversampling methodology with the Synthetic Minor-

ity Oversampling Technique (SMOTE) was attempted, serving as a notable example (Bao

et al., 2020). SMOTE generates additional samples for minority classes through linear in-

terpolation; this involves creating synthetic instances along lines connecting minority

class samples and their neighboring instances. The algorithm selects a subset of data

from minority classes, creates synthetic examples, and integrates them into the original


37

dataset. This augmented dataset then acts as a training sample for classification models,

effectively mitigating overfitting concerns associated with simplistic random oversam-

pling techniques.

However, the results obtained by implementing this technique showed worse perfor-

mance than those obtained previously. Although the prediction performance for the

data used is not optimal, the sequential LSTM model provides insights into the conver-

sion rate, offering an advantage not provided by heuristic or non-Markovian model’s.

The following section presents the results obtained from the selected model. The learn-

ing curves, depicted in Figure 10, illustrate the model’s performance on the training

dataset. Notably, the learning curves begin to stabilize after approximately 50 epochs.

The smooth progression of these curves indicates that the model is improving consis-

tently over time, reflecting stable learning behavior.

Ideally, the validation loss curve should be slightly below the training loss curve, which

would indicate effective regularization and strong generalization to unseen data. Fur-

thermore, when the validation and training loss curves closely overlap, it suggests that

the model is well-balanced, avoiding both overfitting and underfitting, as indicated by

similar performance on both the training and validation sets.

Figure 10: Training vs Validation


38

To address the potential issue of overfitting in the model, K-Fold Cross-Validation with

k = 5 was implemented. A key advantage of this methodology is its effectiveness in

mitigating overfitting. The technique involves partitioning the data into multiple folds

or subsets. In each iteration, one fold is used as the validation set while the model is

trained on the remaining folds. This process is repeated such that each fold serves as the

validation set exactly once.

As shown in Table 7, the behavior of key performance metrics across the different folds

is presented. Generally, the results exhibit stability, yet, there is some variation in the

recall, with values ranging from a minimum of 0.52 to a maximum of 0.81. This variation

is also reflected in the area under the precision-recall curve (PR AUC), which is impacted

by fluctuations in recall and ranges between 0.17 and 0.36. The average PR AUC is 0.22,

which is 16 basis points lower than the metric shown in the Table 6.

Since Cross-Validation offers a more reliable performance measure, a lower average re-

call across folds indicates that the model could benefit from further tuning, regulariza-

tion, or an evaluation of the representativeness of the test set.

Fold Accuracy Loss Precision Recall F1 ROC AUC PR AUC
1 0.86 0.40 0.15 0.81 0.25 0.54 0.17

2 0.85 0.42 0.18 0.65 0.28 0.57 0.19

3 0.86 0.40 0.17 0.64 0.27 0.57 0.17

4 0.86 0.40 0.17 0.52 0.26 0.56 0.36

5 0.86 0.41 0.20 0.55 0.29 0.58 0.20

Average 0.86 0.40 0.17 0.63 0.27 0.57 0.22

Table 7: Cross-Validation Performed for the LSTM Selected Model

As shown in Figure 11, given the nature of the problem, prioritizing a high recall value

over precision is preferred. In this case, the precision is 0.19, while the recall is 0.54.

This recall value indicates that the model is able to predict 0.54 of the cases with positive

conversion; in other words, 0.56 of the positive conversion cases are not being correctly

classified by the model.


39

Figure 11: Actual vs Predicted Classifications

On the other hand, Figure 12 demonstrates that the model is effectively identifying posi-

tive instances at a rate substantially higher than the base rate of the positive class, which

is 0.14. A PR AUC of 0.38 reflects the trade off between the precision of positive predic-

tions and the ability to identify positive conversions. Although a higher PR AUC would

be ideal, a value of 0.38 still indicates that the model is making meaningful predictions,

identifying positive cases better than what would be expected by chance, despite the

imbalance in the dataset.

Figure 12: Performance Across Thresholds


40

As previously mentioned, the primary objective of attribution modeling extends beyond

mere model prediction. A robust representation of dynamic pathways holds significant

value for future business decisions and budget optimization in strategic decision-making

processes. Relevance of the channels is shown in Figure 13; based on the results obtained

in the attention layer for the converting clients, a relative weight were assigned. The

findings indicate that Instagram, with a contribution of 0.34, reveals the most significant

impact on clients with positive conversions, followed by Online Display with 0.31 and

0.15 for Facebook and 0.13 for Paid Search.

A variation in the importance of channels can be observed compared to the results from

heuristic models, which identify Facebook, Paid Search, and Online Video as the most

significant channels. For instance, LSTM sequence model assigns half the importance

to Facebook compared to heuristic models. It is important to note that not only in the

subset of clients who had only one or two interactions with communication channels was

Facebook the touchpoint with the most interactions. Among clients with at least three

interactions, who were included in the calibration of sequential models, 30% of their

total interactions occurred via Facebook.These results underscore the value of sequential

models in attributing key communication channels, enabling more accurate outcomes

by using historical customer behavior.

Figure 13: Contribution of Communication Channels


41

4.4 Model 2: Encoder-Decoder Modification

As shown in Table 8, the combinations of minimum and maximum sequence lengths

that generated the best preliminary results were {(5, 8), (6, 5), (7, 6), (8, 3), (8, 8)}. In

this case, the (8, 3) combination has the highest PR AUC; this model has been selected

as the best option because, in addition to demonstrating the best fitting metrics, it aligns

well with business needs where the collected data often includes a large proportion of

long interaction sequences.

Min. Max. Exec. (h) Precision Recall F1 Positive class PR AUC ROC AUC

5 8 1.5 0.20 0.64 0.30 0.14 0.25 0.61

6 5 0.7 0.18 0.76 0.30 0.16 0.31 0.57

7 6 0.6 0.19 0.82 0.31 0.17 0.28 0.55

8 3 0.3 0.20 0.75 0.32 0.18 0.32 0.55

8 8 0.6 0.21 0.73 0.32 0.18 0.25 0.56

Table 8: Evaluation Metrics for Encoder-Decoder Model Across Different Sequences

The two main difference and benefit identified with the modification of the encoder in

the original base sequential model, are that the Encoder-Decoder model provides better

fitting results with long sequences; furthermore, as can be verified, the execution times

for this model are generally significantly shorter than those of the previous model.

From Figure 14, a reduction in loss value can be observed, indicating increasingly ac-

curate model predictions. A decreasing training error indicates effective learning from

the training data, moreover, as the number of epochs increases, the learning curves be-

gin to stabilize, indicating that the model’s performance is starting to converge. If these

curves do not stabilize, it may indicate that the model requires more epochs to reach

convergence, or it could be due to overfitting or underfitting issues.


42

Figure 14: Training vs Validation

On the other hand, Table 9 demonstrates consistent model performance across the five

folds, indicating stability in the model’s performance with minimal variation in key met-

rics such as accuracy, loss, and F1 Score. The consistent accuracy and relatively stable

precision, recall, and ROC AUC values suggest that the model is generalizing well across

different subsets of the data. This stability implies that the model is not overly sensitive

to variations in the training data, which is a positive sign for its robustness and reliability

when applied to unseen data.

Fold Accuracy Loss Precision Recall F1 ROC AUC PR AUC
1 0.84 0.44 0.19 0.70 0.30 0.55 0.31

2 0.82 0.48 0.20 0.76 0.32 0.55 0.32

3 0.82 0.46 0.21 0.71 0.32 0.56 0.21

4 0.82 0.47 0.21 0.76 0.32 0.56 0.32

5 0.81 0.48 0.22 0.69 0.31 0.55 0.28

Average 0.82 0.46 0.20 0.72 0.31 0.55 0.29

Table 9: Cross-Validation Performed for the Encoder-Decoder Selected Model

Similar to previous model results, due to the nature of the problem, it is preferable to pri-

oritize a high recall value over precision. According with Figure 15, the model achieves


43

a precision of 0.20 and a recall of 0.75; indicading that the model correctly predicts 0.75

of positive conversion cases.

Figure 15: Actual vs Predicted Classifications

Figure 16 shows that the model effectively identifies positive instances at a rate much

higher than the 0.18 base rate of the positive class. A PR AUC of 0.32 still indicates that

the model is making meaningful predictions and identifying positive cases better than

would be expected by chance.

Figure 16: Performance Across Threshold


44

Another difference observed with this model is its reduced emphasis on Facebook, leading

to its exclusion from the top three most important channels. However, as highlighted by

the sequence models analyzed, Instagram and Online Display consistently demonstrate

the greatest influence on customer conversion. The results obtained from both sequen-

tial models demonstrate that channels with the highest number of interactions are not

necessarily the most significant in the consumer’s final decision. Consequently, the in-

sights generated by these models become valuable tools for optimizing resources when

designing advertising campaigns.

Figure 17: Contribution of Communication Channels


45

5 Conclusions

Comparison of sequential models, such as LSTM and Encoder-Decoder, shows that chan-

nels with the highest number of interactions, like Facebook and Paid Search, are not al-

ways the most influential in customer conversions. These models capture long-term de-

pendencies and provide more accurate conversion predictions, offering deeper insights

into channel contribution than heuristic models. For example, with the LSTM model it

was concluded that Instagram and Online Display were on average the most influential

communication channels; allowing the first proposed objective to be fulfilled.

As part of the second objective proposed, when analyzing the modification of the Deep

Neural Net with Attention for Multi-channel Multi-touch Attribution model, by includ-

ing an LSTM Encoder-Decoder, it is observed that individually, this modification per-

formed better with longer sequences. This model provided more precise fitting results

and shorter execution times, making it suitable for cases with more extended customer

interactions and involving complex interaction patterns across multiple channels.

Belong to the last objetive, PR AUC of the models were analyzed. When comparing

the PR AUC of the models across different ranges of sequence lengths, regarles of their

highest values, different distinctions can be observed. LSTM PR AUC outperforms the

Encoder-Decoder in scenarios where both the minimum sequence length exceeds four

and the maximum sequence length exceeds six. For instance, for the combinations of

maximum and minimum sequence (8,5) and (8,6), the PR AUC of the LSTM is higher

than the Encoder-Decoder by 0.13 and 0.08, respectively.

Conversely, in cases where the minimum sequence length is less than four and the max-

imum sequence length is less than six, the Encoder-Decoder shows better performance.

For example, for the combinations of maximum and minimum sequence (3,4) the PR

AUC of the Encoder-Decoder is higher than the LSTM by 0.03. It is important to note

that the Encoder-Decoder’s superior performance in these shorter sequence ranges does

not correspond to its highest PR AUC values.

Nevertheless, the application of the Encoder-Decoder allows an accelerated training of

the model presenting better results in predicting a conversion in the earlier interactions

of the consumer, which allows the user to make decisions without waiting a longer in-

teraction of the consumer to predict their behavior.


46

Furthermore, both models showed Instagram and Online Display. These results show that

channels with the highest number of interactions are not necessarily the most significant

in the consumer’s final decision.

Although sequential models provide more robust results, it is crucial to consider the type

of information that needs to be collected for proper training.

For future implementations, it is recommended to consider a data set collected specifi-

cally for sequential models. Such a data set would have provided more personalized data

to capture customer interaction patterns over time, allowing for more effective model

calibration and improved predictive performance. In this particular case, the accuracy

of conversion rate predictions could have been improved if a publicly available data set

specifically designed to calibrate sequential models had been available.


47

References

Abhishek Vibhanshu, H. K., Fader Peter. (2012). The long road to online conversion: A

model of multi-channel attribution.

doi: 10.2139/ssrn.2158421

Abirami, S., & Chitra, P. (2020). Chapter fourteen - energy-efficient edge based
real-time healthcare support system (Vol. 117; P. Raj & P. Evangeline, Eds.)

(No. 1). Elsevier. Retrieved from https://www.sciencedirect.com/

science/article/pii/S0065245819300506 doi: https://doi.org/10.1016/

bs.adcom.2019.09.007

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate (Vol. abs/1409.0473). Retrieved from https://

api.semanticscholar.org/CorpusID:11212020

Bao, F., Wu, Y., Li, Z., Li, Y., Liu, L., & Chen, G. (2020, 09). Effect improved for high-

dimensional and unbalanced data anomaly detection model based on knn-smote-

lstm. Complexity, 2020, 1-17. doi: 10.1155/2020/9084704

Boyd, E. K. H. P. C. D., Kendrick. (2013). Area under the precision-recall curve: Point

estimates and confidence intervals. In (pp. "451–466"). Springer Berlin Heidelberg.

Brownlee, J. (2019). Imbalanced classification with python: Better metrics, balance skewed
classes, cost-sensitive learning. Machine Learning Mastery.

C. Ferri, R. M., J. Hernández-Orallo. (2009). An experimental comparison of per-

formance measures for classification. Pattern Recognition Letters, 30(1), 27-

38. Retrieved from https://www.sciencedirect.com/science/article/

pii/S0167865508002687 doi: https://doi.org/10.1016/j.patrec.2008.08.010

Danaher, P., & Danaher, T. (2013, 08). Comparing the relative effectiveness of advertis-

ing channels: A case study of a multimedia blitz campaign. Journal of Marketing
Research, 50, 517-534. doi: 10.1509/jmr.12.0241

de Haan, E., Wiesel, T., & Pauwels, K. (2016). The effectiveness of different forms

of online advertising for purchase conversion in a multiple-channel attribu-

tion framework. International Journal of Research in Marketing, 33(3), 491-

507. Retrieved from https://www.sciencedirect.com/science/article/

pii/S0167811615001421 doi: 10.1016/j.ijresmar.2015.12.001

Dimitrios Buhalis, K. V. (2021). Bridging marketing theory and big data analytics: The

https://www.sciencedirect.com/science/article/pii/S0065245819300506
https://www.sciencedirect.com/science/article/pii/S0065245819300506
https://api.semanticscholar.org/CorpusID:11212020
https://api.semanticscholar.org/CorpusID:11212020
https://www.sciencedirect.com/science/article/pii/S0167865508002687
https://www.sciencedirect.com/science/article/pii/S0167865508002687
https://www.sciencedirect.com/science/article/pii/S0167811615001421
https://www.sciencedirect.com/science/article/pii/S0167811615001421


48

taxonomy of marketing attribution. International Journal of Information Manage-
ment, 56. doi: 10.1016/j.ijinfomgt.2020.102253.

Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning : data
mining, inference, and prediction. New York, NY, USA: Springer New York Inc.

Huyton, H. (2021). Multitouch attribution modelling. Retrieved from https://www

.kaggle.com/code/hughhuyton/multitouch-attribution-modelling

(Accessed: 2023)

Jeremite, T. L. (2019). Ffdna.py: Channel attribution model. Retrieved from

https://github.com/jeremite/channel-attribution-model/blob/

master/FFDNA.py (Accessed: 2023)

Kakalejcik, L., Ferencova, M., Angelo, P., & Bucko, J. (2018, 01). Multichannel marketing

attribution using markov chains. Statistika: Statistics and Economy Journal, 101.

Kumar, . S.-D., V. (2021). Marketing accountability for marketing and non-marketing
outcomes. Emerald Publishing Limited.

li, N., Kumar, S., Dong, C., Yan, Z., & Pani, A. (2018, 09). Deep neural net with attention

for multi-channel multi-touch attribution.

doi: 10.48550/arXiv.1809.02230

Lipton, Z. C., Elkan, C., & Narayanaswamy, B. (2014). Optimal thresholding of classifiers
to maximize f1 measure. Berlin, Heidelberg: Springer Berlin Heidelberg.

Sagheer, A., & Kotb, M. (2019, 12). Unsupervised pre-training of a deep lstm-based

stacked autoencoder for multivariate time series forecasting problems. Scientific
Reports, 9, 19038. doi: 10.1038/s41598-019-55320-6

Shi, M. A. A. K. C. T. W., Zhangyue, & Liu, C. (2022, 01). An lstm-autoencoder based

online side channel monitoring approach for cyber-physical attack detection in

additive manufacturing. Journal of Intelligent Manufacturing. doi: 10.1007/s10845

-021-01879-9

Thaichon, P., & Quach, S. (2016, 03). Online marketing communications and childhood’s

intention to consume unhealthy food. Australasian Marketing Journal (AMJ), 24.

doi: 10.1016/j.ausmj.2016.01.007

Thölke, P., Mantilla-Ramos, Y.-J., Abdelhedi, H., Maschke, C., Dehgan, A., Harel, Y., . . .

Jerbi, K. (2023). Class imbalance should not throw you off balance: Choosing the

right classifiers and performance metrics for brain decoding with imbalanced data.

NeuroImage, 277 , 120253. Retrieved from https://www.sciencedirect.com/

https://www.kaggle.com/code/hughhuyton/multitouch-attribution-modelling
https://www.kaggle.com/code/hughhuyton/multitouch-attribution-modelling
https://github.com/jeremite/channel-attribution-model/blob/master/FFDNA.py
https://github.com/jeremite/channel-attribution-model/blob/master/FFDNA.py
https://www.sciencedirect.com/science/article/pii/S1053811923004044
https://www.sciencedirect.com/science/article/pii/S1053811923004044
https://www.sciencedirect.com/science/article/pii/S1053811923004044


49

science/article/pii/S1053811923004044 doi: https://doi.org/10.1016/j

.neuroimage.2023.120253

https://www.sciencedirect.com/science/article/pii/S1053811923004044
https://www.sciencedirect.com/science/article/pii/S1053811923004044
https://www.sciencedirect.com/science/article/pii/S1053811923004044

	Acta de defensa
	Summary
	List of Tables
	List of Figures
	Introduction
	Introduction
	Objetives

	Background of the Study
	Literature Review
	Neural Networks
	 Deep Neural Net with Attention for Multi-channel Multi-touch Attribution Model
	Deep sequential model
	 Attention mechanism
	Binary classification problem

	Encoder-Decoder
	Accuracy Metrics for Imbalanced Datasets

	Methodology
	Dataset
	Samples Definition
	Model Parameterization
	Sequence Length
	Hyperparameters

	Model structure and computational tools
	Data Processing
	Attention Layer
	Long-Short Term Memory
	Encoder-Decoder


	Results and Discussion
	Data Processing
	Heuristic Models
	Model 1: Long-Short Term Memory
	Model 2: Encoder-Decoder Modification

	Conclusions
	References