UNIVERSIDAD DE COSTA RICA SISTEMA DE ESTUDIOS DE POSGRADO MOLECULAR DETERMINANTS OF ANTIBIOTIC TOLERANCE IN THE HIGH-RISK Pseudomonas aeruginosa AG1 BY A MULTI-OMICS APPROACH: FROM THE GENOME TO THE TRANSCRIPTOMIC NETWORK IN RESPONSE TO CIPROFLOXACIN. DETERMINANTES MOLECULARES DE LA TOLERANCIA A LOS ANTIBIÓTICOS EN LA CEPA DE ALTO RIESGO Pseudomonas aeruginosa AG1 MEDIANTE UN ENFOQUE MULTI-ÓMICO: DEL GENOMA A LA RED TRANSCRIPTÓMICA EN RESPUESTA A LA CIPROFLOXACINA. Tesis sometida a la consideración de la Comisión del Programa de Posgrado de Doctorado en Ciencias para optar al grado y título de Doctorado en Ciencias JOSE ARTURO MOLINA MORA Ciudad Universitaria Rodrigo Facio, Costa Rica 2020 TABLE OF CONTENTS DEDICATION/DEDICATORIA ................................................................................................................ ii ACKNOWLEDGMENTS/AGRADECIMIENTOS .......................................................................................iii HOJA DE APROBACIÓN ........................................................................................................................ iv TABLE OF CONTENTS ........................................................................................................................... v RESUMEN ........................................................................................................................................... vii SUMMARY ......................................................................................................................................... viii Abbreviations .................................................................................................................................. ix List of Figures ...................................................................................................................................x INTRODUCTION ................................................................................................................................... 1 Antibiotic resistance and tolerance ................................................................................................ 2 Pseudomonas aeruginosa AG1 (PaeAG1) ....................................................................................... 3 The multi-omics approach to study PaeAG1 ................................................................................... 5 JUSTIFICATION ................................................................................................................................... 17 RESEARCH QUESTION ........................................................................................................................ 18 HYPOTHESIS....................................................................................................................................... 18 RESEARCH OBJECTIVES ...................................................................................................................... 19 General objective ........................................................................................................................... 19 Specific objectives .......................................................................................................................... 19 CHAPTER 1 ......................................................................................................................................... 20 High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers ...................................... 20 CHAPTER 2 ......................................................................................................................................... 39 Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: a VIM-2- carrying old-acquaintance and a novel IMP-18-carrying integron ............................................... 39 v CHAPTER 3 ......................................................................................................................................... 88 Two-dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler: Pseudomonas aeruginosa AG1 as model ...................................................................................... 88 CHAPTER 4 ....................................................................................................................................... 100 A first Pseudomonas aeruginosa perturbome: Identification of core genes related to multiple perturbations by a machine learning approach .......................................................................... 100 CHAPTER 5 ....................................................................................................................................... 129 Transcriptomic determinants of the response of ST-111 Pseudomonas aeruginosa AG1 to ciprofloxacin identified by a top-down systems biology approach ............................................ 129 GENERAL DISCUSSION AND CONCLUSIONS .................................................................................... 155 SUPPLEMENTARY MATERIAL ........................................................................................................... 159 REFERENCES .................................................................................................................................... 174 vi Autorización para digitalización y comunicación pública de Trabajos Finales de Graduación del Sistema de Estudios de Posgrado en el Repositorio Institucional de la Universidad de Costa Rica. Yo, _______J_o_s_e_ _A_r_tu_r_o_ _M__o_li_n_a_ _M_o__ra___________, con cédula de identidad __1_-_1_2_5__2_-0__4_2_8________, en mi condici ón de autor del TFG titulado ___________________________________________________ __M_o_l_e_ c_u_l_a_r_ d_e__te_r_m__in_a__n_ts_ _o_f_ a_n__ti_b_io_t_ic_ _t_o_le_r_a_n_c__e_ i_n_ _th_e_ _h_i_g_h_-_ri_s_k_ P__s_e_u_d_o__m_o__n_a_s_ a__e_r_u_g_in_o__s_a_ A__G_1__ b_y_ ___ __a_ m__u_l_ti_-o__m_i_c_s_ a__p_p_r_o_a_c_h_:_ f_r_o_m__ t_h_e_ _g_e_n_o_m__e_ _to__ t_h_e_ _tr_a_n_s_c_r_ip__to_m__i_c_ n_e__tw__o_r_k_ i_n_ r_e_s_p_o__n_s_e_ _to_ _c_ip__ro_f_lo__x_a_c_in_ Autoriz o a la Universidad de Costa Rica para digitalizar y hacer divulgación pública de forma gratuita de dicho TFG a través del Repositorio Institucional u otro medio electrónico, para ser puesto a disposición del público según lo que estable zca el Sistema de Estudios de Posgrado. SI x NO * *En caso de la negativa favor indicar el tiempo de restricción: ________________ año (s). Este Trabajo Final de Graduación será publicado en formato PDF, o en el formato que en el momento se establezca, de tal f orma que el acceso al mismo sea libre, con el fin de permitir la consulta e impresión, pero no su modificación. Manifi esto que mi Trabajo Final de Graduación fue debidamente subido al sistema digital Kerwá y su contenido corresp onde al documento original que sirvió para la obtención de mi título, y que su información no infringe ni violent a ningún derecho a terceros. El TFG además cuenta con el visto bueno de mi Director (a) de Tesis o Tutor (a) y cump lió con lo establecido en la revisión del Formato por parte del Sistema de Estudios de Posgrado. INFOR MACIÓN DEL ESTUDIANTE: Nombr e Completo: J o s e Artur o Molin a M o r a . Númer o de Carné: A33 282 N ú m e r o d e c é d u l a : 1 - 1 2 5 2 - 0 4 2 8 . Correo Electrónico: jos e.molin amora@ ucr.a c.cr . Fecha: 14- enero-2 020 . N ú m e r o d e t e l é f o n o : 8 8 8 5 9 4 4 5 . Nombr e del Director ( a ) d e T e s i s o T u t o r ( a ) : D r. Fern ando G arcía Santam aría . FIRMA ESTUDIANTE Nota: El presente documento constituye una declaración jurada, cuyos alcances aseguran a la Universidad, que su contenido sea tomado como cierto. Su importancia radica en que permite abreviar procedimientos administrativos, y al mismo tiempo genera una responsabilidad legal para que quien declare contrario a la verdad de lo que manifiesta, puede como consecuencia, enfrentar un proceso penal por delito de perjurio, tipificado en el artículo 318 de nuestro Código Penal. Lo anterior implica que el estudiante se vea forzado a realizar su mayor esfuerzo para que no sólo incluya información veraz en la Licencia de Publicación, sino que también realice diligentemente la gestión de subir el documento correcto en la plataforma digital Kerwá. SISTEMA DE ESTUDIOS DE POSGRADO ESTUDIANTE 1 INTRODUCTION Biological systems rely on the DNA RNA protein information transfer paradigm that determines the phenotype of an organism . The comprehensive or global assessment of a set of molecules, which requires interpretation of molecular intricacy and - (Subramanian, Verma, Kumar, Jere, & Anamika, 2020). Classical -omics levels refer to genomics, transcriptomics, proteomics, and metabolomics, but the spectrum of omics has been extended to other biological data such as epigenomics, phenomics, perturbomics, lipidomics, venomics, and many others. The current high throughput nature of these techniques, as well as their increased accessibility in terms of time and cost, have triggered the volume of information that can be gathered in individual studies including multiple omics levels, - . Multi-omics can provide a greater understanding of the flow of information in biological systems, from the original biological set-up or condition (genetic, environmental, or developmental) to the functional consequences or relevant interactions (Civelek & Lusis, 2014; Hasin, Seldin, & Lusis, 2017). This makes it possible to draw more comprehensive conclusions on the biological processes in which these data sets must be integrated and analyzed as a holistic system 2020). Also, integrated approaches that combine individual omics data help to bridge the gap from genotype to phenotype, are considered a promising strategy to understand the complexity of biological systems and unravel the mechanisms underlying the biological condition of interest (Civelek & Lusis, 2014; Subramanian et al., 2020). 2 In this context, in this work, a comprehensive multi-omics approach was implemented to study molecular determinants of antibiotic tolerance in a model of Pseudomonas aeruginosa, including genomics, transcriptomics, perturbomics, proteomics, and phenomics as main omics levels. Antibiotic resistance and tolerance Antimicrobial resistance is the ability of a microbe to grow in an inhibitory concentration of an antibiotic, explained by inherited mechanisms (Berti & Hirsch, 2020; Brauner, Fridman, Gefen, & Balaban, 2016). Tolerance is generally used to describe the ability of microorganisms to survive transient exposure to bactericidal antibiotics, which can be inherited or not, with a reduced rate of antimicrobial killing, and often achieved by slowing down the cell growth (Berti & Hirsch, 2020; Brauner et al., 2016). Antibiotic resistance is a major threat to public health because it compromises the administration of appropriate antibiotic therapy. This reduces the therapeutic options to treat infections, increasing patient morbidity and mortality (Farajzadeh Sheikh et al., 2019; Woodford, Turton, & Livermore, 2011), as well as it causes an increase in the costs of health services. The situation is aggravated by the emergence of strains resistant to multiple antibiotics (Firme, Kular, Lee, & Song, 2010), the knowledge limitation of interactions with pathogens and mechanisms of the action of antimicrobial agents, and the reduced development of new antibiotics (Brazas, Brazas, Hancock, & Hancock, 2005). The use of antibiotics below the minimum inhibitory concentration (MIC) or sub-inhibitory concentration also contributes to antibiotic resistance as it selects pre- existing resistant organisms and allows the strains to continue growing (McVicker et al., 2014). Since sub-inhibitory antibiotic concentrations are found in many natural environments, bacteria can naturally trigger mechanisms of tolerance and resistance (Andersson & Hughes, 2014). However, 3 the fundamental mechanisms of bacterial response to antibiotics have not been fully elucidated (Stewart et al., 2015). Since in this study we consider not only inherited mechanisms (genomic level, focused on resistance) but also transcriptomic and phenotypical observations using sub-inhibitory antibiotic concentrations (with mechanisms than can be rela Ciofu & Tolker-Nielsen (2019) to refer to all the molecular responses that the bacterial face when exposed to antibiotics. Pseudomonas aeruginosa AG1 (PaeAG1) Pseudomonas aeruginosa is an opportunist and versatile pathogen able to survive in a wide variety of environments (Klockgether et al., 2010). With a large genome (6-7.5 Mb), P. aeruginosa strains have a large proportion of the genome (>8%) dedicated to regulatory functions (Cabot et al., 2016) resulting in a consequent diversity of metabolic capabilities and responses to stress. Because of these features, P. aeruginosa is responsible for infections among immunocompromised hosts (Lu et al., 2016) and nosocomial infections (Fernández, Corral-Lugo, & Krell, 2018). However, the treatment of P. aeruginosa infections is challenging due to its many intrinsic and acquired mechanisms of resistance (Toval et al., 2015), resulting in significant morbidity and mortality. According to the World Health Organization (WHO), resistance to carbapenems in P. aeruginosa, Acinetobacter baumannii, and Enterobacteriaceae family is considered a critical issue in the context of antibiotic resistance, being classified as Priority 1 group (World Health Organization, 2017). In Costa Rica, the isolation of carbapenem-resistant P. aeruginosa strains is relatively common in some major hospitals, up to 63.1% of prevalence, as previously reported (Toval et al., 2015), much higher than the frequencies observed in other countries (Hong et al., 2015). The Costa Rican strain P. aeruginosa AG1 (PaeAG1) was identified as the first report of a P. aeruginosa isolate carrying both 4 VIM-2 and IMP-18 genes encoding for Metallo- -lactamases (MBLs) enzymes, both with carbapenemase activity (Toval et al., 2015). Later, another isolate from the United Kingdom with the same enzymes was reported (Turton et al., 2015). PaeAG1 was grown from a sputum sample of a patient from the Intensive Care Unit in the San Juan de Dios Hospital (San José, Costa Rica) in 2010. This strain has resistance to multiple antibiotics -lactams (including carbapenems), aminoglycosides, and fluoroquinolones, being only sensible to colistin. Figure 1. General workflow to study molecular determinants of antibiotic tolerance in the high-risk P. aeruginosa AG1 by a multi-omics approach. This study is based on five main steps: genome assembly and annotation, pan-genome analysis and integrons architecture, proteomic profiling after antibiotics exposure, identification of core perturbome, and the response to ciprofloxacin at transcriptomic and phenomic levels. 5 The first analysis of the genes in PaeAG1 by Sanger sequencing (primer walking method) confirmed that VIM-2 and IMP-18 genes are encoded in class 1 integrons (NCBI accessions KC907378 and KC907377) (Toval et al., 2015). In addition, at the phenomic level, preliminary comparison to the reference strain (P. aeruginosa PAO1) showed that PaeAG1 has particular features after exposure to different antibiotics, including pigment production, biofilm formation, phage plaque induction, and others (Chinchilla, 2018; Toval et al., 2015). The multi-omics approach to study PaeAG1 In view of the genomic and phenomic features of PaeAG1, we were interested in studying PaeAG1 in-depth using a multi-omics approach. To address this, the strategy was developed in five main steps, each one concretized as a scientific paper and a chapter in this thesis (Figure 1). First, genome sequencing was done using short and long-read technologies. Although a reference genome is available for the P. aeruginosa group (strain PAO1), a de novo strategy to assemble (or to build) the PaeAG1 genome was required since it was initially estimated that PaeAG1 has ~ 1.0 Mb additional of DNA sequence in its genome. Figure 2. Definition of the 3C criterion: Contiguity, Correction and Completeness. Benchmarking of multiple assemblies can be done using metrics related to the number of pieces obtained vrs expected (contiguity), the fidelity of the assembly compared to the actual sequence (correction), and the ability to construct a minimum set of expected genes, vital to the species (completeness). More details in Chapter 1. 6 As detailed in Chapter 1, a benchmark of non-hybrid (using a single DNA sequencing technology) and hybrid (using both short and long-read data) assemblers was required to select the optimum model. To make this possible, the 3C criterion (i.e. contiguity, completeness, and correctness) was conceptualized as a set of metrics that can be used to benchmark genome assemblies and select the best approach (Figure 2). The final assembly (GenBank CP045739), using a hybrid approach, revealed that PaeAG1 has not only the expected gene content for the P. aeruginosa group but also specific elements that are absent in the reference genome: 57 genomic islands (corresponding to ~ 1.0 Mb DNA sequence and >1000 genes) harboring the two complete class 1 integrons, six prophages, mobile genetic elements, and some virulence factors (Figure 3). Besides, PaeAG1 has 58 resistance genes, a not functional CRISPR-Cas system (which may explain the high content of genomic islands), and a molecular genotyping profile of a high-risk sequence type 111 (ST-111) strain. Figure 3. Assembly and annotation of P. aeruginosa AG1 genome. Circularized genome showing genomic islands harboring phages, integrons and other elements. Details in Chapter 1 and (J.-A. Molina-Mora, Campos- Sánchez, Rodríguez, Shi, & García, 2020). 7 These particular results are key components of the multi-omics approach with the subsequent analyses. If a mapping to the reference genome had been selected instead of a de novo assembly, the gene content of the extra 1.0 Mb DNA sequence could not have been revealed. In this regard, Chapter 2 focuses on the two PaeAG1 integrons and Chapter 5 reveals the role of phages in the response to ciprofloxacin. Importantly, these integrons and phages are absent in the reference genome. In order to describe the landscape of the genomic regions associated with the two integrons of PaeAG1, a comparative genomic strategy was performed as a second main step (Chapter 2). It was first demonstrated that VIM-2 and IMP-18 are inducible genes under exposure to carbapenems using RT-qPCR. We then described the phylogenetic relationships among all the complete genomes of P. aeruginosa strains using a pan-genome analysis. This led to identify not only the core and the accessory genome for this group, but also other strains sharing the PaeAG1 genomic islands. Phylogenetically related strains were also classified as ST-111 clones, but a variant profile of the PaeAG1 genomic island content was found in other strains. ST-111 is a lineage that belongs to the high-risk group in P. aeruginosa (Oliver, Mulet, López-Causapé, & Juan, 2015), which is frequently associated with epidemics where multidrug resistance confounds treatment (Petitjean et al., 2017). Many P. aeruginosa high-risk clones carry genomic determinants of antibiotic resistance such as carbapenemases or extended- -lactamases (Oliver et al., 2015). Since PaeAG1 has special genomic features regarding antibiotic multi-resistance, with the carbapenemase activity by the VIM-2 and IMP-18 genes, the profile of genomic island content in phylogenetically related genomes was used to gain insights into the evolution and landscape of genomic regions around the MBL-carrying integrons of PaeAG1. Thus, specific genomic regions associated with the two integrons were reconstructed and characterized to compare the gene content and architecture in close genomes (Figure 4). 8 Figure 4. Architecture of the genomic regions containing the MBL-carrying integrons. The genomic region containing the old-acquaintance VIM-2-carrying integron is also present in other ST-111 strains. The architecture of the IMP-18-carrying integron and surrounding regions is shown with an arrangement that is reported here for the first time. Details in Chapter 2 (J.-A. Molina-Mora, Garcia-Batan, & Garcia, 2020). The genomic region associated with the VIM-2-carrying integron (identified as an In59-like element, INTEGRALL-database http://integrall.bio.ua.pt/) was completely found in the other two ST-111 strains, being considered as an old-acquaintance integron. In the case of the IMP-18-carrying integron, the integron architecture and a surrounding genomic region have never been reported before. The IMP-18-carrying integron was considered as a new element and registered as a mobile element In1666. 9 Jointly, the chromosome assembly and the comparative genomics were able to define the molecular arsenal of PaeAG1 at the genomic level, including multiple genomic determinants of virulence, mobile elements, and antibiotic resistance genes. On the other hand, in the context of antibiotic resistance, different assays have been performed in PaeAG1 to study its tolerance to antibiotics. Antibiotic susceptibility testing was reported before (Chinchilla, 2018; Toval et al., 2015) and an MBLs differential expression has been tested not only to carbapenems as demonstrated in Chapter 2 but also to other antibiotics (Chinchilla, 2018). At the proteomic level, the protein content in PaeAG1 under exposure to antibiotics was investigated. 2-dimensional gel electrophoresis (2D-GE) analysis was implemented using different imaging and machine learning algorithms, as presented in Chapter 3. The pipeline to analyze 2D-GE images has been also implemented to study two PaeAG1 subclones C25 and C50, as shown in the Two-Dimensional Gel Electrophoresis Image Analysis of Two Pseudomonas aeruginosa Clones (José Arturo Molina-Mora, Chinchilla-Montero, Castro-Peña, & García, 2020). Figure 5. Clustering analysis of the proteomic profiling of PaeAG1 exposed to ciprofloxacin (CIP), imipenem (IPM) and tobramycin (TOB) antibiotics. Under CIP exposure, the proteomic profile after CIP exposure remains close to the control, unlike TOB and IPM. Details in Chapter 3 (Jose Arturo Molina-Mora, Chinchilla- Montero, Castro-Peña, & Garcia, 2020). 10 Figure 6. Assessment of PaeAG1 growth curves after treatment with ciprofloxacin (CIP), imipenem (IPM) and tobramycin (TOB) using different concentrations. Concentration-dependent effects were evidenced for CIP but not for the other antibiotics (Jose Arturo Molina-Mora et al., 2020). For PaeAG1, results reveal that the global proteomic profile after exposure to a sub-inhibitory ciprofloxacin (CIP) concentration remains close to control (LB medium, without antibiotics), contrasting with the results obtained with tobramycin and imipenem, as shown in Figure 5. This means that the effects of ciprofloxacin at the proteomic level are fewer than the changes given by other antibiotics. This is an interesting finding when we compare growth curves. Growth curves showed a particular concentration-effect for PaeAG1 when exposed to sub-inhibitory CIP concentrations, but not to other tobramycin (TOB) or imipenem (IPM) antibiotics (Figure 6) at sub- inhibitory concentrations. Thus, to investigate the association between the PaeAG1 growth and sub- inhibitory CIP concentrations, two main transcriptomic analyses were performed: i) the identification of core perturbome in the P. aeruginosa group and ii) transcriptomic profiling of PaeAG1 after exposure to CIP. As detailed in Chapter 4, the study of the molecular response to diverse perturbations (including CIP), term as perturbome, was carried out for P. aeruginosa with the reference strain. This makes it possible to generate the landscape of the central regulatory mechanisms of the stress response at the transcriptomic level in this bacterial group. Tolerance to stress conditions is vital for organismal survival, including bacteria under diverse environmental conditions (including antibiotics) (DeLong, 2012). Thus, to identify the core perturbome of P. aeruginosa, a machine 11 learning approach was implemented to recognize gene expression patterns among public transcriptomic data sets, similar to other studies (Cornforth et al., 2018; Glaab, Bacardit, Garibaldi, & Krasnogor, 2012; Ma, Xin, Feldmann, & Wang, 2014; Zhao et al., 2016). In this regard, only a few studies have used machine learning methods on biological data to describe the effects of multiple perturbations in complex biological systems (Bermingham et al., 2015; Caldera et al., 2019) and so far none in P. aeruginosa. In a subsequent analysis, the specific case of CIP exposure was used to standardize a systems biology pipeline to build large-scale molecular networks, as shown in the Supplementary Material 2 Gene Expression Dynamics Induced by Ciprofloxacin and Loss of LexA Function in Pseudomonas aeruginosa PAO1 Using Data Mining and Network Analysis (J.A. Molina-Mora, Campos-Sanchez, & Garcia, 2018). Figure 7. Distribution of core perturbome of P. aeruginosa on a basal network of functional associations. Pleiotropic effects are revealed for core perturbome genes. The support indicates the number of algorithms that identified a gene as a relevant element of the perturbome. Details in Chapter 4 (J. Molina-Mora et al., 2020). 12 The analysis of the central molecular response to perturbations, by both machine learning and large-scale networks, showed that the stress response is pleiotropic in P. aeruginosa, composed of at least 118 genes, of which 46 have strong support. Specific effects on gene networks were reflected as changes in gene expression profiles and the complexity of molecular regulation. With the identification of the landscape of the core perturbome for P. aeruginosa, the study was resumed with the particular response to CIP in PaeAG1, as the final main step (Chapter 5). The knowledge of the core perturbome was necessary to differentiate the pathways and responses that are shared by other perturbations, but more importantly, to identify the exclusive responses to CIP in PaeAG1. As detailed before, growth reduction was evidenced for this strain as sub-inhibitory CIP concentrations were increased. Thus, we identified the transcriptomic determinants associated with the response to CIP in PaeAG1. To address this, we used transcriptomic profiling by RNA sequencing and network analysis by applying a top-down systems biology approach. In order to study in detail the performance of different approaches for transcriptomic data analyses four different pipelines were assessed. Benchmarking of all pipelines was done using bioinformatics and biological criteria according to the genome analysis, phenotypes, and expert knowledge (Figure 8). The pipeline using EDGE-pro was selected as the best one using different criteria according to body coverage and mapping. See Chapter 5 for details. With these pipelines, transcriptomic determinants were identified. 13 Figure 8. Benchmark of four pipelines for RNA-Seq data analysis to study PaeAG1 after CIP exposure. Pipelines using mapping to the genome or transcriptome with different quantification steps were implemented to identify differentially expressed genes in PaeAG1 after exposure to CIP. Transcriptomic determinants included classical elements of the core perturbome for P. aeruginosa with down-regulation of pathways related to energy metabolism, ribosomal activity, and DNA metabolism, most of them related to bacterial growth reduction. Also, an exclusive feature, the phage induction, was suggested due to the up-regulation of phage genes creating two well- defined clusters at a network level (Figure 9). 14 Figure 9. Large-scale network of differentially expressed genes of PaeAG1 after CIP exposure. Multiple elements of virulence, phage, and pathways were found to be modulated by the antibiotic, revealing pleiotropic effects at the transcriptomic level. Details in Chapter 5 (Jose Arturo Molina-Mora et al., 2020). To validate CIP effects on phage induction, we applied a phage plaque assay (at a phenomic level) that showed an exponential induction as CIP was increased. Since these phages are absent in the reference genome, again, the de novo genome assembly was a critical step to obtain biological insights for PaeAG1. Although PaeAG1 is resistant to CIP, a sub-inhibitory concentration of this antibiotic can induce a pleiotropic effect at a transcriptomic level, including pathways of the core perturbome and phage induction. In the last case, with the subsequent bacterial cell lysis, the reduction on the growth curve is explained by CIP in a concentration-dependent manner. This 15 phenomenon is particular to CIP and not found for imipenem or tobramycin, as it was shown in this study. Phage induction by CIP can be used as a complementary strategy to fight Pseudomonal infections. The fact that PaeAG1 phages are resident elements of the genome and not exogenous elements as in other studies (Fothergill et al., 2011; Kamal & Dennis, 2015), represents an advantage to eventual further implementations. In the context of another study in our group, these results of phage induction were tested in an in vivo murine model (Morales-Berrocal, 2016). Very promissory results have been obtained under CIP injection after P. aeruginosa infection, in which mortality of infected mice was reduced from 70% to 30% and bacteria quantification dropped-off in organs, but a significant increment in phage counts was evidenced (Figure 10). Specific details will be eventually presented as part of another work. Future studies will also evaluate the modulation of the CIP response using genetic engineering (knock-out, knock-down, and the like), other omics approaches (proteomics, ChIP-Seq, etc), and other in vivo models. In summary, by using a multi-omics approach, it was able to study molecular determinants of antibiotic tolerance in PaeAG1. Genome assembly using a benchmark strategy led to building a high- quality sequence. A de novo approach allowed assembling around 1.0 Mb of sequence that is absent in the reference genome. These exclusive regions are composed of 57 genomic islands harboring two MBL-carrying integrons, phages, and many other genes. Comparison to all available complete sequences showed that the genome could be grouped by MLST profile, including a clear ST-111 cluster containing PaeAG1. In addition, a landscape of genomic regions surrounding integrons was described in which an IMP-18-carrying integron was characterized for the first time. Multi-resistance profile, antibiotic resistance genes, the MLST profile, clusters of the pan-genome analysis, and the architecture of integrons define the genomic determinants of PaeAG1. 16 Figure 10. Preliminary results of the in vivo murine model to evaluate phage induction by CIP as a strategy to fight Pseudomonal infections. Upon CIP treatment, the mortality of infected mice was reduced, including reduction of bacteria quantification and increased phage counts in organs. In order to study the central response to perturbations in the P. aeruginosa group, the core perturbome, and to identify gene expression patterns, we used a machine learning approach. Pathways of energy metabolism, ribosomal activity, DNA metabolism, and others were enriched. Similar findings of enriched pathways were obtained for the specific case of PaeAG1 exposed to CIP, but particular genes (absent in the reference strain, such as phage genes) were also identified. Phage induction upon CIP treatment, suggested by phage genes up-regulation, was validated at a phenomic level. Particular key genes, gene clusters, and pathways were recognized as transcriptomic determinants of antibiotic tolerance in PaeAG1. Together, these genomic and transcriptomic elements are molecular determinants of antibiotic tolerance and resistance in PaeAG1, which in part define the high-risk condition of this strain that enables it to conquer nosocomial environments with a multi-resistance profile. 17 JUSTIFICATION This research proposal aims to fill an information gap regarding the molecular determinants associated with tolerance and resistance to antibiotics at the genomic and transcriptomic levels in P. aeruginosa AG1. The initial studies determined that this bacterium has high-risk clone characteristics given its success in conquering nosocomial environments and its multi-resistance profile (including resistance to carbapenems). The latter case allows classifying PaeAG1 as a critical and priority 1 organism according to the WHO. Furthermore, because it was initially estimated that this bacterium contained an additional 1.0 Mb of DNA sequence relative to the reference genome P. aeruginosa PAO1, a multi-omics strategy was established to avoid losing genomic information. This is expected to be a crucial point due to the particular PaeAG1 features that differentiate it from the reference strain. To this end, a de novo genome assembly and subsequent comparative genomic analyses can identify the genomic determinants associated with tolerance to antibiotics. After proteomic profiling using 2D-GE and comparison of response to antibiotics, the definition of the central response to disturbances or core perturbome in the P. aeruginosa group at the transcriptomic level allows identifying the metabolic pathways associated with the stress response. On account of the complexity and amount of data associated with this task, a machine learning strategy was required. For the specific case of PaeAG1 with exposure to CIP, differential expression analyses were performed with RNA sequencing, large- scale molecular network analysis, and experimental validation at the phenomic level. The particular genes, gene clusters, and metabolic pathways of the core perturbome in P. aeruginosa and the response to ciprofloxacin in PaeAG1 constitute the transcriptomic determinants of antibiotic tolerance in this strain. 18 Taken together, these strategies of using a multi-omics approach (at the genomics, transcriptomics, perturbomics, proteomics, and phenomics levels), sequence bioinformatics analyses, machine learning, and systems biology, provided the required approach to identify and characterize the molecular determinants associated with tolerance to antibiotics in PaeAG1. RESEARCH QUESTION Which are the general genomic determinants and transcriptomic determinants associated with ciprofloxacin exposure in P. aeruginosa AG1 that mediate tolerance to antibiotics? HYPOTHESIS Molecular determinants that define antibiotic tolerance in P. aeruginosa AG1 can be identified and characterized at the genomic and transcriptomic levels using a multi-omic approach. 19 RESEARCH OBJECTIVES General objective To identify and characterize the genomic and transcriptomic determinants associated with tolerance to antibiotics in Pseudomonas aeruginosa AG1 using a multi-omics approach. Specific objectives 1. To assemble and annotate the P. aeruginosa AG1 genome using a benchmarking strategy, in order to characterize the gene content and genomic determinants associated with its multidrug- resistance and other phenotypes. 2. To compare P. aeruginosa AG1 genome against other P. aeruginosa sequences using comparative genomics to describe pan-genome, phylogenetic relationships, genomic islands content, and architecture of genomic regions associated with the VIM-2- and IMP-18-carrying integrons. 3. To identify genes associated with multiple perturbations in P. aeruginosa to describe transcriptomic determinants of the central molecular response (perturbome) using a machine learning approach. 4. To identify transcriptomic determinants using RNA-Seq profiling and network analysis by a top-down systems biology approach to characterize the response to ciprofloxacin in P. aeruginosa AG1. 20 CHAPTER 1 High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers Molina-Mora, J.-A., Campos-Sánchez, R., Rodríguez, C., Shi, L., & García, F. (2020). High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Scientific Reports, 10(1), 1392. https://doi.org/10.1038/s41598-020- 58319-6 https://www.nature.com/articles/s41598-020-58319-6 21 Summary Genotyping methods and genome sequencing are indispensable to reveal genomic structure of bacterial species displaying high level of genome plasticity. However, reconstruction of genome or assembly is not straightforward due to data complexity, including repeats, mobile and accessory genetic elements of bacterial genomes. Moreover, since the solution to this problem is strongly influenced by sequencing technology, bioinformatics pipelines, and selection criteria to assess assemblers, there is no systematic way to select a priori the optimal assembler and parameter settings. To assembly the genome of P. aeruginosa strain AG1, short reads (Illumina) and long reads (Oxford Nanopore) sequencing data were used in 13 different non-hybrid and hybrid approaches. PaeAG1 is a multiresistant high-risk sequence type 111 (ST-111) clone that was isolated from a Costa Rican hospital and it was the first report of an isolate of P. aeruginosa carrying both VIM-2 and IMP- 18 genes encoding for metallo- -lactamases (MBLs) enzymes. To assess the assemblies, multiple metrics regard to contiguity, correctness and completeness (3C criterion, as we define here) were used for benchmarking the 13 approaches and select a definitive assembly. In addition, annotation was done to identify genes (coding and RNA regions) and to describe the genomic content of PaeAG1. Whereas long reads and hybrid approaches showed better performances in terms of contiguity, higher correctness and completeness metrics were obtained for short read only and hybrid approaches. A manually curated and polished hybrid assembly gave rise to a single circular sequence with 100% of core genes and known regions identified, >98% of reads mapped back, no gaps, and uniform coverage. The strategy followed to obtain this high-quality 3C assembly is detailed in the manuscript and we provide readers with an all-in-one script to replicate our results or to apply it to other troublesome cases. 22 The final 3C assembly revealed that the PaeAG1 genome has 7,190,208 bp, a 65.7% GC content and 6,709 genes (6,620 coding sequences), many of which are included in multiple mobile genomic elements, such as 57 genomic islands, six prophages, and two complete integrons with VIM-2 and IMP-18 MBL genes. Up to 250 and 60 of the predicted genes are anticipated to play a role in -lactamases, efflux pumps, etc). Altogether, the assembly and annotation of the PaeAG1 genome provide new perspectives to continue studying the genomic diversity and gene content of this important human pathogen. www.nature.com/scientificreports OPEN High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers José Arturo Molina-Mora1*, Rebeca Campos-Sánchez2, César Rodríguez1, Leming Shi3 & Fernando García1 Genotyping methods and genome sequencing are indispensable to reveal genomic structure of bacterial species displaying high level of genome plasticity. However, reconstruction of genome or assembly is not straightforward due to data complexity, including repeats, mobile and accessory genetic elements of bacterial genomes. Moreover, since the solution to this problem is strongly influenced by sequencing technology, bioinformatics pipelines, and selection criteria to assess assemblers, there is no systematic way to select a priori the optimal assembler and parameter settings. To assembly the genome of Pseudomonas aeruginosa strain AG1 (PaeAG1), short reads (Illumina) and long reads (Oxford Nanopore) sequencing data were used in 13 different non-hybrid and hybrid approaches. PaeAG1 is a multiresistant high-risk sequence type 111 (ST-111) clone that was isolated from a Costa Rican hospital and it was the first report of an isolate of P. aeruginosa carrying both blaVIM-2 and blaIMP-18 genes encoding for metallo-β-lactamases (MBL) enzymes. To assess the assemblies, multiple metrics regard to contiguity, correctness and completeness (3C criterion, as we define here) were used for benchmarking the 13 approaches and select a definitive assembly. In addition, annotation was done to identify genes (coding and RNA regions) and to describe the genomic content of PaeAG1. Whereas long reads and hybrid approaches showed better performances in terms of contiguity, higher correctness and completeness metrics were obtained for short read only and hybrid approaches. A manually curated and polished hybrid assembly gave rise to a single circular sequence with 100% of core genes and known regions identified, >98% of reads mapped back, no gaps, and uniform coverage. The strategy followed to obtain this high-quality 3C assembly is detailed in the manuscript and we provide readers with an all-in-one script to replicate our results or to apply it to other troublesome cases. The final 3C assembly revealed that the PaeAG1 genome has 7,190,208 bp, a 65.7% GC content and 6,709 genes (6,620 coding sequences), many of which are included in multiple mobile genomic elements, such as 57 genomic islands, six prophages, and two complete integrons with blaVIM-2 and blaIMP-18 MBL genes. Up to 250 and 60 of the predicted genes are anticipated to play a role in virulence (adherence, quorum sensing and secretion) or antibiotic resistance (β-lactamases, efflux pumps, etc). Altogether, the assembly and annotation of the PaeAG1 genome provide new perspectives to continue studying the genomic diversity and gene content of this important human pathogen. Genotyping methods and genome sequencing are indispensable to reveal genomic structure and evolution of bacterial clones with high resolution1. In this sense, production of large amounts of short sequencing data from genomes (reads) has been facilitated by continuous advances in Next Generation Sequencing (NGS) technologies. 1Centro de Investigación en Enfermedades Tropicales, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica. 2Centro de Investigación en Biología Celular y Molecular, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica. 3Human Phenome Institute of Fudan University, Shanghai, China. *email: jose. molinamora@ucr.ac.cr SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 www.nature.com/scientificreports/ www.nature.com/scientificreports This includes short read sequencing technologies (a few hundred bp read length) such as Illumina and long read sequencing technologies (several hundred kb read length) such as Pacific Biosciences (PacBio) single-molecule real-time (SMRT) and Oxford Nanopore Technology (ONT)2. Using sequencing data, it is expectable that full-length chromosomes could be produced when the genome is fully sequenced and assembled3. However, reconstruction of genome or assembly is not straightforward due data complexity. This is a challenging problem that requires time and expertise4. If a reference genome is avail- able, an assembly can be made by comparison or direct mapping, otherwise, a de novo assembly, in which only the information obtained from reads is used to reconstruct the genome, without prior knowledge of its organ- ization5. In de novo assembly, sequences (reads) are grouped into contigs using graph based algorithms such as Overlap-Layout-Consensus, de Bruijn and greedy approaches5,6. Then contigs are assembled into scaffolds (supercontigs or metacontigs). Alternatively, some de novo assemblers use reference genomes to solve specific inconsistencies or for scaffolding5. Reconstruction can be favored by some previous information, such as expected genome size, GC content and repetitive region content, as they help choose the best strategy to follow. Even though many algorithms to assemble genome by de novo approaches are available, performance is completely dependent on data (short or long reads, instruments, technology), genomic complexity (repeats, number of chromosomes or plasmids) and complementary algorithms (pre-processing, databases, annotations, etc)7. Therefore, for a specific genome and dataset, selection of the optimal assembly strategy to use is not a trivial task because there is no systematic way to determine which assembler and what parameter settings must be selected8. Since a key first requirement in the study of genomes is accuracy9, short reads technologies are preferred because they produce high fidelity reads10. Also, the low cost and high accuracy of Illumina sequencing makes it well suited to high-throughput bacterial genomics10. However, genomes present complex repeat structures difficult to solve by different assemblers. As reported, if the repeats are longer than the reads, genomic regions sharing perfect repeats can be indistinguishable6. With this, resolving a full genome is a challenging issue for short reads approaches. Consequently, most available bacterial genomes are incomplete11, highly fragmented, and of low quality3. Long reads, by contrast, can exceed the length of repeats in a typical bacterial genome, facilitating genome assembly10. Long reads technology offers an important advantage for complex genomes with high level of repeti- tive elements or genome duplication7. Thus, use of long reads data has shown improvements in the context of de novo genome assemblies, rising contiguity, solving fragmented regions, and closing gaps12. However, these third generation sequencing methods deal with relatively high sequencing error8, which has been estimated up to 15% of random but also systematic errors10,12. In addition, long reads sequencing has a higher cost per base than that with Illumina platforms11. Combination of reads of different length and from different sequencing platforms in so-called hybrid approaches often counterbalances the drawbacks of each method4. The growing interest in hybrid assemblies is justified by the popularity, cost and accuracy of short reads sequencing, plus the resolution capacity of repetitive regions and genomic structures of long reads10. In some cases, a hybrid approach is sufficient to produce a single and closed sequence of the microbial genome13. However, to accurately assemble a genome, neither the optimum combination and coverage of long and short reads, nor the minimum required length of long-reads are known a priori9. Due to this, hybrid and non-hybrid assembly must be individually evaluated with regard to select the best assembly conditions, and different metrics and tools are available for this purpose. However, no single or completely useful strategy is considered as universal and sufficient to benchmark assemblies3,14. Benchmark of assemblies can be achieved using metrics related to contigs and scaffolds (contiguity), abil- ity to complete the whole structure of the genome (completeness), and the accuracy of the assembly (cor- rectness). Although most of studies of assemblies exploit these parameters to evaluate the performance of assemblers3,8,10,15–17, here we define the general assessment by “3C criterion” as all metrics required to evaluate and benchmark genome assemblies using contiguity, completeness and correctness metrics, as detailed: Contiguity: It evaluates the assembly in terms of number and size of contigs and scaffolds6, the pieces found in an assembly. Metrics includes statistics related to maximum length, average length, combined total length, and contig N50 (length-weighted median of ordered contigs or scaffolds)2. However, contiguity metrics thereof need to be interpreted with caution due they do not contain information on assembly accuracy and completeness4. Correctness: it refers to how well those pieces accurately represent the genome sequenced16 and, in general is acceptable that it is essential to prioritize correctness rather than contiguity12. However, correctness is diffi- cult to evaluate if a preliminary reference genome is not available, which is a particular problem for de novo assembly6. Mapping and comparison to reference or draft genome (or a consensus sequence) can be used to detect misassemblies, including mismatches, indels, and misjoins8. Completeness: it assesses how much of the genome is represented by the pieces of the assembly16. This implies the evaluation of ability to assembly not only all the genes, but also to solve all complicated regions, includ- ing repetitive sequences and, if it is expected, circularization of genome. The most important metric for this case is the “completeness score”, calculated by the examination of single-copy orthologs conserved genes18. In addition, information of known sequences, unexpected variations in coverage, and remapping of reads allows to analyze the consistency of the genome and identification of potentially poorly assembled regions5,19. Thus, to develop a strategy to assembly a bacterial genome using the non-hybrid and hybrid approaches as well as the 3C criterion, we used a ST-111 strain of Pseudomonas aeruginosa. P. aeruginosa is Gram-negative bacterium and a well-known opportunistic pathogen20. It is responsible for acute and chronic nosocomial and community SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 2 www.nature.com/scientificreports/ www.nature.com/scientificreports infections in immune-compromised patients21. However, the treatment of P. aeruginosa infections is challenging due many intrinsic and acquired mechanisms of resistance22, including the production of to β-lactamases antibi- otic modifying enzymes and target alteration. Multi-resistance in P. aeruginosa is becoming more and more serious, not only due resistance to classical β-lactams, aminoglycosides and fluoroquinolones, but also to resistance last resort treatments including car- bapenems (β-lactams) and colistin, which causes great difficulties in clinical treatment23,24 and resistant to these antibiotics emerge as a final level of fight of bacteria which compromises infections treatments24. Many bacte- rial clones with carbapenemase-producing features are recognized as high-risk clones25. A high-risk clone is a multidrug-resistant clone with highly efficient transmission and/or maintenance among humans or animals26, playing a major role in the spread of resistance in the hospital and other environments27 and a flexible ability to accumulate and switch resistance28. However, the term high-risk is not necessarily associated with severity26. A limited number of Pseudomonas aeruginosa genotypes (mainly ST-111, ST-175, and ST-235) are recognized as high-risk clones, and they are responsible for epidemics of nosocomial infections by multidrug-resistant or exten- sively drug-resistant strains worldwide29. In Costa Rica, isolation of carbapenem resistant P. aeruginosa strains is relatively common in some major hospitals as we reported before22, most of them carrying one blaVIM and one blaIMP allele carbapenemases and up to 63.1% of prevalence22, much higher than the frequencies observed in other countries30. The Costa Rican multi-resistant strain P. aeruginosa AG1 (PaeAG1) was isolated from a sputum sample of a patient with pneumonia from the Intensive Care Unit of the San Juan de Dios Hospital (San José, Costa Rica) in 2010. PaeAG1 has a resistance phenotype to β-lactam (including carbapenems), aminoglycosides and fluoroquinolones, showing susceptibility only to colistin. In addition, PaeAG1 was identified as the first report worldwide of a strain carrying both blaIMP-18 (or IMP-18) and blaVIM-2 (VIM-2) genes, coding for metallo-β-lactamases (MBL) with carbapenemase activity22. PaeAG1 is a high-risk clone with a genotyping profile ST-111, which includes strains with a phenotype extremely resistant to antibiotics, responsible for various types of infections in hospitals and rapid spread between the individuals29,31. Sanger sequencing confirmed that the blaVIM-2 and blaIMP-18 genes of strain AG1 (Accessions KC907377 and KC907378) are encoded in class 1 integrons, likely in two different structures22. In addition, preliminary experimental assays suggested no existence of plasmids22. We were interested in assembling and annotating the genome of the clinical isolate PaeAG1 due to its impor- tance as a high-risk clone with multi-resistance to antibiotics and to identify molecular determinants related to the ability to conquer nosocomial environments, virulence and other phenotypes. Thus, the aims of our study were: (i) to assemble the PaeAG1 genome using short and long reads data by hybrid and non-hybrid multiple approaches, (ii) to benchmark assemblers and select the best genome assembly approach using the 3C criterion, and (iii) to annotate the PaeAG1 genome to characterize and identify general gene content and genomic determi- nants associated with its multidrug-resistance and virulence phenotypes. Methods The general pipeline followed to assembly the PaeAG1 genome by hybrid and non-hybrid approaches is shown in Fig. 1. Complete details of settings of implemented algorithms are shown in supplementary material “Scripts for bioinformatics analysis”. Bacterial isolate. The Costa Rican PaeAG1 strain was isolated in 2010 from a sputum sample of a patient with pneumonia from the Intensive Care Unit of the San Juan de Dios Hospital (San José, Costa Rica). This isolate has phenotypic resistance (AST-GN cards, bioMeriux Vitek) to β-lactams, aminoglycosides and fluoro- quinolones, shows susceptibility only to colistin and expresses metallo-β-lactamase activity (E-test MBL strips, AB Biodisk), as reported22. Bacterial growth and DNA isolation. PaeAG1 cells were grown overnight in Luria-Bertani broth (LB) medium at 37 °C with shaking. Then, cells were collected by centrifugation and genomic DNA was isolated with the QIAGEN DNeasy Kit (QIAGEN, UK) following the manufacturer’s instructions. The yield of genomic DNA obtained was determined using a Nanodrop (Nanodrop 2000, Thermo Scientific, UK) and by Qubit Fluorometric Quantitation (Qubit 3.0 Fluorometer, Thermo Scientific). DNA integrity was verified by electrophoresis using 0.7% agarose gels. Whole genome sequencing using short reads. Genomic DNA was sequenced using Illumina technol- ogy (Illumina Inc.) at Macrogen, Korea. The sequencing library was prepared using TruSeq DNA Sample Prep kit with the standard Illumina DNA shotgun library preparation protocol. DNA fragmentation was achieved by ultrasonication, and then adapter ligation and PCR enrichment were done. Paired end reads of 101 bp were gen- erated using a HiSeq. 2000 sequencing instrument. Sequence files were evaluated using FastQC v0.11.732 before and after trimming. Reads were trimmed (including adapters removal) using Trimmomatic v0.3833 to discard sequences with per base sequence quality score <30. After selection, 7.4 Gb of sequences were kept, with a 14 million of pairs of reads and mean coverage >400X according to expected genome size (approx. 7 Mb). Whole genome sequencing using long reads. Long reads from genomic DNA was sequenced using Oxford Nanopore technology by NextOmics, Wuhan-China. Sequencing libraries were prepared according to the ONT 1D ligation library protocolSQK-LSK109. FLO-MIN-106 flowcell and the standard 48-hour run script with active channel selection enabled were used to sequence reads in a GridION instrument. Poretools v0.6.034 was used to extract and evaluate reads by quality before and after trimming. Adapters were removed using Porechop v0.2.3 (github.com/rrwick/Porechop) and trimming was done using Filtlong v0.2.0 (github.com/rrwick/Filtlong). Reads with mean quality weight <10 and/or shorter than 1 kb were discarded. The final dataset consisted of SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 3 www.nature.com/scientificreports/ www.nature.com/scientificreports Figure 1. General bioinformatic pipeline to assemble, compare and annotate the Pseudomonas aeruginosa AG1 genome using short and long reads as well as hybrid approaches. 4.5 Gb of sequence, with 259,491 reads in total, a read mean length of 17,343 bp, a longest read of 201,659 bp, and a final mean coverage >560X. Short reads genome assembly. Six de Bruijn graph based assemblers were used with default parameters and without reference guided option, if applicable. The classical assemblers included in the study were Velvet v1.2.1035, SPAdes v3.13.036, IDBA v1.1.337, and Megahit v1.1.338. Two newer assemblers were also included: SKESA v2.3.039 and Unicycler v0.4.711. To estimate the best k-mer length for genome de novo assembly for Velvet, KmerGenie 1.7051 was implemented40. Other algorithms selected best k-mer length values automatically, if needed. Assembly sequences were kept at contig level with minimum size of 1,000 bp. Long reads genome assembly. Three graph-based long read assemblers were used: Canu 1.841, Flye 2.3.742 and Unicycler v0.4.711. Default parameters and no reference genome nor alternative sequencing data were consid- ered. Only contigs with size higher than 1,000 bp were kept. Hybrid genome assembly. Three graph-based hybrid approaches were applied. Default parameters with- out reference sequence were used to run IDBA-hyb v1.1.1 (https://github.com/loneknightpy/idba), Unicycler v0.4.711 and SPAdes v3.13.043. Only contigs with size higher than 1,000 bp were kept. SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 4 www.nature.com/scientificreports/ www.nature.com/scientificreports Scaffolding. Prior the final version of the genome assembly of PaeAG1, BLASTn (blast.ncbi.nlm.nih.gov/ Blast.cgi) was used to search closest genome according to contig sequences. All assemblies at contig level were assembled into scaffolds using the closest genome as reference sequences (P. aeruginosa strain RIVM-EMC2982, more details in Results) using MeDuSa v1.644. When final version was achieved, scaffolding and benchmarking was done using the definitive version of the PaeAG1 genome with same scaffolder. 3C Benchmark of approaches and selection of best assembly. Benchmark of all assemblers were done according to 3C criterion, as follow: Contiguity. Genome assembly statistics about quality and contiguity were assessed using QUAST 5.0.114 at both contig and scaffold levels. Assembler outputs were compared with regards to total assembly length (expected: around 7 Mb), number of contigs/scaffolds (one sequence expected), N50 (expected: as large as possible, close to genome size), NG50 (as large as possible), and others. Completeness. Four strategies were implemented to assess completeness. First, single copy ortholog gene sets were searched (expected: 100%) in the assemblies using the BUSCO tool45 within the gVolante plataform (https:// gvolante.riken.jp)18 and comparing gene content against 40 genes of the bacteria database (available at https:// busco.ezlab.org/v1/). We also checked the ability of the assemblers to reproduce the complete sequences of the two class I integrons of PaeAG1 previously obtained by Sanger sequencing (KC907377 and KC907378). The third analysis used Circlator19 to assess the replicon circularization achieved by assemblers that gave rise to single sequences (expected: a circular sequence). A last approach calculated the percentage of genomic and transcrip- tomic reads mapping to each genome reconstruction (expected: >95% mapping). To this end, short and long reads were remapped to the assemblies using BWA 0.7.1746. In addition, 12 reads files from a RNASeq experiment (triplicates of same strain under four experimental conditions with or without ciprofloxacin) were mapped to the assemblies using HISAT2 v2.1.047. Qualimap v2.2.248 was used to calculate coverage and percentage of mapped reads, and comparison was done in a single report using MultiQC v1.749. Correctness. Two strategies were used to evaluate correctness. The first one was to estimate error rates, check for uniform coverage, and detect false variants of short reads that mapped to the polished genome (see below, expected: 0% errors). This was done using Qualimap results. The second strategy was to calculate the percentage of identity of local alignments between known Sanger sequences (integrons, expected: 100% identity) of PaeAG1 and the final assembly (BLASTn). All above criteria were considered to select the best assembly. This draft genome was polished and curated (next section) and the new version was included as extra 13th assembly. We used all quantitative data to run a Principal Components Analysis (PCA), which was implemented in R software v3.5.1 (www.r-project.org/) using the Carret package (caret.r-forge.r-project.org/). This let to compare global profiles and performance given by assemblers. The final version of genome assembly was also included as an independent unit. De novo assembly graphs were visualized using Bandage v0.8.150. Finally, assembled sequences were visualized and compared against the final assembly using the BLAST Ring Image Generator (BRIG) tool v0.9551. Curation and polishing of the definitive genome assembly. Final adjustments of selected genome assembly were made manually based on the assembly graph, read coverage and distribution. Pilon 1.2352 with BWA-mapped reads were implemented to automatically polish the assemblies. After this, a final polished assem- bly was obtained. Remapping of short and long reads, as well as all metrics calculations and 3C criterion evalua- tions were done again. Comparative genome analysis. BLASTn of complete sequence was run again to find the closest genome, which jointly with the genome of the reference strain P. aeruginosa PAO1 were compared using Mauve v2.4.053 to determine the level of synteny and to describe global genomic structure. Also, in order to compare the PaeAG1 genome with other ST-111 strains, a phylogenetic analysis was done using all the available complete sequences of ST-111 P. aeruginosa genomes. The reference strain P. aeruginosa PAO1 was also included. All the records were retrieved from Pseudomonas Genomes Database (PGDB, pseu- domonas.com), and Roary program v3.12.054 was run with default parameters to establish relationships between strains using gene content by a pan-genome analysis. Scripts supplied with the program were used to create plots. Whole genome annotation. For all assemblies, gene prediction and gene annotation was achieved using Prokka v1.13.355 and a custom database created with the genome of P. aeruginosa PAO1 and closest annotated strain to PaeAG1 as primary sources for annotation, or the default bacterial database provided with the software distribution. Also, Clusters of Orthologous Groups (COG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway were searched using EggNOG (http://eggnogdb.embl.de/)56 for all coding sequences (CDS). Specific genome annotation. Specific annotation and searching for specific genomic determinants was only done for the definitive final assembly. Default parameters were used in all cases. In silico serotyping was done using Past v1.0 (https://cge.cbs.dtu.dk/services/PAst-1.0/) and multilocus sequence typing57 using MLST v2.0 (https://cge.cbs.dtu.dk/services/MLST/). Antimicrobial resistance genes were detected using RGI tool v5.1.0 (Resistance Gene Identifier, https://card.mcmaster.ca/analyze/rgi) and ResFinder v3.2 (https://cge.cbs.dtu.dk/ser- vices/ResFinder/). CRISPR-Cas arrays were investigated using CRISPRCasFinder v1.1.2 (https://crisprcas.i2bc. SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 5 www.nature.com/scientificreports/ www.nature.com/scientificreports Short reads only approaches Long reads only approaches Hybrid approaches Final 3C Criterion Level and metrics Velvet SPAdes IDBA Megahit SKESA Unicycler Canu Flye Unicycler IDBA SPAdes Unicycler assembly Contigs 227 89 127 125 217 113 2 1 5 121 16 1 1 Total length 7027785 7094145 7090598 7103650 7047434 7074438 7121028 7209472 7465726 7092836 7188777 7189601 7190208 Contigs assembly GC (%) 65.79 65.73 65.74 65.73 65.77 65.77 65.66 65.59 65.64 65.74 65.68 65.71 65.71 N50 65258 223421 170948 168521 68375 151417 4329427 7209472 7178173 141288 1593634 7189601 7190208 L50 33 11 14 14 34 15 1 1 1 15 2 1 1 Contiguity Scaffolds 1 10 10 10 2 1 1 1 1 10 10 1 1 N50 & NG50 7039385 7078855 7079244 7091835 7056837 7080238 7121028 7209472 7465826 7082290 7171429 7189601 7190208 Genome fraction (%) 97.714 98.362 98.293 98.484 98.054 98.382 99.381 99.991 100 98.356 99.717 99.992 100 NA50 177145 375326 491929 478607 708585 709611 4328063 7207242 7177177 477586 3956502 7189601 7190208 LA50 12 6 5 5 4 4 1 1 1 4 1 1 1 Scaffolding N's per 100 kbp 217.06 52.13 77.51 75.96 151.6 81.92 0 0 1.34 74.67 5.56 0 0 Misassemblies 81 22 37 33 24 19 1 0 4 26 2 0 0 Unaligned mis. contigs 0 0 0 0 0 0 0 0 0 0 0 0 0 Correctness Mismatches per 100 kbp 6.56 2.42 4.88 1.61 1.84 0.48 35.94 28.01 101.21 3.68 11.33 0.07 0 Indels per 100 kbp 6.49 0.41 0.67 0.28 1.79 0.34 324.66 284.54 186.53 1 1.14 0 0 Fragmented genes 0 0 0 0 0 0 4 9 9 0 0 0 0 40 core genes Intact genes 40 40 40 40 40 40 20 13 23 40 40 40 40 (BUSCO) Lost genes 0 0 0 0 0 0 16 18 8 0 0 0 0 Completeness Completeness score (strict, %) 100 100 100 100 100 100 50 32.5 57.7 100 100 100 100 CDS 6574 6554 6543 6565 6540 6567 11229 9565 9089 6559 6605 6621 6620 Contigs 1 10 10 10 2 1 1 1 1 10 10 1 1 Whole rRNA 2 5 5 5 3 3 12 12 12 4 14 12 12 genome tmRNA 1 1 1 1 1 1 0 1 0 1 1 1 1 annotation tRNA 70 62 69 70 61 70 72 65 75 69 76 76 76 Mean length of CDS (bp) 938.34 957.54 956.28 954.9 950.19 953.49 499.35 607.14 664.14 955.11 963.51 961.89 961.86 Completeness Integron Identity (%) 100.0 99.5 99.8 100.0 100.0 99.7 99.488 99.257 99.843 99.753 99.778 99.778 100 & correctness blaVIM-2 Coverage 0.5 0.7 0.6 0.4 0.5 0.6 1.0 1.0 1.0 0.6 0.9 0.9 1.0 Integron Identity (%) 100.0 100.0 100.0 100.0 100.0 100.0 99.515 98.744 99.728 100 100 100 100 blaIMP-18 Coverage 0.6 0.9 0.9 0.9 0.8 0.8 1.0 1.0 1.0 0.8 1.0 1.0 1.0 Table 1. Comparison of contiguity and annotation of P. aeruginosa AG1 genome assembly by different approaches*. *For some metrics, best and worst values are marked as bold or italics, respectively. paris-saclay.fr/CrisprCasFinder/Index). Virulome was identified using Virulence Factor DataBase (VFDB, http:// www.mgc.ac.cn/VFs/). For mobilome delimitation, genomic islands were identified using IslandViewer v4 (www.pathogenomics. sfu.ca/islandviewer/). PHASTER was used to find prophages (phaster.ca/)58 and integrons were searched using IntegronFinder v2.059. The results of this series of searches were visualized in the genome using BRIG. Results In order to assembly the genome of P. aeruginosa AG1, an exhaustive workflow was implemented using hybrid and non-hybrid approaches, using Illumina short reads sequencing and Oxford Nanopore long reads sequencing data. General protocol is presented in Fig. 1. After sequencing and four bioinformatic steps, a single circular sequence was achieved and it was also annotated. Benchmarking of hybrid and non-hybrid assemblers: a winner?. Using different approaches, the PaeAG1 genome assembly was evaluated using the 3C criterion. The final version was presented as a last case, cured and polished. The contiguity and completeness criteria were initially the most important for the selection of the draft assembly, and then, a final polishing strategy focused on ensuring correctness (see next section). A sum- mary of the most important metrics related to these criteria is presented in Table 1. Metrics related to scaffolding were obtained using the final assembly as reference, although various attempts to create scaffolds were made with closely related genomes. According to results of contiguity, the use of short reads only approaches shows a lower performance (89 to 227 contigs and 1–10 scaffolds) compared to other approaches that exploit long reads (1 to 5 contigs and one scaf- fold for all cases) or hybrid methods (1–121 contigs and 1–10 scaffolds). Performance profiles between assemblers are compared in Table 1 and Fig. 2. Short reads assemblies are similar to each other according to Table 1 and PCA SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 6 www.nature.com/scientificreports/ www.nature.com/scientificreports Figure 2. General comparison of P. aeruginosa AG1 genome assemblies. (a) Relationship between different assemblers by PCA using contiguity and annotation features. (b) Completeness evaluation and comparison for all different approaches using the final assembly as reference. (c) De novo assembly graph of three different approaches by short reads, long reads or hybrid assemblers. More details in Supplementary Fig. S1. (Fig. 2a). In the case of long reads approaches, hybrid or not, the performance was also similar to each other at this contiguity level. Differences depending on technology and assembly strategy are recognized according to metrics and global profiles in PCA, gaps in the assembly and graphs (Fig. 2). Only two assemblers generated a single contig. One is a long reads only approach (Flye) and the other one is a hybrid assembler (Unicycler). The hybrid assembler IDBA obtained metrics equivalent to the mode without the use of long reads (short reads only with 127 contigs and 121 contigs for hybrid approach), and also simi- lar to Megahit (125 contigs and other metrics). Velvet and SKESA had the higher contigs values, 227 and 217 respectively. The anticipated total genome length was similar among the 13 assemblers (7–7.2 Mb for all cases, except for long read only Unicycler with 7.4 Mb), while the N50 value tended to be much shorter for short reads assemblies (65–171 kb) compared to long reads (4.3–7.2 Mb). However, at the scaffold level N50 values were comparable among all cases (>7.0 Mb). At this same level, all assemblies covered virtually the entire final genome, although the lower performance was obtained for short reads only approaches (>97%). As to correctness, long reads only were linked to high rates of mismatches (28–101 per 100 kb) and indels (186–324 per 100 kb), which were not solved by posterior polishing steps (as in Unicycler). Better values were obtained for other approaches using short reads, hybrid (0–11 mismatches and 1–1.14 indels) or not (0.48–6.6 mismatches and 0.3–6.5 indels). In addition, although long reads only assemblies generated sequences of approx- imately the same length as the other approaches, their annotations revealed high CDS numbers (9,089–11,229, which contrast with the 6,550–6,600 for short reads and hybrid approaches). Specific analysis of sequences showed a low median CDS size (average <600 bp) from long reads only assemblers compared to other cases with short reads only or hybrid (average 955 bp, which is an expected value for PaeAG1), suggesting fragmentation of CDS in the long reads assemblies. Evaluation of 40 core genes using BUSCO tool and completeness score showed a 100% performance for short reads only and hybrid assemblers. However, in long reads only approaches it was possible to identify 13 to 23 core genes only (32.5–57.7%). Regarding the PaeAG1 integron sequences obtained by Sanger sequencing, with a length greater than 2,500 bp and 3,000 bp, the assemblies of short reads only had low coverage (0.4–0.9), specifically in regions with repeti- tions. On the other hand, models with long reads had the best performance (1.0 in all cases), and their use in the hybrid approaches improved the assembly of the aforementioned repetitive zones (0.9–1.0 for all cases, except IDBA with 0.6–0.8). SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 7 www.nature.com/scientificreports/ www.nature.com/scientificreports Using all information, global profiles were compared the samples using a PCA. The full table used for PCA and the components values are provided in the “Supplementary Material PCA data”. As presented in Fig. 2a, these profiles show a separation between the profiles of the short reads only (green color) and the others, creating two clusters. Also, unpolished and polished Unicycler assemblies kept close, as might be expected. Enhancing the winner: polishing of hybrid unicycler assembly. The assembly directly obtained from the hybrid Unicycler approach was selected as the winner for its better fulfilled the 3C criteria, and it was used for downstream analyses. However, a review of the assembly was required in evidence of: (i) missing coverage for one of the known integrons sequences (Table 1) and (ii) presence of a zone with irregular/non-uniform distribution in the remapping of long reads (Supplementary Fig. S1a -left). Due to this, a manual curation was required. Curation was carried out with the help of the known sequences of the integrons, assembly graphs, and the assemblies of long reads only (because long reads could assemble that region). A detailed explanation of the curation is pro- vided in the “Supplementary Material Manual curation” file, including a graphical representation. After curation with short reads, a final polishing step was carried out to guarantee completeness. Only 5 bases were modified, which is reflected in the mismatches rate (per 100 kbp) of Unicycler hybrid of 5/7,190,208*100 kb = 0.07 (Table 1). When remapping of reads was done, regular and uniform coverage was detected, even in the conflictive zone (Supplementary Fig. S1a-right). Furthermore, the known integron sequences showed complete identity and coverage (Table 1, last column). With this improved version of the assembly, in addition to the PCA comparison, an alignment of all assem- blies was done against the final assembly to highlight the problematic regions to assemble. As shown in Fig. 2b some gaps were evident in all assemblies that were derived from short reads only and these gaps were not always compensated through the use of hybrid approaches. However, for most assemblers, the use of long reads only or hybrid improved those regions. Benchmark of all assemblers in a specific conflictive region is presented in Supplementary Fig. S1b. The assembly graphs of three cases are presented in Fig. 2c, showing the variable ability of assemblers to solve the de novo assembly problem. 3C assessment of PaeAG1 final genome assembly. To assess the final assembly of PaeAG1 genome, 3C criterion was re-evaluated: Contiguity. The final assembly was built with hybrid Unicycler, with curation and polishing steps, but without the need for a reference genome. Full contiguity was achieved. A single and circular sequence was obtained. Completeness. With all the elements evaluated, maximum completeness is considered. This includes circular- ization of sequence, 100% identity and coverage of known sequences of the integrons and 100% completeness scores in 40 expected genes (single copy orthologs set). Regarding the remapping of genomic reads, 99.85% of the short reads were mapped with an average coverage of 403X (See coverage graph in Supplementary Fig. S1c left). About long reads, 97.81% were mapped to the genome with an average coverage of 560X (Supplementary Fig. S1c right). Additional data from the same strain PaeAG1 using RNASeq technology achieved a mapping of 98.6% of read sequences. Correctness. The polishing rounds that Unicycler includes and the additional polishing after curation using short reads guarantee the maximum accuracy of the genome assembly. Thus, circular assembled genome was built according to 3C criterion: high contiguity, completeness and cor- rectness was achieved. Annotation of PaeAG1 genome. The PaeAG1 genome is composed of a single and circular sequence of 7,190,208 bp, with 65.71% GC content (Fig. 3a). A total of 6,620 CDS, 12 rRNA, 76 tRNA and 1 tmRNA (6,709 genes in total) were determined (Table 1). In addition, 2,197 genes were associated with Gene ontology terms, 5,537 related to defined COGs, and 3,060 to KEGG when orthologous groups and functional annotation were analyzed. As shown in Fig. 3b, specific annotation of different genomic determinants was done, including antibiotic resistance genes, mobilome, virulence factors and others. Regarding antibiotic resistance gene profiling, genetic determinants of resistance to β-lactams, aminoglycosides, and fluoroquinolones, fosfomycin, phenicol and sul- phonamide were found. By mechanism, 60 resistance associated genes were identified, including 44 efflux pumps and 8 associated with drug inactivation, including blaVIM-2 and blaIMP-18 gene alleles. Also, six determinant of target alteration and two of target replacement were identified. More details are shown in the Supplementary Table S1. In the case of virulence factors, P. aeruginosa AG1 has more than 250 genomic determinants for 11 classes or enriched groups, including adherence (flagella, type IV pili biosynthesis and motility), antimicrobial activ- ity (phenazines biosynthesis), antiphagocytosis (alginate production), iron uptake (pyochelin and pyoverdine), enzymes (phospholipases), biosurfactant (rhamnolipid biosynthesis), quorum sensing, proteases, regulation of two component system, type three secretion systems (T3SS) and toxins (exotoxin-A). More details are shown in the Supplementary Table S2. In the study of the mobilome, diversity of elements were identified. At the genomic islands level, a total of 57 laterally acquired regions (size >10 kb) were identified (light blue in Fig. 3a), which correspond to drastic changes in the average GC composition. Six prophages (including two intact) were identified. The two complete integrons already described were also found. In correspondence to this diversity of mobile elements, no complete/func- tional CRISPR-Cas systems were recognized. SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 8 www.nature.com/scientificreports/ www.nature.com/scientificreports Figure 3. Annotation of P. aeruginosa AG1 genome. (a) Circularized genome showing phages and integrons locations. (b) Specific annotation of different genomic determinants including number of elements. (c) Genome synteny comparison among three strains of P. aeruginosa: PAO1 (general reference), AG1 (our assembly) and RIVM-EMC2982 (closest one to PaeAG1 according to BLAST analysis). Using BLASTn, RIVM-EMC2982 (Accession CP016955.1; 7,380,063 bp, 65.7% GC content and Prokka anno- tation: 6,871 CDS, 76 tRNAs, 1 tmRNA and 12 rRNA; ST-111 and blaVIM-2+) was identified as the closest genome to PaeAG1 (Query cover 99%, identity 100%), which is a ST-111 and blaVIM-2 carrying strain. Both strains have same number of RNAs genes. Synteny comparison of the nucleotide sequences of both strain revealed 99% identity and 92% of coverage comparing PaeAG1 strain against RIVM-EMC2982. In addition, comparison of genome of PaeAG1 (genome size of 7.2 Mb) was done against strains PAO1 (6.3 Mb) and RIVM-EMC2982 (7.4 Mb). As shown in Fig. 3c, genomic blocks contrast with the general reference of the P. aeruginosa group, PAO1, which has almost 1 MB of difference of the genome size and around 1 000 genes. In the case of compari- son with RIVM-EMC2982, general profile by blocks found similar arrays between both strains, congruent with genome sizes and content of mobile determinants in both strains. In addition, comparison of gene content of ST-111 strains was used for phylogenetic analysis. A total of 9 com- plete genomes were available in PGDB, all with variable genome size (6.7–7.3 Mb) and gene content (6,200–7,400 genes). Pan-genome analysis revealed a total of 10,637 genes, which can separate strains in two clusters, one of them including PaeAG1 and P. aeruginosa RIVM-EMC2982 (Fig. 4a). The reference strain PAO1 was found to be completely separated from the group. Regarding core-genome, 4,783 genes (45% of total genes) were identified (present in at least 10 of the 11 sequences). A third part of genes were identified in only one of the strains. More details are shown in Fig. 4b,c. Interestingly, PaeAG1 is the only isolate which carries blaIMP-18 gene, in contrast to blaVIM-2 which was present in most of the strains. Discussion P. aeruginosa is an opportunistic pathogen able to adapt to different environments and it causes a variety of acute and chronic infections. PaeAG1 is a clinical isolate from a Costa Rican hospital with a profile of multi-resistance to antibiotics. In this context, concern over the increasing prevalence in hospitals of high-risk clones, includ- ing Pseudomonas aeruginosa, has prompted the use of typing methods and sequencing strategies to study the genomic epidemiology of bacterial clones at high resolution1. Interested in the assembly and annotation of PaeAG1 genome, we implemented different approaches using short and long reads and we benchmark them using the 3C criterion. Benchmark of hybrid and non-hybrid assemblies. Of the more than 50 assemblies we run for pipeline standardization (considering different pre-processing, assembly and annotation steps), best cases per assembly SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 9 www.nature.com/scientificreports/ www.nature.com/scientificreports Figure 4. Pan-genome analysis of ST-111 P. aeruginosa strains. (a) Clustering according to strains profile by gene content. A total of 10,637 genes were identified. (b) Distribution of the gene content in all the strains, including that the core genome is composed of 4,783 (45% of total genes). Distribution of genes number by number of genomes is presented in (c). were compared. In total 12 approaches were presented, and the best one was included as a 13th case after polishing and curation. According to the global profiles given by metrics and 3C benchmark, variable results were obtained (Table 1 and Fig. 2). Regarding contiguity, fewer contigs were assembled using long reads or hybrid approaches in comparison to short reads. As reported, assembly continuity and genome size seems not to be correlated60. This is verified in our case, and dependency on technology seems more evident. Also, dependency on algorithms showed different contiguity, even for same type of approach. Use of long reads (non-hybrid or hybrid method) improved contiguity metrics, solving most of conflictive regions that short reads could not assemble. In the case of correctness, long reads only approaches presented critical problems in accuracy. As in our study, in a recent study error rates for short reads and hybrid assemblies were similar but were much higher for long reads assemblies using Unicycler in all cases1. Even though we had ultra-deep coverage for both sequencing tech- nologies, this could be no enough to correct error in long reads only assemblies. This is probably due to system- atic errors that have been detected in long reads sequencers, without compensation even increased sequencing depth10. In addition, our results using long reads only assemblers tended to have larger assemblies (total length) and duplication in different contigs was recognized. This is has been previously reported for long read assem- blers10 and it could be a major obstacle for polishing the genome12 and compromising accuracy. To assess completeness, we implemented an analysis using expected gene content by searching single-copy orthologs61. Short reads only and hybrid approaches achieved the assembly of 100% of core genes, but long reads only had a poor performance. Also, despite the larger number of CDS for long reads, incomplete assembly of genes was evidenced. Fragmentation of genes was confirmed by comparing the average size of all those elements. SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 0 www.nature.com/scientificreports/ www.nature.com/scientificreports In long reads only assemblies the CDS average size was <600 bp, but for all other approaches this value was around 955 bp (Table 1). The CDS average size of the closest genome to PaeAG1, RIVM-EMC2982, is 955 bp, meanwhile for PAO1 strain is 1000 bp. This appreciation has been briefly reported before62. The incompleteness of genome assembly will not matter if genome structure is not the focus of a study9, but it is not the case of PaeAG1, where genomic events reconstruction would be crucial to understand the special features of this strain. When all features of assemblies are included in the PCA analysis, general profiles of short reads approaches define a separated cluster, and another one for long reads and hybrid methods (Fig. 2a). Considering all the metrics of the 3C criterion, definitively SPAdes and Unicycler hybrid approaches outperformed non-hybrids methods. This can be explained due reference-free genomes assembly is feasible using best features of both short and long reads technologies9. IDBA assembler is a particular case which remains as the same using the hybrid or non-hybrid approach. About other works related to the algorithms we evaluated, different results have been found depending on data and genome complexity. However, since introduction of Unicycler assembler, a last generation algorithm, most studies have suggested that Unicycler outperforms other approaches1,10,11,63. In the case of IDBA and Velvet, performance was comparable to SPAdes when it was introduced36. For Megahit, an assembler for metagenomes but also working for single genomes38, it has been also used in recent studies, mainly related to microbial commu- nities or particular strains64. More restricted works using SKESA are reported, but performance seem to be better than SPAdes and Megahit for some cases39. For short reads only or hybrid assemblies, SPAdes is still used to aseembly genomes36,39,65. In a recent study, SPAdes had better results when compared to others, where Unicycler was not included3. For long reads, Canu has been successfully implemented in different studies10,12,41, showing well performance when benchmark is done (but most of them without Unicycler assember). For Flye, it has been used in recent studies66,67, including a case where Canu, Flye and Unicycler (using long reads only and hybrid approaches) had very similar performance68. Comparison between Unycicler, SPAdes and Canu has shown that in some cases Canu and SPAdes are not able to circularize the final assembly, unlike Unicycler11. In another study with long reads only, Canu was the best ranked assembler using Escherichia coli genome12. All this variable results of assemblers (in our benchmark and the literature) are congruent with several reports about the diversity of assemblers, which have been developed to generate high quality de novo assemblies, but their output is very different because of algorithmic differences, data source and genomic complexity2. This com- plicates selection of appropriate strategy. Thus, the need for more capable assemblers is still mandatory in terms of capabilities, accuracy and the way to deal with genomic features3. Regarding the differences in cost for both technologies (only considering sequencing step and no other com- plementary costs) Illumina short reads sequencing cost ($1500) was around three times more expensive than ONT ($500) sequencing. In our case, the hybrid approach has a cost of around $2000 for both technologies. Although we had ultra-deep sequencing data for both platforms, the minimal coverage requirements for PaeAG1 genome assembly are not known, which could significantly reduce the sequencing price. This cost is higher than other studies but with hundreds of sequenced samples69,70, in contrast with our case in that a single genome was sequenced (increasing costs). In the case of conflictive regions, each assembler implements slightly different heuristics to deal with rep- etitions in the genome, uneven coverage, sequencing errors and chimeric reads8. Efforts to generate complete genome sequences with repetitive regions has been hampered by dramatic expansion of mobile elements, espe- cially when short read sequencing methodologies are used13. In PaeAG1 genome assembly, different complicated regions were identified when short reads only approaches (all methods) and hybrid IDBA were used, creating gaps in an incomplete assembly (Fig. 2b). Although the PaeAG1 has not really a repeat-dense genome, mobile elements add repetitive sequences. This has complicated the assembly of its genome using short reads only approaches. All this regions were apparently solved by long reads only and for hybrid SPAdes and hybrid Unicycler. This results are expectable according to previous reports and the differences in each technology. Use of long reads technolo- gies achieve repeat regions spanning63 and it permits bridging of repetitive sequences65. However, evaluation of remapping of reads with the selected assembly (hybrid Unicycler according to 3C cri- terion) revealed a variation in the coverage in one specific region, as shown in Supplementary Fig. S1a (left), with an irregular and non-uniform distribution of reads. This conflictive region was preliminary annotated as a flank- ing repetitive sequence of one of the integrons (containing blaVIM-2 gene). This is a common phenomenon in regions carrying antimicrobial resistance determinants, which are often flanked by repetitive insertion sequences, and it can be difficult to assemble using short reads because are very short compared to the repetitions10. In our case, the conflictive region is part of the known region of the integron (approx. 2,500 bp, sequenced using Sanger method), and 100% of short reads had a size of 101 bp. Although this region was identified in a hybrid approach, this problem is an in force limitation of the algorithms11 and curation step was required. No resolution of repetitive region made that short reads were mapped incorrectly9, evidenced as a cover- age peak of reads in the remnant conflictive region of PaeAG1 genome assembly. In addition, this is congruent with the alignment of known sequence against the assembly. At least a 12% of the blaVIM-2 carrying integron sequence was lost in the hybrid approaches, including hybrid Unicycler (Table 1). We can conclude that those identical flanking regions of integrons were not well assembled using short reads. Long reads approaches were able to coverage both regions completely. The compromised ability of the Unicycler algorithm to assemble this conflictive region in the hybrid mode is related to the approach. In general, hybrid assembly can be accomplished with either a short-read-first or long-read-first approach. In the short-read-first method, contigs are assembled using short reads followed by a scaffolding is addresses using long reads11. Drawbacks of this approach include scaffolding mistakes and structural errors (misassemblies) in the sequence71. This could be the reason of our case in the conflictive region due Unicycler in hybrid mode is a short-read-first approach. In this context, the genome assembly problem is an open issue due is a NP-hard problem, and no universal solution to find the optimal SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 1 www.nature.com/scientificreports/ www.nature.com/scientificreports route in graph-based approaches is available, in particular which is aggravated by repetitive regions. To deal with repetitive sequences in the genome, Unicycler determine the occurrence (multiplicity) of contigs in the assembly using both depth and connectivity using a greedy algorithm, and a bridging step is used to connect contigs and solve repeats using paired-end short reads11. However, due the algorithm used by Unicycler is a greedy approach, optimal solution is not warranted, and assembly errors can be induced. Thus, additional steps, as the manual curation, are required. In this sense, manual curation is a common practice to finish genome due complexity of genomic data which algorithms not always can deal with9,10. In a case, by comparing long reads only and hybrid assemblies, this man- ual curation it implied recovery of lost sequences up to 18 kbp for some assemblies in another study10. Same situ- ation was presented in another ST-111 P. aeruginosa strain, where flanking regions of blaVIM-2 gene was broken during assembly72. In other studies, no polishing strategy improves the completeness of assemblies65. To improve the genome assembly of PaeAG1, curation was done with the help of the known sequences of PaeAG1 (Sanger sequencing), assembly graphs and the assemblies of long reads only. After this polishing step, remapping showed a uniform distribution of reads (Supplementary Fig. S1a right) and complete matching (100% identity and coverage) of the known sequences of the integrons, as expected. At graph assembly level, when topological structure of assembly is analyzed for short reads assemblies (Fig. 2c, short reads), a collapsed graph is evidenced, where sequences are shown as cycles due the repeats or small shared sequences in many reads at same time. This means that there is insufficient information to disambiguate the repeat or shared sequences in the graph. This problem was solved when long reads were implemented, showing no cycles for long reads approaches (although shown case had two contigs), and a complete circularized genome for the final hybrid assembly. Assessment of the genome assembly of PaeAG1. Based on best overall quality statistics and polish- ing, hybrid approach using Unicycler was selected as the final assembly of PaeAG1 genome using 3C criterion. In our initial efforts to assembly the genome, using only short reads, most of assemblers generated more than 100 contigs, and using RIVM-EMC2982 strain (which was selected after doing a full genome BLASTn of con- tigs), scaffolding finished with 1 sequence for the case of Unicycler and 22 gaps. In order to improve the genome assembly, ONT technology was used to produce long reads and new evaluations were made using both, long read only or hybrid methods. On the other hand, notwithstanding all the three contiguity, completeness and correctness evaluation are fre- quently evaluated in genome assembly studies3,8,12,15–17, no explicit conceptualization of “3C criterion” has been achieved. Here we emphasized its use to referrer to the classical metrics and comparisons. The final assessment of the definitive assembly of PaeAG1 genome accomplished an ultra-deep coverage for both, short (>400X) and long reads (>560X) technologies. Also it achieved high performance according to 3C criterion: (i) full contiguity with a single and circular genome without gaps; (ii) correctness based on short reads remapping and polishing, achieving full accuracy (including known sequences of the strain); and (iii) complete- ness according to identification of 100% of expected core gene set and percentage of remapping of genomic reads as well mapping of reads from RNASeq technology. Altogether, the use of a hybrid strategy allowed the PaeAG1 genome to be inferred by a de novo or reference-free assembly approach, which it represent a key element in the study of this strain due its exclusive genomic features9. To our knowledge, this is the first genome assembly of a ST-111 P. aeruginosa strain using a hybrid approach. The first hybrid assemblies for other-class P. aeruginosa strains were published recently23,73,74. In order to eval- uate our pipeline in these publicly available sequencing data, we implemented our hybrid approach to the two cases with Illumina and ONT sequencing technologies. For the case of the P. aeruginosa strain Houston-173, we were able to reproduce the assembly of the chromosome and the plasmid with our approach. For the P. aeruginosa strain CRPA23, the published draft genome was composed of three contigs, and with our approach we were able to finish into two contigs, representing an improvement in the assembly. More details of the assemblies of these two strains are shown at the end of the Supplementary Material Manual curation. Annotation of the PaeAG1 genome and epidemiological insights. In order to identify main fea- tures of the PaeAG1 genome, including its architecture, composition and functions, genome characterization and annotation was done. The PaeAG1 chromosome is a large and circular sequence of 7,190,208 bp, larger than reference strain PAO1 and similar to other ST-111 strains size31,75. Same pattern was found for the GC content of 65.7%. This relatively large genome in P. aeruginosa has been associated to thrive in a repertoire of hosts and environments21. The general annotation of genome revealed that PaeAG1, contain 6,709 genes (including 6,620 CDS), which are related to 2,197 Gene ontology terms, 3,060 elements in KEGG and 5,537 COGs. In similar way as reported in first whole genome sequencing of a P. aeruginosa strain76, genome analysis of PaeAG1 shows determinants associated to versatility and successful ability to conquer multiple niches in nature. For example broad capabilities to transport and metabolize organic substances, presence of chemotaxis systems, biofilms production and efflux systems have been described and all of them were annotated for PaeAG1. Genome sequence analysis using molecular typing methods showed that PaeAG1 has a ST-111 profile and O12 serotype. ST-111 is a lineage that belongs to the O12 serotype, which has been associated with multidrug resistance and expansion in hospitals for decades28,72,75. Thus, emergence of high-risk clones, including the ST-111 clones of P. aeruginosa, undermines the available therapeutic strategies and therefore, compromises public health. The presence of this kind of high-risk clones in Costa Rican hospitals is a nationwide concern because MBL and particular virulence factors producing isolates cause serious infections that are difficult to treat77. This same SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 2 www.nature.com/scientificreports/ www.nature.com/scientificreports ST-111 profile has been identified in most of MBL producing P. aeruginosa strains in the United Kingdom75 thus as in Netherlands77. Annotation of virulence factors found classical elements in P. aeruginosa group78, including elements related to adherence, antiphagocytosis, iron uptake, phospholipases, biosurfactant, quorum sensing, proteases, regula- tion, secretion systems, and toxins. Some particular virulence factors of PaeAG1 are substrate for type I protein secretion system T1SS (alkaline protease aprA), T2SS (elastases LasA and LasB, exotoxin-A and phospholipases PlcH, PlcN, and PlcB) and T3SS (ExoS, ExoT, and ExoY)78. It has been reported that secretion of ExoS is predom- inantly identified in invasive P. aeruginosa strains78. Recently, this determinant was identified in two blaVIM-2 carrying strains, one serotype O12 and ST-111 isolate (P. aeruginosa Carb01 63) and another O11 strain of ST-446 (P. aeruginosa S04 90) in Netherlands31. In PaeAG1, a potential invasive role of this strain can be related to the presence of this element. In the context of mobile genetic elements, large number of determinants were identified in the chromosome of PaeAG1, including multiple genomic islands, six prophages and two integrons. Comparison of PaeAG1 against the reference of the P. aeruginosa group PAO1 and the closest strain to PaeAG1, RIVM-EMC2982, is consistent with genome size and mobile elements content. In the case of strain PAO1, this reference has a 6.3 Mb genome, meanwhile PaeAG1 has almost 1 Mb more of bases pairs (around 1,000 genes). This difference is congruent with high content of genomic island and other mobile elements in PaeAG1 but it is compromised in PAO1 strain. In the case of RIVM-EMC2982 (ST-111 and blaVIM-2+), this strain was identified as the closest to PaeAG1 and similar profile by genomic blocks were recognized (Fig. 3c). Meticulous analysis showed some different genomic arrangements, including differences in composition of mobile elements and absence of blaIMP-18 in RIVM-EMC2982. In the case of the six prophages, all of them are also found in RIVM-EMC2982 genome (ten prophages in total) in same conditions of integrity. However, there are variable results of prophage presence in many ST-111 strains, which has been discussed as difficult to interpret, due transient nature of phages or the more method- ological issues72. In addition, these high numbers of prophages might be related to the absence of CRISPR-Cas systems in the genome31, as the case of PaeAG1. Reports of compromised CRISPR-Cas defense systems are asso- ciated to better ability to acquire mobile element carrying antibiotic resistance genes in P. aeruginosa and other organisms79. Regarding the integrons of PaeAG1, identification of genes intl1, sul1 and qacE∆1 for class I integrons, sug- gested two integron-like structures carrying the VIM-2 and IMP-18 genes22. This was confirm when Sanger method was used for sequencing both integrons. In our assembly, these two complete integrons and same struc- ture were found, one carrying blaVIM-2 and another one including blaIMP-18. This is congruent with previous studies showing that these two genes are regularly identified in integrons in P. aeruginosa30,31,80. In more detail, VIM (Verona integron-encoded metallo-β-lactamase) enzymes have same hydrolytic spectrum than the IMP-type enzymes, and specifically blaVIM-2 is responsible of multiple outbreaks being the most wide- spread MBL in P. aeruginosa30. Multiple strains carrying VIM-2 have been identified in different latitudes around the world75,80–83. In United Kingdom, a study with 87 ST-111 P. aeruginosa strains found that 73 isolates carried VIM-2 and others carried different IMPs and one isolate had both VIM-2 and IMP-18, the second report of a clone carrying both MBL75. In a Netherlands outbreak, another strain (Carb01–63 strain, isolated from drains and sinks in a hospital) had a ST-111 profile and it was closely related to same RIVM-EMC298231. All the three strains (PaeAG1, Carb01–63 and RIVM-EMC2982, in the same group according to phylogenetic analysis) are resistant to multiple antibiotics and carry blaVIM-2 allele. In the case of imipenemases coded by blaIMP-18 gene, outbreaks reports and genetic context is limited in P. aeruginosa, including some cases in United States84, México85, France81 and Puerto Rico86. For other antibiotic resistance determinants, annotation also included serine- and metallo-β-lactamases (PDC-3, OXA-2, as well as VIM-2 and IMP-18), porins and efflux pumps (including mexAB–oprM, mexCD– oprJ, mexEF–oprN, mexHI–opmD operons). All of them may contribute to the multi-resistance phenotype in PaeAG1. As it was revealed by pan-genome analysis of ST-111 members, variable composition of gene content separate strains in relatively independent groups. The strains (including PaeAG1) belongs to the O12 serotype, which has been associated with multidrug resistance and nosocomial expansion28,29. PaeAG1 was close to the main group with 5 isolates, including the P. aeruginosa RIVM-EMC2982 (the closest to PaeAG1 by BLAST analysis) and Carb01–63 strains. Although all the strains (except the reference) are part of same group, differences in gene content is a remarkable feature, including that PaeAG1 was the only strain carrying blaIMP-18 genes. In contrast, ST-111 strains has been frequently associated with blaVIM-2, as mentioned before28,75. Other less commonly associated lactamases genes include VIM-4 or other IMP-type enzymes, but also only with extended-spectrum β-lactamases without carbapenemase activity (such as VEB-1 and OXA)75. Due differences in size of the genome (6.7–7.3 Mb) and gene content, as well as the particular genomic features of this strains (genomic island composition and evolution, mobile elements, integrons, phages and others), fur- ther analysis are required to describe high plasticity in this group. Conclusions Advances in sequencing technology play an increasing and determinant role in infection investigations and track- ing evolution of international lineage of high-risk bacterial clones in clinical context over long times and in great detail87. However, genome assembly is not obvious and it is challenged by sequencing technology, genomic features and all bioinformatic algorithms, making it an open problem. Exhaustive comparison of different strat- egies to assembly the genome and it assessment gives a better way to get close to the real genome sequence. Benchmarking using the 3C criterion is a consensus approach that includes different levels and aims of compari- son for the robust selection of a final assembly. SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 13 www.nature.com/scientificreports/ www.nature.com/scientificreports In our case, a hybrid assembly was the best approach to achieve a single circular sequence with high quality 3C for the case of the genome of a high-risk P. aeruginosa strain. Thus, best features of short and long reads sequenc- ing technologies are included and their drawbacks are compensated. The case of PaeAG1 genome assembly is a first and important step to understand the genomic architecture of an ST-111 high-risk strain. Annotation could reveal all the genomic content and molecular determinants related to phenotypes, which for PaeAG1 are related to multi-resistance and virulence mainly. This highlighting the need for more studies using epidemiological information and both high throughput technologies and conventional methods to understand the molecular mechanisms and phenotypes, make decisions at clinical level and to fight, and hopefully, overcome the antibiotic multi-resistance problem. Data availability Data input and output data for PCA are provided as Supplementary material PCA data. The details of the approach for the manual curation are available in the Supplementary Material Manual Curation. Scripts for bioinformatics analysis are provided as a supplementary material, but also available at https:// github.com/josemolina6/PaeAG1_genome/blob/master/Script_for_bioinformatic_analysis.sh. To specifically run the analysis of the 3C criterion, access a simplified Script at: https://github.com/josemo- lina6/PaeAG1_genome/blob/master/Script_3C_evaluation.sh. The annotated final assembly of the PaeAG1 chromosome was deposited in GenBank under the accession number CP045739. Short reads and long reads raw data were uploaded to the NCBI Sequence Read Archive (SRA) and it is available under the accessions numbers SRX7088413 and SRX7088414, respectively. A full table of all the details of the genome annotation is provided as a Supplementary material, and it is also available at: https:// github.com/josemolina6/PaeAG1_genome. Files of the annotation in different formats as well as the fasta files of all the assemblies are available in the same link. Received: 14 November 2019; Accepted: 6 January 2020; Published: xx xx xxxx References 1. Gonzales Decano, A. et al. Complete Assembly of Escherichia coli Sequence Type 131 Genomes Using Long Reads Demonstrates Antibiotic Resistance Gene Variation within Diverse Plasmid and Chromosomal Contexts. mSphere 4 (2019). 2. Kwon, D., Lee, J. & Kim, J. GMASS: A novel measure for genome assembly structural similarity. BMC Bioinformatics 20, 1–9 (2019). 3. Yahav, T. & Privman, E. A comparative analysis of methods for de novo assembly of hymenopteran genomes using either haploid or diploid samples. Sci. Rep. 9, 1–10 (2019). 4. Ekblom, R. & Wolf, J. B. W. A field guide to whole-genome sequencing, assembly and annotation. Evol. Appl. 7, 1026–1042 (2014). 5. Aguilar-Bultet, L. & Falquet, L. Secuenciación y ensamblaje de novo de genomas bacterianos: una alternativa para el estudio de nuevos patógenos. Rev. Salud Anim. 37, 125–132 (2015). 6. Miller, J. R., Koren, S. & Sutton, G. Assembly algorithm for Next-Generation Sequencing data. Genomics 95, 315–327 (2010). 7. Bellec, A., Courtial, A., Cauet, S. & Rodde, N. Long Read Sequencing Technology to Solve Complex Genomic Regions Assembly in Plants. J. Next Gener. Seq. Appl. 3 (2016). 8. Alhakami, H., Mirebrahim, H. & Lonardi, S. A comparative evaluation of genome assembly reconciliation tools. Genome Biol. 18, 1–14 (2017). 9. Wang, W. et al. Assembly of chloroplast genomes with long- and short-read data: A comparison of approaches using Eucalyptus pauciflora as a test case. BMC Genomics 19, 1–15 (2018). 10. Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb. Genomics 3 (2017). 11. Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, 1–22 (2017). 12. Jayakumar, V. & Sakakibara, Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long- read sequence data. Brief. Bioinform. 20, 866–876 (2019). 13. Batty, E. M. et al. Long-read whole genome sequencing and comparative analysis of six strains of the human pathogen Orientia tsutsugamushi. PLoS Negl. Trop. Dis. 12, 1–17 (2018). 14. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013). 15. Michael, T. P. et al. High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. Nat. Commun. 9, 1–8 (2018). 16. Broad Institute. GAEMR. Available at: http://software.broadinstitute.org/software/gaemr/ (Accessed: 30th July 2019) (2019). 17. Nadalin, F., Vezzi, F. & Policriti, A. GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics 13, S8 (2012). 18. Nishimura, O., Hara, Y. & Kuraku, S. gVolante for standardizing completeness assessment of genome and transcriptome assemblies. Bioinformatics 33, 3635–3637 (2017). 19. Liao, Y. C. et al. Completing bacterial genome assemblies: strategy and performance comparisons Oxford Nanopore MinION sequencing and genome assembly Circlator: automated circularization of genome assemblies using long sequencing reads Versatile genome assembly evaluation. 2016–2017 (2019). 20. Duan, J., Jiang, W., Cheng, Z., Heikkila, J. J. & Glick, B. R. The Complete Genome Sequence of the Plant Growth-Promoting Bacterium Pseudomonas sp. UW4. PLoS One 8 (2013). 21. Freschi, L. et al. The Pseudomonas aeruginosa Pan-Genome Provides New Insights on Its Population Structure, Horizontal Gene Transfer, and Pathogenicity. Genome Biol. Evol. 11, 109–120 (2019). 22. Toval, F. et al. Predominance of carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP and blaVIM metallo-β- lactamases in a major hospital in Costa Rica. J. Med. Microbiol. 64, 37–43 (2015). 23. Yu, X. et al. Long-read Nanopore Sequencing-based Draft Genome of a Carbapenem-resistant Pseudomonas aeruginosa. J. Glob. Antimicrob. Resist. https://doi.org/10.1016/j.jgar.2019.05.023 (2019). 24. Farajzadeh Sheikh, A. et al. Molecular epidemiology of colistin-resistant Pseudomonas aeruginosa producing NDM-1 from hospitalized patients in Iran. Iran. J. Basic Med. Sci. 22, 38–42 (2019). 25. Miriagou, V. et al. Acquired carbapenemases in Gram-negative bacterial pathogens: detection and surveillance issues. Clin. Microbiol. Infect. 16, 112–22 (2010). SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 4 www.nature.com/scientificreports/ www.nature.com/scientificreports 26. Baquero, F., Coque, T. M. & Cruz, Fdela Ecology and Evolution as Targets: the Need for Novel Eco-Evo Drugs and Strategies To Fight Antibiotic Resistance. Antimicrob. Agents Chemother. 55, 3649–3660 (2011). 27. Willems, R. J. L., Hanage, W. P., Bessen, D. E. & Feil, E. J. Population biology of Gram-positive pathogens: high-risk clones for dissemination of antibiotic resistance. FEMS Microbiol. Rev. 35, 872–900 (2011). 28. Woodford, N., Turton, J. F. & Livermore, D. M. Multiresistant Gram-negative bacteria: the role of high-risk clones in the dissemination of antibiotic resistance. FEMS Microbiol. Rev. 35, 736–755 (2011). 29. Mulet, X. et al. Biological markers of Pseudomonas aeruginosa epidemic high-risk clones. Antimicrob. Agents Chemother. 57, 5527–5535 (2013). 30. Hong, D. J. et al. Epidemiology and characteristics of metallo-ß-lactamase-producing Pseudomonas aeruginosa. Infect. Chemother. 47, 81–97 (2015). 31. van der Zee, A. et al. Spread of carbapenem resistance by transposition and conjugation among Pseudomonas aeruginosa. Front. Microbiol. 9, 1–11 (2018). 32. Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data. Available at, https://www.bioinformatics. babraham.ac.uk/projects/fastqc/ (Accessed: 10th April 2018) (2010). 33. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 34. Loman, N. J. & Quinlan, A. R. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics 30, 3399–3401 (2014). 35. Zerbino, D. R. & Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). 36. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–77 (2012). 37. Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. In 426–440, https://doi.org/10.1007/978-3-642-12683-3_28 (Springer, Berlin, Heidelberg, 2010). 38. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015). 39. Souvorov, A., Agarwala, R. & Lipman, D. J. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biol. 19, 153 (2018). 40. Chikhi, R. & Medvedev, P. Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37 (2014). 41. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017). 42. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). 43. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016). 44. Bosi, E. et al. MeDuSa: a multi-draft based scaffolder. Bioinformatics 31, 2443–2451 (2015). 45. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015). 46. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–60 (2009). 47. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015). 48. Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–4 (2016). 49. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016). 50. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies: Fig. 1. Bioinformatics 31, 3350–3352 (2015). 51. Alikhan, N.-F., Petty, N. K., Ben Zakour, N. L. & Beatson, S. A. BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC Genomics 12, 402 (2011). 52. Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS One 9, e112963 (2014). 53. Darling, A. C. E., Mau, B., Blattner, F. R. & Perna, N. T. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14, 1394–403 (2004). 54. Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015). 55. Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014). 56. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019). 57. Larsen, M. V. et al. Multilocus sequence typing of total-genome-sequenced bacteria. J. Clin. Microbiol. 50, 1355–61 (2012). 58. Arndt, D. et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 44, W16–W21 (2016). 59. Cury, J., Jové, T., Touchon, M., Néron, B. & Rocha, E. P. Identification and analysis of integrons and cassette arrays in bacterial genomes. Nucleic Acids Res. 44, 4539–4550 (2016). 60. Koren, S. et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14, R101 (2013). 61. Wang, W. et al. Data descriptor: The sequence and de novo assembly of hog deer genome. Sci. Data 6, 4–11 (2019). 62. Kirkegaard, R. What is a good genome assembly? – Albertsen Lab. Available at, https://albertsenlab.org/what-is-a-good-genome- assembly/ (Accessed: 9th August 2019) (2019). 63. Peter, S. et al. Tracking of antibiotic resistance transfer and rapid plasmid evolution in a hospital setting by Nanopore sequencing. bioRxiv 639609, https://doi.org/10.1101/639609 (2019) 64. Learman, D. R. et al. Comparative genomics of 16 Microbacterium spp. that tolerate multiple heavy metals and antibiotics. PeerJ 6, e6258 (2019). 65. Nicholls, S. M., Quick, J. C., Tang, S. & Loman, N. J. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience 8, 1–9 (2019). 66. Schmid, M. et al. Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats. Nucleic Acids Res. 46, 8953–8965 (2018). 67. Bishara, A. et al. High-quality genome sequences of uncultured microbes by assembly of read clouds. Nat. Biotechnol, https://doi. org/10.1038/nbt.4266 (2018) 68. Ring, N. et al. Resolving the complex Bordetella pertussis genome using barcoded nanopore sequencing. Microb. genomics 4 (2018). 69. De Maio, N. et al. Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microb. Genomics 5, e000294 (2019). 70. Risse, J. et al. A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data. Gigascience 4, 60 (2015). 71. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012). SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 5 www.nature.com/scientificreports/ www.nature.com/scientificreports 72. Witney, A. A. et al. Genome sequencing and characterization of an extensively drug-resistant sequence type 111 serotype O12 hospital outbreak strain of Pseudomonas aeruginosa. Clin. Microbiol. Infect. 20, O609–O618 (2014). 73. Spinler, J. K., Raza, S., Runge, J. K. & Luna, R. A. Complete Genome Sequence of the Multidrug-Resistant Pseudomonas aeruginosa Endemic Houston-1 Strain, Isolated from a Pediatric Patient with Cystic Fibrosis and Assembled Using Oxford Nanopore and Illumina Sequencing. Microbiol. Resour. Announc. 8 (2019). 74. Magalhães, B., Senn, L. & Blanc, D. S. High-Quality Complete Genome Sequences of Three Pseudomonas aeruginosa Isolates Retrieved from Patients Hospitalized in Intensive Care Units. Microbiol. Resour. Announc. 8 (2019). 75. Turton, J. F. et al. High-resolution analysis by whole-genome sequencing of an international lineage (Sequence Type 111) of pseudomonas aeruginosa associated with metallo-carbapenemases in the United Kingdom. J. Clin. Microbiol. 53, 2622–2631 (2015). 76. Olson, M. V. et al. Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunisticpathogen. Nature 406, 959–964 (2000). 77. Van der Bij, A. K. et al. Metallo-β-lactamase-producing Pseudomonas aeruginosa in the Netherlands: the nationwide emergence of a single sequence type. Clin. Microbiol. Infect. 18, E369–E372 (2012). 78. Bleves, S. et al. Protein secretion systems in Pseudomonas aeruginosa: A wealth of pathogenic weapons. Int. J. Med. Microbiol. 300, 534–543 (2010). 79. Pawluk, A., Bondy-Denomy, J., Cheung, V. H. W., Maxwell, K. L. & Davidson, A. R. A new group of phage anti-CRISPR genes inhibits the type I-E CRISPR-Cas system of Pseudomonas aeruginosa. MBio 5, e00896 (2014). 80. Giakkoupi, P. et al. Spread of Integron-Associated VIM-Type Metallo-β-Lactamase Genes among Imipenem-Nonsusceptible Pseudomonas aeruginosa Strains in Greek Hospitals. J. Clin. Microbiol. 41, 822 (2003). 81. Hocquet, D. et al. Nationwide investigation of extended-spectrum beta-lactamases, metallo-beta-lactamases, and extended- spectrum oxacillinases produced by ceftazidime-resistant Pseudomonas aeruginosa strains in France. Antimicrob. Agents Chemother. 54, 3512–5 (2010). 82. Poirel, L. et al. Characterization of VIM-2, a carbapenem-hydrolyzing metallo-beta-lactamase and its plasmid- and integron-borne gene from a Pseudomonas aeruginosa clinical isolate in France. Antimicrob. Agents Chemother. 44, 891–7 (2000). 83. Poirel, L. et al. Characterization of Class 1 Integrons from Pseudomonas aeruginosa That Contain the blaVIM-2 Carbapenem- Hydrolyzing -Lactamase Gene and of Two Novel Aminoglycoside Resistance Gene Cassettes. Antimicrob. Agents Chemother. 45, 546–552 (2001). 84. Borgianni, L. et al. Genetic Context and Biochemical Characterization of the IMP-18 Metallo-β-Lactamase Identified in a Pseudomonas aeruginosa Isolate from the United States. Antimicrob. Agents Chemother. 55, 140–145 (2011). 85. Garza-Ramos, U. et al. Metallo-β-lactamase IMP-18 is located in a class 1 integron (In96) in a clinical isolate of Pseudomonas aeruginosa from Mexico. Int. J. Antimicrob. Agents 31, 78–80 (2008). 86. Martínez, T., Vazquez, G. J., Aquino, E. E., Goering, R. V. & Robledo, I. E. Two novel class I integron arrays containing IMP-18 metallo-β-lactamase gene in Pseudomonas aeruginosa clinical isolates from Puerto Rico. Antimicrob. Agents Chemother. 56, 2119–21 (2012). 87. Dößelmann, B. et al. Rapid and Consistent Evolution of Colistin Resistance in Extensively Drug-Resistant Pseudomonas aeruginosa during Morbidostat Culture. Antimicrob. Agents Chemother. 61, e00043–17 (2017). Acknowledgements We thank to all members of both Centro de Investigación en Enfermedades Tropicales (Universidad de Costa Rica Costa Rica) and PGx group of The Human Phenome Institute (Fudan University, Shanghai-China) for their support in the experimental and bioinformatics analysis, respectively. This work was funded by project “B8114 Definición de la red transcriptómica y de las alteraciones genómicas inducidas por ciprofloxacina en Pseudomonas aeruginosa AG1”, Vicerrectoría de Investigación, Universidad de Costa Rica (period 2017–2019). Author contributions J.M.M., C.R. and F.G. participated in the conception, design of the study and data selection. JMM implemented the bioinformatics analysis. J.M.M., R.C.S., C.R. and L.S. participated in the interpretation of bioinformatics results. J.M.M., C.R. and F.G. participated in the interpretation of the data in the biological context. J.M.M. drafted the manuscript and all authors were involved in its revision. All authors read and approved the final manuscript. Competing interests The authors declare no competing interests. Additional information Supplementary information is available for this paper at https://doi.org/10.1038/s41598-020-58319-6. Correspondence and requests for materials should be addressed to J.A.M.-M. Reprints and permissions information is available at www.nature.com/reprints. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre- ative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not per- mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. © The Author(s) 2020 SCIENTIFIC REPORTS | (2020) 10:1392 | https://doi.org/10.1038/s41598-020-58319-6 1 6 39 CHAPTER 2 Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: a VIM-2-carrying old-acquaintance and a novel IMP-18-carrying integron Molina-Mora, J.-A., Garcia-Batan, R., & Garcia, F. (2020). Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: A VIM-2-carrying old-acquaintance and a novel IMP-18-carrying integron. Research Square (Pre-Print). https://doi.org/10.21203/RS.3.RS-41474/V1 40 Summary P. aeruginosa AG1 is a high-risk ST-111 strain with resistance to multiple antibiotics, including carbapenems by the activity of VIM-2 and IMP-18 metallo- -lactamases. These genes are harbored in two class 1 integrons, belonging to genomic islands. However, the genomic context related to these determinants in PaeAG1 is unclear. Thus, we implemented a comparative genomic approach to define and up-date the phylogenetic relationship among complete P. aeruginosa genomes and genotyping profiles using a pan-genome analysis. We also studied the PaeAG1 genomic islands content in other strains and the architecture of genomic regions around the integrons. With 211 strains, the pan-genome analysis revealed that complete genome sequences are able to separate clones by MLST, including a ST-111 cluster with PaeAG1. The PaeAG1 genomic islands were found to define a diverse presence/absence pattern among related genomes, but content was related to phylogenetic relationships. Finally, landscape reconstruction of specific genomic regions showed that VIM-2-carrying integron (In59-like) is an old-acquaintance element harbored in a known genomic region completely found in other two ST-111 strains. In addition, PaeAG1 has an exclusive genomic region containing a novel IMP-18-carrying integron (registered as In1666), with an arrangement never reported before. Altogether, we provide new insights about the genomic determinants associated with the resistance to carbapenems in this high-risk P. aeruginosa using comparative genomics. Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: a VIM-2-carrying old-acquaintance and a novel IMP-18-carrying integron Imipenem VIM-2 and IMP-18 expression Reconstruction of genomic regions associated with after imipenem exposure (RT-qPCR) VIM-2- and IMP-18-carrying integrons Pan-genome analysis: PaeAG1 genomic islands in selection of related genomes other realted strains Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: a VIM-2-carrying old-acquaintance and a novel IMP-18-carrying integron Highlights Pseudomonas aeruginosa AG1 (PaeAG1) carries VIM-2 and IMP-18 genes, which are induced by carbapenems Pan-genome analysis is able to separate strains by MLST profile Few PaeAG1 genomic islands were found in other related genomes The VIM-2-carrying integron (In59-like) is an old-acquaintance element A novel IMP-18-carrying integron (registered as In1666) was described for the first time 1 2 3 4 Genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: 5 6 a VIM-2-carrying old-acquaintance and a novel IMP-18-carrying integron 7 8 9 10 11 Authors: 12 13 Jose Arturo Molina Mora, M.Sc.* 14 15 Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica 16 17 18 Email: jose.molinamora@ucr.ac.cr 19 20 * Corresponding author 21 22 23 24 Diana Chinchilla-Montero, M.Sc. 25 26 27 Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica 28 29 Email: dchinchilla@inciensa.sa.cr 30 31 32 33 Raquel García Batán, M.D. 34 35 Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica 36 37 38 Email: raquel.garcia@ucr.ac.cr 39 40 41 42 Fernando García, Ph.D. 43 44 Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica 45 46 47 Email: fernando.garcia@ucr.ac.cr 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 1 63 64 65 1 2 3 4 Abstract 5 6 Pseudomonas aeruginosa is an opportunist and versatile organism responsible for infections among 7 8 9 immunocompromised hosts. This pathogen has high intrinsic resistance to most antimicrobials. P. 10 11 aeruginosa AG1 (PaeAG1) is a Costa Rican high-risk ST-111 strain with resistance to multiple 12 13 antibiotics, including carbapenems due to the activity of both VIM-2 and IMP-18 metallo- - 14 15 lactamases (MBLs). These genes are harbored in two class 1 integrons, belonging to one out of the 16 17 18 57 PaeAG1 genomic islands. However, the genomic context related to these determinants in 19 20 PaeAG1 and other P. aeruginosa strains is unclear. Thus, we first assessed the transcriptional 21 22 activity of VIM-2 and IMP-18 genes when exposed to imipenem (a carbapenem) by RT-qPCR. To 23 24 select related genomes to PaeAG1, we then implemented pan-genome analysis to define and up- 25 26 27 date the phylogenetic relationship among complete P. aeruginosa genomes. We also studied the 28 29 PaeAG1 genomic islands content in the related strains and finally we described the architecture and 30 31 possible evolutionary steps of the genomic regions around the VIM-2- and IMP-18-carrying 32 33 integrons. 34 35 Expression of VIM-2 and IMP-18 genes was demonstrated to be induced after imipenem exposure. 36 37 38 In a subsequent comparative genomics analysis with 211 strains, the P. aeruginosa pan-genome 39 40 revealed that complete genome sequences are able to separate clones by MLST profile, including a 41 42 clear ST-111 cluster with PaeAG1. The PaeAG1 genomic islands were found to define a diverse 43 44 presence/absence pattern among related genomes. Finally, landscape reconstruction of genomic 45 46 47 regions showed that VIM-2-carrying integron (In59-like) is an old-acquaintance element harbored 48 49 in a known region completely found in other two ST-111 strains. In addition, PaeAG1 has an 50 51 exclusive genomic region containing a novel IMP-18-carrying integron (registered as In1666), with 52 53 an arrangement never reported before. Altogether, we provide new insights about the genomic 54 55 56 determinants associated with the resistance to carbapenems in this high-risk P. aeruginosa using 57 58 comparative genomics. 59 60 61 62 2 63 64 65 1 2 3 4 Keywords: Pseudomonas aeruginosa AG1, ST-111, IMP-18, VIM-2, pan-genome, genomic islands 5 6 7 8 9 Abbreviations 10 11 FDR: False Discovery Rate 12 13 GI: Genomic island 14 15 GIC: Genomic islands cluster 16 17 18 IMP: Imipenemase 19 20 KEGG: Kyoto Encyclopedia of Genes and Genomes 21 22 MBLs: Metallo- -lactamases 23 24 MLST: Multilocus sequence typing 25 26 27 PaeAG1: Pseudomonas aeruginosa AG1 28 29 ST: Sequence type 30 31 VIM: Verona integron-encoded MBLs 32 33 WHO: World Health Organization (WHO) 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 3 63 64 65 1 2 3 4 5 6 1. INTRODUCTION 7 8 9 Pseudomonas aeruginosa is an opportunist and versatile pathogen able to survive in a wide 10 11 variety of environments (Klockgether et al., 2010). With a large genome (6-7.5 Mb), P. aeruginosa 12 13 strains have a the large proportion of the genome (>8%) dedicated to regulatory functions (Cabot et 14 15 al., 2016) resulting in a consequent diversity of metabolic capabilities and responses to stress 16 17 18 studied (J. A. Molina-Mora et al., 2020; J. Molina-Mora et al., 2020). Due to these features, P. 19 20 aeruginosa is responsible for infections among immunocompromised hosts (Lu et al., 2016) and 21 22 nosocomial infections (Fernández, Corral-Lugo, & Krell, 2018). This pathogen has high intrinsic 23 24 resistance to most antimicrobials used in therapeutic practice (Brazas, Brazas, Hancock, & 25 26 27 Hancock, 2005), many of them by multidrug-resistant or extensively drug-resistant strains (Oliver, 28 29 Mulet, López-Causapé, & Juan, 2015). This severely compromises the selection of appropriate 30 31 treatments (X. Mulet et al., 2013) causing significant morbidity and mortality. According to World 32 33 Health Organization (WHO) resistance to carbapenems in P. aeruginosa, Acinetobacter baumannii 34 35 and Enterobacteriaceae family is considered a critical issue in the context of antibiotic resistance, 36 37 38 being classified as Priority 1 group (World Health Organization, 2017). 39 40 P. aeruginosa AG1 (PaeAG1) is a particular P. aeruginosa strain isolated from an 41 42 immunocompromised patient in a Costa Rican hospital in 2010 (Toval et al., 2015). This strain has 43 44 -lactams (including carbapenems), aminoglycosides, and 45 46 47 fluoroquinolones, being only sensible to colistin. This strain was the first report of a P. aeruginosa 48 49 isolate carrying both VIM-2 and IMP-18 genes encoding for metallo- -lactamases (MBLs) 50 51 enzymes, both with carbapenemase activity (Toval et al., 2015). As shown in our previous works, 52 53 including the genome assembly (GenBank CP045739) (J.-A. Molina-Mora, Campos-Sánchez, 54 55 56 Rodríguez, Shi, & García, 2020), these genes belong to two independent class 1 integrons, each 57 58 contained in one out the 57 predicted genomic islands of PaeAG1 (J.-A. Molina-Mora et al., 2020; 59 60 Toval et al., 2015). Other elements such as six phages, mobile genetic elements and some virulence 61 62 4 63 64 65 1 2 3 4 factors are also harbored in genomic islands. Ciprofloxacin exposure in PaeAG1 induces phage 5 6 activity with a very complex activity, affecting the growth despite the strain is sensitive to this 7 8 9 antibiotic (J. A. Molina-Mora et al., 2020). In addition, PaeAG1 has a not functional CRISPR-Cas 10 11 system and molecular genotyping by multilocus sequence type (MLST) classifies PaeAG1 as a 12 13 high-risk sequence type 111 (ST-111) strain. 14 15 ST-111 is a lineage that belongs to the O12 serotype, including a multi-resistance profile and 16 17 18 the ability to colonize nosocomial environments (X. Mulet et al., 2013; Turton et al., 2015; Witney 19 20 et al., 2014; Woodford, Turton, & Livermore, 2011). Jointly with ST-235 and ST-175 genotypes, 21 22 ST-111 belong to the high-risk group in P. aeruginosa (Oliver et al., 2015). High-risk clones are 23 24 frequently associated with epidemics where multidrug resistance confounds treatment (Petitjean et 25 26 27 al., 2017). 28 29 In this context, it is considered that P. aeruginosa high-risk clones are part of a non-clonal 30 31 epidemic population structure (Oliver et al., 2015; Petitjean et al., 2017), many carrying genomic 32 33 determinants such as carbapenemases or extended- -lactamases (Oliver et al., 2015). 34 35 Carbapenemases include Ambler class A enzymes such as KPC and GES variants, Ambler class B 36 37 38 MBLs (IMP, VIM, SPM, GIM, NDM and FIM type), and Amber class D (OXA variants) enzymes 39 40 (Farajzadeh Sheikh et al., 2019; Hong et al., 2015). In Costa Rica, isolation of carbapenem resistant 41 42 P. aeruginosa strains is relatively common in some major hospitals as we reported, most of them 43 44 carrying VIM or IMP alleles and up to 63.1% prevalence (Toval et al., 2015). This is much higher 45 46 47 than the frequencies observed in other countries (Hong et al., 2015). 48 49 VIM and IMP genes, as well as other MBLs, are frequently found as part of gene cassettes 50 51 carried by integrons (Walsh, 2005; Zhao & Hu, 2011), leading to the dissemination of multidrug 52 53 resistance among Gram negative bacteria (Jones-Dias et al., 2016). Thus, there is a growing interest 54 55 56 in the reconstruction of the genomic context of mobile elements (in particular for integrons) to gain 57 58 insights into bacterial evolution and its association with human activities, as well as to identify 59 60 possible ways to mitigate antibiotic resistance (Ghaly, Chow, Asher, Waldron, & Gillings, 2017). 61 62 5 63 64 65 1 2 3 4 However, the genomic context of P. aeruginosa high-risk clones associated with integrons has been 5 6 studied in some few studies (Chowdhury et al., 2016). 7 8 9 In this sense, comparative genomic strategies can provide insights not only about gene content, 10 11 architecture and evolutionary details, but also dynamics of mobile genetic elements, pathogenicity 12 13 determinants, and others (Peter et al., 2019). Several studies at genomic level have been 14 15 implemented to describe the molecular diversity in P. aeruginosa (including high-risk clones) using 16 17 18 different comparative approaches (Xavier Mulet et al., 2013; Petitjean et al., 2017; Turton et al., 19 20 2015). 21 22 Since PaeAG1 has special genomic features regarding antibiotic multi-resistance, including 23 24 VIM-2 and IMP-18 genes with carbapenemase activity, 57 genomic islands and a ST-111 profile, 25 26 27 we hypothesized that the comparative genomics can reveal insights about the evolution and 28 29 landscape of genomic regions around the MBLs-carrying integrons of PaeAG1. Thus, the aim of the 30 31 study was to compare PaeAG1 genome against other P. aeruginosa sequences using comparative 32 33 genomics to describe phylogenetic relationships, genomic islands content and architecture of 34 35 genomic regions associated with the VIM-2- and IMP-18-carrying integrons of PaeAG1. We first 36 37 38 demonstrated that VIM-2 and IMP-18 are functional genes that can be induced after treatment with 39 40 imipenem (a carbapenem antibiotic). We then analyzed all the complete P. aeruginosa genomes 41 42 using a pan-genome analysis approach to identify related genomes to PaeAG1, revealing that whole 43 44 genome sequences are able to separate clones by MLST profile (ST). Afterward, PaeAG1 genomic 45 46 47 islands were searched in the related genomes, including all the ST-111 genomes, and diverse 48 49 presence/absence patterns were found in related genomes. Finally, specific genomic regions 50 51 associated with the two integrons were reconstructed and characterized to compare the gene content 52 53 and architecture in close genomes. Genomic region associated with the VIM-2-carrying integron 54 55 56 (In59-like) was completely found in other two ST-111 strains (i.e. it is an old-acquaintance 57 58 integron), but an IMP-18-carrying integron (registered as In1666), with an architecture never 59 60 reported before, was found when the landscape of the related genomic region was described. 61 62 6 63 64 65 1 2 3 4 5 6 2. MATERIALS AND METHODS 7 8 9 2.1 Bacterial isolate 10 11 The PaeAG1 strain is a Costa Rican isolate with -lactams (including carbapenems, 12 13 MICImipenem >32 µg/mL), aminoglycosides, and fluoroquinolones, being only sensitive to colistin. 14 15 We recently assembled and annotated the PaeAG1 genome (J.-A. Molina-Mora et al., 2020) and 16 17 18 data is available in Genbank under accession CP045739 (Bioproject PRJNA587210). 19 20 21 22 2.2 RT-qPCR for VIM-2 and IMP-18 expression after imipenem exposure 23 24 In order to study the expression of VIM-2 and IMP-18 genes by imipenem exposure in 25 26 27 PaeAG1, experiments of growth curves and RT-qPCR were performed. 28 29 Growth curves assay: Three aliquots of pre-cultured PaeAG1 cells were added to fresh Lysogenic 30 31 Broth (LB) broth to an initial optical density (OD600nm) of 0.01. Each aliquot was treated with 0.0 32 33 (control), 25.0 or 50.0 µg/mL of imipenem. Growth was monitored at times 0, 2, 4, 6, 8, 12 and 16 34 35 hours. The assay was performed in triplicates. Two specific aliquots at times 6 and 12 hours were 36 37 38 taken for RT-qPCR assay, as follows. 39 40 RNA isolation: Aliquots at times 6 and 12 hours after imipenem exposure were preserved using the 41 42 RNA protect reagent (QIAGEN). Total RNA was extracted using the RNeasy Mini kit (QIAGEN, 43 44 UK) following the manufacturer´s instructions. Subsequently, RNA was transcribed into cDNA 45 46 47 with the Maxima H Minus First Strand cDNA Synthesis kit ( ). In the 48 49 different steps, quality and quantity of extracted RNA or cDNA were determined using a Nanodrop 50 51 (Nanodrop 2000, ). 52 53 Primers sequences: Primers sequences for target VIM-2 and IMP-18 genes and the reference gene 54 55 56 rpoD were found from literature (Kim, Kim, & Choi, 2003; Mendes et al., 2007; Savli et al., 2003) . 57 58 See 59 60 61 62 7 63 64 65 1 2 3 4 RT-qPCR: The standard curve method was implemented to quantify expression of target and 5 6 reference genes. 7 8 9 - 10 11 Thermocycling was performed on the StepOnePlus Real-Time PCR Sy 12 13 Inc.). For VIM-2 and IMP-18 genes, assay was run with a denaturation at 95°C (10 min), 35 14 15 amplification cycles of 94°C (20 s), 53°C (45 s), and 72°C (30 s), with data acquisition at 72°C. For 16 17 18 rpoD gene, conditions were denaturation at 95°C (10 min), 45 amplification cycles of 95°C (15 s), 19 20 20°C (10 s), and 72°C (15 s), with data acquisition at 72°C. Melt curve data were used to determine 21 22 whether only the correct product had been amplified. 23 24 25 Relative gene expression analysis: Gene expression of VIM-2 and IMP-18 in the experimental 26 27 rpoD housekeeping gene. 28 29 The data was analyzed using the delta-delta Ct method (12). The change in gene expression within 30 31 32 samples (time and ant 33 34 imipenem) and a two-way ANOVA test was performed between conditions (95% confidence level). 35 36 37 38 2.3 Datasets of complete P. aeruginosa genome sequences 39 40 41 In order to compare all the complete genomic sequences of P. aeruginosa by a pan-genome 42 43 analysis, metadata (including strain names, alternative ID, gene content, MLST profile, and others), 44 45 46 47 formats) files were retrieved from Pseudomonas Genomes Database (PGDB, 48 49 50 https://pseudomonas.com). 51 52 53 54 2.4 Comparative genomic analysis by a pan-genome approach 55 56 Since differences in annotation were identified for many sequences, even in exactly the 57 58 59 same genomic regions, we decided to identify and annotate genes from the complete genomic 60 61 62 8 63 64 65 1 2 3 4 sequences using the same approach. To achieve this, gene prediction and annotation was done using 5 6 Prokka v1.13.3 (with --genus Pseudomonas --species aeruginosa and other parameters by default 7 8 9 configuration) (Seemann, 2014). The Prokka annotation files (in gbk 10 11 the phylogenetic analysis by a pan-genome approach based on gene content in the Roary program 12 13 v3.12.0 (Page et al., 2015) 14 15 visualized using Interactive Tree Of Life Tool (iTOL, https://itol.embl.de/) v5 (Letunic & Bork, 16 17 18 2019), and strain names and MLST profiles were incorporated for each strain. For strains with 19 20 unknown MLST, the profile was verified using the complete genome sequence approach (Larsen et 21 22 al., 2012) in the MLST tool v2.0 (https://cge.cbs.dtu.dk/services/MLST/). For a functional analysis 23 24 for all core-genes, STRINGdb (https://string-db.org/) was used to identify significantly enriched 25 26 27 KEGG pathways (cutoff of false discovery rate FDR < 0.05). 28 29 30 31 2.5 Comparative analysis of the presence of PaeAG1 genomic islands in other strains 32 33 The 57 PaeAG1 genomic islands were previously identified using IslandViewer v4 34 35 (www.pathogenomics.sfu.ca/islandviewer/), as we reported recently (J.-A. Molina-Mora et al., 36 37 38 2020). were downloaded from the same platform and 39 40 obtained using the getfasta function in bedtools software v2.29.2 41 42 (Quinlan & Hall, 2010). Distribution of genomic islands along the genome was visualized using the 43 44 BLAST Ring Image Generator BRIG tool v0.95 (Alikhan, Petty, Ben Zakour, & Beatson, 2011). 45 46 47 In order to determinate the presence and frequency of these genomic islands in other strains, 48 49 a comparative analysis based on sequence alignment was done. Thus, we implemented a BLASTn 50 51 pipeline to align PaeAG1 genomic island sequences and the complete genome sequences of all 52 53 strains. A minimum length for coverage of 95% (overlap between query and subject sequences) and 54 55 56 80% of minimum sequence identity between sequences were used to define that a specific genomic 57 58 island was present in a strain, otherwise, it was considered absent. Final comparison of 59 60 presence/absence of genomic islands was done for selected strains (see Results) using a small 61 62 9 63 64 65 1 2 3 4 phylogenetic tree and a heatmap, which were visualized using phylo.heatmap function from 5 6 phytools package v0.7-20 (https://www.rdocumentation.org/packages/phytools), in the R software 7 8 9 (https://www.r-project.org/). 10 11 12 13 2.6 Landscape of genomic regions associated with the two class 1 integrons of PaeAG1 14 15 Two complete and independent class 1 integrons were previously identified in PaeAG1, one 16 17 18 carrying the VIM-2 gene and another harboring the IMP-18 gene (J.-A. Molina-Mora et al., 2020). 19 20 To better understand the possible evolutionary history of these integrons and its potential for lateral 21 22 transfer, we reconstructed the genetic landscape of the genomic regions around these elements. 23 24 Identity of the integrons was investigated using INTEGRALL database (http://integrall.bio.ua.pt). 25 26 27 For the new integron (see Results), the same database was used for the registry and the integron 28 29 number assignment. 30 31 Since the two integrons are absent in the reference strain Pae-PAO1, an alignment of the 32 33 genomic regions (BLASTn) and another of amino acid (AA) sequences (BLASTp) were used to 34 35 identify the limits of the complete inserted region in PaeAG1. The two specific inserted regions 36 37 38 were composed of two or more genomic islands in a row, as obtained in our previous study 39 40 (grouped or with overlapping regions) (J.-A. Molina-Mora et al., 2020). Thus, regions were called 41 42 GICVIM-2 (genomic island cluster containing VIM-2-carrying integron) and GICIMP-18 (genomic 43 44 island cluster harboring the IMP-18-carrying integron). 45 46 47 Once the insertions were delimited in PaeAG1 and the insertion point in the reference 48 49 genome was identified, we expanded the loci up to cover three coding genes on each side. A final 50 51 alignment (BLASTn) of the expanded regions of GICVIM-2 and GICIMP-18 was done against selected 52 53 genomes. Genomes selection was done based on the phylogenetic relationships of strains close to 54 55 56 PaeAG1 (pan-genome analysis) and the profile of presence/absence of the PaeAG1 genomic islands 57 58 in other strains. All the syntenic regions of selected strains were compared using annotation files in 59 60 Easyfig software v2.2.3 (Sullivan, Petty, & Beatson, 2011), leading to visualize alignments, gene 61 62 10 63 64 65 1 2 3 4 content and identity, exclusive/shared elements by strain and possible evolutionary steps, and 5 6 others. 7 8 9 10 11 3. RESULTS 12 13 3.1 Expression of VIM-2 and IMP-18 genes is induced after imipenem treatment in PaeAG1 14 15 In order to assess the functional activity of VIM-2 and IMP-18 genes, a RT-qPCR was 16 17 18 performed. Exposition to imipenem had no effects on the growth curves of PaeAG1 (Fig. 1-A). 19 20 Evaluation of gene expression after exposition to imipenem (Fig. 1-B-C) showed that VIM-2 and 21 22 IMP-18 increased its expression at least by a 1.7-fold (respect to control) at 6 hours, but only 1.1- 23 24 fold at 12 hours. This observation was independent of the imipenem concentration (25 or 50 25 26 27 which changes in the relative expression by time 28 29 but not by concentration were significant for each gene. 30 31 32 33 3.3 Pan-genome analysis with the complete genome sequences defines P. aeruginosa clusters which 34 35 correlates with the MLST genotyping profile 36 37 38 To select related genomes to PaeAG1, a total of 211 strains were selected to compare the 39 40 genomic composition (including PaeAG1). Supplementary file 1 All_strains_information.xlsx 41 42 contains the list of all the selected genomes, ID, strain, MLST profile, and others. Gene content 43 44 comparison was done based on a pan-genome approach. A total of 2726 genes were identified as 45 46 47 part of the core-genome (present > 99% strains). More details of results and complementary plots 48 49 are provided in the Supplementary file 2 Pan-genome analysis results.xlsx. 50 51 Enrichment analysis of KEGG pathways for all core genes (Table S1) found 42 biological 52 53 processes implicated in several metabolism routes related to energy (carbon, fatty acids, amino 54 55 56 acids), DNA and RNA, ribosomal activity, protein synthesis, and others. 57 58 As shown in Fig. 2, similarity in the genomic composition by pan-genome analysis defines 59 60 a phylogenetic tree able to separate groups that can be described in turn by the MLST genotyping 61 62 11 63 64 65 1 2 3 4 profile. Although we identified a total of 67 different MLST profiles (and unknown cases), many of 5 6 them resulted with low frequency. For example, 35 different ST classes had only a single strain (35 7 8 9 strains, 17% of all genomes) and 88 strains (42%) belonged to the 56 ST profiles with less than five 10 11 genomes. In addition, 44 strains (21%) had an allelic composition with an unknown ST profile. On 12 13 the other hand, a total of 79 (37%) genomes corresponded to 11 ST classes with five or more 14 15 strains. The last were evidenced using different colors by ST profile (as showed in the Fig. 2), 16 17 18 meanwhile strains belonging to low frequency ST profiles were colored in the same way. 19 20 Representative genomes such as the reference strain Pae-PAO1 (ST-549, purple cluster) and Pae- 21 22 UCBPP-PA14 (ST-253, yellow group) were identified in the main ST groups. 23 24 Regarding PaeAG1, this strain was located in the same group with the other nine ST-111 25 26 27 strains in a clearly separated cluster (green). Other two ST profiles (low frequency ST-234 and ST- 28 29 654) and one unknown case (Pae-Pa84 strain) kept close to this group. The whole group of these 30 31 related strains, and the reference strain Pae-PAO1, were used for subsequent analysis, including 32 33 their phylogenetic relationships. For other high-risk clones, a single ST-175 genome was identified, 34 35 and a clear cluster was found for the ten ST-235 genomes (including other genomes with unknown 36 37 38 profile). 39 40 41 42 3.3 Varying profiles of the presence/absence of the 57 PaeAG1 genomic islands are found in the 43 44 ST-111 strains and related genomes 45 46 47 A comparative analysis based on sequence alignment was run in order to determinate the 48 49 presence and frequency of the PaeAG1 genomic islands in other phylogenetically related strains. 50 51 Genomic islands locus were previously predicted (J.-A. Molina-Mora et al., 2020). We first 52 53 represented the distribution of the genomic islands along the PaeAG1 genome, as presented in Fig. 54 55 56 3. Many of the islands kept together, including overlapping regions or an arrangement in a row. 57 58 Thus, we termed this as a genomic islands cluster (GIC) to refer to this group of islands. In Fig. 3, 59 60 GICs correspond to the genomic regions labeled as joined names of the genomic islands, for 61 62 12 63 64 65 1 2 3 4 48-49 islands GI48 and GI49. In some cases each 5 6 genomic island in the cluster can be differentially distributed in the genomes (for example GI48 is 7 8 9 present in PaeAG1 and Pae-97, but GI49 is only found in PaeAG1, Fig. 4). For this reason, we do 10 11 not re-define the locus neither joined the islands. 12 13 Analysis of the presence/absence of PaeAG1 genomic islands in other ST-111 strains and 14 15 related genomes is shown in Fig. 4. Profiles for all the 211 is available in the Supplementary file 1 16 17 18 All_strains_information.xlsx, including total counts of strains by genomic islands, and total genomic 19 20 islands per genome. The closest genomes to PaeAG1 (Pae-RIVM-EMC2982 and Pae-Carb0163) 21 22 had the most similar profiles in the genomic islands content (carrying 41 genomic islands), but 23 24 different patterns are obtained for other ST-111 strains. None of the islands is present in the 25 26 27 reference genome Pae-PAO1, and other few genomic islands are rarely present in other non ST-111 28 29 strains. 30 31 On the other hand, two particular genomic islands were particularly recognized due to they 32 33 carry the two PaeAG1 integrons. GI27 genomic island harbors the VIM-2-carrying integron, while 34 35 IMP-18-carrying integron belongs to GI49. As shown in Fig. 4, GI27 (red) is present in PaeAG1 36 37 38 and two other ST-111 strains, and it is also absent in the rest of the 208 genomes. GI49 (blue) is 39 40 unique to PaeAG1 and it is not it is present in none of the other 210 strains in the study. 41 42 Additionally, both genomic island are associated with a GIC, GI27-30 and GI48-49 (Fig. 3) 43 44 respectively. Since the importance of these genomic regions to study the integrons, we specifically 45 46 47 called them GICVIM-2 (genomic island cluster containing VIM-2-carrying integron) and GICIMP-18 48 49 (genomic island cluster harboring the IMP-18-carrying integron). 50 51 Based on phylogenetic relationships, ST profile and genomic islands content, we selected 52 53 specific genomes to compare the GICs associated with the integrons. As shown in Fig. 4, the four 54 55 56 genomic islands of GICVIM-2 (GI27-30) are differentially present in the genomes. For example, GI28 57 58 and G29 are present in eight strains, but GI27 in three and G30 in four. To specifically compare the 59 60 genomic regions of GICVIM-2, we used the reference Pae-PAO1, Pae-RIVM-EMC2982 (with the 61 62 13 63 64 65 1 2 3 4 four genomic islands), and Pae-AR445 (with three of the genomic islands). For the case of GICIMP- 5 6 18, the two islands GI48 and GI49 are absent in other ST-111 strains, but GI48 is present in Pae-97. 7 8 9 Except for this case, no other strains in all 211 genomes were identified harboring both islands. To 10 11 compare the genomic regions, the reference genome Pae-PAO1, Pae-RIVM-EMC2982 as a closest 12 13 genome, and Pae-97 (the only genome sharing a section of the GIC) were used. 14 15 3.4 GIC 16 VIM-2 is a known region containing the old-acquaintance VIM-2-carrying integron in 17 18 PaeAG1 19 20 With the aim of describing the possible evolutionary history of the VIM-2-carrying integron 21 22 in PaeAG1, we described the architecture of the genomic regions delimited by the GICVIM-2 23 24 (including three extreme genes on each side: 35 798 bp and 32 protein-coding genes). Using Pae- 25 26 27 PAO1 as reference, we found that genomic insertion occurred in the middle of the PA2229 gene, as 28 29 shown in the top of Fig. 5. The insertion resulted mostly present in Pae-AR445 (coverage 94% and 30 31 identity 99.97% of the PaeAG1 region), but without most of the integron (integrase intI1 and sul1 32 33 are present, unlike the gene cassette including VIM-2). However, a full coverage region was 34 35 identified in Pae-RIV-EMC2982, with a 100% coverage and identity 99.99%. The only two variants 36 37 38 identified in the full region were non-synonymous mutations, with an amino-acid change in 39 40 PaeAG1_03254 (transcriptional regulator merD, 99.0% identity) and PaeAG1_03255 (mercuric 41 42 reductase merA, 99.8% identity). See Table 2 and supplementary Table S2 for more details. 43 44 Although not shown in Fig. 5, alignment was also done for Pae-Carb0163, which has the same 45 46 47 profile of genomic islands content as Pae-RIV-EMC2982. In this case, a 100% coverage and 48 49 identity 99.87% (45 variants) were obtained in the GICVIM-2 region; most of the variants resulted in 50 51 a change in the amino-acid sequence in PaeAG1_03245 (aacA29a, part of the integron with a 52 53 95.8% identity resulting in aacA29e allele), but also affecting other three proteins (mercuric 54 55 56 reductase, integrase IntI and a transposase). See supplementary Table S2 for more details. 57 58 Regarding the gene content (Table 2), this genomic insertion contains the complete integron 59 60 carrying VIM-2 gene. Composition of this integron is described in Fig. 5 (bottom), containing 61 62 14 63 64 65 1 2 3 4 classical elements int1, attI, sul1 and the gene cassette (with aacA29a-b and VIM-2) of a class 1 5 6 integron, being classified as In59-like. Furthermore, GICVIM-2 has at least other mobile genetic 7 8 9 elements, including transposases and recombinases modules. Other coding modules are associated 10 11 with mercury metabolism or they remain unknown (hypothetical proteins). Details of the protein 12 13 alignment of PaeAG1 against four genomes is also provided (supplementary Table S2). 14 15 Reconstruction of the evolutionary steps related to the conformation of this genomic region include 16 17 18 participation of four transposons (Tn402, Tn21-like, a disrupted and another complete Tn4661) as 19 20 shown in Fig. 7-A. See details in Discussion. 21 22 Considering the full coverage and very high identity in at least two genomes, Pae-RIVM- 23 24 EMC2982 and Pae-Carb0163, GICVIM-2 can be considered a genomic region present in two well- 25 26 27 known VIM-2+ strains, being this gene located in an old-acquaintance class 1 integron (In59-like). 28 29 30 31 3.5 GICIMP-18 is a PaeAG1 exclusive genomic region harboring a new IMP-18-carrying integron 32 33 In a similar way as before, we compared four genomes to described the architecture of the 34 35 genomic regions delimited by the GICIMP-18 (including three extreme genes on each side: 30 258 bp 36 37 38 and 29 protein-coding genes). Using Pae-PAO1 as reference, we found that genomic insertion 39 40 occurred between the genes PA4704 and PA4705, as shown in the top of Fig. 6. Genomic islands 41 42 GI48-49 are absent in Pae-RIV-EMC2982 and Pae-Carb0163 genomes (the last not shown in the 43 44 Fig.). 45 46 47 BLAST of GICIMP-18 identified the highest scored sequence in Pae-97 genome (ST-234). 48 49 Thus, since Pae-97 carries GI48, syntenic comparison was done using this genome (Fig. 6). 50 51 Analysis revealed a 77% coverage with identity 99.92%. The Pae-97 integron also contains Int1, 52 53 aacA genes and another allele of the IMP gene (IMP-1), all with a different arrangement. 54 55 56 Regarding gene content (Table 3), this genomic insertion contains the complete integron 57 58 carrying IMP-18 gene. Composition of this integron is described in Fig. 6 (bottom), containing int1, 59 60 attI, sul1 and the gene cassette (IMP-18, gcuD, OXA-2 and aacA4). GICIMP-18 also has genes coding 61 62 15 63 64 65 1 2 3 4 for endonucleases and recombinases, or hypothetical proteins. Details of the protein alignment of 5 6 PaeAG1 against the four genomes are also provided (see supplementary Table S3). 7 8 9 Considering the absence of the complete region in other genomes and the first report of the 10 11 architecture of this integron, GICIMP-18 can be considered a PaeAG1 exclusive region harboring a 12 13 new IMP-18-carrying integron. This integron was registered as In1666 in INTEGRALL database. 14 15 Conformation of GICIMP-18 region seems to include the participation of at least three mobile 16 17 18 elements (the new integron In1666, insertion sequence IS1326 and transposon TnAs3) as shown in 19 20 Fig. 7-B. However, a lack of information about the role of other elements (regions without matching 21 22 sequences) makes difficult to complete the possible evolutionary steps related to this genomic 23 24 region. 25 26 27 In summary, the pan-genome analysis lead us to identify that the genomic content can 28 29 separate groups according to the ST profile (MLST genotyping). All the ST-111 strains, including 30 31 PaeAG1, resulted in the same phylogenetic group but different presence/absence profiles of 32 33 PaeAG1 genomic islands were identified in other strains, even for grouped genomic islands, the 34 35 GICs. Analysis of the landscape of regions GICVIM-2 and GICIMP-18 revealed one known and another 36 37 38 new arrangement of genomic sequences in PaeAG1, harboring two independent MBLs-carrying 39 40 integrons. The IMP-18-carrying integron has a unique and exclusive composition, reported here for 41 42 the first time. 43 44 45 46 47 4. DISCUSSION 48 49 Antibiotic multi-resistance is a major threat to public health because continuous emergence, 50 51 worldwide spread, and increasing prevalence (Hong et al., 2015). With a high-risk ST-111 profile, 52 53 PaeAG1 is a critical organism due to its resistance to multiple antibiotics but in particular the 54 55 56 resistance to carbapenems (World Health Organization, 2017). In our study, we first demonstrated 57 58 that expression of VIM-2 and IMP-18 genes (with carbapenemase activity) are induced after 59 60 imipenem exposure, evidencing that are functional genes. To describe the genomic context 61 62 16 63 64 65 1 2 3 4 associated with theses MBLs, we performed a pan-genome analysis, a comparison of genomic 5 6 islands between representative strains and the reconstruction of the surrounding genomic regions. 7 8 9 In the pan-genome analysis, we were able not only to reveal that whole genome sequences 10 11 could separate clones by ST profile (MLST), but also identification of core and accessory genes was 12 13 achieved. Other pan-genome analysis in P. aeruginosa also found clusters than could be identified 14 15 by the ST profile (Aguilar-Rodea et al., 2017; Weiser et al., 2019). While multiple comparative 16 17 18 genomic analyses (many using a pan-genome approach) have been reported for P. aeruginosa 19 20 (Aguilar-Rodea et al., 2017; Chowdhury et al., 2016; Freschi et al., 2019; Gomila, Peña, Mulet, 21 22 Lalucat, & García-Valdés, 2015; Hilker et al., 2015; Mosquera-Rendón et al., 2016; Ozer, Allen, & 23 24 Hauser, 2014; Poulsen et al., 2019; Valot et al., 2015; Weiser et al., 2019; Wendt & Heo, 2016), 25 26 27 most of them include incomplete, fragmented or draft genomes, or sequences of few genes. In 2015, 28 29 complete genomes were used in a similar approach, but only 17 genomes were available (NCBI), 30 31 which only three corresponded to high-risk clones (Valot et al., 2015). Thus, our analysis provides 32 33 an up-date of the general status of relationships of the 211 available complete genomes by pan- 34 35 genome analysis. 36 37 38 In relation to gene content among all strains, we identified a total of 2726 genes as part of 39 40 the core-genome (>99% strains), similar to another similar approach (Mosquera-Rendón et al., 41 42 2016). Other studies have suggested a higher number of core genes (4000-5300) (Hilker et al., 43 44 2015; Ozer et al., 2014; Valot et al., 2015; Weiser et al., 2019). The relatively high number of 45 46 47 conserved genes in the core-genome can be associated with the ability to conquer multiple 48 49 environments and to facilitate infectious capability towards a large set of hosts (Valot et al., 2015). 50 51 According to functional analysis, 42 KEGG pathways (energy metabolism, nucleic acids, amino 52 53 acids, ribosomal activity, and many others) were found as part of the enriched routes for all the core 54 55 56 genes, with functions that are in line with other similar pan-genome studies (Mosquera-Rendón et 57 58 al., 2016; Valot et al., 2015). 59 60 61 62 17 63 64 65 1 2 3 4 P. aeruginosa genome is composed of a mosaic structure including the large core-genome 5 6 (Valot et al., 2015), into which regions of genomic plasticity lead to the insertion of block of genes 7 8 9 belonging to the accessory genome (Mathee et al., 2008). In the case of PaeAG1 and other ST-111 10 11 strains, genome sequence is around 1.0 Mb longer that the reference genome Pae-PAO1, difference 12 13 that is reflect as genomic islands distributed along the genome. 14 15 Pae-RIVM-EMC2982 and Pae-Carb0163 (closest genomes to PaeAG1) had the most 16 17 18 similar profiles carrying 41 out the genomic islands. As highlighted in Results, many genomic 19 20 islands formed clusters (GICs, Fig. 3 and 3), including the genomic islands clusters harboring the 21 22 two integrons (GICVIM-2 and GICIMP-18). Genomic islands groups have been reported before as 23 24 integrative and conjugative elements or ICEs (Petitjean et al., 2017), but ICEs in PaeAG1 (using 25 26 27 ICEberg 2.0 platform, https://db-mml.sjtu.edu.cn/ICEfinder/ICEfinder.html) overlap with other 28 29 GICs but none with GICVIM-2 and GICIMP-18. Since size of the core-genome and its content is not 30 31 well known (Valot et al., 2015), prediction methods are required to define accessory regions, but 32 33 outcome depends on algorithms (Ozer et al., 2014), which could explain differences and the GICs. 34 35 On the other hand, this prominent number of genomic islands in PaeAG1 and other ST-111 36 37 38 strains can be explained due to the absence of a functional CRISPR-Cas system (bacterial defense 39 40 system against foreign DNA) and consequent high number of successful events of horizontal gene 41 42 transfer (Petitjean et al., 2017). This genome plasticity of individual strains represents an advantage 43 44 for P. aeruginosa to fit the needs for survival in virtually any environment (Mathee et al., 2008). 45 46 47 In the context of carbapenems resistance, genes encoding for MBLs are usually found 48 49 as gene cassettes in class 1 integrons (Jones-Dias et al., 2016; Walsh, 2005). This allows a rapid 50 51 dissemination in the clinical setting due to the selective pressure by the use of antibiotics (Sánchez- 52 53 Martinez et al., 2010), which is aggravated due to this antibiotic represents the last therapeutic 54 55 56 source to treat P. aeruginosa infections (Toval et al., 2015). While multiple studies correlate 57 58 antibiotic resistance and the presence of integrons, genetic context surrounding class 1 integrons is 59 60 often not investigated in P. aeruginosa, as remarked before (Chowdhury et al., 2016). 61 62 18 63 64 65 1 2 3 4 Carbapenem resistance in PaeAG1 was demonstrated to be explained by activity of two 5 6 MBLs (VIM-2 and IMP-18) (Toval et al., 2015), each gene harbored in two independent class 1 7 8 9 integrons (J.-A. Molina-Mora et al., 2020; Toval et al., 2015). 10 11 Evaluation of the sequence showed that GICVIM-2 is also present in Pae-RIVM-EMC2982 12 13 (100% coverage and 99.99% identity) and Pae-Carb0163 (100% coverage and 99.87% identity) at 14 15 chromosomal level. However, a study including these strains showed that VIM-2-carrying integron 16 17 18 and surrounding regions (~30 Kb, equivalent to GICVIM-2) were shared with a plasmid of ST-446 P. 19 20 aeruginosa S04-90 with 99% identity. Based on identity, mobilization of the fragment between 21 22 plasmids and chromosomes may have occurred recently (van der Zee et al., 2018). 23 24 In the same study, analysis of genome landscape showed that the regions (equivalent to 25 26 27 GICVIM-2) corresponded to a DNA segment acting as a composite transposon, composed of four 28 29 different transposons (Tn402, Tn21-like, a disrupted and another complete Tn4661). The class 1 30 31 integron carrying VIM-2 is contained in the Tn402 transposon (Gillings, 2017; van der Zee et al., 32 33 2018). Evolutionary details are completely explained in (van der Zee et al., 2018). GICVIM-2 carries 34 35 the genes involved in its own transposition module (transposases such as TniB and TnpA) and 36 37 38 mercury resistance module, as described in other similar transposons and insertion sequences 39 40 (Chowdhury et al., 2016; Ghaly et al., 2017; Jones-Dias et al., 2016; Liebert, Hall, & Summers, 41 42 1999; van der Zee et al., 2018). Presence of gene cassettes unrelated to the antibiotic resistance can 43 44 be result of anthropogenic settings (Ghaly et al., 2017) and selection pressures in environments 45 46 47 polluted with heavy metals and other substances such as mercury, arsenic and disinfectants 48 49 (Gillings et al., 2015). 50 51 Regarding the VIM-2-carrying integron, this element is an In59-like integron. In59 was first 52 53 reported two decades ago in France (Poirel et al., 2001) and then worldwide (Gillings, 2017; 54 55 56 Samuelsen et al., 2010; Toval et al., 2015; van der Zee et al., 2018). Among all the 211 strains in 57 58 our study, VIM-2 was only present into PaeAG1 and the two closest genomes (all ST-111). 59 60 Differences in aacA29 genes defined the aacA29e allele found in Pae-Carb0163 (van der Zee et al., 61 62 19 63 64 65 1 2 3 4 2018), in contrast to aacA29a-b in PaeAG1, all coding for aminoglycoside acetyltransferases. Since 5 6 GICVIM-2 sequence and architecture is completely found in two VIM-2+/ST-111 strains, VIM-2- 7 8 9 carrying integron (In59-like) can be considered old-acquaintance element in a well-known genomic 10 11 context. 12 13 Additionally, genomic context defined by GICIMP-18 was also analyzed. Using Pae-PAO1 as 14 15 reference, it is shown that GICIMP-18 insertion occurred in a specific point (prrH) between PA4704 16 17 18 and PA4705 (Fig. 6). This region contains three genes for regulatory small RNAs (prrF1, prrH and 19 20 prrF2) are found, which are involved in iron homeostasis under iron-depleted conditions (Reinhart 21 22 et al., 2017) or to avoid iron toxicity (Reinhart et al., 2015). 23 24 While complete GICIMP-18 (composed of GI48-GI49 genomic islands) was not found in none 25 26 27 of other strains, GI48 section was found in Pae-97 strain (ST-234, with a class 1 integron), a 28 29 genome close to ST-111 group (Fig. 2 and 3). Sequences comparison of GICIMP-18 and Pae-97 30 31 showed 77% coverage and 99.92% identity. Gene composition of GICIMP-18 includes endonucleases 32 33 and recombinases module, the class1 integron, transposase TniB and hypothetical proteins. 34 35 In relation to the integron harbored in GICIMP-18, the IMP-18-carrying element is composed 36 37 38 of the intI1, the gene cassette (carrying IMP-18, gcuD and OXA-2), aacA4 and sul1. In another 39 40 strain, similar genes with another arrangement (orderly IMP-18, a disrupted aacA43, OXA-2 and 41 42 gcuD) were reported for the first time in the In706 integron in 2012 (Martínez, Vazquez, Aquino, 43 44 Goering, & Robledo, 2012). Pae-97 contains a class 1 integron, but with a different arrangement 45 46 47 with IMP-1 allele (without OXA nor gcuD genes). Other studies found multiple strains carrying 48 49 both IMP-18 and OXA-2 (without gcuD nor aacA4) in Mexican isolates as part of In169 (Sánchez- 50 51 Martinez et al., 2010) and In1215 (López-García et al., 2018) integrons, including some located in 52 53 plasmids. 54 55 56 Since there is a lack of information about the genomic context of many IMP-carrying 57 58 integrons (such as region GICIMP-18, unlike GICVIM-2), and the particular architecture of the class 1 59 60 61 62 20 63 64 65 1 2 3 4 integron in PaeAG1 with the gene cassette IMP-18/gcuD/OXA-2/aacA4, we consider that this IMP- 5 6 18-carrying integron (registered as In1666) is a novel element that we report here for the first time. 7 8 9 In the partial reconstruction of the evolutionary steps related to the GICIMP-18 region, the 10 11 integron In1666, the insertion sequence IS1326 and the transposon TnAs3 seem to play a key role 12 13 in the current state of this genomic region. Both IS1326 and TnAs3 have been reported in different 14 15 integrons and high plasticity regions (He et al., 2016; Jones-Dias et al., 2016; Liebert et al., 1999; 16 17 18 Szuplewska, Czarnecki, & Bartosik, 2014). Further analyses are required to complete the 19 20 evolutionary steps which have defined this genomic region as well as the implications of 21 22 multiresistant in PaeAG1. 23 24 Jointly, identification of the landscape of the genomic context defined by GICVIM-2 and 25 26 27 GICIMP-18, provides insights about the dissemination and evolution of mobile elements, in this 28 29 particular case for integrons carrying MBLs. Since MBL-producing P. aeruginosa is able to 30 31 produce epidemic outbreaks and responsible for the dissemination of carbapenemase resistance 32 33 worldwide (Castanheira, Deshpande, Costello, Davies, & Jones, 2014), it is worrisome that strains 34 35 such as PaeAG1 are able to circulate among Costa Rican hospitals. This can be correlated with the 36 37 38 high prevalence of carbapenem resistant strains in Costa Rica, many carrying VIM or IMP genes 39 40 (Toval et al., 2015). Future works are necessary to trigger the surveillance system in order to 41 42 evaluate if other circulating strains carry these two elements, to identify its possible dissemination 43 44 and hence carry out an adequate infection control program in medical centers. 45 46 47 48 49 5. CONCLUSIONS 50 51 PaeAG1 is a high-risk and a critical organism due to its resistance to carbapenems by the 52 53 activity of VIM-2 and IMP-18 enzymes, both harbored in two class 1 integrons. To describe the 54 55 56 genomic context associated with these integrons, we first verified the functionality of VIM-2 and 57 58 IMP-18 after imipenem exposure. We then analyzed 211 complete genome sequences using a pan- 59 60 genome analysis, separating strains by MLST profile. Analysis of the 57 PaeAG1 genomic islands 61 62 21 63 64 65 1 2 3 4 showed a varying pattern of the presence/absence among all the strains, in particular for closest 5 6 genomes to PaeAG1. Two selected genomic islands clusters, GIC 7 VIM-2 and GICIMP-18, were studied 8 9 in-depth. GICVIM-2 sequence was completely found in other two known ST-111 strains, which 10 11 contained the VIM-2-carrying integron as an old-acquaintance In59-like element. GICIMP-18 was 12 13 partially found in another genome, but the IMP-18-carrying integron has an architecture never 14 15 reported before, being considered as a novel In1666 integron. We provide new insights about the 16 17 18 genomic determinants associated with this high-risk P. aeruginosa clone and its resistance to 19 20 carbapenems using comparative genomics. 21 22 23 24 Ethical approval and consent to participate 25 26 27 Not applicable. 28 29 30 31 Consent for publication 32 33 Not applicable. 34 35 36 37 38 Availability of data and material 39 40 All the strains we used in this study were obtained from NCBI. All the IDs are available in the 41 42 Supplementary_file 1 All_strains_information enome sequence and annotation 43 44 files are available from our previous work at https://github.com/josemolina6/PaeAG1_genome, and 45 46 47 in GenBank under the accession number CP045739. 48 49 50 51 Declaration of Competing Interest 52 53 The authors declare that there is no conflict of interest. 54 55 56 57 58 Acknowledgements 59 60 61 62 22 63 64 65 1 2 3 4 This work was funded by projects 5 6 genómicas inducidas por ciprofloxacina en Pseudomonas aeruginosa AG1", B8152 proNGS 7 8 9 1.0: Implementación y evaluación de protocolos de análisis de datos de tecnologías NGS y afines 10 11 para el estudio de sistemas biológicos complejos Vicerrectoría de Investigación, Universidad de 12 13 Costa Rica (period 2017-2020). 14 15 We thank all members of Centro de Investigación en Enfermedades Tropicales (Universidad de 16 17 18 Costa, Costa Rica) for their support in the activities associated with the projects. 19 20 21 22 REFERENCES 23 24 Aguilar-Rodea, P., Zúñiga, G., Rodríguez-Espino, B. A., Cervantes, A. L. O., Arroyo, A. E. G., 25 26 27 Moreno- -Guadarrama, N. (2017). Identification of extensive drug 28 29 resistant Pseudomonas aeruginosa strains: New clone ST1725 and high-risk clone ST233. 30 31 PLoS ONE, 12(3), 2007 2013. https://doi.org/10.1371/journal.pone.0172882 32 33 Alikhan, N.-F., Petty, N. K., Ben Zakour, N. L., & Beatson, S. A. (2011). BLAST Ring Image 34 35 Generator (BRIG): simple prokaryote genome comparisons. BMC Genomics, 12(1), 402. 36 37 38 https://doi.org/10.1186/1471-2164-12-402 39 40 Brazas, M. D., Brazas, M. D., Hancock, R. E. W., & Hancock, R. E. W. (2005). Ciprofloxacin 41 42 Induction of a Susceptibility Determinant in Pseudomonas aeruginosa. Antimicrobial Agents 43 44 and Chemotherapy, 49(8), 3222 3227. https://doi.org/10.1128/AAC.49.8.3222 45 46 47 Cabot, G., Zamorano, L., Moyà, B., Juan, C., Navas, A., Blázquez, J., & Oliver, A. (2016). 48 49 Evolution of Pseudomonas aeruginosa antimicrobial resistance and fitness under low and high 50 51 mutation rates. Antimicrobial Agents and Chemotherapy, 60(3), 1767 1778. 52 53 https://doi.org/10.1128/AAC.02676-15.Address 54 55 56 Castanheira, M., Deshpande, L. M., Costello, A., Davies, T. A., & Jones, R. N. (2014). 57 58 Epidemiology and carbapenem resistance mechanisms of carbapenem-non-susceptible 59 60 Pseudomonas aeruginosa collected during 2009-11 in 14 European and Mediterranean 61 62 23 63 64 65 1 2 3 4 countries. Journal of Antimicrobial Chemotherapy, 69(7), 1804 1814. 5 6 https://doi.org/10.1093/jac/dku048 7 8 9 10 11 Djordjevic, S. P. (2016). Genomic islands 1 and 2 play key roles in the evolution of 12 13 extensively drug-resistant ST235 isolates of Pseudomonas aeruginosa. Open Biology, 6(3). 14 15 https://doi.org/10.1098/rsob.150175 16 17 18 Farajzadeh Sheikh, A., Shahin, M., Shokoohizadeh, L., Halaji, M., Shahcheraghi, F., & Ghanbari, 19 20 F. (2019). Molecular epidemiology of colistin-resistant Pseudomonas aeruginosa producing 21 22 NDM-1 from hospitalized patients in Iran. Iranian Journal of Basic Medical Sciences, 22(1), 23 24 38 42. https://doi.org/10.22038/ijbms.2018.29264.7096 25 26 27 Fernández, M., Corral-Lugo, A., & Krell, T. (2018). The plant compound rosmarinic acid induces a 28 29 broad quorum sensing response in Pseudomonas aeruginosa PAO1. Environmental 30 31 Microbiology, 20(12), 4230 4244. https://doi.org/10.1111/1462-2920.14301 32 33 Freschi, L., Vincent, A. T., Jeukens, J., Emond-Rheault, J. G., Kukavica- 34 35 Levesque, R. C. (2019). The Pseudomonas aeruginosa Pan-Genome Provides New Insights on 36 37 38 Its Population Structure, Horizontal Gene Transfer, and Pathogenicity. Genome Biology and 39 40 Evolution, 11(1), 109 120. https://doi.org/10.1093/gbe/evy259 41 42 Ghaly, T. M., Chow, L., Asher, A. J., Waldron, L. S., & Gillings, M. R. (2017). Evolution of class 1 43 44 integrons: Mobilization and dispersal via food-borne bacteria. PLoS ONE, 12(6), 1 11. 45 46 47 https://doi.org/10.1371/journal.pone.0179169 48 49 Gillings, M. R. (2017). Class 1 integrons as invasive species. Current Opinion in Microbiology, 38, 50 51 10 15. https://doi.org/10.1016/j.mib.2017.03.002 52 53 Gillings, M. R., Gaze, W. H., Pruden, A., Smalla, K., Tiedje, J. M., & Zhu, Y.-G. (2015). Using the 54 55 56 class 1 integron-integrase gene as a proxy for anthropogenic pollution. The ISME Journal, 57 58 9(6), 1269 1279. https://doi.org/10.1038/ismej.2014.226 59 60 Gomila, M., Peña, A., Mulet, M., Lalucat, J., & García-Valdés, E. (2015). Phylogenomics and 61 62 24 63 64 65 1 2 3 4 systematics in Pseudomonas. Frontiers in Microbiology, 6(MAR), 1 13. 5 6 https://doi.org/10.3389/fmicb.2015.00214 7 8 9 He, S., Chandler, M., Varani, A. M., Hickman, A. B., Dekker, J. P., & Dyda, F. (2016). 10 11 Mechanisms of evolution in high-consequence drug resistance plasmids. MBio, 7(6), 1987 12 13 2003. https://doi.org/10.1128/mBio.01987-16 14 15 Hilker, R., Munder, A., Klockgether, J., 16 17 18 (2015). Interclonal gradient of virulence in the P seudomonas aeruginosa pangenome from 19 20 disease and environment. Environmental Microbiology, 17(1), 29 46. 21 22 https://doi.org/10.1111/1462-2920.12606 23 24 Hong, D. J., Bae, I. K., Jang, I. H., Jeong, S. H., Kang, H. K., & Lee, K. (2015). Epidemiology and 25 26 27 characteristics of metallo-ß-lactamase-producing Pseudomonas aeruginosa. Infection and 28 29 Chemotherapy, 47(2), 81 97. https://doi.org/10.3947/ic.2015.47.2.81 30 31 Jones-Dias, D., Manageiro, V., Ferreira, E., Barreiro, P., Vieira, L., Moura, I. B., & Caniça, M. 32 33 (2016). Architecture of class 1, 2, and 3 integrons from gram negative bacteria recovered 34 35 among fruits and vegetables. Frontiers in Microbiology, 7(SEP), 1 13. 36 37 38 https://doi.org/10.3389/fmicb.2016.01400 39 40 Kim, S.-M., Kim, E.-C., & Choi, S.-Y. (2003). Typing by Pulsed Field Gel Electrophoresis and 41 42 Detection of Metallo- -lactamase Gene Against Acinetobacter baumannii from Clinical 43 44 Specimens. Korean J Clin Lab Sci, 35(2), 90 98. Retrieved from 45 46 47 http://www.kjcls.org/journal/view.html?spage=90&volume=35&number=2 48 49 50 51 Tümmler, B. (2010). Genome diversity of Pseudomonas aeruginosa PAO1 laboratory strains. 52 53 Journal of Bacteriology, 192(4), 1113 1121. https://doi.org/10.1128/JB.01515-09 54 55 56 57 58 (2012). Multilocus sequence typing of total-genome-sequenced bacteria. Journal of Clinical 59 60 Microbiology, 50(4), 1355 1361. https://doi.org/10.1128/JCM.06094-11 61 62 25 63 64 65 1 2 3 4 Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new 5 6 developments. Nucleic Acids Research, 47(W1), W256 W259. 7 8 9 https://doi.org/10.1093/nar/gkz239 10 11 Liebert, C. A., Hall, R. M., & Summers, A. O. (1999). Transposon Tn21, Flagship of the Floating 12 13 Genome. Microbiology and Molecular Biology Reviews, 63(3), 507 522. 14 15 https://doi.org/10.1128/mmbr.63.3.507-522.1999 16 17 18 López-García, A., Rocha-Gracia, R. del C., Bello-López, E., Juárez-Zelocualtecalt, C., Sáenz, Y., 19 20 Castañeda- Lozano-Zarain, P. (2018). Characterization of antimicrobial 21 22 resistance mechanisms in carbapenem-resistant Pseudomonas aeruginosa carrying IMP 23 24 variants recovered from a Mexican Hospital. Infection and Drug Resistance, 11, 1523. 25 26 27 https://doi.org/10.2147/IDR.S173455 28 29 Lu, P., Wang, Y., Zhang, Y., Hu, Y., Thompson, K. M., & Chen, S. (2016). RpoS-dependent sRNA 30 31 RgsA regulates Fis and AcpP in Pseudomonas aeruginosa. Molecular Microbiology, 102(2), 32 33 244 259. https://doi.org/10.1111/mmi.13458 34 35 Martínez, T., Vazquez, G. J., Aquino, E. E., Goering, R. V, & Robledo, I. E. (2012). Two novel 36 37 38 class I integron arrays containing IMP-18 metallo- -lactamase gene in Pseudomonas 39 40 aeruginosa clinical isolates from Puerto Rico. Antimicrobial Agents and Chemotherapy, 56(4), 41 42 2119 2121. https://doi.org/10.1128/AAC.05758-11 43 44 45 46 47 (2008). Dynamics of Pseudomonas aeruginosa genome evolution. Proceedings of the National 48 49 Academy of Sciences, 105(8), 3100 3105. https://doi.org/10.1073/PNAS.0711982105 50 51 Tufik, 52 53 S. (2007). Rapid detection and identification of metallo- -lactamase-encoding genes by 54 55 56 multiplex real-time PCR assay and melt curve analysis. Journal of Clinical Microbiology, 57 58 45(2), 544 547. https://doi.org/10.1128/JCM.01728-06 59 60 Molina-Mora, J.-A., Campos-Sánchez, R., Rodríguez, C., Shi, L., & García, F. (2020). High quality 61 62 26 63 64 65 1 2 3 4 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa 5 6 genome: Benchmark of hybrid and non-hybrid assemblers. Scientific Reports, 10(1), 1392. 7 8 9 https://doi.org/10.1038/s41598-020-58319-6 10 11 Molina-Mora, J. A., Chinchilla, D., Chavarría, M., Ulloa, A., Campos-Sanchez, R., Mora- 12 13 - 14 15 111 Pseudomonas aeruginosa AG1 to ciprofloxacin identified by a top-down systems biology 16 17 18 approach. Scientific Reports, 10, 1 23. https://doi.org/10.1038/s41598-020-70581-2 19 20 Molina-Mora, J., Montero-Manso, P., Batán, R. G., Sánchez, R. C., Fernández, J. V., & García, F. 21 22 (2020). A first Pseudomonas aeruginosa perturbome: Identification of core genes related to 23 24 multiple perturbations by a machine learning approach. BioRxiv, 2020.05.05.078477. 25 26 27 https://doi.org/10.1101/2020.05.05.078477 28 29 Mosquera-Rendón, J., Rada-Bravo, A. M., Cárdenas-Brito, S., Corredor, M., Restrepo-Pineda, E., 30 31 & Benítez-Páez, A. (2016). Pangenome-wide and molecular evolution analyses of the 32 33 Pseudomonas aeruginosa species. BMC Genomics, 17(1), 1 14. 34 35 https://doi.org/10.1186/s12864-016-2364-4 36 37 38 Mulet, X., Cabot, G., Ocampo- 39 40 Network for Research in Infectious Diseases (REIPI). (2013). Biological Markers of 41 42 Pseudomonas aeruginosa Epidemic High-Risk Clones. Antimicrobial Agents and 43 44 Chemotherapy, 57(11), 5527 5535. https://doi.org/10.1128/AAC.01481-13 45 46 47 Mulet, Xavier, Cabot, G., Ocampo- 48 49 Oliver, A. (2013). Biological markers of Pseudomonas aeruginosa epidemic high-risk clones. 50 51 Antimicrobial Agents and Chemotherapy, 57(11), 5527 5535. 52 53 https://doi.org/10.1128/AAC.01481-13 54 55 56 Oliver, A., Mulet, X., López-Causapé, C., & Juan, C. (2015). The increasing threat of Pseudomonas 57 58 aeruginosa high-risk clones. Drug Resistance Updates, 21 22, 41 59. 59 60 https://doi.org/10.1016/j.drup.2015.08.002 61 62 27 63 64 65 1 2 3 4 Ozer, E. A., Allen, J. P., & Hauser, A. R. (2014). Characterization of the core and accessory 5 6 genomes of Pseudomonas aeruginosa using bioinformatic tools Spine and AGEnt. BMC 7 8 9 Genomics, 15(1), 737. https://doi.org/10.1186/1471-2164-15-737 10 11 12 13 (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22), 14 15 3691 3693. https://doi.org/10.1093/bioinformatics/btv421 16 17 18 19 20 Tracking of antibiotic resistance transfer and rapid plasmid evolution in a hospital setting by 21 22 Nanopore sequencing. BioRxiv, 639609. https://doi.org/10.1101/639609 23 24 Petitjean, M., Martak, D., Silvant, A., Bertrand, X., Valot, B., & Hocquet, D. (2017). Genomic 25 26 27 characterization of a local epidemic Pseudomonas aeruginosa reveals specific features of the 28 29 widespread clone ST395. Microbial Genomics, 3(10), e000129. 30 31 https://doi.org/10.1099/mgen.0.000129 32 33 Poirel, L., Lambert, T., Turkoglu, S., Ronco, E., Gaillard, J., & Nordmann, P. (2001). 34 35 Characterization of Class 1 Integrons from Pseudomonas aeruginosa That Contain the 36 37 38 blaVIM-2 Carbapenem-Hydrolyzing -Lactamase Gene and of Two Novel Aminoglycoside 39 40 Resistance Gene Cassettes. Antimicrobial Agents and Chemotherapy, 45(2), 546 552. 41 42 https://doi.org/10.1128/AAC.45.2.546-552.2001 43 44 Poulsen, B. E., Yang, R., Clatworthy, A. E., Wh 45 46 47 (2019). Defining the core essential genome of Pseudomonas aeruginosa. Proceedings of the 48 49 National Academy of Sciences of the United States of America, 116(20), 10072 10080. 50 51 https://doi.org/10.1073/pnas.1900570116 52 53 Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic 54 55 56 features. Bioinformatics, 26(6), 841 842. https://doi.org/10.1093/bioinformatics/btq033 57 58 Reinhart, A. A., Nguyen, A. T., Brewer, L. K., Bevere, J., Jone - 59 60 Sherrouse, A. G. (2017). The Pseudomonas aeruginosa PrrF Small Acute Murine Lung 61 62 28 63 64 65 1 2 3 4 Infection. Infection and Immunity, 85(5), 1 15. https://doi.org/10.1128/IAI.00764-16 5 6 - 7 8 9 Sherrouse, A. G. (2015). The prrF-encoded small regulatory RNAs are required for iron 10 11 homeostasis and virulence of Pseudomonas aeruginosa. Infection and Immunity, 83(3), 863 12 13 875. https://doi.org/10.1128/IAI.02707-14 14 15 Sam 16 17 18 Giske, C. G. (2010). Molecular epidemiology of metallo- -lactamase-producing Pseudomonas 19 20 aeruginosa isolates from Norway and Sweden shows import of international clones and local 21 22 clonal expansion. Antimicrobial Agents and Chemotherapy, 54(1), 346 352. 23 24 https://doi.org/10.1128/AAC.00824-09 25 26 27 Sánchez-Martinez, G., Garza-Ramos, U. J., Reyna-Flores, F. L., Gaytán-Martínez, J., Lorenzo- 28 29 Bautista, I. G., & Silva-Sanchez, J. (2010). In169, A New Class 1 Integron that Encoded 30 31 blaIMP-18 in a Multidrug-Resistant Pseudomonas aeruginosa Isolate from Mexico. Archives 32 33 of Medical Research, 41(4), 235 239. https://doi.org/10.1016/j.arcmed.2010.05.006 34 35 Savli, H., Karadenizli, A., Kolayli, F., Gundes, S., Ozbek, U., & Vahaboglu, H. (2003). Expression 36 37 38 stability of six housekeeping genes: a proposal for resistance gene quantification studies of 39 40 Pseudomonas aeruginosa by real-time quantitative RT-PCR. Journal of Medical 41 42 Microbiology, 52(5), 403 408. https://doi.org/10.1099/jmm.0.05132-0 43 44 Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068 45 46 47 2069. https://doi.org/10.1093/bioinformatics/btu153 48 49 Sullivan, M. J., Petty, N. K., & Beatson, S. A. (2011). Easyfig: a genome comparison visualizer. 50 51 Bioinformatics, 27(7), 1009. https://doi.org/10.1093/BIOINFORMATICS/BTR039 52 53 Szuplewska, M., Czarnecki, J., & Bartosik, D. (2014). Autonomous and non-autonomous Tn 3 - 54 55 56 family transposons and their role in the evolution of mobile genetic elements . Mobile Genetic 57 58 Elements, 4(6), 1 4. https://doi.org/10.1080/2159256x.2014.998537 59 60 Toval, F., Guzmán-Marte, A., Madriz, V., Somogyi, T., Rodríguez, C., & García, F. (2015). 61 62 29 63 64 65 1 2 3 4 Predominance of carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP and 5 6 blaVIM metallo- -lactamases in a major hospital in Costa Rica. Journal of Medical 7 8 9 Microbiology, 64(1), 37 43. https://doi.org/10.1099/jmm.0.081802-0 10 11 Turton, J. F., Wright, L., Underwood, A., Witney, A. A., Chan, Y. T., Al- 12 13 N. (2015). High-resolution analysis by whole-genome sequencing of an international lineage 14 15 (Sequence Type 111) of pseudomonas aeruginosa associated with metallo-carbapenemases in 16 17 18 the United Kingdom. Journal of Clinical Microbiology, 53(8), 2622 2631. 19 20 https://doi.org/10.1128/JCM.00505-15 21 22 Valot, B., Guyeux, C., Rolland, J. Y., Mazouzi, K., Bertrand, X., & Hocquet, D. (2015). What It 23 24 Takes to Be a Pseudomonas aeruginosa? The Core Genome of the Opportunistic Pathogen 25 26 27 Updated. PLOS ONE, 10(5), e0126468. https://doi.org/10.1371/journal.pone.0126468 28 29 van der Zee, A., Kraak, W. B., Burggraaf, A., Goessens, W. H. F., Pirovano, W., Ossewaarde, J. 30 31 M., & Tommassen, J. (2018). Spread of carbapenem resistance by transposition and 32 33 conjugation among Pseudomonas aeruginosa. Frontiers in Microbiology, 9(SEP), 1 11. 34 35 https://doi.org/10.3389/fmicb.2018.02057 36 37 38 Walsh, T. R. (2005). The emergence and implications of metallo- -lactamases in Gram-negative 39 40 bacteria. Clinical Microbiology and Infection, Supplement, 11(6), 2 9. 41 42 https://doi.org/10.1111/j.1469-0691.2005.01264.x 43 44 Weiser, R., Green, A. E., Bull, M. J., Cunningham-Oakes, E., Jolley, K. A., Maiden, M 45 46 47 Mahenthiralingam, E. (2019). Not all pseudomonas aeruginosa are equal: Strains from 48 49 industrial sources possess uniquely large multireplicon genomes. Microbial Genomics, 5(7). 50 51 https://doi.org/10.1099/mgen.0.000276 52 53 Wendt, M., & Heo, G.-J. (2016). Multilocus sequence typing analysis of Pseudomonas aeruginosa 54 55 56 isolated from pet Chinese stripe-necked turtles ( Ocadia sinensis ) . Laboratory Animal 57 58 Research, 32(4), 208. https://doi.org/10.5625/lar.2016.32.4.208 59 60 Witney, A. A., Gould, K. A., Pope, C. F., 61 62 30 63 64 65 1 2 3 4 (2014). Genome sequencing and characterization of an extensively drug-resistant sequence 5 6 type 111 serotype O12 hospital outbreak strain of Pseudomonas aeruginosa. Clinical 7 8 9 Microbiology and Infection, 20(10), O609 O618. https://doi.org/10.1111/1469-0691.12528 10 11 Woodford, N., Turton, J. F., & Livermore, D. M. (2011). Multiresistant Gram-negative bacteria: the 12 13 role of high-risk clones in the dissemination of antibiotic resistance. FEMS Microbiology 14 15 Reviews, 35(5), 736 755. https://doi.org/10.1111/j.1574-6976.2011.00268.x 16 17 18 World Health Organization. (2017). Guidelines for the prevention and control of carbapenem- 19 20 resistant Enterobacteriaceae, Acinetobacter baumannii and Pseudomonas aeruginosa in 21 22 health care facilities. Geneva. Retrieved from 23 24 https://apps.who.int/iris/bitstream/handle/10665/259462/9789241550178- 25 26 27 eng.pdf?sequence=1&ua=1 28 29 Zhao, W. H., & Hu, Z. Q. (2011). IMP-type metallo- -lactamases in Gram-negative bacilli: 30 31 Distribution, phylogeny, and association with integrons. Critical Reviews in Microbiology, 32 33 37(3), 214 226. https://doi.org/10.3109/1040841X.2011.559944 34 35 36 37 38 FIGURES AND TABLES LEGENDS: 39 40 41 42 Fig. 1. VIM-2 and IMP-18 expression after imipenem exposure. A RT-qPCR was performed to 43 44 assess the transcriptomic activity of VIM-2 and IMP-18 genes. PaeAG1 was exposed to two 45 46 47 imipenem concentrations, showing no effects on the growth curves (A). Relative gene expression 48 49 showed that higher induction occurs at 6 hours after exposure, not only for VIM-1 (B), but also for 50 51 IMP-18 (C). Relative expression was statistically different by time but not by concentration 52 53 (p<0.05). 54 55 56 57 58 Fig. 2. Comparative genomic analysis of 211 P. aeruginosa strains. By a pan-genome analysis 59 60 strategy, the complete genomes were compared and the gene composition defined groups that can 61 62 31 63 64 65 1 2 3 4 be described in turn by the MLST genotyping profile. ST groups with a low frequency of less than 5 5 6 strains are shown in beige and cases with unknown ST were represented in gray. ST groups with 5 7 8 9 or more strains were represented with colors. The Pae-AG1 strain and all the other ST-111 strains 10 11 are located in a clearly separated cluster, as shown in green. 12 13 14 15 Fig. 3. Distribution of genomic islands of PaeAG1 along the genome. The 57 predicted genomic 16 17 18 islands are distributed along the PaeAG1 genome, and most of them forming groups with two or 19 20 more islands in a row (genomic islands clusters, GIC), which are jointly named in a single label. 21 22 23 24 Fig. 4. Comparative analysis of the presence/absence of PaeAG1 genomic islands in other ST- 25 26 27 111 strains and representative genomes. The 57 genomic islands were searched in the genomes of 28 29 the other ST-111 strains, the reference strain PAO1 (ST-549) and three other strains close to the ST- 30 31 111 group (see Fig. 2). The GI27 genomic island includes the VIM-2-carrying integron and it is 32 33 present in PaeAG1 and two other ST-111 strains, while the GI49 (blue) harboring IMP-18-carrying 34 35 integron is unique to PaeAG1 and is not it is present in none of the other 210 strains in the study. 36 37 38 Other genomic islands linked to GICVIM-2 and GICIMP-18 have a different pattern of occurrence 39 40 between strains. 41 42 43 44 Fig. 5. Description of the architecture of the genomic region GIC containing the old- 45 VIM-2 46 47 acquaintance VIM-2-carrying integron. The genomic region GICVIM-2 is absent in the reference 48 49 sequence Pae-PAO1, meanwhile it is mostly present in Pae-AR445, but without most of the 50 51 integron. Full coverage of the region was identified in Pae-RIV-EMC2982. The architecture of the 52 53 VIM-2-carrying integron is shown. 54 55 56 57 58 Fig. 6. Description of the architecture of the exclusive genomic region GICIMP-18 containing the 59 60 new IMP-18-carrying integron. The genomic region GICIMP-18 is absent in the reference sequence 61 62 32 63 64 65 1 2 3 4 Pae-PAO1 and Pae-RIV-EMC2982 strains, meanwhile it is partially present in Pae-97. The 5 6 architecture of the IMP-18-carrying integron is shown with an arrangement that is reported here for 7 8 9 the first time. 10 11 12 13 Fig. 7. Possible evolutionary steps associated with the genomic regions of the VIM-2- and 14 15 IMP-18-carrying integrons. Different mobile elements are involved in the current state of the 16 17 18 genomic region, being completely described for GICVIM-2 (A) and partially for GICIMP-18 (B). 19 20 21 22 Table 1. Primer sequences used for RT-qPCR experiments. 23 24 25 26 27 Table 2. Annotation of protein-coding genes of the genomic region GICVIM-2 associated with 28 29 the VIM-2-carrying integron. 30 31 32 33 Table 3. Annotation of protein-coding genes of the genomic region GICIMP-18 associated with 34 35 36 the IMP-18-carrying integron. 37 38 39 40 SUPPLEMENTARY MATERIAL 41 42 Supplementary Table S1 43 44 Supplementary Table S2 45 46 47 Supplementary Table S3 48 49 Supplementary_file 1 All_strains_information 50 51 Supplementary_file 2 Pan-genome analysis results 52 53 54 55 56 57 58 59 60 61 62 33 63 64 65 Table 1. Primer sequences used for RT-qPCR experiments. Gene Primer Final Amplicon concentration length Forward GAATAG(A/G)(A/G)TGGCTTAA(C/T)TCTC IMP-18 Reverse CCAAAC(C/T)ACTA(G/C)GTTATC 1 µM 188 pb CCGCGTCTATCATGGCTATT VIM-2 Forward Reverse ATGAGACCATTGGACGGGTA 0.1 µM 181 pb Forward GGGCGAAGAAGGAAATGGTC rpoD Reverse CAGGTGGCGTAGGTGGAGAA 1 µM 178 pb Table 2. Annotation of protein-coding genes of the genomic region GICVIM-2 associated with the VIM-2-carrying integron Pae-RIV-EMC2982 PaeAG1 Annotation Name RefSeq Gene* Identity PslA, psl cluster plays a role in cell-cell and/or NP_250921.1; PaeAG1_03237 EMC2982_03491 100.0 pslA (PA2231) cell-surface interaction WP_003111160.1 in biofilm formation Hypothetical protein NP_250920.1; PaeAG1_03238 EMC2982_03490 100.0 PA2230 PA2230 WP_003122761.1 Hypothetical protein NP_250919.1 ; PaeAG1_03239 EMC2982_03489 100.0 PA2229 PA2229 WP_003113716.1 PaeAG1_03240 EMC2982_03488 100.0 Hypothetical protein HP WP_034066849.1 PaeAG1_03241 EMC2982_03487 100.0 Transposase TnpA tnpA WP_003460108.1 WP_000147567.1; PaeAG1_03242 EMC2982_03486 100.0 Transposase TnpR tnpR YP_005211182.1 PaeAG1_03243 EMC2982_03485 100.0 Transposase TnpM tnpM WP_004217866.1 Class I integron integrase PaeAG1_03244 EMC2982_03484 100.0 intI YP_005221021.1 IntI 6'-N-aminoglycoside PaeAG1_03245 EMC2982_03483 100.0 acetyltransferase type I aacA29a WP_032490447.1 aacA29a Carbapenem-hydrolyzing PaeAG1_03246 EMC2982_03482 100.0 metallo-beta-lactamase VIM-2 WP_032491390.1 VIM-2 6'-N-aminoglycoside PaeAG1_03247 EMC2982_03481 100.0 acetyltransferase type I aacA29b WP_032490447.1 aacA29b Sulfonamide-resistant PaeAG1_03248 EMC2982_03480 100.0 dihydropteroate sul1 WP_000259031.1 synthase Sul1 Acetyltransferas PaeAG1_03249 EMC2982_03479 100.0 Acetyltransferase WP_000376623.1 e WP_003107582.1; PaeAG1_03250 EMC2982_03478 100.0 Transposase TniB tniB WP_021264342.1 WP_000179844.1; PaeAG1_03251 EMC2982_03477 100.0 Transposase TniA tniA YP_008766137.1 PaeAG1_03252 EMC2982_03476 100.0 Hypothetical protein urf2 WP_000204520.1 Mercury resistance WP_000993386.1; PaeAG1_03253 EMC2982_03475 100.0 merE protein merE YP_789372.1 Transcriptional regulator WP_001277456.1; PaeAG1_03254 EMC2982_03474 99.0 merD merD YP_789373.1 Mercuric reductase WP_000105636.1; PaeAG1_03255 EMC2982_03473 99.8 merA merA YP_789374.1 WP_003111042.1; PaeAG1_03256 EMC2982_03472 100.0 Transposase tnpA WP_003460108.1 WP_003111043.1; PaeAG1_03257 EMC2982_03471 100.0 TpnA repressor protein tnpC NP_745109.1 PaeAG1_03258 EMC2982_03470 100.0 Hypothetical protein HP WP_003111045.1 PaeAG1_03259 EMC2982_03469 100.0 Hypothetical protein HP WP_003111046.1 Homospermidine PaeAG1_03260 EMC2982_03468 100.0 HPS WP_003111047.1 synthase (HPS) PaeAG1_03261 EMC2982_03467 100.0 Hypothetical protein HP WP_003111048.1 PaeAG1_03262 EMC2982_03466 100.0 Hypothetical protein HP WP_003111049.1 PaeAG1_03263** EMC2982_03465 100.0 Recombinase Recombinase WP_003111050.1 PaeAG1_03265** EMC2982_03463 100.0 Hypothetical protein HP WP_010792965.1 PaeAG1_03266 EMC2982_03462 100.0 Hypothetical protein HP WP_003092560.1 Hypothetical protein NP_250919.1 ; PaeAG1_03267 EMC2982_03461 100.0 PA2229 PA2229 WP_003113716.1 Hypothetical protein NP_250918.1 ; PaeAG1_03268 EMC2982_03460 100.0 PA2228 PA2228 WP_003113715.1 AraC-type transcriptional NP_250917.1 ; PaeAG1_03269 EMC2982_03459 100.0 vqsM (PA2227) regulator VqsM WP_003113714.1 Notes: with our annotation (see Methods). See Supplementary Table S1 for locus in PGDB annotation file and amino-acid comparison against other genomes. **PaeAG1_03264 is a tRNA, i.e. not included here. Table 3. Annotation of protein-coding genes of the genomic region GICIMP-18 associated with the IMP-18-carrying integron Pae-97 PaeAG1 Annotation Name RefSeq Gene* Identity Hypothetical protein NP_253390.1 ; PaeAG1_05736 Pa97_05533 100.0 PA4702 PA4702 WP_003095090.1 Hypothetical protein NP_253391.1 ; PaeAG1_05737 Pa97_05534 100.0 PA4703 PA4703 WP_003095094.1 cAMP-binding protein A NP_253392.1 ; PaeAG1_05738 Pa97_05535 100.0 cbpA (PA4704) PA4704 , cbpA WP_003095096.1 PaeAG1_05739 Pa97_05536 100.0 Recombinase Recombinase WP_023442562.1 helix-turn-helix PaeAG1_05740 Pa97_05537 100.0 transcriptional regulator HTH-TR WP_003148665.1 (HTH-TR) PaeAG1_05741 Pa97_05538 99.8 Hypothetical protein HP WP_137462639.1 PaeAG1_05742 Pa97_05539 100.0 Hypothetical protein HP WP_071567699.1 PaeAG1_05743 Pa97_05540 100.0 Hypothetical protein HP WP_042855636.1 Type I restriction WP_042855635.1; PaeAG1_05744 Pa97_05541 100.0 hsdR endonuclease subunit R YP_005974822.1 PaeAG1_05745 Pa97_05542 100.0 Hypothetical protein HP WP_003148682.1 restriction endonuclease WP_079393399.1; PaeAG1_05746 Pa97_05543 100.0 hsdS subunit S YP_005974824.1 type I restriction- WP_003148685.1; PaeAG1_05747 Pa97_05544 100.0 modification system hsdM YP_005974823.1 (RMS) subunit M recombinase family PaeAG1_05748 Pa97_05545 100.0 Recombinase WP_003148687.1 protein class 1 integron PaeAG1_05749 Pa97_05546 100.0 intI YP_005221021.1 integrase IntI1 Pa97_05548 subclass B1 metallo- PaeAG1_05750 80.5 IMP-18 WP_060614779.1 (IMP-1) beta-lactamase IMP-18 DUF1010 domain- PaeAG1_05750.1 CP913_RS21750 36.4 gcuD WP_001336345.1 containing protein gcuD oxacillin-hydrolyzing PaeAG1_05751 Pa97_05547 36.4 class D beta-lactamase OXA-2 WP_034033256.1 OXA-2 Aminoglycoside N(6')- PaeAG1_05751.1 CP913_RS28765 99.4 acetyltransferase type 1 aacA4 WP_003159191.1 aacA4 sulfonamide-resistant PaeAG1_05752 Pa97_04840 100.0 dihydropteroate sul1 WP_000259031.1 synthase Sul1 GNAT family N- PaeAG1_05753 Pa97_04839 100.0 GNAT WP_000376623.1 acetyltransferase ATP-binding protein, PaeAG1_05754 Pa97_05603 44.4 istD WP_000983249.1 protease istD WP_001324342.1; PaeAG1_05755 Pa97_04622 44.1 Transposase istA istA WP_000996451.1 WP_003107582.1; PaeAG1_05756 Pa97_05551 100.0 Transposase TniB tniB WP_021264342.1 WP_000179844.1; PaeAG1_05757 Pa97_05552 100.0 Transposase TniA tniA YP_008766137.1 PaeAG1_05758 Pa97_05553 100.0 Hypothetical protein HP WP_003157545.1 PaeAG1_05759 Pa97_05554 99.6 Hypothetical protein HP WP_003157546.1 iron(III) ABC transporter NP_253393.1 ; PaeAG1_05760 Pa97_05555 97.6 phuW PhuW WP_003113451.1 heme ABC transporter NP_253394.1 ; PaeAG1_05761 Pa97_05556 99.6 ATP-binding protein phuV WP_003095098.1 PhuV iron ABC transporter NP_253395.1 ; PaeAG1_05762 Pa97_05557 100.0 phuU permease PhuU WP_003121063.1 Notes: Pa97 with our annotation (see Methods). See Supplementary Table S2 for locus in PGDB annotation file and amino-acid comparison against other genomes. locus refers to the PGDB annotation file with a better score due to annotation algorithms differences. 88 CHAPTER 3 Two-dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler: Pseudomonas aeruginosa AG1 as model Molina-Mora, J. A., Chinchilla-Montero, D., Castro-Peña, C., & Garcia, F. (2020). Two-dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler: Pseudomonas aeruginosa AG1 as model. Medicine, IN-PRESS. 89 Summary Using the bacterial strain Pseudomonas aeruginosa AG1 as a model, we obtained images from Two-dimensional gel electrophoresis (2D-GE) of periplasmic protein profiles when the strain was exposed to multiple antibiotics. As reported, 2D-GE is an indispensable technique for the study of proteomes of biological systems, providing an assessment of changes in protein abundance under various experimental conditions. However, due to the complexity of 2D-GE gels, there is no systematic, automatic and reproducible protocol for image analysis and specific implementations are required for each context. In addition, practically all available solutions are commercial, which implies high cost and little flexibility to modulate the parameters of the algorithms. Then we proceeded to implement and evaluate an image analysis protocol with an open-source software, CellProfiler. First, a preprocessing step included a bUnwarpJ-Image pipeline for aligning 2D-GE images. Then, using CellProfiler we standardized two pipelines for spots identification. Total spots recognition was achieved using segmentation by intensity, whose performance was evaluated when compared with a reference protocol. In a second pipeline with the same program, differential identification of spots was addressed when comparing pairs of protein profiles. Due to the characteristics of the programs used, our workflow can automatically analyze a large number of images and it is parallelizable, which is an advantage with respect to other implementations. Finally, we compared six experimental conditions of bacterial strain in the presence or absence of antibiotics, determining protein profiles relationships by applying clustering algorithms PCA (Principal Components Analysis) and HC (Hierarchical Clustering). Results revealed that global proteomic profile after exposure to a sub-inhibitory ciprofloxacin (CIP) concentration remains close to control (LB medium, without antibiotics), contrasting with the results obtained with tobramycin and imipenem. This means that the effects of ciprofloxacin at the proteomic level are fewer than the changes given by other antibiotics. ® Quality Improvement Study Medicine OPEN Two-dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler Pseudomonas aeruginosa AG1 as model ∗ Jose Arturo Molina-Mora, MSc , Diana Chinchilla-Montero, MSc, Carolina Castro-Peña, MSc, Fernando García, PhD Abstract Two-dimensional gel electrophoresis (2D-GE) is an indispensable technique for the study of proteomes of biological systems, providing an assessment of changes in protein abundance under various experimental conditions. However, due to the complexity of 2D-GE gels, there is no systematic, automatic, and reproducible protocol for image analysis and specific implementations are required for each context. In addition, practically all available solutions are commercial, which implies high cost and little flexibility to modulate the parameters of the algorithms. Using the bacterial strain, Pseudomonas aeruginosaAG1 as a model, we obtained images from 2D-GE of periplasmic protein profiles when the strain was exposed tomultiple conditions, including antibiotics. Then, we proceeded to implement and evaluate an image analysis protocol with open-source software, CellProfiler. First, a preprocessing step included a bUnwarpJ-Image pipeline for aligning 2D-GE images. Then, using CellProfiler, we standardized two pipelines for spots identification. Total spots recognition was achieved using segmentation by intensity, whose performance was evaluated when compared with a reference protocol. In a second pipeline with the same program, differential identification of spots was addressed when comparing pairs of protein profiles. Due to the characteristics of the programs used, our workflow can automatically analyze a large number of images and it is parallelizable, which is an advantage with respect to other implementations. Finally, we compared six experimental conditions of bacterial strain in the presence or absence of antibiotics, determining protein profiles relationships by applying clustering algorithms PCA (Principal Components Analysis) and HC (Hierarchical Clustering). Abbreviations: 2D-GE = two-dimensional gel electrophoresis, ANOVA = analysis of variance, CIP = Ciprofloxacin, FDR = false discovery rate, HC = hierarchical clustering, IMP = Imipenem, PCA = Principal Component Analysis, pI = isoelectric point, TOB = Tobramycin. Keywords: 2D-GE, bUnwarpJ, CellProfiler, image analysis, proteomics, Pseudomonas aeruginosa 1. Introduction presence or absence of proteins, or the measurement of their relative abundance, can help to understand the cellular processes, Proteomics is a field of study of the omic sciences that focuses on including associated to pathologies, particular biological con- the analysis of the complete set of proteins produced in a cell, ditions or to understand molecular mechanisms of biological tissue, or organism at a given moment, that is, proteomes. The relevance.[1] However, since cells can produce thousands of evaluation of protein profiles of biological samples, either by the proteins, the processing of protein information is complex. Editor: Duane R. Hospenthal. Ethics approval and consent to participate: Not Applicable. Since, our study is focused on a bacterial strain, an ethical approval was not necessary, as it was evaluated by the scientific committee of the Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica. Neither humans nor animals were used in this study. This work was funded by project B7124 Analysis of the extracellular proteome of Pseudomonas aeruginosa AG1, registered in the University of Costa Rica (period 2017–2019). The authors have no conflicts of interest to disclose. Data availability: Image analysis pipelines for both “Identification of total spots” and “Differential identification of spots” are available at: https://github.com/josemolina6/ 2D-GE. Both need to be opened using CellProfiler software. The datasets generated during and/or analyzed during the present study are publicly available. Research Center in Tropical Diseases (CIET), University of Costa Rica, San Pedro, Costa Rica. ∗ Correspondence: Jose Arturo Molina-Mora, Research Center in Tropical Diseases (CIET), University of Costa Rica, San Pedro, Costa Rica (e-mail: jose.molinamora@ucr.ac.cr). Copyright © 2020 the Author(s). Published by Wolters Kluwer Health, Inc. This is an open access article distributed under the terms of the Creative Commons Attribution-Non Commercial License 4.0 (CCBY-NC), where it is permissible to download, share, remix, transform, and buildup the work provided it is properly cited. The work cannot be used commercially without permission from the journal. How to cite this article: Molina-Mora JA, Chinchilla-Montero D, Castro-Peña C, García F. Two-dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler: Pseudomonas aeruginosa AG1 as model. Medicine 2020;99:49(e23373). Received: 25 October 2019 / Received in final form: 20 October 2020 / Accepted: 27 October 2020 http://dx.doi.org/10.1097/MD.0000000000023373 1 D o w n l o a d e d f r o m h t t p : / / j o u r n a l s . l w w . c o m / m d - j o u r n a l b y B h D M f 5 e P H K a v 1 z E o u m 1 t Q f N 4 a + k J L h E Z g b s I H o 4 X M i 0 h C y w C X 1 A W n Y Q p / I l Q r H D 3 i 3 D 0 O d R y i 7 T v S F l 4 C f 3 V C 4 / O A V p D D a 8 K K G K V 0 Y m y + 7 8 = o n 1 2 / 1 5 / 2 0 2 0 Downloaded from http://journals.lww.com/md-journal by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC4/OAVpDDa8KKGKV0Ymy+78= on 12/15/2020 Molina-Mora et al. Medicine (2020) 99:49 Medicine In this sense, two-dimensional gel electrophoresis (2D-GE) has includes the potential application to automatically analyze a large become a method of choice for proteomic studies since its number of images and, due to the computational requirements, introduction more than 40 years ago.[2] Its current use in part is that is potentially parallelizable. For this, we standardized explained due to its high performance in terms of the separation experimental protocols for the study of the periplasmic proteins of complex protein mixtures.[3] The use of 2D-GE gels allows the of Pseudomonas aeruginosa AG1 under various conditions of comparison of complex protein profiles, first separating them by exposure to antibiotics. This bacterium is an opportunistic isoelectric point (pI) and then by molecular weight.[4] With this, pathogen that survives diversity of environments, including the proteins are separated as spots, which are revealed with stains hospital environments.[10] Specifically, our study model is the such as Coomassie blue or silver stain, to then capture images of strain P aeruginosa AG1, a Costa Rican isolate[11] with a the gel. These images are then analyzed to identify the points and multiresistance profile to antibiotics and with clonal MLST study the protein content, as well as continue with subsequent (https://pubmlst.org/) categorized as ST111, which implies a high proteomic studies by other strategies.[1] risk for public health because of its resistance to therapy and However, due to the anomalies present in the images of 2D-GE association with nosocomial infections. gels, there is still no reliable, automatic and highly reproducible With this bacterial model, from the experimental assays, pipeline for 2D-GE image analysis.[4] At a strictly experimental separation of the proteins was achieved using 2D-GE gels and it level, the challenges of this type of technique include experimental was revealed with silver staining. After capturing the respective variation (reagents, running conditions, etc), particular mobility images, we implemented a pre/processing step that included an of the proteins, deformation of the gel and the high probability of initial phase of image alignment using the script bUnwarpJ[12] in nding several proteins in the same space of the plane of the gel.[5] the program ImageJ[7,13]fi ; this package has the ability to align At the level of image analysis, the difficulties are greater, hundreds of images to the same reference in one step. including anomalies such as the presence of vertical and Subsequently, we made the spots identification with two horizontal stripes, noise around protein spots, diffuse spots protocols using the program CellProfiler.[8,14] A first protocol and background noise, fusions of spots, artifacts due to the was established to identify total spots in the images of the gels, presence of dust or bubbles, saturation of certain spots and lack and that was contrasted with a reference analysis with the of linear intensity of protein spots.[1,3] commercial program Melanie (https://2d-gel-analysis.com/). In a At the preprocessing level, one of the basic tasks is the second implementation, spots differential identification in alignment of images, in which one of the images is intentionally experimental conditions was made, separating the common deformed to match the spots with the other image. This is done spots from the exclusive ones. Finally, a comparison of several with a transformation that optimizes the measure of similarity experimental conditions was carried out with two clustering and in turn quantifies the quality of alignment.[6] Then, algorithms, showing the similarity of protein profiles of P algorithms are implemented to detect protein spots, that is, the aeruginosa AG1 exposed to antibiotics. To the best of our recognition of objects by segmentation to define the limits of each knowledge, CellProfiler program has not been used for the spot, many of them with methods based on intensity, form, or identification of spots on 2D-GE gels, although it has been hybrid strategies.[3] In a subsequent step, the quantification of the implemented to for recognizing biological objects (cells, complete level of protein expression is performed according to the intensity organisms, tumors, colonies of microorganisms and others) in and the number of pixels.[1] If required, a differential expression hundreds of images, making this implementation as promising analysis can be performed by comparing conditions, in which for the analysis of hundreds of gels in proteomics studies. multivariate statistical criteria are used, including analysis of variance (ANOVA) according to the size and intensity of the spot, strategies of correction of P values such as FDR (false discovery 2. Methods rate) or machine learning algorithms for clustering or classifying protein pro les.[5] 2.1. Experimental assays for 2D-GE gelsfi For the implementation of these analysis modules, there are For the extraction and analysis of periplasmic proteins of P software packages, practically all commercially available. This aeruginosa AG1, cultures were used at exponential phase in LB has the disadvantage that many are for a particular proteomics medium (Luria Bertani, 2 clones) and LB medium added with market, subject to purchase of equipment and that makes it even subinhibitory concentrations of antibiotics ciprofloxacin (CIP, more expensive. Within these commercial solutions are PDQuest, 12.5mg/mL), tobramycin (TOB, 62.5 and 125mg/mL), and ImageMaster2D, ProteomeWeaver, ProteinMine, Delta2D and imipenem (IMP, 25 and 50mg/mL). The marker “IEF 3–10 Melanie, among others,[1] which generally contain modules that SERVA liquid mix” (with proteins of size and known isoelectric include the alignment of images to be compared, automatic point) was used as migration control. After pre-cultivation for identification and edition of spots, counting, quantification of 16h under the corresponding conditions, the bacteria were intensity, and area calculation by spot. Within the options of free cultured for 6h at 37°C under agitation. After verifying their software, ImageJ[7] has been widely used for analysis of images of exponential growth by optical density, the samples were biological origin, but automation is limited, given that its centrifuged at 10,000rpm for 30min and the supernatant was approach is of individual analysis, as has been described.[8,9] In discarded. the approach of Natele and collaborators, a protocol was For the extraction of periplasmic proteins with chloroform, implemented with ImageJ for the study of spots in 2D-GE gels, pellets were washed with sterile PBS 1! and then 0.01M Tris– applicable to pairs of images but with a strategy of limited hydrochloride pH 8.0 filtered and chloroform were added. After scalability to large sets of images.[4] an incubation, the sample was centrifuged and the supernatant Thus, due to all above, the aim of this work was to implement stored at"80°C. For protein precipitation, the supernatants were and evaluate an image analysis protocol with open-source treated with methanol and chloroform. After vigorous stirring software for identifying spots in 2D-GE gels images, which also and a strong centrifugation, the separation was achieved in 2 Molina-Mora et al. Medicine (2020) 99:49 www.md-journal.com 2 phases, an upper one of methanol/water and a lower one of 2. creating a new image of spots commonly shared by the images, chloroform. The periplasmic protein fraction was found in the preserving the minimum value of pixels in the same location, middle of both phases, which was finally precipitated with more 3. automatic identification and manual edition of primary alcohol and centrifugation. After the supernatant was removed, objects (same as protocol of total spots), and the protein pellet was dried and resuspended in 0.05% SDS lysis 4. the elimination of common spots of each image. buffer, obtaining the protein extract of interest. Modified With this, we obtained images of gels with common spots protocol of Ames et al.[15] eliminated, so in a next step we performed Finally, the protein separation in two-dimensional gel was performed by adding the proteins to Isoelectric Focusing (IEF) 5. the identification and edition of primary objects of the strips and hydrated for 24h at room temperature. Then, the exclusive spots of each gel, proteins were separated using a non-linear 3 to 11 pH gradient, 6. calculation of metrics for each spot, and finally, following the manufacturer’s instructions (GE HealthCare 7. the representation of common and exclusive spots for each Immobiline Dry Strip GelsTM). For the second dimension image. (molecular weight), the IEF strips were incubated in equilibrium With this, each image of each condition identified spots present buffer (50mM Tris–HCl, 6M Urea, 30% glycerol and 2% SDS) in both conditions (configured to be marked in red), or, exclusive with 4-dithiothreitol (DTT), for 10min, before separation into a of each gel (blue or green colors in each image). SDS-GE gradient of 4% to 20% for 90min at 150V. PageRuler Protein Ladder (Fermentas) was used as a molecular weight marker. All gels were visualized with silver stain. The bands were 2.5. Comparison of gels from multiple experimental observed in the ChemiDoc photo viewer (BioRad). conditions In order to compare different profiles of periplasmic proteins in 2.2. Preprocessing of 2D-GE images by alignment various conditions of antibiotic exposure in P aeruginosa AG1, we proceeded to run two machine learning algorithms for Due to the conditions inherent in the assembly of 2D-GE gels, the clustering: a Principal Component Analysis (PCA) and a images require preprocessing alignment (Fig. 1). Thus, the Hierarchical Clustering (HC) analysis (Fig. 1 down). To address detailed protocol was implemented by Natale and collabora- [4] [12] this, the images were first aligned (as previously described) andtors using the bUnwarpJ package in the ImageJ program. then the images were divided into 121 sectors (11 ! 11 Using 5 reference points, with spots known as common between quadrants) and, given that the location was in coordinates, the the images, we proceeded to the deformation of the larger images counting of spots was made for each of the zones. This to align with the spots of the smaller image, using the parameter information was used to implement the clustering algorithms, of “degree of deformation” as fine. After the deformation, the which used Euclidean distance for the dissimilarity and default aligned images were saved for the following analysis steps. parameters of the Caret package (http://caret.r-forge.r-project. org/) in the R program (https://www.r-project.org/). 2.3. Identification of total spots In order to identify the totality of visible protein spots in the gels, 3. Results an image analysis protocol was implemented using the In order to establish an automatic procedure for the identification CellProfiler program (https://CellProfiler.org/). As detailed in of spots of proteins in 2D-GE gels, we first proceeded with the Figure 1 (middle-left) the protocol consisted of 5 steps: generation of images from experimental assays with the 1. the inversion of the images to enable recognition, periplasmic proteins of P aeruginosa AG1, in conditions with 2. the implementation of an object recognition, evaluating or without antibiotics. Then, we proceeded with the analysis of different parameters and recognition algorithms and segmen- images, including alignment, identification of total spots and tation, validation, differential identification when comparing pairs of 3. improving the identification by manual editing, conditions and finally analysis by clustering, as summarized in 4. calculating different metrics by object and, finally, Figure 1. 5. visualizing the recognition in the images. To align and compare the protein migration profile in 2D-GE gels, the bUnwarpJ package was used to deform the larger images Similarly, the automatic protocol of a specialized program for and align them to a reference. In the case presented in Figure 2A, 2D gels, Melanie (https://2d-gel-analysis.com/), was used to which starts with two images of different sizes (two clones of the compare the performance of our protocol, contrasting the strain in control condition), five points of reference or common number of recognized elements and the intensity measured with a denominator are established between the images, which are used linear regression. by the algorithm to optimize the alignment by calculating a field and network of deformation (Fig. 2B). With this, the larger image 2.4. Differential identification of spots is reduced to align and make the spots comparable between conditions (the image was cropped to visualize the distribution, To compare the differential expression of proteins between Fig. 2C). experimental conditions, we proceeded to implement an analysis Using the CellProfiler software, two spots recognition proto- of pairs of images (Fig. 1 middle-right, also see Figure 4A for case cols were implemented. In the first one, with the identification of of two clones of control condition). The steps for this process total spots, it was established that the optimal conditions were the included: use of a global algorithm (assuming relatively homogeneous 1. the inversion of aligned images, background pixels and other parameters with default values), 3 Molina-Mora et al. Medicine (2020) 99:49 Medicine Figure 1. General workflow by image analysis for identifying and comparing spots in 2D-GE gel images. sizes of 40 to 100 pixels for the objects and the use of intensity to Melanie pipeline and that includes cases of proteins grouped as a recognize and segment objects. Thus, after the inversion of the single spots in cases of large spots). Given that the boundaries or image and the recognition of objects, the recognized objects were edges of recognition of an object varied between protocols, we presented on the original image (Fig. 3A left). When performing proceeded to perform a linear regression between the intensity the comparison with an automatic protocol with the Melanie values, determining that the intensity behavior between the program (used as a reference for validation), it was verified that algorithms is linear (Fig. 3B). the resolution capacity of the protocol we implemented had the In a second protocol with the same program, we proceeded same ability to identify spots (Fig. 3A, right). The number of spots with the differential identification of spots when comparing pairs was counted in 124 for both protocols (this value was controlled of gels, obtained from two clones of the same strain P aeruginosa with the manual edition available both in our protocol and in AG1 in LB medium condition. The identification of objects was 4 Molina-Mora et al. Medicine (2020) 99:49 www.md-journal.com Figure 2. Alignment of 2D-GE gel images by warping method with bUnwarpJ pipeline (gels of proteic profiles from two clones from same strain Pseudomonas aeruginosa AG1). (A) Raw images showing differences by size and scale. Color marks define the reference points for warping. (B) Deformation field (left) and deformation grid (right) of larger image to align to the small one. (C) Aligned 2D-GE gel images after warping and cropping (two clones). done with the algorithm and intensity conditions described for Finally, with the identification of spots made for each gel in the previous case, both for common spots and exclusive spots. different conditions including antibiotics, we proceeded to the Obtaining common spots was achieved by creating a new image, comparison of the protein profiles. First, a division of the images preserving the lower pixel value for the two images (so if a dot into zones was carried out, and the number of spots was counted. was present in both conditions, the image created would have a Then, the PCA and HC clustering algorithms were evaluated, high value). Then, the spots were identified and they were labeled obtaining that the profiles given by different antibiotics generate as proteins common to both conditions. Using the MaskImage more differences than the concentration of the antibiotic. In the function, the elimination of these common objects was achieved case of the PCA (Fig. 5A), using first two components (with a and, in a new recognition for each image, it was possible to cumulative variation between both >60%), they show a similar identify the exclusive elements of each gel. Using colors, each type relationship between the control with LB medium and the case of of object, common spots (red) or exclusive spots (green or blue) ciprofloxacin. This relationship is maintained when evaluating were marked on the images, showing that for this case the HC (Fig. 5B), but the relationship between imipenem and its two majority of proteins were shared by the two clones of the bacteria concentrations shows minor differences. In addition, for this (Fig. 4B and C). same case, the division by zones shows the sectors of gels with 5 Molina-Mora et al. Medicine (2020) 99:49 Medicine Figure 3. Total spots identification by a CellProfiler pipeline and comparison with Melanie pipeline. (A) CellProfiler pipeline (left) vrs Melanie software (right) for segmentation of objects and final identification after manual edition. (B) Comparison of spots intensity using the CellProfiler pipeline and Melanie software. similar or very different compartment (potentially useful to select existing implementations, although there are some investigations zones for subsequent analysis, see discussion). The HC results are in methods of analysis of gels 2D-GE work directly at the level of shown with the respective gels in Figure 5C. pixels, most focus on recognizing spots on gel to describe the abundance in each condition.[3] Despite this, there is no protocol 4. Discussion for universal or consensus analysis, and multiple limitations arereported in various processing steps.[4] At commercial field, the Proteomics is considered an essential field for the systematic available programs have additional drawbacks of having a high analysis of biological systems, an assessment of changes in the cost, in addition to many of them are for sale with hardware abundance of proteins that occur in living organisms and that can equipment, which restricts the possibilities of use. In addition, be studied at various levels.[3] The two-dimensional gel due to its nature, the private code of the implementations is not electrophoresis 2D-GE, separating the proteins according to available, which prevents knowing the details of strategy at the their isoelectric point and molecular weight, is still used in level of algorithms and makes the modification impossible for proteomics laboratories due to the relative ease of implementa- specific applications. In addition, some limitations of commercial tion in terms of execution and cost, the capacity of solve and or open access programs include the limited number of images to visualize miles of proteins in a single run and it is compatible with analyze. other high-performance protein techniques, such as mass With the aim of implementing and evaluating an image spectrometry.[1] 2D-GE and subsequent strategies have been analysis protocol for the recognition of spots in 2D-GE gels implemented in recent studies using bacterial models, including images, using open-source software, parallelizable, and applica- application of protein phosphorylation (phosphoproteomics) ble to hundreds of images, we obtained experimental data of in Bacillus anthracis[16,17] or biotechnological applications in protein profiles of P aeruginosa AG1 under standardized Xanthomonas campestris.[18] conditions with or without antibiotics. The general protocol After the experimental phase, the visualization of the proteins was presented in Figure 1. Although it is possible to find is done with the particular stains and gel images are captured, variations between runs for the same sample, in our work, we which must be analyzed qualitatively and quantitatively for the used data from different samples but the same run. Comparison extraction of biologically relevant protein information. Of the of other protein concentrations, experimental conditions, or 6 Figure 4. Spots differential identification and comparison of 2D-GE gel images from two experimental conditions (clones from same strain). (A) General pipeline for identifying common (red, 124 spots) and exclusive spots (blue or green), which was applied to two different proteomic profiles, Clone 1 with 11 exclusive spots (B) or Clone 2 with 14 exclusive spots (C), respectively. Molina-Mora et al. Medicine (2020) 99:49 www.md-journal.com 7 Molina-Mora et al. Medicine (2020) 99:49 Medicine Figure 5. Machine learning approach of clustering analysis for comparing 2D-GE gel images from multiple experimental conditions. (A) PCA algorithm, (B) HC analysis showing zones and spots count, and (C) HC showing images. replicates are known to produce changes in the proteomic profile, with a pipeline of the commercial software Melanie, showing an and further analyses are required to study these effects and the equivalent performance when comparing the intensity obtained performance of our pipeline considering this. per object. Due to the fact that in both protocols a module of Images were aligned with the bUnwarpJ package in the ImageJ manual editing of the identification is implemented, the count of program. This step is required as preprocessing of data since the elements was intentionally controlled according to expert final performance depends to a great extent on the quality of the criteria, for a total of 124 spots. Similar results in performance images to be processed. This processing includes the alignment of have been previously reported when an analysis with ImageJ was images to match the corresponding protein points of different compared with Melanie,[4] but as mentioned before, with limited conditions.[1] In our case, the larger image was adjusted to the number of images to be processed. In the case of CellProfiler, smaller one and as an example the case of two protein profiles of automation is an essential component from its design, as well as two clones of the bacteria was presented in the control condition the option to parallelize in computer clusters.[14] with LB culture medium (Fig. 2). Although in our final In a second protocol (Fig. 4), we implemented a procedure to implementation we use 6 images when aligning, the alignment differentially recognize the expression of proteins in pairs of of hundreds or thousands of images is possible using a single experimental conditions, allowing us to identify common and reference, as we did in another application with data of cell exclusive spots of experimental conditions. To do this, our cultures followed over time, aligning 600 images to the initial strategy was based on the construction of a new image using the image (unpublished data), showing the potential of using this minimum value of pixels of the two images aligned and inverted, package for the analysis of multiple gel images. Other using the MathImage function of the program. In this way applications with other types of images show this fact.[9,19,20] common spots were preserved. The recognition by segmentation After the preprocessing, we carried out the implementation of based on intensity allowed the identification of objects, which two protocols with CellProfiler software. Particular features of were later excluded in each. In a second phase, the remaining this software are discussed below. In a first approach, we spots were recognized in each image, to then differentially recognized total spots (Fig. 3), allowing the counting of spots and represent the edges of the shared and exclusive spots. the quantification of the area and intensity integrated by each To the best of our knowledge, there are no approximations one. Additionally, we compared the performance of this protocol that allow the display of common and exclusive spots 8 Molina-Mora et al. Medicine (2020) 99:49 www.md-journal.com automatically, given that it is regularly done manually. This obtain new findings of the biological relationships to molecular information is used to identify proteins differentially expressed in level that provide insights to begin to explain the mechanisms the conditions studied. However, our approach is very robust of tolerance to antibiotics and the modulation of biological considering only the presence or absence of spots, and true cases processes in response to cellular stress. of differential expression with significant changes in intensity are not contemplated, so we consider that this protocol allows the 5. Conclusions differential identification of spots, but not properly the differen- tial protein expression analysis. This last type of analysis is In the context of proteomics and its importance for the study of carried out by commercially available packages, but they are different biological conditions, our implementation of the image mainly based on the intensity and area, and due to the analysis of gels 2D-GE offers an opportunity to continue with preprocessing of the image in terms of image contrast, dimensions studies of analysis of protein profiles. Using the open-source and other modifications, the normalization and transformation software, CellProfiler (and bUnwarpJ for preprocessing), we of data it remains a challenge.[1] achieved thealignmentof images, the identificationof spots and the Regarding the CellProfiler program and its convenience for this final comparison of protein profiles. These workflow also allow implementation, this software offers the management of analyze a large number of images automatically aswell as enabling hundreds of thousands of images, freely available and with an the parallelization in computational clusters to counteract the open and flexible code platform to share, test, and develop new complexityof processing this typeofdata.Regarding thebiological methods by experts in image analysis. In addition, it offers an meaning, exposure to ciprofloxacin inP aeruginosaAG1 showed a easy-to-use interface and the possibility of implementing in similar pattern to control without treatment, and other groups computational clusters.[8] In addition, due to its nature of were generated according to the antibiotic class. This information automation, the program is capable of handling hundreds of will be integratedwithothermolecular analyses using antibiotics in thousands of images, which high performance infrastructure is thismultiresistant strain to gain insights regarding themechanisms required for massive analyzes, such as those implemented at the of tolerance to antibiotics and the modulation of biological omics level. Although many of the applications of the CellProfiler processes in response to cellular stress. program are formulated for cells, other applications have been implemented at the level of recognition of complete organism in Acknowledgments images, such as the parasite Caenorhabditis elegans,[21] or complete tumors, colonies of yeast or bacteria, and other images We thank the student Daniel Solano Alvarado for his collabora- of biological origin, as evidenced in the Educational Modules tion as an assistant in various activities of the project. section of the web page (https://CellProfiler.org/outreach/). In contrast, we finally carried out the implementation of a Author contributions machine learning analysis to compare protein profiles from gel images using PCA and HC clustering algorithms. This type of Conceptualization: Fernando García. strategy has been previously implemented with PCA and heuristic Formal analysis: Jose Arturo Molina-Mora, Diana Chinchilla- clustering algorithms,[22–24] as well as supervised classification Montero, Carolina Castro-Peña. algorithms to separate conditions, including Support Vector Methodology: Jose Arturo Molina-Mora, Diana Chinchilla- Machine.[23] Other approaches have implemented comparison Montero, Carolina Castro-Peña, Fernando García. modules using directly the properties of intensity, brightness, and Software: Jose Arturo Molina-Mora. contrast of images to contrast with databases,[25] or, other levels Supervision: Fernando García. of proteomic analysis, such as mass spectrometry.[26] Regarding Visualization: Jose Arturo Molina-Mora. the methodology used in our case of division by zones and Writing – original draft: Jose Arturo Molina-Mora. grouping of regions with a similar profile, this strategy can be Writing – review & editing: Jose Arturo Molina-Mora, Diana used to make subsequent decisions of work in proteomics Chinchilla-Montero, Carolina Castro-Peña, FernandoGarcía. laboratories, where the task after the gels is the selection of spots and continue with identification applications with techniques References such as HPLC or mass spectrometry.[1] [1] Goez MM, Torres-Madroñero MC, Röthlisberger S, et al. Preprocessing In the biological aspect according to the results obtained when of 2-dimensional gel electrophoresis images applied to proteomic comparing 6 experimental conditions with P aeruginosa AG1 analysis: a review. Genomics Proteomics Bioinformatics 2018;16:63–72. bacteria with or without antibiotic, it was possible to identify the [2] O’Farrell PH. High resolution two-dimensional electrophoresis of relationships between the total protein expression profiles. Both proteins. J Biol Chem 1975;250:4007–21. [3] Silva TS, RichardN, Dias JP, et al. Data visualization and feature selection with the results of the analysis by PCA and by HC, it is concluded methods in gel-based proteomics. Curr Protein Pept Sci 2014; 15:4–22. that there is greater similarity between the profiles obtained for [4] Natale M, Maresca B, Abrescia P, et al. Image analysis workflow for 2-D the same antibiotic at different concentrations, and that they are electrophoresis gels based on imageJ. Proteomics Insights 2011;4:37–49. separated from the conditions of other antibiotics, congruent [5] Abdallah C, Dumas-Gaudot E, Renaut J, et al. Gel-based and gel-free according to the mechanisms of action of each type of antibiotic. quantitative proteomics approaches at a glance. Int J Plant Genomics2012;2012. In the case of ciprofloxacin, its profile was separated to a greater [6] Dowsey AW, Morris JS, Gutstein HB, et al. Informatics and statistics for degree from the other antibiotics and was grouped with the analyzing 2-D gel electrophoresis images. Methods Mol Biol 2010; control with LB medium. 604:239–55. Because the bacterial strain P aeruginosa AG1 is resistant to [7] Abramoff MD, Magalhães PJ, Ram SJ. Image processing with Image. J Biophotonics Int 2004;11:36–42. those antibiotics, this information and subsequent analysis at the [8] Lamprecht M, Sabatini D, Carpenter A. CellProfilerTM: free, versatile proteomic level, together with other genomic, transcriptomic, software for automated biological image analysis. Biotechniques 2007; and phenomic analyzes that we are conducting, will allow us to 42:71–5. 9 Molina-Mora et al. Medicine (2020) 99:49 Medicine [9] Schindelin J, Rueden CT, Hiner MC, et al. The ImageJ ecosystem: [18] Schulte F, Hardt M, Niehaus K. A robust protocol for the isolation of an open platform for biomedical image analysis. Mol Reprod Dev cellular proteins from Xanthomonas campestris to analyze the methio- 2015;82:518–29. nine effect in 2D-gel experiments. Electrophoresis 2017;38:2603–9. [10] Cirz RT, O’Neill BM, Hammond JA, et al. Defining the Pseudomonas [19] Kindle LM, Kakadiaris IA, Ju T, et al. A semiautomated approach for aeruginosa SOS response and its role in the global response to the artefact removal in serial tissue cryosections. J Microsc 2011;241:200–6. antibiotic ciprofloxacin. J Bacteriol 2006;188:7101–10. [20] Seiler C, Reyes M. Displacement Vector Field Regularization for [11] Toval F, Guzmán-Marte A, Madriz V, et al. Predominance of Modelling of Soft Tissue Deformations; 2008. Available at: https:// carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP christofseiler.github.io/MasterThesis.pdf. [Accessed June 4, 2019]. and blaVIM metallo-b-lactamases in a major hospital in Costa Rica. J [21] Moy TI, Conery AL, Larkins-Ford J, et al. High throughput screen for Med Microbiol 2015;64:37–43. novel antimicrobials using a whole animal infection model. ACS Chem [12] Arganda-carreras I, Sorzano COS, Kybic J, et al. bUnwarp: Consistent Biol 2009;4:527. and Elastic Registration in ImageJ. Methods and Applications. Image [22] Appel R, Hochstrasser D, Roch C, et al. Automatic classification of two- (Rochester, NY); 2006. dimensional gel electrophoresis pictures by heuristic clustering analysis: a [13] Collins T. ImageJ for microscopy. Biotechniques 2007;43(S1):S25–30. step toward machine learning. Electrophoresis 1988;9:136–42. [14] Kamentsky L, Jones TR, Fraser A, et al. Improved structure, function and [23] Supek F, Peharec P, Krsnik-Rasol M, et al. Enhanced analytical power of compatibility for CellProfiler: modular high-throughput image analysis SDS-PAGEusingmachine learning algorithms. Proteomics 2008;8:28–31. software. Bioinformatics 2011;27:1179–80. [24] Castillejo MÁ, Fernández-Aparicio M, Rubiales D. Proteomic analysis [15] Ames GF, Prody C, Kustu S. Simple, rapid, and quantitative release of by two-dimensional differential in gel electrophoresis (2D DIGE) of the periplasmic proteins by chloroform. J Bacteriol 1984;160:1181–3. early response of Pisum sativum to Orobanche crenata. J Exp Bot [16] Virmani R, Sajid A, Singhal A, et al. The Ser/Thr protein kinase PrkC 2012;63:107–19. imprints phenotypic memory in Bacillus anthracis spores by phosphory- [25] Kush A, Raghava GPS. AC2DGel: analysis and comparison of 2D Gels. J lating the glycolytic enzyme enolase. J Biol Chem 2019;294:8930–41. Proteomics Bioinform 2008;01:043–6. [17] Arora G, Sajid A, Virmani R, et al. Ser/Thr protein kinase PrkC-mediated [26] Kelchtermans P, Bittremieux W, De Grave K, et al. Machine learning regulation of GroEL is critical for biofilm formation in Bacillus anthracis. applications in proteomics research: how the past can boost the future. Npj Biofilms Microbiomes 2017;3:7. Proteomics 2014;14:353–66. 10 100 CHAPTER 4 A first Pseudomonas aeruginosa perturbome: Identification of core genes related to multiple perturbations by a machine learning approach Molina-Mora, J., Montero-Manso, P., Batán, R. G., Sánchez, R. C., Fernández, J. V., & García, F. (2020). A first Pseudomonas aeruginosa perturbome: Identification of core genes related to multiple perturbations by a machine learning approach. BioRxiv, 2020.05.05.078477. https://doi.org/10.1101/2020.05.05.078477 101 Summary Tolerance to stress conditions is vital for organismal survival, including bacteria under specific environmental conditions, antibiotics and other perturbations. Some studies have described common modulation and shared genes during stress response to different types of disturbances (termed as perturbome), leading to the idea of a central control at the molecular level. We implemented a robust machine learning approach to identify and describe genes associated with multiple perturbations or perturbome in a Pseudomonas aeruginosa PAO1 model. Using public transcriptomic data, we evaluated six approaches to rank and select genes: using two methodologies, data single partition (SP method) or multiple partitions (MP method) for training and testing datasets, we evaluated three classification algorithms (SVM Support Vector Machine, KNN K-Nearest neighbor and RF Random Forest). Gene expression patterns and topological features at systems level were include to describe the perturbome elements. We were able to select and describe 46 core response genes associated to multiple perturbations in Pseudomonas aeruginosa PAO1 and it can be considered a first report of the P. aeruginosa perturbome. Molecular annotations, patterns in expression levels and topological features in molecular networks revealed biological functions of biosynthesis, binding and metabolism, many of them related to DNA damage repair and aerobic respiration in the context of tolerance to stress. We also discuss different issues related to implemented and assessed algorithms, including normalization analysis, data partitioning, classification approaches and metrics. Altogether, this work offers a different and robust framework to select genes using a machine learning approach. 102 A first Pseudomonas aeruginosa perturbome: Identification of core genes related to multiple perturbations by a machine learning approach Jose Arturo Molina Mora (corresponding author) Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica Email: jose.molinamora@ucr.ac.cr Pablo Montero-Manso Department of Mathematics, University of A Coruña, Spain Email: pmonteromanso@udc.es Raquel García Batán Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica Email: raquel.garcia@ucr.ac.cr Rebeca Campos Sánchez Research Center in Cellular and Molecular Biology, (CIBCM), University of Costa Rica, Costa Rica Email: rebeca.campos@ucr.ac.cr Jose Vilar Fernández Department of Mathematics, University of A Coruña, Spain Email: josevilarf@udc.es Fernando García Santamaría Research Center in Tropical Diseases (CIET), University of Costa Rica, Costa Rica Email: fernando.garcia@ucr.ac.cr 103 Abstract Tolerance to stress conditions is vital for organismal survival, including bacteria under specific environmental conditions, antibiotics and other perturbations. Some studies have described common modulation and shared genes during stress response to different types of disturbances (termed as perturbome), leading to the idea of a central control at the molecular level. We implemented a robust machine learning approach to identify and describe genes associated with multiple perturbations or perturbome in a Pseudomonas aeruginosa PAO1 model. Using microarray datasets from the Gene Expression Omnibus (GEO), we evaluated six approaches to rank and select genes: using two methodologies, data single partition (SP method) or multiple partitions (MP method) for training and testing datasets, we evaluated three classification algorithms (SVM Support Vector Machine, KNN K-Nearest neighbor and RF Random Forest). Gene expression patterns and topological features at systems level were include to describe the perturbome elements. We were able to select and describe 46 core response genes associated to multiple perturbations in Pseudomonas aeruginosa PAO1 and it can be considered a first report of the P. aeruginosa perturbome. Molecular annotations, patterns in expression levels and topological features in molecular networks revealed biological functions of biosynthesis, binding and metabolism, many of them related to DNA damage repair and aerobic respiration in the context of tolerance to stress. We also discuss different issues related to implemented and assessed algorithms, including data partitioning, classification approaches and metrics. Altogether, this work offers a different and robust framework to select genes using a machine learning approach. Key words: Perturbations, Pseudomonas aeruginosa, machine learning, gene selection, perturbome. 104 1. Introduction Cell stress can be defined as a wide range of molecular changes that cells undergo in response to environmental, physical, chemical or biological stressors; sensing and responding to stress is critical for survival [1]. These biological functions and metabolic activities are executed through complex physical and regulatory interactions of genes that resemble networks [2]. Additionally, tolerance to stress conditions (i.e. stressors, perturbations or disturbances) is vital for organismal survival, including bacteria under diverse environmental conditions, including antibiotics [1]. Several studies have revealed diverse molecular levels that can explain the general response to disturbances in many organisms. However, detailed mechanisms related to responses to perturbations remain poorly understood [3]. Available reports usually focus on specific stressors and relatively few studies have focused on common, central and universal determinants affected by multiple perturbations [3]. This concept has been recently referred as the perturbome [4,5]. For example, in eukaryotic organisms, including plants and human cancer models, some studies have described diverse stress-response genes as common modulators for different types of disturbances, suggesting a central control mechanism [2,4–6]. In prokaryotic models, similar findings have been reported for Escherichia coli [3,7]. Additionally, comprehensive study of gene-interactions allows the identification of functional relationships among genes [8], their products and the underlying biological phenomena that are critical to understand phenotypes under different biological conditions [9,10]. In the context of cell stress, the response to different environmental or experimental stimuli can be recognized by distinct gene expression patterns. This can be inferred from transcriptomic profiling data and functional associations using high throughput molecular technologies such as microarrays or RNASeq [2]. However, a challenge with these technologies is the large amount of high complexity data generated. Specialized bioinformatics analysis are required to select relevant information and to reduce noise that distinguishes the molecular determinants for particular biological conditions. Thus, a primary objective of transcriptomic profiling is to find an optimal subset of genes that could be used to characterize and classify unknown samples [11]. This gene selection is not obvious and complex due to the thousands of genes to select from [12]. 105 To study the central response determinants to different perturbations in a living organism, we used the model Pseudomonas aeruginosa PAO1 (reference strain). P. aeruginosa is a Gram negative gamma- proteobacterium with a noteworthy metabolic versatility and adaptability enabling colonization of diverse niches infecting plants, animals and humans alike [8,13]. In this group, the molecular mechanisms associated to most biological processes remain unclear causing limited action to modulate responses, including the susceptibility to stressors, environmental factors and experimental conditions. Several studies have used machine learning algorithms at the transcriptomic level to recognize gene expression patterns [2,14–16]. However, for many common biological contexts, applicability and utility of these machine learning approaches have not been fully explored and utilized [17], for example in the exposure to multiple stressors and the molecular response. To our knowledge, only a few studies have used feature selection methods on biological data to describe the effects of multiple perturbations in complex biological systems [4,6] and so far none in P. aeruginosa. A related work in P. aeruginosa used a machine learning approach to identify sets of genes that correlate with multiple culture media, but without other conditions [18]. The use of microarray and other high throughput technologies data involve challenges for machine learning approaches. These include the curse of dimensionality [19,20], normalization of raw values to compare samples [21,22], data partitions for training and testing models [23,24], and evaluation of performance [21,25]. Since comparison between the machine learning algorithms are completely variable [11,17,20,26,27], Support Vector Machine (SVM) [28], K-Nearest neighbor (KNN) [27] and Random Forest (RF) [29] have been successfully used with microarray gene expression data allowing the recognition of emerging patterns [26]. Here, we hypothesize that perturbations on living cells will trigger global reprogramming of multiple molecular determinants that can be sensed at the transcriptional level. The initial response after an acute stress will then expand producing the global molecular rearrangement. Then, pleotropic and specific effects on gene networks will be reflected as changes in gene expression profiles and the complexity of molecular regulation at other levels. Therefore, by using a machine learning approach, common molecular features (for all stressors) could be identified as a central or core determinant (see Figure 1-A-B). 106 Thus, the aims for this study were (i) to implement a machine learning approach to select genes from microarray expression data, and (ii) to identify and describe a subset of genes than can be associated with multiple perturbations or perturbome in P. aeruginosa, i.e. the core response components. Figure 1. General pipeline to identify core response genes in P. aeruginosa by a machine learning approach. (A) Schematic representation of hypothesis for identifying core response determinants when bacteria are exposed to multiple perturbations. (B) Workflow of the machine learning approach using microarray data and model fitting by SP and MP methods for identifying and describing core response genes. (C) Representation of data partition methods, SP and MP, including subsamples for testing and an internal 10-fold cross validation for training data set. 107 2. Materials and Methods Overall methodology of the study is presented in Figure 1-B. In brief, after a data selection, normalization and gene selection to define the perturbome were run. To achieve this, we considered two different procedures to split the data (training and testing datasets for the machine learning approaches: SVM, KNN and RF): a single partition (SP method) and another with multiple partitions (MP method). Relations between genes were represented using both large scale and small world networks, and a final comparison between conditions, an analysis of differential expression and gene annotation were performed. 2.1 Datasets In order to compare gene expression profiles of strain P. aeruginosa PAO1 when exposed to multiple perturbations, GEO database (https://www.ncbi.nlm.nih.gov/geo/) was used for a systematic selection of datasets. Initial evaluation by organism and mRNA profiles by GPL84 platform (Affymetrix Pseudomonas aeruginosa PAO1 Array, with all 5549 protein-coding sequences) identified 156 series of datasets with 1310 samples (Date of Access: January 25th 2018). In a second step, data were selected according to experimental design if they included perturbations, leaving only 47 series. Finally, to make datasets comparable by experimental conditions, evaluation and selection were done for series with similar culture conditions (Luria Bertani LB medium and exponential phase when measuring mRNA profile) and if a control condition was available (without any perturbation or treatment). The final dataset was composed of 10 series with 71 samples (Series GSE2430, GSE3090, GSE4152, GSE5443, GSE7402, GSE10605, GSE12738, GSE13252, GSE14253 and GSE36753). Some series included temporal measurements which we considered as separate perturbations, resulting in replicates of 10 controls and 14 perturbations: azithromycin with 2 series (AZM-a and AZM-b) [30,31], Hydrogen peroxide (H2O2) [32], copper (Cu) [33], sodium hypoclorite (NaClO) [34], ortho-phenylphenol (OPP) at 20 and 60 minutes [35], colistin (COL) [36], chlorhexidine diacetate (CDA) at 10 and 60 minutes [37], E-4- bromo-5-bromomethylene-3-methylfuran-2-5H-one (BF8) [38] and ciprofloxacin CIP at 0, 30 and 120 minutes [39]. 108 2.2 Pre-processing and comparison of global transcriptomic profiles To compare all the 71 samples, a first analysis was the pre-processing step using Bioconductor 3.8 (https://www.bioconductor.org/) in the R software (Version 3.5) with classical functions for microarrays. Robust MultiArray Average algorithm (RMA) was performed in the Affy package to correct background, the normalization, and summarization. Subsequently, clustering algorithms were implemented in order to compare global transcriptomic profiles between perturbations and controls. Principal Component Analysis (PCA) and Hierarchical Clustering (HC) were run with default parameters using the Caret Package (caret.r-forge.r-project.org/) in R software. In order to robustly select a number of genes that could separate experimental conditions (controls and perturbations) and to identify the core response of P. aeruginosa, two approaches of feature selection protocols were implemented, as detailed below. 2.3 Gene ranking and selection by Single Partition SP method With the aim of identifying genes which expression values were commonly related to multiple perturbations, a first approach was implemented considering a particular partition of dataset for training and testing sets (Figure 1-C). Single partition was established using the last replicate of each experiment, in both control and perturbation. Because there were 14 perturbations and 10 controls (71 samples including replicates), a total of 24 samples were included in the testing dataset and the remaining 47 samples were included for the training dataset (66% for training and 34% for testing set). Using this partition, ranking of genes was done by a machine learning approach using three classification algorithms: SVM, KNN and RF. A homemade script in R included these functions of the Caret package. For all three algorithms, default parameters were used for training and 10-fold cross validation (Figure 1-B) was included. After this, variable importance metric was calculated for all genes using the varImp function, associating a specific value for each gene. In the case of SVM and KNN, same importance is calculated because function is model-free for these cases (as detailed in Caret Package), resulting in the same list of genes but metrics are specifically calculated for each algorithm. 109 As similarly reported [11], to evaluate the number of genes that should be selected in the top group (the first K ranked genes, as candidates for the core response by each algorithm), multiple classification models were systematically run, starting only with the highest-score ranked gene. In brief, after the training with one gene, model performance was evaluated by calculating metrics using the testing dataset. Then, training was run again when the next ranked gene was added, and new metrics were calculated. This process was repeated up to complete all the ranked genes. Metrics included accuracy (correct classification percentage), kappa value (inter-rater classification agreement), sensitivity, specificity, precision, recall, prevalence and F1 score (harmonic average of the precision and recall). Selection of the K value of top genes was based on the following criteria: (i) the stability of the metrics (priority for accuracy, kappa and F1) when increment of ranked genes was done, as suggested in [11], (ii) consensus value suitable for all the three algorithms (including a 10% of tolerance), and (iii) minimum number of genes as possible. After the selection of the K value, ROC (Receiver-operating characteristic) curve and AUC (Area under the curve) value were calculated for each algorithm. Finally, selection of top K genes between algorithms were compared by metrics and list of genes. 2.4 Gene ranking and selection by Multiple Partitions (MP) method In order to identify genes related to multiple perturbations independently of a single/specific partition, a second method using multiple random partitions was implemented (Figure 1-C). To address this, a random data selection for training and testing sets was done using the createDataPartition function. Partition was set to 80% (57 samples) for training set and remaining for testing set (14 samples) with experimental conditions equally distributed. Then, protocols with SVM, KNN and RF algorithms (same conditions as previously described in SP method, with 10-fold cross validation) resumed the analysis with a final ranking of genes using the varImp function. Using only top K of ranked genes (K value determined using the criteria described in the SP method), new set of training/testing sets were used for evaluating performance of the models and each metric was stored with the list of the K ranked genes. This full process was automatically repeated 100 times using replicate function, starting with a new random partition and finishing with the list of the K genes and the metrics associated to that partition. Finally, for each algorithm, full data of all the runs were analyzed for 110 determining frequency of the appearance of genes (table function) and calculating average and dispersion of metrics across all the 100 runs. Definitive list of the K more frequent genes was established for each algorithm after this comparison by frequency. 2.5 Identification of core response genes After selection of top K genes in each algorithm by SP and MP methods, comparison of genes was done using Venn diagrams in order to identify all the candidate genes using Venn-tool (http://bioinformatics.psb.ugent.be/webtools/Venn/). Genes identified by at least four algorithms were considered part of the perturbome (this guarantees that a gene was identified by the two methods and at least by two different classification algorithms). Gene relationships were represented using molecular networks using a large scale model (using a top- down systems biology approach). The network was built using protein-protein interaction (PPI) graphs in PseudomonasNet database (www.inetbio.org/pseudomonasnet/). Network was downloaded and visualized using the Cytoscape software (https://cytoscape.org/). 2.6 Description and comparison of core response genes In order to describe and compare the genes associated with the core response of P. aeruginosa, by experimental conditions, four analyzes were established. First, clustering analysis by PCA and HC algorithms were evaluated again but now only considering genes of the core response. Based on distribution in the case of PCA, representation of centroids was done using Kmeans algorithm. Second, using the PseudomonasNet database, a small world network was built and then exported into Cytoscape software with the genes of the core response. The information of topological features (including connectivity) and expression levels of kmedoids were incorporated into different versions of the network. Third, a classic analysis of differentially expressed genes (DEGs, p<0.05) was implemented in R with Limma package (https://www.rdocumentation.org/packages/limma/versions/3.28.14) using empirical Bayes moderated t-statistic (eBayes) with Benjamini and Hochberg method for p value correction [40]. This led us to compare our results with a classical approach for gene expression. 111 Finally, in order to give biological interpretation to the selected genes, to determine levels of expression reported in databases and to study metabolic pathways involved under each perturbation, an exhaustive annotation was made using the databases PseudomonasDB (gene ontology), GEO database and particular literature. This information was integrated with the results obtained by all the analysis and the DEGs in order to fully describe the genes that make up the core response or perturbome of P. aeruginosa PAO1. Figure 2. Normalization and comparison of samples by global profiles using all genes. (A) Dispersion of intensities of samples, showing similar distribution between samples. (B-C) Global profiles were compared by both PCA and HC clustering algorithms, showing mixed patterns between classes. 112 3. Results 3.1 Perturbome genes of P. aeruginosa can be identified by a machine learning approach using SP and MP methods A total of 71 samples of 10 controls and 14 perturbations were considered for the study, with comparable expression levels (Figure 2-A). Global transcriptomic profiles (all 5549 genes) were compared by both PCA and HC algorithms, revealing a mixed pattern (no separation) between perturbations and controls. Two samples (BF8 and Control 8) resulted with extreme global profiles (Figure 2 B-C). Two methods using machine learning (SP and MP) were implemented in order to robustly rank and select genes associated to multiple perturbations in P. aeruginosa. In each method, three classification algorithms were evaluated: SVM, KNN and RF. Metric results associated to RF are shown in Figure 3, and supplementary Figure S1 for SVM and KNN. The first method (SP) implemented a gene ranking based on variable importance using a single/specific data partition. After the ranking was established, multiple classification models were run with the ranked genes (Figures 3-A and supplementary Figure S1-A-C). For each classification model, stability of the three metrics were evaluated to select the suitable K value of genes that could be applied to all algorithms at the same time. For SVM and RF, stable values of metrics are given with at least the first 51 ranked genes, meanwhile it is 45 genes for KNN. Considering a 10% of tolerance with the highest of these values, K=56 was selected as the number of top genes that were included as preliminary candidates of the core response according to each algorithm. With this value, metrics of each algorithm were compared (Table S1). For example, accuracy was 0.79, 0.71 and 0.75 for SVM, KNN and RF respectively in the SP method. SVM obtained a better performance according to kappa, sensitivity, recall and F1 scores, but higher values of specificity and precision resulted for RF. Also, ROC curve and AUC value were calculated (Figures 3-B and supplementary Figure S1-B-D). Best performance was obtained for RF with AUC = 0.92, then 0.82 for SVM and finally 0.76 for KNN. Since importance metric for SVM and KNN is the same, they shared same list of top 56 genes. Comparison between implementations showed that 21 genes were identified by the three algorithms at same time, 35 exclusively by RF and same number for SVM/KNN. In total 91 genes were identify by any of the algorithms. List of genes and importance value of top 56 ranked genes for each approach is presented in Figures 3-C for RF and Figure S1-E for SVM/KNN. 113 Figure 3. Evaluation of SP method for gene ranking by importance, case for RF algorithm. (A) Accuracy, F1 and kappa values after iterations of classification with the first top 200 genes (adding genes 1-to-1). (B) ROC plot using selected top 56 genes for evaluation of performance of the algorithm. (C) Ranking and importance value of top 56 genes. Similar results are shown for SVM and KNN algorithms in supplementary Figure S1. In a second approach, the MP method was implemented using multiple random partitions. Same SVM, KNN and RF algorithms were evaluated by running 100 iterations with different partitions and top-56 more frequent genes for each method were selected. Details of ranking and frequency is shown in Figure 4-A (SVM) and supplementary Figure S2 (KNN and RF) and metrics for all 100 iterations are presented in Table S1, Figure 4-B and supplementary Figure S2. Accuracies for all the models were 0.66, 0.69 and 0.70 for SVM, KNN and RF respectively. Specific values for kappa, precision, recall and F1 score were found for each algorithm. When 114 comparison of list of genes was done, 29 genes were identified at same time by all implementations, and 23 by both SVM and KNN. With the aim of identifying preliminary core response genes in P. aeruginosa, the lists of top 56 genes selected by each algorithm and method were jointly represented using a Venn diagram (Figure 5-B). Note that the same set was used for SVM and KNN since both procedures lead to the same genes. A total of 118 different genes were identified and 15 genes were simultaneously identified by all the algorithms. Distribution of all 118 genes on large scale molecular networks is presented in Figure 5-A. Results show that selected genes are connected but they do not establish a defined cluster. These genes seem to be associated to different subgroups of highly connected genes. Final version of the core perturbome components was established by selecting genes recognized for at least four algorithms. A total of 46 genes were finally associated to core response. In the representation as small world network, relationships between the 46 genes revealed different topological patterns of connectivity between molecules, being nuoC and nuoF the ones with higher connectivity (connection degree). In addition, only six genes had no connections between them (Figure 5-C). Figure 4. Evaluation of MP method for gene ranking by frequency, case for SVM algorithm. (A) Ranking of top 56 genes by frequency after iterations of 100 data partitions and classification model fitting. (B) Dispersion of metrics across 100 iterations. Similar results are shown for KNN and RF algorithms in supplementary Figure S2. 115 Figure 5. Identification, systems level description and global profiles comparison given by core response genes. (A) Distribution and number of algorithms of preliminary 118 selected genes on a basal large scale network of functional associations in P. aeruginosa. (B) Set comparison of genes given by classification algorithms of SP and MP methods, with a final selection of 46 genes which were identified by at least 4 algorithms. (C) Small world network showing relationships between the 46 core response genes and connectivity metric for each gene. (D-E) Global profile comparison of samples by both PCA and HC clustering algorithms, showing separation of conditions. For reference, centroids of each cluster were plotted as triangles in each cluster. More details are shown in supplementary Figure S3. 116 3.2 Comparisons of core response genes show separation of global transcriptomic profiles according to experimental conditions and biological functions related to tolerance to stress After the selection of the core response genes, comparisons between controls and perturbations were done to characterize these genes by global profiles. First, clustering algorithms to compare global profiles were run using the 46 genes of the perturbome. In the case of PCA (Figure 5-D), samples distribution let to differentiate controls and perturbations. A Kmedoid algorithm (k = 2) was able to identify two clusters enriched by samples of each condition: one consisting entirely of perturbation samples (11 samples, blue color) and the other mostly by control samples (10 controls and 3 disturbances, red color). In the case of the blue cluster, the kmedoid was sample CIP-120min, meanwhile control 6 was selected for the another cluster. For the case of HC (Figure 5-E), the same distribution of samples was obtained. Supplementary analysis of gene expression was included by comparing levels for all core response genes (Figure S3-A) and comparing expression levels of the kmedoids on the small world network (Figure S3-B-C). Figure 6. Annotation of core response genes and comparison with a differential expression analysis. (A) General annotation profile of identified genes by core response genes showing associated biological processes. (B) Comparison of identified genes by our machine learning approach and DEGs lists. (C) General annotation of DEGs showing similar profile than our approach. Specific annotation per gene is shown in supplementary Table S2. 117 Gene annotation revealed that biological processes related to most of the genes are metabolism, molecule binding and biosynthesis (Figure 6-A). Specific information about participation in processes of DNA damage response, DNA repairing and response to general/specific stimuli was also searched for each gene, showing that most of genes includes participation in such processes. Literature support shows variable patterns of expression, depending on the disturbance as shown in supplementary Table S2. Finally, in order to compare the results of the machine learning strategy with another approach, a differential expression analysis was run. A total of 101 DEGs were identified, which 33 were shared by the core response (Figure 6- B). This means that 72% of core response genes were also identified by another single and independent method. Annotation profile of DEGs (Figure 6-C) showed a similar pattern as our machine learning approach. 4. Discussion Living organisms face external and internal conditions that compromise cellular functions at molecular, metabolic and structural levels, disrupting their homeostasis [41,42]. Cell stress response is crucial for organismal survival and complex networks are usually involved in the molecular mechanism related to tolerance [1]. However, few studies have identified central and possible universal regulation of the response to multiple disturbances, a concept termed as perturbome [4,5]. Common molecular response was previously reported as a network of common set of genes and pathways that can be generically associated to multiple perturbations in plants [7], pathogenic bacteria [3] or cell lines models [4], and others. In our approach using P. aeruginosa, but applicable to other organisms, we hypothesized the existence of a set of core genes regulating the response to stressors in a generic sense of different pathways. P. aeruginosa has a high proportion (about 5%) of its genome dedicated to regulatory mechanisms, probably explaining its adaptability to such a broad range of growth conditions [25]. Since strain P. aeruginosa PAO1 is a clinical isolate with a profile of multiresistance to many antibiotics [43], characterization of molecular mechanisms involved in the tolerance to stressors in this strain could eventually help to modulate sensitivity and overcome resistance. Exhaustive integration of -omics data and network analysis are required in order to clarify the molecular mechanisms related to stress conditions and eventually use them for modulating cell response. 118 4.1 Insights of algorithms to identify core response genes or perturbome In our study, initial global transcriptomic profiles showed mixed patterns between samples according to their experimental condition. Because its complexity, raw microarray data showed noise and redundant information that can explain the poor resolution of this clustering algorithms to identify classes [20]. Thus, a feature (gene) selection analysis was implemented for not only identifying genes associated to multiple perturbations in P. aeruginosa, but also to improve performance of predictive/descriptive models capable of separate control and perturbation categories. In our case, these potential patterns were investigated using a robust machine learning approach by implementing six protocols, using SP and MP methods and SVM, KNN and RF algorithms in each case. This was a critical step because gene selection from microarray data is complex per se. Feature reduction remains as a challenging task in transcriptomic studies because thousands of genes to select from, and it introduces an additional layer of complexity in the modelling task [12,44]. To avoid bias and overfitting, implementations of diverse strategies of data partitioning such as bootstrapping, random partitions and cross validation are recommended. In general, these methodologies can robustly minorate the influence from noise, outliers, absence of ground truth sets, and to reduce variance [2,24,45]. The single partition SP method consisted of a particular and invariable data for training (with internal cross validation) and another to test, and it is probably the most common approach used in machine learning. In the case of the multiple partitions MP method, 100 random partitions of dataset were run. MP method had a dual consideration when splitting data (multiple partitions and the internal cross validation). This method can be considered as an ensemble based on different data partitioning, as it had been previously proposed [23]. Datasets were divided using multiple random partitioning procedures and then genes were ranked. After all runs, a final feature subset is determined by calculating the frequency of features in all the runs [23]. A equivalent approach was implemented by Pai and collaborators to classify gene expression data in a cancer model [46]. However, one possible problem with MP approach is that cross validation results may depend on similarity of testing and training sets. A classification prediction method is only expected to learn how to predict on unseen samples that are drawn from the same distribution as training samples [24,45], and MP 119 method could violate this assumption. SP method guarantees this because was built always using a replicate for each perturbation and control. Thus, both methods SP and MP are required to robustly select features. On the other hand, many algorithms for dimension reduction have been proposed [6,19,20] but no standard machine learning algorithm can be selected due multiple evaluations results on completely variable metrics associated to performance [26]. Many studies have shown that SVM, KNN, and RF generally outperform other traditional supervised classifiers [17,26,47]. In our case, a variable pattern was found for different metrics in all evaluated implementations. For example, based on accuracy SVM for SP method and RF for MP method resulted in higher scores, which agree with other studies using machine learning and biological data [26,46]. In our subsequent analysis, we were interested in identifying a consensus list of candidate genes for the core response in P. aeruginosa, resulting in 46 genes of the perturbome. When global profiles were compared using these genes (PCA and HC), control and perturbation classes (Figure 5-D) were clearly separated. This gene number seems to be a modest number of elements (less than 1% of all the available dataset with 5549 genes) but it agrees completely with other studies, including machine learning methods [11,17] or other approaches [11,22,48,49]. In addition, 72% of core response genes were also identified as DEGs with similar annotation profiles; differences can be explained by significant fluctuations in the differential expression results as previously reported, mainly because it is not a consensus strategy (only based on p-value) and it does not incorporate the estimates of the test performance (true positive/negative rates and other metrics) on the results [2]. 4.2 Biological insights of the core response genes in P. aeruginosa: the perturbome Core response genes or perturbome can be related to a central regulation network, and as convergent point of signal transduction, transcriptional regulation and stress-related pathways, as it has been suggested [2,4,5,42]. Annotation of the 46 genes shows that most of them are functionally related to biosynthesis, molecule binding and metabolism (including an important number of hits for lipids), including additional functions for regulation of DNA damage repair, response to stimuli and aerobic respiration. Interestingly, these processes are represented by genes with high connectivity in the small world network. For example, 120 main functions associated to fabA are lipid metabolism, fatty acid and lipid biosynthesis, meanwhile for nuoC are aerobic respiration, electron/proton transport and DNA damage. Finally, nuoF is associated to aerobic respiration, electron/proton transport and Krebs cycle. In this sense, cells are equipped with systems and mechanisms to recover from the environmental stress and stimuli to maintain all necessary physiological functions [50]. Other common stimuli such as low ATP, slow growth, and ROS production can also occur before cells express stress specific factors, but mediating a common effect. For example, response to stress includes modulation of energy production and aerobic electron transfer chain components. As it has been reported in E. coli, aerobic electron transport chain components are down-regulated in response to growth arrest [51]. This corresponds with the global profiles of expression of 46 core response genes. Also, regulation of lipid metabolism is relevant for survival in the wide range of environmental conditions where bacteria thrive [52], even for biofilm-living forms [3] as P. aeruginosa. Core response genes fabG, lpxA, and PA5174 could be implied in this process. In the case of DNA damage repair (including the case of cycB and gltP genes), responses mediated by SOS and rpoS help to maintain genome integrity, colonization, and virulence [39,53]. These responses are activated under multiple disturbances and modulating a low energy production and shutdown of the metabolism, promotes formation of antibiotic resistance and biofilms [3]. Other related pathways for some specific genes included regulation of the transcription during stress by RNA-binding proteins in order to reprogram or shut down translation and to rescue the ribosome stalled by a variety of mechanisms induced [54]. Three core response genes (rpmH, tsf and PA2735) were annotated with these functions. Jointly, the relatively few diversity of metabolic functions and pathways makes sense in order to ensure redundancy and robustness in the response to stimuli. Similar results, regarding enriched pathways, have been obtained in other studies with eukaryotic models, including disturbed human cell lines [4,5,42], Arabidopsis thaliana under physical and genotoxic stresses [2] or a genome-wide association study of a generic response to stress conditions [6]. In the case of prokaryotic organisms, two studies have used Escherichia coli as model to identify differentially expressed genes after exposure to stress conditions [3] and to create networks associated with the response to fluctuating environments [7]. Differences with other organisms and disturbances can suggest that response cell stress can be organismal specific, although 121 heterogeneity has also been suggested a reasonable explanation because differences in the response in a apparently homogeneous cell population [42,55]. As in our case, the response to stimuli and stressors is orchestrated by a pleotropic modulation [56] which can be associated to a central regulation. Alternative mechanisms such as cross stress protection (ability of a stress condition to provide protection against other stressors) [7], role of sigma factors and specific two component systems [3] can contribute to explain this phenomenon. The molecular response can lead to regulate multiple biological activities including metabolism, replication, transcription and translation, changes in membrane composition, motility, modification gene expression, expression of virulence factors, multi-drug resistant phenotypes and biofilm formation, and others [42]. Taking all together, results of our study suggest that identification of core response genes associated to multiple perturbations or perturbome in P. aeruginosa can define a central network available to modulate a basic response that includes biological functions such as biosynthesis, binding and metabolism, many of them related to DNA damage repair and aerobic respiration. To our knowledge, this study can be considered a first report of the P. aeruginosa perturbome. Further analyses are required to explore potential use of perturbome network to modulate (positively or negatively) the response to disturbances, to model molecular circuits, to identify possible biomarker genes of stressed states, and to experimentally validate our findings. In addition, this approach can be used to model the perturbome in other P. aeruginosa strains, as we hope to run soon with a genome we recently described [13], and other organisms. 5. Conclusions A robust machine learning approach was implemented in order to identify and describe core response genes to multiple perturbations in P. aeruginosa. Using public microarray data, two independent partition strategies (single and multiple with SP and MP methods respectively) and three classification algorithms, we were able to identify 46 perturbome elements. Both network analysis and functional annotations of these genes showed coordinated modulation of biological processes in response to multiple perturbations, including metabolism, biosynthesis and molecule binding, associated to DNA damage repairing, and aerobic respiration, 122 all probably related to tolerance to stressors, growth arrest and molecular regulation. We also discussed different issues related to implemented and assess algorithms of normalization analysis, data partitioning, classification approaches and metrics. Abbreviations AUC: Area under the curve AZM: Azithromycin B8F: E-4-bromo-5-bromomethylene-3-methylfuran-2-5H-one CDA: Chlorhexidine diacetate CIP: Ciprofloxacin COL: Colistin Cu: Copper DEGs: Differentially expressed genes GEO: Gene Expression Omnibus H2O2: Hydrogen peroxide HC: Hierarchical Clustering KNN: K-Nearest Neighbor LB: Luria Bertani mRNA: Messenger RNA NaClO: Sodium hypoclorite OPP: Ortho-phenylphenol PCA: Principal Component Analysis PPI: Protein-protein interaction RF: Random Forest ROC: Receiver-operating characteristic SVM: Support Vector Machine 123 Declarations Ethics approval and consent to participate Not Applicable Consent for Publication Not Applicable Availability of data and material The datasets generated and/or analysed during the current study are available in: Public raw data used in this study: GEO database (https://www.ncbi.nlm.nih.gov/geo/), data Series GSE2430, GSE3090, GSE4152, GSE5443, GSE7402, GSE10605, GSE12738, GSE13252, GSE14253 and GSE36753. Normalized data and R Scripts: https://github.com/josemolina6/CoreResponsePae Competing interests No competing interests to declare. Funding This work was funded by projects B8114 Definición de la red transcripcional y de las alteraciones genómicas inducidas por la ciprofloxacina en Pseudomonas aeruginosa AG1 and B8152 proNGS 1.0: Implementación y evaluación de protocolos de análisis de datos de tecnologías NGS y afines para el estudio de sistemas biológicos complejos, Universidad de Costa Rica (period 2017-2020), Costa Rica. Authors' contributions JMM and FGS participated in the conception, design of the study and data acquisition. JMM, PMM, RGB and JVF participated in data analysis and interpretation. JMM, RCS, JVF and FGS participated in the interpretation of the data analysis. JMM drafted the manuscript and all authors were involved in its revision. All authors read and approved the final manuscript. 124 Acknowledgements We thank students Iosif Forero Trelles, Daniel Solano Alvarado and Daniela Aguilar Orozco for their collaborations in different activities of the project. References 1. DeLong, E.F. Prokaryotes : prokaryotic physiology and biochemistry; Springer, 2012; ISBN 9783642301407. 2. Ma, C.; Xin, M.; Feldmann, K.A.; Wang, X. Machine Learning-Based Differential Network Analysis: A Study of Stress-Responsive Transcriptomes in Arabidopsis. Plant Cell 2014, 26, 520–537, doi:10.1105/tpc.113.121913. 3. Nagar, S.D.; Aggarwal, B.; Joon, S.; Bhatnagar, R.; Bhatnagar, S. A Network Biology Approach to Decipher Stress Response in Bacteria Using Escherichia coli As a Model. Omi. A J. Integr. Biol. 2016, 20, 310–324, doi:10.1089/omi.2016.0028. 4. Caldera, M.; Müller, F.; Kaltenbrunner, I.; Licciardello, M.P.; Lardeau, C.H.; Kubicek, S.; Menche, J. Mapping the perturbome network of cellular perturbations. Nat. Commun. 2019, 10, doi:10.1038/s41467-019-13058-9. 5. Sadeh, S.; Clopath, C. Theory of Neuronal Perturbome: Linking Connectivity to Coding via Perturbations. bioRxiv 2020, 2020.02.20.954222, doi:10.1101/2020.02.20.954222. 6. Bermingham, M.L.; Pong-Wong, R.; Spiliopoulou, A.; Hayward, C.; Rudan, I.; Campbell, H.; Wright, A.F.; Wilson, J.F.; Agakov, F.; Navarro, P.; et al. Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Sci. Rep. 2015, 5, 1–12, doi:10.1038/srep10312. 7. Dragosits, M.; Mozhayskiy, V.; Quinones-Soto, S.; Park, J.; Tagkopoulos, I. Evolutionary potential, cross-stress behavior and the genetic basis of acquired stress resistance in Escherichia coli. Mol. Syst. Biol. 2014, 9, 643–643, doi:10.1038/msb.2012.76. 8. Nogales, J.; Guðmundsson, S.; Duque, E.; Ramos, J.L.; Palsson, B.Ø. Expanding the computable reactome in Pseudomonas putida reveals metabolic cycles providing robustness. bioRxiv 2017, 139121, doi:10.1101/139121. 9. Kc, K.; Li, R.; Cui, F.; Haake, A.R. GNE: A deep learning framework for gene network inference by aggregating biological information. Bioinformatics 2018, 1–9. 10. Szklarczyk, D.; Gable, A.L.; Lyon, D.; Junge, A.; Wyder, S.; Huerta-Cepas, J.; Simonovic, M.; Doncheva, N.T.; Morris, J.H.; Bork, P.; et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019, 47, D607– D613, doi:10.1093/nar/gky1131. 125 11. Li, Y.; Wang, N.; Perkins, E.J.; Zhang, C.; Gong, P. Identification and optimization of classifier genes from multi- class earthworm microarray dataset. PLoS One 2010, 5, 1–9, doi:10.1371/journal.pone.0013715. 12. Saeys, Y.; Inza, I.; Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517, doi:10.1093/bioinformatics/btm344. 13. Molina-Mora, J.-A.; Campos-Sánchez, R.; Rodríguez, C.; Shi, L.; García, F. High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non- hybrid assemblers. Sci. Rep. 2020, 10, 1392, doi:10.1038/s41598-020-58319-6. 14. Zhao, W.; Chen, J.J.; Perkins, R.; Wang, Y.; Liu, Z.; Hong, H.; Tong, W.; Zou, W.; Metzker, M.; Didelot, X.; et al. A novel procedure on next generation sequencing data analysis using text mining algorithm. BMC Bioinformatics 2016, 17, 213, doi:10.1186/s12859-016-1075-9. 15. Cornforth, D.M.; Dees, J.L.; Ibberson, C.B.; Huse, H.K.; Mathiesen, I.H.; Kirketerp-Møller, K.; Wolcott, R.D.; Rumbaugh, K.P.; Bjarnsholt, T.; Whiteley, M. Pseudomonas aeruginosa transcriptome during human infection. Proc. Natl. Acad. Sci. U. S. A. 2018, 115, doi:10.1073/pnas.1717525115. 16. Glaab, E.; Bacardit, J.; Garibaldi, J.M.; Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One 2012, 7, doi:10.1371/journal.pone.0039932. 17. Leung, R.K.K.; Wang, Y.; Ma, R.C.W.; Luk, A.O.Y.; Lam, V.; Ng, M.; So, W.Y.; Tsui, S.K.W.; Chan, J.C.N. Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: A prospective case-control cohort analysis. BMC Nephrol. 2013, 14, 1, doi:10.1186/1471-2369-14-162. 18. Tan, J.; Doing, G.; Lewis, K.A.; Price, C.E.; Chen, K.M.; Cady, K.C.; Perchuk, B.; Laub, M.T.; Hogan, D.A.; Greene, C.S. Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks. Cell Syst. 2017, 5, 63–71.e6, doi:10.1016/j.cels.2017.06.003. 19. Kar, S.C. Comparing Prediction Accuracy for Supervised Techniques in Gene Expression Data. Math. Theory Model. 2014, 4, 108–116. 20. Raza, K.; Hasan, A. A Comprehensive Evaluation of Machine Learning Techniques for Cancer Class Prediction Based on Microarray Data. Int. J. Bioinform. Res. Appl. 2015, 11, 397–416, doi:10.1504/IJBRA.2015.071940. 21. Savli, H.; Karadenizli, A.; Kolayli, F.; Gundes, S.; Ozbek, U.; Vahaboglu, H. Expression stability of six housekeeping genes: a proposal for resistance gene quantification studies of Pseudomonas aeruginosa by real-time quantitative RT-PCR. J. Med. Microbiol. 2003, 52, 403–408, doi:10.1099/jmm.0.05132-0. 126 22. Casares, F.M. A Simple Method for Optimization of Reference Gene Identification and Normalization in DNA Microarray Analysis. Med. Sci. Monit. Basic Res. 2016, 22, 45–52, doi:10.12659/MSMBR.897644. 23. Yang, P.; Zhou, B.B.; Yang, J.Y.-H.; Zomaya, A.Y. Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. Biol. Knowl. Discov. Handb. Preprocessing, mining, postprocessing Biol. data 2013, 333–352, doi:10.1002/9781118617151.ch14. 24. Tabe-Bordbar, S.; Emad, A.; Zhao, S.D.; Sinha, S. A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models. Sci. Rep. 2018, 8, 1–11, doi:10.1038/s41598-018-24937-4. 25. Alqarni, B.; Colley, B.; Klebensberger, J.; McDougald, D.; Rice, S.A. Expression stability of 13 housekeeping genes during carbon starvation of Pseudomonas aeruginosa. J. Microbiol. Methods 2016, 127, 182–187, doi:10.1016/j.mimet.2016.06.008. 26. Noi, P.T.; Kappas, M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery. Sensors (Switzerland) 2018, 18, doi:10.3390/s18010018. 27. Li, L.; Weinberg, C.R.; Darden, T.A.; Pedersen, L.G. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001, 17, 1131–42. 28. Vapnik, V.N.; Vladimir Estimation of dependences based on empirical data; Springer-Verlag, 1982; ISBN 0387907335. 29. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32, doi:10.1023/A:1010933404324. 30. Yusuf Nalca; Lothar Jänsch; Florian Bredenbruch; Robert Geffers, J.B.; Susanne Häussler Quorum-sensing antagonistic activities of azithromycin in Pseudomonas aeruginosa PAO1: a global approach. Antimicrob Agents Chemother 2008, 50, 1680–1688, doi:10.1128/AAC.50.5.1680. 31. Kai, T.; Tateda, K.; Kimura, S.; Ishii, Y.; Ito, H.; Yoshida, H.; Kimura, T.; Yamaguchi, K. A low concentration of azithromycin inhibits the mRNA expression of N-acyl homoserine lactone synthesis enzymes, upstream of lasI or rhlI, in Pseudomonas aeruginosa. Pulm. Pharmacol. Ther. 2009, 22, 483–486, doi:10.1016/j.pupt.2009.04.004. 32. Chang, W.; Small, D.A.; Toghrol, F.; Bentley, W.E. Microarray analysis of Pseudomonas aeruginosa reveals induction of pyocin genes in response to hydrogen peroxide. BMC Genomics 2005, 6, 1–14, doi:10.1186/1471- 2164-6-115. 33. Teitzel, G.M.; Geddie, A.; De Long, S.K.; Kirisits, M.J.; Whiteley, M.; Parsek, M.R. Survival and Growth in the Presence of Elevated Copper: Transcriptional Profiling of Copper-Stressed Pseudomonas aeruginosa. J. Bacteriol. 2006, 188, 7242–7256, doi:10.1128/JB.00837-06. 127 34. Small, D.A.; Chang, W.; Toghrol, F.; Bentley, W.E. Comparative global transcription analysis of sodium hypochlorite, peracetic acid, and hydrogen peroxide on Pseudomonas aeruginosa. Appl. Microbiol. Biotechnol. 2007, 76, 1093–1105, doi:10.1007/s00253-007-1072-z. 35. Nde, C.W.; Jang, H.-J.; Toghrol, F.; Bentley, W.E. Toxicogenomic response of Pseudomonas aeruginosa to ortho- phenylphenol. BMC Genomics 2008, 9, 473, doi:10.1186/1471-2164-9-473. 36. Cummins, J.; Reen, F.J.; Baysse, C.; Mooij, M.J.; O’Gara, F. Subinhibitory concentrations of the cationic antimicrobial peptide colistin induce the pseudomonas quinolone signal in Pseudomonas aeruginosa. Microbiology 2009, 155, 2826–2837, doi:10.1099/mic.0.025643-0. 37. Nde, C.W.; Jang, H.J.; Toghrol, F.; Bentley, W.E. Global transcriptomic response of Pseudomonas aeruginosa to chlorhexidine diacetate. Environ. Sci. Technol. 2009, 43, 8406–8415, doi:10.1021/es9015475. 38. Pan, J.; Bahar, A.A.; Syed, H.; Ren, D. Reverting Antibiotic Tolerance of Pseudomonas aeruginosa PAO1 Persister Cells by (Z)-4-bromo-5-(bromomethylene)-3-methylfuran-2(5H)-one. PLoS One 2012, 7, doi:10.1371/journal.pone.0045778. 39. Cirz, R.T.; O’Neill, B.M.; Hammond, J.A.; Head, S.R.; Romesberg, F.E. Defining the Pseudomonas aeruginosa SOS response and its role in the global response to the antibiotic ciprofloxacin. J. Bacteriol. 2006, 188, 7101–7110, doi:10.1128/JB.00807-06. 40. Smyth, G.K. Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat. Appl. Genet. Mol. Biol. 2004, 3, 1–25, doi:10.2202/1544-6115.1027. 41. Richter, K.; Haslbeck, M.; Buchner, J. The Heat Shock Response: Life on the Verge of Death. Mol. Cell 2010, 40, 253–266, doi:10.1016/j.molcel.2010.10.006. 42. Vihervaara, A.; Duarte, F.M.; Lis, J.T. Molecular mechanisms driving transcriptional stress responses. Nat. Rev. Genet. 2018, 19, 385–397, doi:10.1038/s41576-018-0001-6. 43. Holloway, B.W. Genetic Recombination in Pseudomonas aeruginosa. Microbiology 1955, 13, 572–581, doi:10.1099/00221287-13-3-572. 44. Piao, J.; Sun, J.; Yang, Y.; Jin, T.; Chen, L.; Lin, Z. Target gene screening and evaluation of prognostic values in non-small cell lung cancers by bioinformatics analysis. Gene 2018, 647, 306–311, doi:10.1016/j.gene.2018.01.003. 45. Touw, W.G.; Bayjanov, J.R.; Overmars, L.; Backus, L.; Boekhorst, J.; Wels, M.; van Hijum, S.A.F.T. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief. Bioinform. 2013, 14, 315– 326, doi:10.1093/bib/bbs034. 128 46. Pai, S.; Hui, S.; Isserlin, R.; Shah, M.A.; Kaka, H.; Bader, G.D. netDx: interpretable patient classification using integrated patient similarity networks. Mol. Syst. Biol. 2019, 15, e8497, doi:10.15252/msb.20188497. 47. Park, H.; Shimamura, T.; Imoto, S.; Miyano, S. Adaptive NetworkProfiler for Identifying Cancer Characteristic- Specific Gene Regulatory Networks. J. Comput. Biol. 2017, 25, cmb.2017.0120, doi:10.1089/cmb.2017.0120. 48. Falciani, F.; Diab, A.; Sabine, V.; Williams, T.; Ortega, F.; George, S.; Chipman, J. Hepatic transcriptomic profiles of European flounder (Platichthys flesus) from field sites and computational approaches to predict site from stress gene responses following exposure to model toxicants. Aquat. Toxicol. 2008, 90, 92–101, doi:10.1016/j.aquatox.2008.07.020. 49. Nota, B.; Verweij, R.A.; Molenaar, D.; Ylstra, B.; van Straalen, N.M.; Roelofs, D. Gene Expression Analysis Reveals a Gene Set Discriminatory to Different Metals in Soil. Toxicol. Sci. 2010, 115, 34–40, doi:10.1093/toxsci/kfq043. 50. Krämer, R. Bacterial stimulus perception and signal transduction: Response to osmotic stress. Chem. Rec. 2010, 10, 217–229, doi:10.1002/tcr.201000005. 51. Schurig-Briccio, L.A.; Farías, R.N.; Rodríguez-Montelongo, L.; Rintoul, M.R.; Rapisarda, V.A. Protection against oxidative stress in Escherichia coli stationary phase by a phosphate concentration-dependent genes expression. Arch. Biochem. Biophys. 2009, 483, 106–110, doi:10.1016/j.abb.2008.12.009. 52. Parsons, J.B.; Rock, C.O. Bacterial lipids: metabolism and membrane homeostasis. Prog. Lipid Res. 2013, 52, 249–76, doi:10.1016/j.plipres.2013.02.002. 53. Storvik, K.A.M.; Foster, P.L. RpoS, the stress response sigma factor, plays a dual role in the regulation of Escherichia coli’s error-prone DNA polymerase IV. J. Bacteriol. 2010, 192, 3639–44, doi:10.1128/JB.00358-10. 54. Starosta, A.L.; Lassak, J.; Jung, K.; Wilson, D.N. The bacterial translation stress response. FEMS Microbiol. Rev. 2014, 38, 1172–1201, doi:10.1111/1574-6976.12083. 55. Adamson, B.; Norman, T.M.; Jost, M.; Cho, M.Y.; Nuñez, J.K.; Chen, Y.; Villalta, J.E.; Gilbert, L.A.; Horlbeck, M.A.; Hein, M.Y.; et al. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell 2016, 167, 1867–1882.e21, doi:10.1016/j.cell.2016.11.048. 56. Molina-Mora, J.A.; Campos-Sanchez, R.; Garcia, F. Gene Expression Dynamics Induced by Ciprofloxacin and Loss of Lexa Function in Pseudomonas aeruginosa PAO1 Using Data Mining and Network Analysis. In Proceedings of the 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI); IEEE, 2018; pp. 1–7. 129 CHAPTER 5 Transcriptomic determinants of the response of ST-111 Pseudomonas aeruginosa AG1 to ciprofloxacin identified by a top-down systems biology approach Molina-Mora, J. A., Chinchilla, D., Chavarría, M., Ulloa, A., Campos-Sanchez, R., Mora-Rodríguez, R. A., Shi, L., García, F. (2020). Transcriptomic determinants of the response of ST-111 Pseudomonas aeruginosa AG1 to ciprofloxacin identified by a top-down systems biology approach. Scientific Reports, 10, 1 23. https://doi.org/10.1038/s41598-020-70581-2 https://www.nature.com/articles/s41598-020-70581-2 130 Summary Ciprofloxacin (CIP) is an antibiotic commonly used to treat P. aeruginosa infections, and it is known to produce DNA damage, triggering a complex molecular response. In order to evaluate the effects of a sub-inhibitory CIP concentration on the multi-resistant PaeAG1, growth curves using increasing CIP concentrations were compared. We then measured gene expression using RNA-Seq at three time points (0, 2.5 and 5 hours) after CIP exposure to identify the transcriptomic determinants of the response (i.e. hub genes, gene clusters and enriched pathways). Changes in expression were determined using differential expression analysis and network analysis using a top- down systems biology approach. A hybrid model using database-based and co-expression analysis approaches was implemented to predict gene-gene interactions. We observed a reduction of the growth curve rate as the sub-inhibitory CIP concentrations were increased. In the transcriptomic analysis, we detected that over time CIP treatment resulted in the differential expression of 518 genes, showing a complex impact at the molecular level. The transcriptomic determinants were 14 hub genes, multiple gene clusters at different levels (associated to hub genes or as co-expression modules) and 15 enriched pathways. Down-regulation of genes implicated in several metabolism pathways, virulence elements and ribosomal activity was observed. In contrast, amino acid catabolism, RpoS factor, proteases, and phenazines genes were up-regulated. Remarkably, >80 resident-phage genes were up-regulated after CIP treatment, which was validated at phenomic level using a phage plaque assay. Thus, reduction of the growth curve rate and increasing phage induction was evidenced as the CIP concentrations were increased. In summary, transcriptomic and network analyses, as well as the growth curves and phage plaque assays provide evidence that PaeAG1 presents a complex, concentration-dependent response to sub-inhibitory CIP exposure, showing pleiotropic effects at the systems level. Manipulation of these determinants, such as phage genes, could be used to gain more insights about 131 the regulation of responses in PaeAG1 as well as the identification of possible therapeutic targets. To our knowledge, this is the first report of the transcriptomic analysis of CIP response in a ST-111 high-risk P. aeruginosa strain, in particular using a top-down systems biology approach. www.nature.com/scientificreports open transcriptomic determinants  of the response of ST‑111  Pseudomonas aeruginosa AG1  to ciprofloxacin identified  by a top-down systems biology  approach José Arturo Molina-Mora1*, Diana chinchilla-Montero1, Maribel chavarría-Azofeifa1,  Alejandro J. Ulloa-Morales2, Rebeca campos-Sánchez3, Rodrigo Mora-Rodríguez1,  Leming Shi4 & fernando García1 Pseudomonas aeruginosa is an opportunistic pathogen that thrives in diverse environments and causes  a variety of human infections. Pseudomonas aeruginosa AG1 (PaeAG1) is a high‑risk sequence type 111  (ST‑111) strain isolated from a Costa Rican hospital in 2010. PaeAG1 has both blaVIM‑2 and blaIMP‑18  genes encoding for metallo‑β‑lactamases, and it is resistant to β‑lactams (including carbapenems),  aminoglycosides, and fluoroquinolones. Ciprofloxacin (CIP) is an antibiotic commonly used to treat  P. aeruginosa infections, and it is known to produce DNA damage, triggering a complex molecular  response. In order to evaluate the effects of a sub‑inhibitory CIP concentration on PaeAG1, growth  curves using increasing CIP concentrations were compared. We then measured gene expression  using RNA‑Seq at three time points (0, 2.5 and 5 h) after CIP exposure to identify the transcriptomic  determinants of the response (i.e. hub genes, gene clusters and enriched pathways). Changes in  expression were determined using differential expression analysis and network analysis using a top– down systems biology approach. A hybrid model using database‑based and co‑expression analysis  approaches was implemented to predict gene–gene interactions. We observed a reduction of the  growth curve rate as the sub-inhibitory cip concentrations were increased. in the transcriptomic  analysis, we detected that over time CIP treatment resulted in the differential expression of 518  genes, showing a complex impact at the molecular level. The transcriptomic determinants were 14 hub  genes, multiple gene clusters at different levels (associated to hub genes or as co‑expression modules)  and 15 enriched pathways. Down‑regulation of genes implicated in several metabolism pathways,  virulence elements and ribosomal activity was observed. in contrast, amino acid catabolism, RpoS  factor, proteases, and phenazines genes were up‑regulated. Remarkably, > 80 resident‑phage genes  were up‑regulated after CIP treatment, which was validated at phenomic level using a phage plaque  assay. thus, reduction of the growth curve rate and increasing phage induction was evidenced as  the CIP concentrations were increased. In summary, transcriptomic and network analyses, as well  as the growth curves and phage plaque assays provide evidence that PaeAG1 presents a complex,  concentration‑dependent response to sub‑inhibitory CIP exposure, showing pleiotropic effects at  the systems level. Manipulation of these determinants, such as phage genes, could be used to gain  1Centro  de  Investigación  en  Enfermedades Tropicales  (CIET),  Facultad  de  Microbiología,  Universidad  de Costa  Rica,  San  José,  Costa  Rica.  2Chemical Genomics Centre (CGC), Max-Planck-Institute for Molecular Physiology, Dortmund, Germany. 3Centro de Investigación en Biología Celular Y Molecular (CIBCM), Facultad de Microbiología,  Universidad de Costa Rica, San José, Costa Rica. 4Human Phenome Institute (HuPI), Fudan University, Shanghai, China. *email: jose.molinamora@ucr.ac.cr Scientific RepoRtS |        (2020) 10:13717   | https://doi.org/10.1038/s41598-020-70581-2 1 Vol.:(0123456789) www.nature.com/scientificreports/ more insights about the regulation of responses in PaeAG1 as well as the identification of possible  therapeutic targets. To our knowledge, this is the first report of the transcriptomic analysis of CIP  response in a ST‑111 high‑risk P. aeruginosa strain, in particular using a top-down systems biology  approach. Abbreviations CIP Ciprofloxacin DEGs D ifferentially expressed genes GGI G ene–gene interaction GSEA G ene set enrichment analysis KEGG K yoto encyclopedia of genes and genomes MIC Minimum inhibitory concentration MLST M ultilocus sequence typing OD600nm Optical density measured at 600 nm PaeAG1 P seudomonas aeruginosa Strain AG1 PCA P rincipal components analysis QC Quality control RNA-Seq RNA sequencing ST-111 S equence type 111 WGCNA W eighted gene co-expression network analysis WHO World Health Organization (WHO) Pseudomonas aeruginosa is a ubiquitous Gram-negative organism which thrives in diverse environments and acts as an opportunistic p athogen1. The ability of this pathogen to cause a variety of human infections is facilitated by its nutritional versatility2, resistance to a wide spectrum of antibiotics, and virulence factors3,4. Pseudomonas aeruginosa AG1 (PaeAG1) is a multiresistant high-risk sequence type 111 (ST-111) strain (GenBank CP045739)5. It was isolated from a Costa Rican hospital and it was the first report of an isolate of P. aeruginosa carrying both blaVIM-2 and blaIMP-18 genes encoding for metallo-β-lactamases enzymes (carbapenemases), located in two independent i ntegrons5,6. PaeAG1 is resistant to β-lactams (including carbapenems), aminoglycosides, and fluoroquinolones, being only sensitive to colistin. In addition to this multidrug-resistant feature, as in other P. aeruginosa strains, the ability to colonize nosocomial environments makes this strain a high-risk c lone7. Owed to this antibiotic resistance profile, including resistance to carbapenems, PaeAG1 is classified as a Priority 1 (critical) organism according to the World Health Organization (WHO)8. Antibiotic resistance is a major threat to public health because it compromises the administration of appro- priate antibiotic therapy, and reduces the therapeutic options to treat infections, increasing patient morbidity and mortality9,10. This situation is aggravated by the emergence of strains resistant to multiple a ntibiotics11, limitation of the knowledge of interactions with pathogens and mechanisms of action of antimicrobial agents, and development of new a ntibiotics12. Use of antibiotics below the minimum inhibitory concentration (MIC) or sub-inhibitory concentrations also contributes to antibiotic resistance as they allow strains to continue growing and can select for pre-existing resistant o rganisms13. Since sub-inhibitory antibiotic concentrations are found in many natural environments, bacteria can naturally trigger mechanisms of tolerance14. However, the fundamental mechanisms of bacterial tolerance to antibiotics have not been fully elucidated15. It has been shown that the perturbation induced by many antibiotics leads to stress conditions in prokary- otic cells16, which can induce DNA d amage17. Stressors activate the regulation of gene expression or the activity and stability of existing proteins to induce adaptation m echanisms16. Organisms have evolved numerous DNA repair pathways to eliminate DNA damage and restart DNA r eplication18. Regulatory networks of transcriptional responses to DNA damage involves not only DNA repair enzymes, but also diverse proteins with roles in cell divi- sion, metabolism modulation, genetic rearrangements and exchange, mutation, and virulence factor production19. Ciprofloxacin (CIP) is a fluoroquinolone antibiotic used to treat P. aeruginosa infections20. CIP is well-known to produce DNA damage by inhibiting DNA gyrase and topoisomerase IV, leading to DNA strand breaks21. Mutations in these genes are responsible for CIP resistance by losing drug affinity22. CIP has been used to study stress responses in this bacterial group12,23, in particular with the induction of the SOS response as a mechanism of DNA damage r epair17,24,25. In P. aeruginosa, the SOS response regulon is composed of 15 genes, including recA and lexA genes26. Upon DNA damage, RecA recognizes the single-stranded DNA (ssDNA) forming filaments and induces the autocleavage of the repressor LexA. This response leads to the expression of genes related to DNA damage repair27. Other LexA-like repressors are regulated during SOS activation, including elements of phages and pyocines19. SOS also mediates responses to resistance element transfer, generation of mutations and evolution of r esistance26, as well as appearance of persister cells24. However, modulation of stress responses after DNA damage is not limited to SOS response. RpoS is a gen- eral stress sigma factor (σS) known as a central element in a regulatory network that governs the expression of stationary-phase-induced genes28 to maintain cell viability29. This regulator is strongly induced when cells are exposed to various stress conditions, including antibiotics, pH downshift, starvation, and hyperosmolarity30. RpoS regulates more than 50 genes in Pseudomonas aeruginosa31, including virulence factors32. The SOS and RpoS regulons are complementary mechanisms in response to certain stresses and that protect bacteria from DNA damage33. Lon protease11 and AmpR34 can modulate both SOS and RpoS regulons. In addi- tion, both responses can regulate key genes such as polB18, iraD19, and dinB33. The connection between RpoS and SOS responses seems to be associated with a mechanism to maximize survival and fitness of cells, and to Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 2 Vol:.(1234567890) www.nature.com/scientificreports/ maintain genome s tability18. These responses can modulate virulence factors (including quorum sensing and biofilm formation), and increase homologous recombination and mutation f requencies33,35. However, other SOS and RpoS independent mechanisms are also known to be present in bacteria36, including P. aeruginosa after CIP treatment12,26 with variable results depending on strains and showing a mosaic r esponse12. Although the full mechanisms of all these molecular responses are not well understood, it is known that cells respond to stress conditions by complex regulatory systems that control gene expression37. Since a key objective in biological research is to describe molecular i nteractions38, the use of networks analysis is a common approach to describe complex biological systems and to mathematically model gene–gene interactions (GGI) with graphical representations (genes as nodes and interactions as edges)39. Molecules are thereby studied not only at a single level, but emergent properties are identified to describe and understand the complexity of the gene networking response and the emergent properties towards the stress condition. Functional status of genes by a top-down systems biology perspective, starting from “whole”-omics data to identify specific determinants or elements of biological importance, can be evaluated by construction of large scale networks40. For this purpose, data analysis from high-throughput technologies such as microarrays and RNA sequencing (RNA-Seq) can be used to describe molecular interactions at transcriptomic level38,41. Thus, to understand or to infer mechanisms associated with the transcriptional response, it is possible to build gene regulatory networks either using databases or based on co-expression data39,42,43. These networks allow to gain insight into response to stress conditions44, leading to the identification of gene clusters or even hub genes as candidate biomarkers or modulators with potential to become key therapeutic targets43,45. In P. aeruginosa, rapid adaptation to stress conditions is partially explained by the modulation of the global gene expression, which represents around 8% of all coding genes3. This regulation induces pleiotropic effects on its genomic regulatory n etwork46, as previously shown using systems b iology47, and the transcriptomic profiling of the response to CIP12,26,48. In this work we first evaluated PaeAG1 growth at sub-inhibitory CIP concentrations, showing growth reduc- tion as CIP was increased. We hypothesized that after exposing PaeAG1 to ciprofloxacin, even at sub-inhibitory concentrations, transcriptomic determinants will be triggered, including bacterial growth modulators. Thus, the aim was to identify transcriptomic determinants associated with the response to CIP in PaeAG1 using RNA-Seq profiling and network analysis by a top-down systems biology approach. Results showed that PaeAG1 generates a complex response to CIP exposure, evidencing pleiotropic effects involving the regulation of multiple hub genes, gene clusters and enriched pathways (transcriptomic determinants), many of them related to growth. As evidenced at the transcriptomic and the phenomic levels, phage induction was a particular trait modulated by CIP in a concentration-dependent manner with a correlation with bacterial growth reduction. Methods The general pipeline followed in this study to identify the transcriptomic determinants associated with the response to CIP in PaeAG1 is shown in Fig. 1. Bacterial isolate.  The PaeAG1 strain is a Costa Rican multiresistant isolate from a sputum sample of a patient with pneumonia at the Intensive Care Unit of the San Juan de Dios Hospital (San José, Costa Rica)6. PaeAG1 exhibits resistance to β-lactams (including carbapenems, M ICMeropenem 32 µg/mL and M ICImipenem > 32 µg/mL), aminoglycosides (MICGentamycin 128 µg/mL and M ICTobramycin > 192 µg/mL) and fluoroquinolones ( MICCiprofloxacin 32 µg/mL), and it is only sensitive to colistin ( MICColistin 2 µg/mL). We recently assembled and annotated the PaeAG1 genome5, and genome sequence and annotation are available in Genbank under accession CP045739 (Bioproject PRJNA587210). Growth curves assay.  Three independent cultures of PaeAG1 cells were grown to exponential-phase over- night in Lysogenic Broth (LB) at 37 °C with shaking (pre-culture to reach mid-log phase). Then, five aliquots were added to 50 mL of fresh LB broth to an initial optical density measured at 600 nm ( OD600nm) of 0.01. Each sample was treated with a specific CIP concentration of 0.0 (control), 5.0, 12.5, 25.0 or 50.0 µg/mL (final concen- trations). Growth of cultures was monitored by OD600nm at times 0, 2, 4, 6, 8, 12 and 16 h. Comparison of differ- ent CIP concentrations was done by assessing growth curve kinetics, including lag and exponential phases. As a complementary assay, evaluation of two other antibiotics was done in exactly the same growth conditions, but antibiotic concentrations depended on the MIC: imipenem (carbapenem) and tobramycin (aminoglycoside). See results and supplementary Figure S1 for details. The growth curves were statistically compared to the control growth curve using a two-way ANOVA with Bonferroni post-tests (significance level of 95%), similar t o49, using the time and concentrations as factors. We also ran a unpaired t-test (95% significance) comparing area under curve (AUC) of each growth curve against the control, similar t o50. Analyses were done using Prism (GraphPad Software, Inc., La Jolla, CA). To perform the transcriptomic assay, we used the results from growth curves to select a specific sub-inhibitory CIP concentration at which there were no major changes in the growth rate after treatment. RNA isolation and RNA sequencing.  In order to evaluate the molecular response of PaeAG1 to a sub- inhibitory CIP concentration, a transcriptomic assay was designed using RNA-Seq technology, as described below. Growth conditions. PaeAG1 cells were grown under the same conditions as detailed before but treatment was done using a single CIP concentration of 12.5 µg/mL (see “Results” for details of concentration selection). Imme- diately after adding treatment, an aliquot was taken as control (time 0 h), and cells were kept growing for 2.5 and Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 3 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 1. General pipeline to identify the transcriptomic determinants of the response of P. aeruginosa AG1 to ciprofloxacin (CIP). After growth curves assessment, a specific CIP concentration was used to sequence RNA (RNA-Seq) at 0, 2.5 and 5 h after exposure. DEGs were identified and used to build GGI networks. Transcriptomic determinants were identified by network analysis. Findings were verified at phenomic level using a phage plaque assay. 5 h (times were selected according to preliminary results of phage induction, see “Methods” for Phage plaque assays). This was done with three independent cultures for a total of nine aliquots, three replicates per time. RNA isolation. Aliquots from the cultures were preserved in two volumes of RNA protect reagent (QIAGEN) and cells were stored at 4 °C until RNA extraction. At the end of the sample collection period, total RNA was extracted using the RNeasy Mini kit (QIAGEN, UK) following the manufacturer’s instructions. RiboZero Gold (Epicentre) was used to deplete bacterial rRNA from total RNA samples according to manufacturer’s instruc- tions. The quality and quantity of extracted RNA was determined using a Nanodrop (Nanodrop 2000, Thermo Scientific, UK). The RNA integrity was analyzed using Agilent 2,100 Bioanalyzer (Agilent Technologies, USA) to obtain the RNA integrity number (RIN) for all samples. RNA sequencing. For RNA sequencing, TruSeq Stranded Total RNA library preparation kit (Illumina, USA) was used to generate cDNA (amplification with 13 PCR cycles) and libraries for 2 × 51 bp paired-end reads. Libraries were prepared and sequenced at the Genome Technology Center, New York University (New York, USA) on the Illumina HiSeq 2,500 platform. Sequencing generated more than 120 Gb of sequences (> 300 mil- lions of reads in total) for all samples. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 4 Vol:.(1234567890) www.nature.com/scientificreports/ RNA‑Seq data analysis.  With the aim of quantifying transcripts and identifying DEGs in PaeAG1 after CIP treatment, RNA-Seq data was analyzed including a quality control step, reads mapping to genome for tran- script quantification and differential expression analysis. Quality control (QC). QC was done before and after trimming/filtering. Reads were trimmed using Trim- momatic v0.3851 to discard sequences with per base phred sequence quality score < 30 and 35 minimum length. Reads were filtered using BBDuk (https: //jgi.doe.gov/data-and-tools /bb-tools /) to remove adapters and reads mapping to rRNA. Sequence files were evaluated using FastQC v0.11.752 to obtain general quality control met- rics. To evaluate the origin of reads sequences, FastQ-Screen53 was used to quantify the proportion of reads that mapped to reference genomes (human, mouse, and adapters contaminants, included by default) and prokaryotic sequences specifically added for this work (PaeAG1 and E. coli genomes, and rRNA 16S and 23S databases). Reports were merged using MultiQC54 to summarize all individual results. After selection, sequences for each of the nine samples had an average output of approximately 60 million reads. Reads mapping and transcript quantification. We used EDGE-pro v1.3.1 software to: map RNA-Seq reads to the PaeAG1 genome (Genbank CP045739), filter out multialigned reads, and estimate expression levels of each gene by counts55. This program was run with the default parameters, using Bowtie256 as read alignment algorithm. The script “edgeToDeseq.perl”, provided with the software, was used to convert raw counts (EDGE-pro output) to a count-table format for further differential expression analysis. Quality control of alignments per sample was done using: Qualimap RNA-Seq tool57 to assess mapping quality, and RSeQC package58 to estimate transcripts coverage uniformity (gene body coverage) and transcript integrity number (TIN). Required formats of genome annotation files for these analyses are available in https ://githu b.com/josemo lina 6/PaeAG 1_genom e. Differential expression analysis. We used raw counts of transcripts to estimate differential expression. For this purpose, DESeq2 package v1.26.059 in R program v3.5.160 was used based on the negative binomial generalized linear models, using default settings. DESeq2 based normalization, absolute expression comparisons by the reg- ularized log transformation (rlog), Principal Component Analysis (PCA), counts dispersion plots and clustering analysis were run in the same program. Triplicates of each time after PaeAG1 exposure to CIP were considered as a factor level. Differential expression analysis was done comparing 2.5 h or 5 h data against the initial time point at 0 h. Hypothesis testing to select differentially expressed genes (DEGs) was done using Benjamini–Hochberg adjustment (to control false discovery rate, FDR) and log2[FoldChange] (logFC) of transformed and normalized mean counts. Genes were considered up-regulated if logFC > 1 or down-regulated if logFC < -1, considering an adjusted p-value < 0.05 for both cases. Gene list comparisons by Venn diagrams were performed using the Draw Venn Diagram Tool (https: //bioin forma tics.psb.ugent .be/webto ols/Venn/). Annotation  of  differentially  expressed  genes.  DEGs annotation was retrieved from our previous work5 for the assembly and annotation of PaeAG1 genome (Genbank CP045739). Particular features per gene (including molecular function, product, gene size and domains, and sub cellular location of proteins) were explored in more detail from Pseudomonas Genome Database (https ://www.pseud omona s.com/)61. In addition, general regulators of the DEGs were investigated using PseudomonasNet tool (https ://www.inetb io.org/pseud omona snet/Netwo rk_regulo n_form.php) with a p-value < 0.05 in a context-centric analysis. Using the same plat- form, it was possible to identify the DEGs and their regulators that corresponded to transcription factors genes. Analysis  of  DNA–protein  interactions.  For selected genes, protein-DNA binding sites were inves- tigated. The CollectTF database (https ://www.collec tf.org/) was primarily used to search for consensus DNA binding sequences of the protein of interest and to identify modulated genes. If no information was available, promoter consensus sequences were searched from particular studies and the identification of binding sites was done using the motif-based sequence analysis tool (MEME, using Find Individual Motif Occurrences FIMO, https ://meme-suite .org/tools /fimo). In order to identify DEGs as molecular determinants (hub genes, gene clusters and key pathways) of the response to CIP in PaeAG1, a large scale gene–gene interaction (GGI) network of DEGs was built using a top- down systems biology approach. Connections between genes were predicted using two independent methods, one using a database-based model and another from co-expression analysis, detailed as follows. Database‑based  method  for  gene–gene  interactions  prediction  and  network  construc- tion.  With the aim of obtaining a high confidence GGI between DEGs using a database-based method, the Search Tool for the Retrieval of Interacting Genes database (STRINGdb)62 was used to construct a large scale GGI network for the DEGs using default parameters. All DEGs at any of the two times were used to build the main network. The resulting graph was exported and then visualized and topologically analyzed using Cytoscape s oftware63. Co‑expression analysis and co‑expression network construction.  To incorporate more interac- tions between DEGs, a data-driven systems biology approach was implemented using co-expression analysis with all the normalized counts of DEGs, as in recent studies45,64–66. Modules identification using co-expression analysis. Weighted gene co-expression network analysis (WGCNA) p ackage43 was run in R software. Briefly, a matrix of Pearson correlation between all pairs of genes was calcu- Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 5 Vol.:(0123456789) www.nature.com/scientificreports/ lated. The adjacency matrix was then constructed using a power of β = 9 as a saturation level for a soft threshold of the correlation matrix based on the criterion of scale-free topology. The topological overlap matrix was calcu- lated. Hierarchical clustering was used to generate a dendrogram to group highly co-expressed genes, creating gene clusters called modules (arbitrarily represented by colors) using the default dynamic tree cut algorithm. Default colors given to modules were kept. Association of co-expression modules and traits. A t-test evaluated the association between the modules (using module eigengene ME, the first principal component gene of module expression matrix) and traits of PaeAG1 according to the experimental design. For this, the times (the experiment factors 0, 2.5 and 5 h) and data of phage induction at 2.5 and 5 h after 12.5 µg/mL CIP exposure were incorporated as traits (see “Phage plaque assay” section in “Methods”). Co-expression network. To visualize the whole network including the modules by colors, the WGCNA “export- NetworkToCytoscape” function was run, using a correlation threshold of 0.985 and weight = false to build an un-weighted graph of highly connected genes with very strict correlation. The data-driven graph was visualized using Cytoscape. Integrated  DEGs  network  construction.  The final GGI network of DEGs was constructed joining the files of the well-known interactions predicted by STRING database and the strict data-driven interactions obtained from co-expression analysis (un-weighted graph). The definitive graph was visualized using Cytoscape software. Topological metrics of the graph were obtained using the defaults apps available in Cytoscape. enrichment analysis.  For the gene set enrichment analysis (GSEA), STRINGdb was used to identify sig- nificantly enriched pathways according to KEGG database, using a cutoff of FDR < 0.05. This analysis was run for complete gene lists of DEGs at 2.5 h, DEGs at 5 h, and genes of each co-expression module. Results of enrich- ment were incorporated into the DEGs network using the Cytoscape app Omics Visualizer (https ://apps.cytos cape.org/apps/omics visual izer ). Hub genes identification.  In order to identify central or hub genes in the DEGs network of PaeAG1 after exposure to CIP, cytoHubba a pp67 was run in Cytoscape. To address this, bottleneck and betweenness meth- ods were implemented with default parameters. The top 10 nodes (genes) were selected for each method using calculated metrics. All selected genes in any of the methods were labeled as hub genes. In addition, cytoHubba was also used to build two subnetworks using the hub genes, one with the selected elements only, and another including the first-stage nodes (in direct connection with hub genes) to identify gene clusters. KEGG annotation information was kept from the DEGs network. Expression profiles of hub genes were compared to expression levels obtained in other representative studies, including the following stressors: Cu (copper)68, CIP (ciprofloxacin)26, COL (colistin)69, AZM (Azithromycin)70 and H2O2 (hydrogen peroxide)71. Comparison was done using the general information of expression levels (down, up or variable regulation). Phage  plaques  assay  (validation  assay  at  the  phenomic  level).  To validate the transcriptomic results which showed an up-regulation of phage genes in PaeAG1 after exposure to CIP, we implemented a phage plaques assay and performed this assay in triplicate. To assess the CIP effect on phage induction, different CIP concentrations were evaluated. Evaluation was also done for imipenem and tobramycin as supplementary assays. Growth conditions were the same as described in the “Growth curve assays”, until the addition of differ- ent antibiotic concentrations. At this point, cultures were kept growing for five hours and phages were isolated and quantified for each sample. During standardization, it was determined that five hours after CIP exposure was the minimum time for clear detection of phage plaques (see supplementary Figure S1-B for details). Phage plaque counts at 2.5 h and 5 h for 12.5 µg/mL CIP were used to associate the phage induction with co-expression modules (detailed in “Co-expression analysis” section). Phages isolation. Protocols of 72 and 73 were adapted. Briefly, the culture was centrifuged for 20 min at 4,000 rpm, 40 mL of the supernatant was taken and 1 mL of chloroform was added to residual bacterial cells. After overnight incubation, cell debris was removed by centrifugation for 20 min at 3,000 rpm. The supernatant was filtered through a 0.45 μm filter to select phages. A volume of 30 mL of the filtered supernatant was mixed with 7.5 mL of polyethylene glycol (20%) and NaCl (2.5 M) to precipitate the phages. After overnight incubation, the sam- ple was centrifuged for 30 min at 4,000 rpm, the supernatant was discarded and the pellet was resuspended in 250 µL of phage buffer (10 mM MgSO4, 10 mM Tris–HCl and 150 mM NaCl). Phages quantification. Phages were quantified by means of Plaque Forming Units (PFU) using P. aeruginosa PAO1 as host cells. The numbers of PFU was determined using the double-agar-layer m ethod74. Briefly, medium was composed of two agar layers, a first layer 1.5% and another to 0.5% agar concentration. P. aeruginosa PAO1 and phages were added on the second layer and phage plaques were visualized after incubation for 24 h at 25 °C. An exponential regression between the CIP concentrations and the PFU was run to associate the effect of CIP exposure on the phage induction. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 6 Vol:.(1234567890) www.nature.com/scientificreports/ Figure 2. In vitro effects of ciprofloxacin on growth curve of PaeAG1. A growth rate reduction was observed as the CIP concentration was incremented. Area under curve (AUC) was compared using t-test (p < 0.05), showing a statistical difference between all curves when compared to control (0.0 mg/mL). In a similar manner, two-way ANOVA found differences in the OD600nm and time for each case. ethical considerations.  No animals or human participants were included in this study. Both the scientific committee of the Centro de Investigación en Enfermedades Tropicales (CIET) and Vicerrectoría de Investi- gación of Universidad de Costa Rica approved the study and the access to the PaeAG1 strain from the CIET collection of bacterial specimens. Results Concentration‑dependent effect of CIP compromises the growth rate of PaeAG1.  To evalu- ate the effects of CIP in the growth rate of PaeAG1, increasing concentrations of the antibiotic were added to exponential-phase PaeAG1, and growth was monitored over time for 16 h. As shown in Fig. 2, OD600nm values were highly consistent between replicates (error bars represent standard deviation). All CIP curves showed a statistical significant difference on OD600nm compared to control (p < 0.05 for both AUC and two-way ANOVA). Lag phase for the control and two lower CIP concentrations (5 and 12.5 µg/mL) lasted approximately 4 h, while the higher CIP concentration of 25.0 µg/mL showed a lag phase of 8 h. Kinetics at the exponential phase showed more variable results. There was a decrease in cell growth for 12.5 µg/mL CIP from 12 h onwards in comparison to 0 or 5.0 µg/mL, and more evident at same time for 25 µg/ mL. For the case of 50.0 µg/mL (higher than MIC), the growth was drastically impaired and no exponential growth was observed. These results indicate that higher CIP concentrations have a stronger effect on the growth rate, even for sub-inhibitory concentrations (MICCiprofloxacin 32 µg/mL). Evaluation of the growth effects of other two antibiotics (imipenem and tobramycin) was also performed (supplementary Figure S1C–E, left). Unlike CIP, both cases showed no changes in the growth curves with different sub-inhibitory concentrations. Due to the significant changes in growth curves with CIP (with respect to control) and considering a condition with enough cell mass for RNA-Seq analysis, 12.5 µg/mL CIP was used to evaluate the transcriptomic response of PaeAG1 to a sub-inhibitory concentration of the antibiotic. RNA‑Seq analysis  identifies 518 DEGs  in PaeAG1 over  time after  exposure  to CIP.  A tran- scriptomic analysis was conducted to evaluate the molecular response to sub-inhibitory CIP concentration in PaeAG1. To this end, samples were taken at 0 (control), 2.5 and 5 h after CIP treatment. To ensure exponential growth at these times, the growth curve was monitored using O D600nm measurements (successfully reproduced as Fig. 2), in addition to counting of Colony Forming Units (CFU), as shown in supplementary Figure S1A. After RNA was extracted, RNA integrity RIN > 9 was obtained for all samples and paired-end RNA sequencing was performed. For all samples, quality control of raw sequence data showed good results in terms of mean quality (> 30), no adapters, and no reads mapping to rRNA after filtering. Read mapping quality control showed that 98.6% were mapped to the PaeAG1 genome, with expected uniform coverage for gene body, and TIN > 90 for all samples. Details of assessment of transcriptomic data (counts per gene) is shown in supplementary Figure S2. Identification of DEGs was conducted by comparing times 2.5 or 5 h against the initial 0 h time after CIP exposure (Fig. 3A,B). As shown in Table 1, 355 DEGs were identified at time 2.5 h, with 204 (57.5%) up-regulated and 151 (42.5%) down-regulated. At 5 h, 248 (56.6%) genes were up-regulated, meanwhile 190 (43.4%) were found to be down-regulated, for a total of 438 DEGs. A total of 518 DEGs were found at any time points (union ∪), as shown in Fig. 3C and Table 1. These represent around 7% of the genes of PaeAG1. In addition, as presented in Fig. 3D, a total of 85 DEGs (at any time) belong to phages (27.6% of the 308 phage genes identified in the PaeAG1 genome), most of them up-regulated as shown in Table 2, Fig. 4 and supplementary Figure S3. The phages regulated include phiCTX, F10, JBD44 and JDO24 for which 3, 10, 65 and 7 DEGs were respectively observed at any time (Table 2). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 7 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 3. Differential expression analysis in PaeAG1 exposed to ciprofloxacin compared to initial time 0 h. Selection of DEGs according to adjusted p-value (p < 0.05) and logFC (logFC < − 1 or logFC > 1) at 2.5 h (A) or 5 h (B) post-exposure to antibiotic. (C) Venn diagram showing the comparison of DEGs in the two evaluated times, with 275 shared genes (intersection) and total 518 genes at any time (union) with respect to time 0 h (control). More details in Table 1. (D) Venn diagram showing the comparison of DEGs and phage genes or virulence factors (more details in Table 2). (E) Heatmap of normalized counts and gene clustering of the total 518 DEGs at the three evaluated time points. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 8 Vol:.(1234567890) www.nature.com/scientificreports/ Sets DEGs 2.5 h 5 h 2.5 h ∩ 5 h 2.5 h ∪ 5 h Up regulated genes 204 248 153 299 Down regulated genes 151 190 118 223 Total DEGs 355 438 275 518 Table 1. Comparison of DEGs of PaeAG1 at 2.5 and 5 h after treatment with Ciprofloxacin, including counts of down or up regulated genes, shared genes (intersection) and total genes at both times (union). Determinants Sets of DEGs Total genes (in Regulation* and Type Specific elements PaeAG1 genome) 2.5 h 5 h 2.5 h ∩ 5 h 2.5 h ∪ 5 h observations Antibiotic resistance Total 56 3 2 2 3 Down, lactamases PPpW 12 0 0 0 0 No DEGs phiCTX 25 2 3 2 3 Up F10 62 1 9 0 10 Up Phages JBD44 105 34 65 34 65 Up JDO24 59 4 7 4 7 Up phi3 45 0 0 0 0 No DEGs Total 308 41 84 40 85 – Adherence 96 11 19 11 19 Down Antimicrobial activity 17 1 6 1 6 Up, phenazines Antiphagocytosis 25 0 0 0 0 No DEGs Phospholipases 3 0 0 0 0 No DEGs Biosurfactant 3 0 0 0 0 No DEGs Iron uptake 28 0 1 0 1 Up, Pyochelin Virulence factors Protease 4 1 2 1 2 Up, elastases Quorum sensing 5 0 1 0 1 Up, RhlR Regulation GacS/GacA system 2 0 0 0 0 No DEGs Secretion system 63 0 2 0 2 Down, T3SS Toxins 4 0 1 0 1 Up, hydrogen cyanide Total 250 13 32 13 32 – Table 2. Comparison of DEGs of PaeAG1 at 2.5 and 5 h after treatment with ciprofloxacin, and specific phages or categories of virulence factors, including shared genes (intersection) and total genes at both times (union), the regulation and the type of elements. *Based in logFC of genes for both times 2.5 and 5 h. Type of elements is also shown. In the case of the 250 known virulence factors of PaeAG1, 32 (12.8%) were identified as DEGs at all of the assessed times (arrowheads of Fig. 4 and supplementary Figure S3). The virulence factors are mainly associated with adherence (19) and phenazines (6) genes (see Table 2). Regarding antibiotic resistance genes, only three out the 56 genes were found to be differentially expressed (Table 2). A heatmap of normalized counts and gene clustering of the total 518 DEGs are shown in Fig. 3E. Well-defined clusters were found for genes and samples, showing similar expression patterns. Out of all the DEGs at 2.5 h, seven genes corresponded to transcription factors, including psrA, rpoH and prtN. At 5 h, 14 DEGs including psrA, rpoH, prtN, rpoS, rhlR and ptrB were identified as transcription factors. All transcription factors activated at 2.5 h remained active at 5 h (Supplementary Table S2). Identification of regula- tors by a context-centric analysis revealed a total of 22 transcription factors modulating all the DEGs at 2.5 h, and most of them are part of the 28 transcription factors recognized as DEGs at 5 h (see Supplementary Table S2). Genes of the SOS response were not identified as DEGs. The rpoS factor was up-regulated at 5 h. Due to the preponderant role of LexA (SOS response) and RpoS as essential genes in the response to CIP in P. aeruginosa, we further investigated the DNA binding sites for these elements. The CollectTF database provided the consensus binding sequence for LexA as CTG-TATAA-ATATA-CAG, described by26. Analysis revealed the role of LexA modulating all 15 genes in the SOS response in P. aeruginosa, as well as other sequences at promoter regions of psrA (coding for a transcription factor as described before), grpE, hemO and other genes. In PaeAG1, psrA and grpE genes were up-regulated at 2.5 and 5 h after CIP treatment. For RpoS, no sequence information was available in CollectTF, therefore we used the RpoS-dependent promoter consensus sequence CTAT AC T found by75. A total of 49 sites for RpoS were predicted to be associated with promoter regions of PaeAG1 genes, but none as DEGs in PaeAG1. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 9 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 4. Gene–gene interaction (GGI) large scale network of differentially expressed genes in PaeAG1 after ciprofloxacin treatment, using a database-based method for prediction of interactions. Using STRINGdb, interactions between genes were predicted. To build the network all the DEGs in both times 2.5 and 5 h were included. A total of 342 genes resulted connected (66.0% of all DEGs) with 1685 edges in total (not connected nodes are not shown). The logFC is shown for 5 h. Gray nodes represent genes that were differentially expressed only at time 2.5 h (i.e. no logFC value is displayed at time 5 h). Details of the network by time is shown in supplementary Figure S3. Phages genes, virulence factors and antibiotic resistance genes are represented as triangles, arrowheads and rhomboids, respectively. Down-regulation (red tones) and up-regulation (blue tones). Networks analysis shows pleiotropic effects of CIP exposure in PaeAG1.  Using a top-down sys- tems biology approach, a large scale GGI network of DEGs was built to identify molecular determinants associ- ated with the response to CIP in PaeAG1. GGI predictions by a database-based model: All of the 518 DEGs were incorporated as nodes and edges (high confidence connections or interactions). A total of 342 (66.0% of all DEGs) nodes were found to be connected with at least one other gene, as well as 1685 edges were established (Fig. 4). When selecting DEGs for each time, 248 nodes (69.9%) of the 355 DEGs at 2.5 h were connected with a total of 1,156 edges (supplementary Fig- ure S2A). Out of all the 438 DEGs at 5 h, 284 (64.8%) were connected with 1,041 edges in total (supplementary Figure S2B). As shown in Fig. 4, some determinants of virulence factors (adherence) and antibiotic resistance genes showed a down-regulation after CIP treatment, meanwhile, phage genes and other virulence factors (phenazines) were Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 10 Vol:.(1234567890) www.nature.com/scientificreports/ Figure 5. Co-expression analysis to identify modules of genes and the data-driven co-expression network in PaeAG1 after Ciprofloxacin treatment. (A) Modules identification (clusters by colors) using correlated expression genes (along times 0, 2.5 and 5 h) and clustering analysis after WGCNA was implemented. (B) Association of modules to traits, showing relations between turquoise and blue modules with exposure time to antibiotic and phages induction. (C) Data-driven co-expression network using correlation of gene expression by WGCNA analysis (correlation > 98.5%). A total of 388 DEGs were found to be connected, with a total of 1,073 edges. Only correlated genes are shown. More details in supplementary Figure S3A. Phages genes, virulence factors and antibiotic resistance genes are represented as triangles, arrowheads and rhomboids, respectively. found to be up-regulated. In addition, gene clusters of highly connected DEGs showed the same expression pat- tern, suggesting a coordinated regulation. The observed unconnected genes (107 DEGs for 2.5 h and 154 for 5 h) are inherent to limitations in the database (incomplete inclusion of phage genes) or the current state of the gene annotation (without information, hypothetical protein, etc.). To improve the associations between genes creating more connections, a data-driven co-expression analysis was run. Co-expression analysis. Modules of highly connected genes (represented using color groups) were created using normalized counts for all the 518 DEGs. As shown in Fig. 5A, genes were clustered into four main mod- ules, showing similar expression along samples. The number of genes belonging to the turquoise module was 239, 124 for blue, 114 brown and 39 for yellow module. In the co-expression network (Fig. 5C), a total of 388 DEGs (74.9% of the 518 DEGs) were found to be connected, with a total of 1,073 edges. Of these interactions, 385 were also found using the database-based model and 688 novel gene interactions were suggested by our co- expression analysis. The turquoise module includes most of the phage genes and virulence factors. Integrated GGI network of DEGs. Integration of predicted connections between genes by both the database- based model and co-expression analysis was done to build a definitive large scale network, shown in Fig. 6. A total of 449 (86.7%) of DEGs were connected, in contrast with the 342 nodes from the preliminary network, an increment of ~ 20%. In addition, 2,373 edges were identified, 1685 from the database-based method (solid lines in the network) and the 688 new interactions suggested by the co-expression analysis (dashed lines). Further- Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 11 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 6. Definitive large scale network of DEGs, identification of hub genes and associated groups in PaeAG1 after treatment with ciprofloxacin. Network showing all 518 DEGs genes and their interactions (449 genes have at least one connection). Known interactions according to STRINGdb (database-based method) are shown as solid lines and data-driven interactions according to data-driven co-expression analysis as dashed lines. Enriched nodes associated to KEGG annotation are colored according to each pathway (more details in Table 3). Phages genes, virulence factors and antibiotic resistance genes are represented as triangles, arrowheads and rhomboids, respectively. Other genes are represented as ellipses. more, a separated cluster was observed with high connectivity between phage genes (cluster of blue triangles, Fig. 6 left top). Remarkably, this cluster appears to have a critical bottleneck at the fahA gene, since many genes are connected to this node but, for the majority of the cluster nodes, this gene is the only connection to the rest of the network. Thus, the cluster becomes a clearly separated module. In addition, another smaller and less distinct cluster of phage genes was formed (Fig. 6 left down). The same GGI network is presented in supplementary Figure S4A to show the distribution of genes by co- expression modules. A high functional interaction of genes across different clusters is observed. The logFC values at time 5 h are shown in the network in Figure S4B. Enrichment analysis. In order to gain insight about the biological meaning of DEGs, gene set enrichment analysis (GSEA) was performed. The 518 DEGs were shown to be implemented in a total of 15 KEGG pathways (Figs. 6 and 7, and Table 3). The enriched pathways included ribosomal functions, RNA degradation, biosyn- thesis of antibiotics, fatty acids metabolism, propanoate metabolism, fatty acids biosynthesis, quorum sensing, amino acid degradation, carbon metabolism and citrate cycle, butanoate metabolism, phenazine biosynthesis, among others (see Fig. 7). Details of gene counts, FDR and regulation are shown in Table 3. Additionally, path- ways by co-expression modules (Table 3) showed that some of them are enriched in specific pathways. For exam- ple, the blue module is down-regulated for ribosomal activity and RNA degradation (exclusive functions for this module), meanwhile the yellow module has multiple but tightly related pathways, most of them associated to interconnected metabolism pathways, down-regulated. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 12 Vol:.(1234567890) www.nature.com/scientificreports/ Figure 7. Identification of hub genes and first-stage subnetwork of their associated groups in PaeAG1 after treatment with ciprofloxacin. (A) Hub genes identification using cytoHubba (betweenness and bottleneck methods) in the network of DEGs (large nodes). Details in Table 4. (B) Subnetwork of nodes that directly interact with the 14 hub genes were used to build a first-stage elements network. Details of node shapes and colors are the same as described in Fig. 6. Only 14 hub genes are able to represent the key pathways regulated by CIP in PaeAG1.  With the aim of identifying an inter-modular key or central genes in the DEGs network of PaeAG1 after exposure to CIP, an analysis of hub gene identification was conducted. This approach revealed 14 connected hub genes (Fig. 7A and details in Table 4). Two genes, identified as PaeAG1_03660 and PaeAG1_03610, are part of the phage JBD44 and they were up regulated at 5 h. Topologically, they are part of the two identified phage gene clusters in the main network (Fig. 6). Two genes, sdhB and sdhC, (down-regulated) have functions related to Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 13 Vol.:(0123456789) www.nature.com/scientificreports/ DEGs 2.5 h DEGs 5 h Regulation (% KEGG term ID Term description Total gene count Observed gene count FDR Observed gene count FDR Modules DEGs)* paeb01130 Biosynthesis of antibiotics 266 30 0.0015 34 0.00047 Brown, Yellow Down (61%) paeb01110 Biosynthesis of sec-ondary metabolites 320 30 0.0352 31 0.0205 Yellow Down (70%) paeb00650 Butanoate metabolism 37 8 0.0133 9 0.0068 Yellow Down (55%) paeb01200 Carbon metabolism 126 15 0.0258 18 0.0068 Yellow Down (80%) paeb00020 Citrate cycle (TCA cycle) 30 7 0.0158 8 0.0068 Yellow Down (75%) paeb00061 Fatty acid biosynthesis 27 7 0.0131 9 0.0014 Yellow Down (100%) paeb01212 Fatty acid metabolism 49 8 0.0309 10 0.0068 Yellow Down (90%) paeb00405 Phenazine biosyn-thesis 20 5 0.0309 6 0.0127 Brown Up (100%) paeb00640 Propanoate metabo-lism 47 12 0.00061 16 3.87e-06 Brown, Yellow Variable (50/50) paeb03060 Protein export 15 5 0.026 3 0.0014 Yellow Down (100%) paeb02024 Quorum sensing 86 11 0.0317 14 0.0068 Brown Up (69%) paeb03010 Ribosome 55 27 1.95e-14 27 2.63e-13 Blue Down (100%) paeb03018 RNA degradation 17 5 0.0258 5 0.0273 Blue Down (60%) Synthesis and deg- paeb00072 radation of ketone 10 4 0.0258 4 0.0273 Brown, Turquoise Up (100%) bodies paeb00280 Valine, leucine and isoleucine degradation 46 11 0.0015 11 0.0023 Brown, Turquoise Up (82%) Table 3. Pathways related to DEGs network of PaeAG1 exposed to ciprofloxacin, according to KEGG annotation. Annotation of modules of co-expressed genes and the general regulation are also included. *Based on logFC of DEGs at both times 2.5 and 5 h. carbon and butanoate metabolism, and biosynthesis of secondary metabolites. Interestingly, the ribosomal pro- tein L32 (rpmF, down-regulated), a chaperonin (groL, up-regulated) and the sigma factor (rpoS, up-regulated) were also identified as single molecular determinants of the network. Also, the fahA gene, which was previously recognized as a bottleneck for the phage genes cluster and coding for fumarylacetoacetase enzyme, was identi- fied as a hub gene. Analysis of gene clusters of first-stage connected genes (Fig. 7B) showed not only the same profile of enriched pathways for those hub genes (Fig. 7A), but also other pathways such as lipids metabolism, phenazine biosynthe- sis, quorum sensing and others. These groups include many elements of phages, virulence factors and multiple uncharacterized genes, as well as one antibiotic resistance gene (PaeAG1_05751). The logFC values at time 5 h are shown in Figure S4C. Six hub genes were consistently identified by both bottleneck and betweenness approaches (Table 4). Together with rpoS and groL, eight hub genes (57%) are part of the turquoise module, and all of them are up-regulated by CIP. All other genes are part of the brown (4) and blue modules (2). Only four genes were found to be down regulated, three of them belonging to the brown module. To compare the expression profiles of hub genes to other studies, we included information in Table 4 of the effect of perturbations or stressors of P. aeruginosa in the modulation of gene expression. Similar effects of CIP on hub genes were found when comparing our results to a previous r eport26. The effects of azithromycin seem to be opposite to CIP for these genes. More variable results were found for other perturbations (e.g. colistin, copper and H2O2); and lecB was the only hub gene that was up-regulated for all perturbations. Thus, as expected, hub genes are strongly linked to elements of highly connected gene clusters and at the same time with the key pathways in response to CIP. Together, these three elements (hub genes, gene clusters and enriched pathways) represent the determinants of the response to CIP in PaeAG1, many of them related to the bacterial growth modulation, as initially hypothesized. Concentration dependent effect of CIP in PaeAG1 phage induction.  According to transcriptomic analysis, phage genes were up-regulated under 12.5 μg/mL CIP treatment in PaeAG1. To validate these results at phenomic level, evaluation of lytic plaque formation was done using a phage plaque assay. As shown in Fig. 8A, after treatment with 12.5 µg/mL CIP, phage induction was increased by tenfold (1,000 PFU/mL) with respect to control condition without antibiotics, in concordance with the molecular findings. More drastic changes were evidenced for higher concentrations, where more than 10 000 or 100 000 PFU/mL were quantified for PaeAG1 after treatment with 25.0 and 50.0 µg/mL CIP concentrations, respectively. Figure 8C shows phage plaques on culture plate during in vitro assays. Unlike CIP, when the same analysis was done for imipenem and tobramycin (supplementary assay), no induction was evidenced. Indeed, a slight reduction was observed for imipenem (Sup- plementary Figure S1C–E, right). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 14 Vol:.(1234567890) www.nature.com/scientificreports/ PaeAG1 Locus Betweenness Bottleneck Co-expression KEGG Annotation Other ID Gene name score* score* logFC 2.5 h* logFC 5 h* module Annotation** details studies*** Metabolic pathways, Acyl carrier pro-PaeAG1_01864 acpP (PA2966) 6,268.3 17 2.64 3.63 Turquoise ↑ AZM,biosynthesis of tein; fatty acid ↕ CIP COL antibiotics biosynthesis LysM domain/ PaeAG1_06246 ygaU 6,340.4 – 1.79 0.9 Blue – BON superfam- – ily protein Biosynthesis of antibiotics, Carbon metabo- lism, Citrate Succinate dehy- PaeAG1_04068 sdhB (PA1584) 6,401.4 – − 1.37 − 1.17 Blue cycle (TCA drogenase and ↑ COL AZM cycle), Butanoate fumarate reduc- ↓ CIP Cu metabolism, tase iron-sulfur Biosynthesis family protein of secondary metabolites Propanoate Belongs to the PaeAG1_04991 prpC (PA0795) 6,485.7 14 1.58 1.7 Turquoise metabolism citrate synthase ↑ H 2O2 CIP ↓ family AZM ↕ COL Phage: JBD44; PaeAG1_03610 DR97_5412 7,285.4 15 0.9 1.84 Turquoise – Tail tape meas- – ure protein 60 kDa chaper- onin; Prevents misfolding and promotes the PaeAG1_05221 groL or groEL(PA4385) 8,440.2 – 1.16 1.21 Turquoise RNA degrada- refolding and ↑ CIP Cu tion proper assembly ↓ AZM ↕ H 2O2 of unfolded polypeptides generated under stress conditions Biosynthesis of antibiotics, Carbon metabo- lism, Citrate Succinate PaeAG1_04071 sdhC (PA1581) 8,716.8 – − 1.68 − 1.54 Brown cycle (TCA dehydrogenase, ↑ AZMcycle), Butanoate cytochrome b556 ↓CIP Cu metabolism, subunit Biosynthesis of secondary metabolites PaeAG1_03660 PaeAG1_03660 9,477.2 17 1.05 1.23 Turquoise – Phage: JBD44 – PaeAG1_03555 fahA (PA2008) 11,245.9 16 1.19 1.93 Turquoise Tyrosine Fumarylacetoac- ↑ CIP ↕ COLmetabolism etase ↓ Cu AZM PaeAG1_01837 lecB (PA3361) 13,150.8 17 1.71 3.88 Turquoise Quorum sensing fucose-binding lectin PA-IIL ↑ CIP COL AZM PaeAG1_01229 DR97_3944 – 15 1.3 1.45 Brown – Uncharacterized protein – RNA polymerase PaeAG1_01591 rpoS – 15 1.03 1.49 Turquoise Transcription (PA3622) machinery sigma factor ↑ COL CIP RpoS ↓ AZM ↕ Cu PaeAG1_01361 DR97_4078 – 19 -1.22 -1.48 Brown – Uncharacterized protein – Ribosomal pro- tein L32; Belongs PaeAG1_02250 rpmF (PA2970) – 22 -1.17 -1.39 Brown Ribosome to the bacterial ↓ CIP H 2O2ribosomal ↕ COL Cu protein bL32 family Table 4. Characterization of hub genes in the DEGs network of PaeAG1 after treatment with ciplofloxacin. *Cases with gray numbers refer to genes which were no selected as a DEG at that time (logFC and adjusted p-value). **Cases with “-” refer to no annotation information. ***Results from other studies: ↑ up-regulated, ↓down-regulated, ↕ variable regulation or “– “ no information. All results from GEO-NCBI according to stress conditions: Cu (copper) from (Teitzel et al., 2006), CIP (ciprofloxacin) from (Cirz, O’Neill, Hammond, Head, & Romesberg, 2006), COL (colistin) from (Cummins, Reen, Baysse, Mooij, & O’Gara, 2009), AZM (Azithromycin) from (Kai et al., 2009) and H2O2 (hydrogen peroxide) from (Chang, Small, Toghrol, & Bentley, 2005). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 15 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 8. Phage plaques assay of PaeAG1 after exposure to ciprofloxacin. (A) Phages of PaeAG1 are induced under CIP exposure, with a pattern of higher induction of phage plaques at higher concentration of the drug, evidenced with an exponential regression as shown in (B). (C) Example of visualization of phage plaques on culture plate during in vitro assays. Analysis of module genes to traits of PaeAG1 (phage production and time after CIP exposure) is presented in Fig. 5B. This analysis revealed a significant association of gene expression of the blue module, with changes at 2.5 h after CIP treatment and the low phage induction at this same time point. In a similar way, the turquoise module was significantly associated with changes of gene expression at 5 h and stronger phage induction. Other modules were not directly associated with these traits. Altogether, these results indicate that phage induction in PaeAG1 is strongly dependent on CIP concentra- tion, as shown with an exponential regression (R2 = 0.97) in Fig. 8B. Discussion P. aeruginosa is a remarkable organism that can successfully resist, adapt, and survive in a wide variety of environments29. This versatility is conferred by the large proportion (> 8%) of regulatory genes encoded in its large genome (6–7.5 Mb, 7.2 Mb in the case of PaeAG1)5,22. This particular case of PaeAG1 strain is a high-risk ST-111 strain isolated from an immune-compromised patient in a Costa Rican Hospital, with resistance to mul- tiple antibiotics including CIP and carbapenems. Although many P. aeruginosa strains are resistant to CIP6,10,12,48 and other antibiotics, the effects of sub-lethal concentrations on the development of antibiotic resistance had been ignored for decades due to the assumption that resistance emerges only with lethal concentrations (> MIC)14. Therefore, we evaluated the effect of different CIP concentrations on PaeAG1 growth rate (Fig. 2). We detected a concentration-dependent reduction of growth rate as the CIP concentration was increased, similar to another study with CIP in P. aeruginosa12. We then employed RNA-Seq analysis to investigate the influence of a sub- inhibitory CIP concentration on the gene expression of PaeAG1 and its relationship with the bacterial growth, similar to recent studies in P. aeruginosa76,77 and other b acteria16,44,78–81. Differential expression analysis (Fig. 3) highlighted 518 DEGs at 2.5 and 5 h. Contrasting results have been previously reported in P. aeruginosa after CIP exposure, with some variations attributed mainly to differences in CIP concentration, time after exposure and/or the technical approach12,26,48. We used a top-down systems biology approach to build the interaction network across the 518 DEGs. Inter- actions were modeled using a database-based method and co-expression analysis. A total of 14 hub genes, gene clusters and 15 KEGG pathways were associated with the molecular response to CIP, many of them related to bacterial growth, in line with other studies26,82,83. Discovery and description of these strong relationships between genes provided not only biological insights of the molecular regulation under stress c onditions42, but also helped to reduce data complexity to only several central elements40, as other studies in P. aeruginosa PAO147 and E. coli40. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 16 Vol:.(1234567890) www.nature.com/scientificreports/ Sigma factor RpoS as a hub gene.  Not surprisingly, one of the identified hub genes in PaeAG1 after CIP treatment was rpoS. This gene was only up-regulated at 5 h after exposure, suggesting a late regulation in com- parison with other DEGs. RpoS is considered a master regulator of the general stress r esponse35 which is induced when bacterial growth decreases, or under starvation, antibiotics and osmotic or oxidative stress18. In addition, RpoS participates in the protection of cellular m acromolecules18, modulation of metabolism, virulence, and changes in cell envelope and m orphology11. The overexpression of RpoS suggests that bacteria enter a station- ary phase-like state upon stress conditions, as reported p reviously44. This is further supported by the observed significant lack of growth of bacteria under CIP treatment of various concentrations. According to growth curves, PaeAG1 was in exponential phase at the time points used for the transcrip- tomic analysis (Fig. 2 and supplementary Figure S1A). This is a key point to ensure that RpoS induction (and all the response) is explained by the antibiotic and not due to stationary-phase entry (i.e. experimental design). The reliance of the observed changes on CIP treatment was further supported by the fact the curves at same conditions showed no changes for imipenem or tobramycin antibiotics (supplementary Figure S1C–E). Other fluoroquinolones were not tested for their effect on the production of phages in PaeAG1. In addition, DNA binding site analysis using consensus sequence described in75 revealed 49 sites for RpoS in PaeAG1 genes, however none of these were found to be DEGs. In the same work, RpoS was regulating 772 genes at the stationary phase, of which 41 genes (5%) were identified as DEGs in our study. Since our analysis was performed at the exponential phase, the small number of common genes could be attributed to growth phase differences in each study. In another study using a de novo approach to identify binding sites using ChIP-Seq, RpoS showed to have 199 binding motifs in P. aeruginosa PA1437, including six transcription factors. In PaeAG1, 23 of these 199 genes corresponded to promoter regions of DEGs, including the RhlR and RpoS (itself) transcrip- tion factor genes. This suggests that 12% of the RpoS regulon was modulated by CIP in PaeAG1. Interestingly, context-centric analysis revealed that up to 28 transcription factors (including RpoS) are associated with the response to CIP, regulating gene expression with pleiotropic consequences and defining a crosstalk among fac- tors in P. aeruginosa37. On the other hand, the RpoS response contributes to the robustness of bacterial cells facing stress conditions, acting synergistically with the SOS response18. Although SOS response is known to be induced by CIP in P. aer- uginosa and other bacteria20,26,27,84, in this study the SOS response was not significantly induced in response to CIP treatment at 2.5 and 5 h. The absence of SOS induction may be due to the timing and concentration of CIP treatment. In E. coli, dynamic models have shown that the time of response to cell stress is very fast, and stability of the SOS response can be achieved in minutes, around 30 min according t o85 or up to 90 min according t o86, until homeostasis is recovered or stronger stress responses are induced. Also, the SOS regulon of P. aeruginosa was established using a supra-inhibitory CIP concentration (8 × MIC) at times 30 and 120 min26. These differ- ences in concentration and time (0.4 × MIC at 2.5 and 5 h for PaeAG1) could explain absence of SOS elements as DEGs. Our results are similar to another proteomic study using P. aeruginosa; profiles at 1.5, 5.5 and 14.5 h after CIP treatment were evaluated, and neither LexA nor other SOS proteins were differentially expressed, except for RecA, which was found to be up-regulated87. phage induction as a response determinant.  Regarding phage genes, two gene clusters with hub genes were defined in PaeAG1 after CIP treatment. Phage induction is known to be modulated upon stress conditions, including the SOS response88. As found recently for some antimicrobials, phage activity is product of pleiotropic r egulation89. In the presence of sub-lethal concentrations of certain antibiotics, phages have been observed to be induced or to form larger phage p laques88,90. Under fluoroquinolones exposure, P. aeruginosa DNA is affected and the SOS response is triggered. In a similar manner to LexA, repressor cleavage reaction is stimulated by activated RecA, allowing virus assembly91,92, and killing of the bacterium93. In some cases, alternative RecA- independent mechanisms have been described91,94. PaeAG1 has six prophages in the genome, including two complete e lements5. After CIP exposure 85 phage genes were up-regulated, most of them from JBD44 (65 genes out of 105 JBD44 genes). In the co-expression analysis, when association between modules and traits was assessed, the turquoise module (Fig. 5) was sig- nificantly related to CIP exposure time and phage induction, indicating a coordinated gene expression activity belonging to this cluster/traits (Fig. 5B). Although general information on PaeAG1 phages is scarce, there is evidence to suggest that JBD44 is one of the most prevalent in P. aeruginosa95. Effects of JBD44 induction on growth have been previously described in P. aeruginosa PAO1, showing that JBD44 expression significantly decreased the growth of PAO1, unlike other phages96. Similarly, SOS-mediated phage induction has been reported in P. aeruginosa PAO112,26 and LESB5897. In addition, effect evaluation of several antibiotics found that CIP and norfloxacin (another fluoroquinolone) caused a high level of phage induction, but variable results were found for other a ntibiotics92. As observed in our experiments, no induction was found for imipenem nor tobramycin (supplementary Figure S1C–E). The underlying relationship between the up-regulation of multiple phage genes in PaeAG1 after CIP exposure and the effect on bacterial lysis was validated through the effect of CIP concentrations in the phage induction. A concentration-dependent effect of CIP on both growth curves (rate reduction, Fig. 2) and phage plaques formation (exponential increment, Fig. 8) was demonstrated. This validated the transcriptomic findings of up- regulation of phage genes in PaeAG1. In congruence with this and the enriched pathways in PaeAG1, it has been reported that cells can adapt to stresses by disrupting their own metabolism in such a way that will impair the success of phage a ctivity98. This implies that effects are observed not only on the host cell fate but also modulation of different responses, including RpoS regulation. These changes can be a product of tight modulation of functions reliant on molecu- lar interactions from both phage and b acteria99. Similarly, as phages generally appear to consume amino acid Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 17 Vol.:(0123456789) www.nature.com/scientificreports/ m etabolites100, the bacterial up-regulation response of genes involved in amino acid catabolism has been sug- gested as a strategy for reducing the infection s uccess98 and disrupting phage p ropagation100. Blasdel et al. 2017 found that maiA, fahA, hmgA and hpd genes of tyrosine catabolism were up-regulated by P. aeruginosa during phage a ctivity98. In our study, all four genes were up-regulated, including fahA as a hub gene and a bottleneck element for the main phage gene cluster, indicating a catabolic effect after exposure to CIP that may be related to phage induction. More details of the fahA gene are discussed later. Although different possibilities of the regulation of phage genes have been suggested, in the case of PaeAG1 phages, most of the predicted phage genes cannot be associated with a putative function, as in other s tudies26. This complicates the interpretation of the results for particular genes99. Validation of phage induction at phe- nomic level in congruence with transcriptomic results suggests that modulation of phages by CIP (but not for imipenem or tobramycin as discussed before) in PaeAG1 is possible. This is particularly relevant since this strain is a ST-111 high-risk clone and a critical organism Priority 1 (resistant to carbapenems) according to W HO8. Modulation could be achieve targeting phage production as a therapeutic option, with the advantage that the induced phages are resident elements of the genome and not exogenous elements as in other studies. Thus, treatment of antibiotic-resistant bacterial infections can potentially be improved by using phage therapy and traditional antibiotics, regardless if cells are growing in biofilms or as planktonic b acteria88. In addition, phage therapy can be used as a bactericidal element against multiresistant s trains93. However, this does not necessarily apply to all P. aeruginosa strains since phage induction in other cases (with different strains and antibiotics) have been shown to be variable92. other transcriptomic determinants.  Of the 15 pathways recognized as enriched in PaeAG1 after CIP treatment, ribosomal activity, RNA degradation and several metabolic routes were prominently enriched with respect to others. Reduction in the abundance of ribosomal proteins and protein implicated in cell division over time indicate a shift by tolerant cells away from growth87, as it was evidenced by the changes in the growth curves under different CIP concentrations in PaeAG1. In the case of ribosomal activity, a cluster is clearly recognized in the whole network and the subnetwork of hub genes, where the rpmF gene is the up-regulated hub element. The rpmF gene encodes for the 50S ribosomal subunit protein L32, which is responsible for protein synthesis and membrane lipid synthesis101. It is also involved in multidrug tolerance by modulating biofilm formation and persister cell induction102. Regarding metabolism, several reports have shown a down-regulation of energy production and carbohy- drates, amino acids and lipids metabolism15,36,87,103, 104. Five hub genes (sdhB, sdhC, prpC, acpP and fahA) are particularly associated with metabolism. For instance, fahA is key in the inhibition of amino acid m etabolism105, coding for a fumarylacetoacetase necessary for the tyrosine catabolism pathway. In addition, fahA is a topologi- cal bottleneck in the networks (Fig. 6A–C), separating the main phage genes cluster from the rest of the nodes. As detailed before, regulation of this gene could be used to restrict amino acids access to the phage and thus restraining the full phage a ctivity98. In the case of RNA degradation pathways, we identified groL (or groEL) as a hub gene, a homolog of heat shock protein 6 0106. DnaK and GroL are major ubiquitous chaperones that play crucial roles in promoting protein folding during normal growth and under stress conditions107 such as oxidative stress, antibiotics or heat26,107,108. In PaeAG1, both chaperones were up-regulated. In relation to virulence factors, CIP modulated adherence and phenazines. A total of 19 DEGs implicated in adherence were identified with down-regulation observed for LPS O-antigen, flagella, and type IV pili biosyn- thesis elements. Similar results were found for P. aeruginosa after CIP treatment in another s tudy26. Under other stress conditions, this down-regulation has been suggested to be a mechanism to avoid biofilm formation as a possible way to escape as planktonic cells46 and, in general, to modulate mechanisms for colonization, survival and invasion within the host tissues93. Regarding phenazines, six genes were up-regulated. This profile is associated with tolerance to oxidative stress, iron availability, biofilms, virulence and killing microbial c ompetitors109. Phenazine biosynthesis is regulated by the R hl76 and P QS110 quorum sensing systems in P. aeruginosa. The rhlR gene was found to be up-regulated, suggesting a possible regulation of the phenazines. More details of specific genes and their relationship with other virulence factors, antibiotic resistance and other responses (all with few number of DEGs) are discussed in the supplementary material “Extended discus- sion: Other transcriptomic determinants of PaeAG1 in response to CIP”. Altogether, the transcriptomic analysis in PaeAG1 allowed us to identify key molecular determinants of the response to CIP, many of them related to the bacterial grown, such as RpoS and phage induction. This agrees completely with our hypothesis in which transcriptomic response to CIP was related to bacterial growth modula- tion. After a DNA damage response is induced by sub-inhibitory CIP treatment, there is a subsequent pathway modulation and transcriptional changes that define changes in the bacterial growth. A conceptual representa- tion of these results is shown in Fig. 9, aiming to integrate our results, literature reports and possible unknown connections. All these features are particularly relevant for high-risk strains, such as PaeAG1. As it has been suggested, the biological markers of P. aeruginosa high-risk clones could be useful for the future design of specific treat- ments and infection control s trategies7. Thus, more detailed analyses are needed to study the different levels of transcriptomic regulation in PaeAG1, including targeted expression analysis, other stress conditions, genetic and phenotypic variability, validation of the effect and power of hub genes, explorations of the relationship between presence of specific virulence traits and severity, and phage induction as a potential therapy. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 18 Vol:.(1234567890) www.nature.com/scientificreports/ Figure 9. Conceptualization of effects of ciprofloxacin treatment in PaeAG1 at the molecular level. Effects of DNA damage triggers RecA increment, which cleaves different repressors such as LexA, inducing the SOS response, but also phages induction repressors, and other elements. The general stress induces the RpoS response, modulating different responses and virulence factors. Other modulators induce changes in the metabolic state of cells, expression of virulence factors, as well as the down-shift in ribosomal activity. Together, all changes imply modulation of multiple responses with pleiotropic effects at a molecular level and regulation of phenotypes to face the stress given by the antibiotic. conclusions In this work, we report a concentration-dependent reduction of PaeAG1 growth rate upon increasing sub- inhibitory CIP concentrations by comparing growth curves. The RNA-Seq analysis of PaeAG1 after treatment with a sub-inhibitory CIP concentration allowed us to identify 518 DEGs along time at 2.5 and 5 h. Using a top-down systems biology approach, we identified diverse transcriptomic determinants: 14 hub genes, multiple gene clusters and 15 enriched pathways. These included down-regulation of pathways related to metabolism, ribosomal activity and adherence factors, most of them related to bacterial growth reduction. Phages, phenazines and specific virulence factors were found to be up-regulated. In most cases, hub genes and complex relationships were identified, showing pleiotropic effects that are mainly illustrated by clusters of highly connected genes. Two particular clusters of phages genes were up-regulated by CIP. Validation of CIP effects on phage induction was done at phenomic level with a phage plaque assay, showing an exponential induction as CIP was increased. To our knowledge, this is the first report of the analysis of CIP response in a ST-111 high-risk P. aeruginosa strain, in particular by a combined strategy using a top-down systems biology approach. This led us to identify transcriptomic determinants in response to CIP, including resident phages induction as a potential therapeutic strategy to overcome antibiotic resistance. Data availability The RNA-seq raw data and processed files of transcripts quantification are available at the NCBI Gene Expression Omnibus (GEO) database under accession number GSE139866. Processed data and scripts for bioinformat- ics analyses (RNA-Seq data, differential expression using DESeq2 and co-expression analyses) are available at https: //github .com/josemo lina6 /PaeAG1 _CIP_RNA-Seq). Genome sequence and annotation files in all required formats for mapping and quality control of the RNA-Seq reads alignment are available from our previous work at https ://githu b.com/josem olina 6/PaeAG 1_genom e. More details of the genome assembly and annotation in5. Received: 11 March 2020; Accepted: 25 June 2020 References 1. Lyczak, J. B., Cannon, C. L. & Pier, G. B. Establishment of Pseudomonas aeruginosa infection: Lessons from a versatile opportunist1*Address for correspondence: Channing Laboratory, 181 Longwood Avenue, Boston, MA 02115, USA. Microbes Infect. 2, 1051–1060 (2000). 2. Goldberg, J. B. ‘Pseudomonas ’99, The Seventh International Congress on Pseudomonas: biotechnology and pathogenesis’, organized by the American Society for Microbiology, was held in Maui, HI, USA, 1–5 September 1999. Trends Microbiol. 8, 55–57 (2000). 3. Wu, W. & Jin, S. PtrB of Pseudomonas aeruginosa suppresses the type III secretion system under the stress of DNA damage. J. Bacteriol. 187, 6058–6068 (2005). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 19 Vol.:(0123456789) www.nature.com/scientificreports/ 4. Silby, M. W., Winstanley, C., Godfrey, S. A. C., Levy, S. B. & Jackson, R. W. Pseudomonas genomes: Diverse and adaptable. FEMS Microbiol. Rev. 35, 652–680 (2011). 5. Molina-Mora, J.-A., Campos-Sánchez, R., Rodríguez, C., Shi, L. & García, F. High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Sci. Rep. 10, 1392 (2020). 6. Toval, F. et al. Predominance of carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP and blaVIM metallo- β-lactamases in a major hospital in Costa Rica. J. Med. Microbiol. 64, 37–43 (2015). 7. Mulet, X. et al. Biological markers of Pseudomonas aeruginosa epidemic high-risk clones. Antimicrob. Agents Chemother. 57, 5527–5535 (2013). 8. World Health Organization. Guidelines for the prevention and control of carbapenem-resistant Enterobacteriaceae, Acinetobacter baumannii and Pseudomonas aeruginosa in health care facilities. (2017). 9. Woodford, N., Turton, J. F. & Livermore, D. M. Multiresistant Gram-negative bacteria: The role of high-risk clones in the dis- semination of antibiotic resistance. FEMS Microbiol. Rev. 35, 736–755 (2011). 10. Farajzadeh Sheikh, A. et al. Molecular epidemiology of colistin-resistant Pseudomonas aeruginosa producing NDM-1 from hospitalized patients in Iran. Iran. J. Basic Med. Sci. 22, 38–42 (2019). 11. Firme, M., Kular, H., Lee, C. & Song, D. RpoS contributes to variations in the survival pattern of Pseudomonas aeruginosa in response to ciprofloxacin. J. Exp. Microbiol. Immunol. 14, 21–27 (2010). 12. Brazas, M. D., Brazas, M. D., Hancock, R. E. W. & Hancock, R. E. W. Ciprofloxacin induction of a susceptibility determinant in Pseudomonas aeruginosa. Antimicrob. Agents Chemother. 49, 3222–3227 (2005). 13. McVicker, G. et al. Clonal expansion during Staphylococcus aureus infection dynamics reveals the effect of antibiotic interven- tion. PLoS Pathog. 10, 2 (2014). 14. Andersson, D. I. & Hughes, D. Microbiological effects of sublethal levels of antibiotics. Nat. Rev. Microbiol. 12, 465–478 (2014). 15. Stewart, P. S. et al. Contribution of stress responses to antibiotic tolerance in Pseudomonas aeruginosa biofilms. Antimicrob. Agents Chemother. 59, 3838–3847 (2015). 16. Matern, W. M., Rifat, D., Bader, J. S. & Karakousis, P. C. Gene enrichment analysis reveals major regulators of Mycobacterium tuberculosis gene expression in two models of antibiotic tolerance. Front. Microbiol. 9, 1–10 (2018). 17. Hocquet, D. et al. Evidence for induction of integron-based antibiotic resistance by the SOS response in a clinical setting. PLoS Pathog. 8, 2 (2012). 18. Dapa, T., Fleurier, S., Bredeche, M.-F. & Matic, I. The SOS and RpoS regulons contribute to bacterial cell robustness to genotoxic stress by synergistically regulating DNA polymerase Pol II. Genetics 206, 1349–1360 (2017). 19. Kreuzer, K. N. DNA damage responses in prokaryotes: Regulating gene expression, modulating growth patterns, and manipulat- ing replication forks. Cold Spring Harbor Perspect. Biol. https: //doi.org/10.1101/cshpe rspec t.a01267 4 (2013). 20. Valencia, E. Y., Esposito, F., Spira, B., Blázquez, J. & Galhardo, R. S. Ciprofloxacin-mediated mutagenesis is suppressed by sub- inhibitory concentrations of amikacin in Pseudomonas aeruginosa. Antimicrob. Agents Chemother. AAC https: //doi.org/10.1128/ AAC.02107 -16 (2016). 21. Siqueira, V. L. D. et al. Structural changes and differentially expressed genes in Pseudomonas aeruginosa exposed to meropenem- ciprofloxacin combination. Antimicrob. Agents Chemother. 58, 3957–3967 (2014). 22. Cabot, G. et al. Evolution of Pseudomonas aeruginosa antimicrobial resistance and fitness under low and high mutation rates. Antimicrob. Agents Chemother. 60, 1767–1778 (2016). 23. Knezevic, P., Curcin, S., Aleksic, V., Petrusic, M. & Vlaski, L. Phage-antibiotic synergism: A possible approach to combatting Pseudomonas aeruginosa. Res. Microbiol. 164, 55–60 (2013). 24. Dörr, T., Lewis, K. & Vulić, M. SOS Response induces persistence to fluoroquinolones in Escherichia coli. PLoS Genet. 5, e1000760 (2009). 25. Recacha, E. et al. Quinolone resistance reversion by targeting the SOS response. MBio 8, 2 (2017). 26. Cirz, R. T., O’Neill, B. M., Hammond, J. A., Head, S. R. & Romesberg, F. E. Defining the Pseudomonas aeruginosa SOS response and its role in the global response to the antibiotic ciprofloxacin. J. Bacteriol. 188, 7101–7110 (2006). 27. Breidenstein, E. B. M., Bains, M. & Hancock, R. E. W. Involvement of the lon protease in the SOS response triggered by cipro- floxacin in Peudomonas aeruginosa PAO1. Antimicrob. Agents Chemother. 56, 2879–2887 (2012). 28. Shiba, T., Tsutsumi, K., Ishige, K. & Noguchi, T. Inorganic polyphosphate and polyphosphate kinase: Their novel biological functions and applications. Biochem. 65, 315–323 (2000). 29. Suh, S. J. et al. Effect of rpoS mutation on the stress response and expression of virulence factors in Pseudomonas aeruginosa. J. Bacteriol. 181, 3890–3897 (1999). 30. Weber, H. et al. Genome-wide analysis of the general stress response network in Escherichia coli: sigmaS-dependent genes, promoters, and sigma factor selectivity. Society 187, 1591–1603 (2005). 31. Kayama, S. et al. The role of rpoS gene and quorum-sensing system in ofloxacin tolerance in Pseudomonas aeruginosa. FEMS Microbiol. Lett. 298, 184–192 (2009). 32. Hong, S. H., Wang, X., O’Connor, H. F., Benedik, M. J. & Wood, T. K. Bacterial persistence increases as environmental fitness decreases. Microb. Biotechnol. 5, 509–522 (2012). 33. Baharoglu, Z. & Mazel, D. SOS the formidable strategy of bacteria against aggressions. FEMS Microbiol. Rev. 38, 2 (2014). 34. Balasubramanian, D. et al. The regulatory repertoire of pseudomonas aeruginosa AmpC ß-lactamase regulator AmpR includes virulence genes. PLoS ONE 7, 2 (2012). 35. Nguyen, H. et al. Negative control of RpoS synthesis by the sRNA ReaL in Pseudomonas aeruginosa. Front. Microbiol. 9, 1–10 (2018). 36. Müller, A. U., Imkamp, F. & Weber-Ban, E. The mycobacterial LexA/RecA-independent DNA damage response is controlled by PafBC and the pup-proteasome system. Cell Rep. 23, 3551–3564 (2018). 37. Schulz, S. et al. Elucidation of sigma factor-associated networks in Pseudomonas aeruginosa reveals a modular architecture with limited and function-specific crosstalk. PLoS Pathog. 11, 1–21 (2015). 38. van Dam, S., Võsa, U., van der Graaf, A., Franke, L. & de Magalhães, J. P. Gene co-expression analysis for functional classification and gene-disease predictions. Brief. Bioinform. 19, 575–592 (2018). 39. Linde, J., Schulze, S., Henkel, S. G. & Guthke, R. Data- and knowledge-based modeling of gene regulatory networks: An update. EXCLI J. 14, 346–378 (2015). 40. Liu, W. et al. Construction and analysis of gene co-expression networks in Escherichia coli. Cells 7, 19 (2018). 41. Khaledi, A. et al. Transcriptome profiling of antimicrobial resistance in Pseudomonas aeruginosa. Antimicrob. Agents Chemother. 60, 4722–4733 (2016). 42. Fang, G. et al. Transcriptomic and phylogenetic analysis of a bacterial cell cycle reveals strong associations between gene co- expression and evolution. BMC Genom. 14, 2 (2013). 43. Langfelder, P. & Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 9, 2 (2008). 44. Lovelace, A. H., Smith, A. & Kvitko, B. H. Pattern-triggered immunity alters the transcriptional regulation of virulence-associated genes and induces the sulfur starvation response in pseudomonas syringae pv. tomato DC3000. Mol. Plant-Microbe Interact. 31, 750–765 (2018). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 20 Vol:.(1234567890) www.nature.com/scientificreports/ 45. Dai, H., Zhou, J. & Zhu, B. Gene co-expression network analysis identifies the hub genes associated with immune functions for nocturnal hemodialysis in patients with end-stage renal disease. Med. (United States) 97, 1–8 (2018). 46. Chan, K.-G. et al. Transcriptome analysis of Pseudomonas aeruginosa PAO1 grown at both body and elevated temperatures. PeerJ 4, e2223 (2016). 47. Anupama, R., Sajitha Lulu, S., Mukherjee, A. & Babu, S. Cross-regulatory network in Pseudomonas aeruginosa biofilm genes and TiO2 anatase induced molecular perturbations in key proteins unraveled by a systems biology approach. Gene 647, 289–296 (2018). 48. Molina-Mora, J. A., Campos-Sanchez, R. & Garcia, F. Gene Expression Dynamics Induced by Ciprofloxacin and Loss of Lexa Function in Pseudomonas aeruginosa PAO1 Using Data Mining and Network Analysis. in 2018 IEEE International Work Con- ference on Bioinspired Intelligence (IWOBI) 1–7 (IEEE, 2018). doi: 10.1109/IWOBI.2018.8464130 49. Stojakovic, A., Mastronardi, C. A., Licinio, J. & Wong, M.-L. Long-term consumption of high-fat diet impairs motor coordina- tion without affecting the general motor activity. J. Transl. Sci. 5, 1–10 (2018). 50. Bjursell, M. et al. Ageing Fxr deficient mice develop increased energy expenditure, improved glucose control and liver damage resembling NASH. PLoS ONE 8, 2 (2013). 51. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 52. Andrews, S. FastQC a quality control tool for high throughput sequence data. (2010). Available at: https://www.bioinformatics. babraham.ac.uk/projects/fastqc/. (Accessed: 10th April 2018) 53. Wingett, S. W. & Andrews, S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research 7, 1338 (2018). 54. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016). 55. Magoc, T., Wood, D. & Salzberg, S. L. EDGE-pro: estimated degree of gene expression in prokaryotic genomes. Evol. Bioinform. Online 9, 127–136 (2013). 56. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). 57. Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294 (2016). 58. Wang, L., Wang, S. & Li, W. RSeQC: Quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012). 59. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). 60. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2020). 61. Winsor, G. L. et al. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database. Nucleic Acids Res. 44, D646–D653 (2016). 62. Szklarczyk, D. et al. STRING v11: Protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019). 63. Shannon, P. et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). 64. Mine, A. et al. The defense phytohormone signaling network enables rapid, high-amplitude transcriptional reprogramming during effector-triggered immunity[OPEN]. Plant Cell 30, 1199–1219 (2018). 65. Wang, X. et al. Weighted gene co-expression network analysis for identifying hub genes in association with prognosis in Wilms tumor. Mol. Med. Rep. 19, 2041–2050 (2019). 66. Cao, L. et al. Identification of hub genes and potential molecular mechanisms in gastric cancer by integrated bioinformatics analysis. PeerJ 6, e5180 (2018). 67. Chin, C.-H. et al. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst. Biol. 8, S11 (2014). 68. Teitzel, G. M. M. et al. Survival and growth in the presence of elevated copper: Transcriptional profiling of copper-stressed Pseudomonas aeruginosa. J. Bacteriol. 188, 7242–7256 (2006). 69. Cummins, J., Reen, F. J., Baysse, C., Mooij, M. J. & O’Gara, F. Subinhibitory concentrations of the cationic antimicrobial peptide colistin induce the pseudomonas quinolone signal in Pseudomonas aeruginosa. Microbiology 155, 2826–2837 (2009). 70. Kai, T. et al. A low concentration of azithromycin inhibits the mRNA expression of N-acyl homoserine lactone synthesis enzymes, upstream of lasI or rhlI, Pseudomonas aeruginosa. Pulm. Pharmacol. Ther. 22, 483–486 (2009). 71. Chang, W., Small, D. A., Toghrol, F. & Bentley, W. E. Microarray analysis of Pseudomonas aeruginosa reveals induction of pyocin genes in response to hydrogen peroxide. BMC Genom. 6, 1–14 (2005). 72. Ceyssens, P.-J. Isolation and characterization of lytic bacteriophages infecting Pseudomonas aeruginosa (Katholieke Universiteit Leuven, Flanders, 2009). 73. Schwab, K. J., De Leon, R. & Sobsey, M. D. Concentration and purification of beef extract mock eluates from water samples for the detection of enteroviruses, hepatitis A virus, and Norwalk virus by reverse transcription-PCR. Appl. Environ. Microbiol. 61, 531–537 (1995). 74. Paterson, W. D., Douglas, R. J., Grinyer, I. & McDermott, L. A. Isolation and preliminary characterization of some Aeromonas salmonicida bacteriophages. J. Fish. Res. Board Canada 26, 629–632 (1969). 75. Schuster, M., Hawkins, A. C., Harwood, C. S. & Greenberg, E. P. The Pseudomonas aeruginosa RpoS regulon and its relationship to quorum sensing. Mol. Microbiol. 51, 973–985 (2004). 76. Kumar, S. S., Penesyan, A., Elbourne, L. D. H., Gillings, M. R. & Paulsen, I. T. Catabolism of Nucleic acids by a cystic fibrosis Pseudomonas aeruginosa isolate: An adaptive pathway to cystic fibrosis sputum environment. Front. Microbiol. 10, 1–14 (2019). 77. Fernández, M., Corral-Lugo, A. & Krell, T. The plant compound rosmarinic acid induces a broad quorum sensing response in Pseudomonas aeruginosa PAO1. Environ. Microbiol. 20, 4230–4244 (2018). 78. Salmon-Divon, M., Zahavi, T. & Kornspan, D. Transcriptomic analysis of the brucella melitensisrev.1 vaccine strain in an acidic environment: Insights into virulence attenuation. Front. Microbiol. 10, 1–12 (2019). 79. Thode, S. K. et al. Construction of a fur null mutant and RNA-sequencing provide deeper global understanding of the Aliivibrio salmonicida Fur regulon. PeerJ 2017, 2 (2017). 80. Mets, T. et al. Fragmentation of Escherichia coli mRNA by MazF and MqsR. Biochimie 156, 79–91 (2019). 81. Cabezas, C. E. et al. The transcription factor SlyA from Salmonella Typhimurium regulates genes in response to hydrogen peroxide and sodium hypochlorite. Res. Microbiol. 169, 263–278 (2018). 82. Fornelos, N., Browning, D. F. & Butala, M. The use and abuse of LexA by mobile genetic elements. Trends Microbiol. 24, 391–401 (2016). 83. Stockwell, V. O. & Loper, J. E. The sigma factor RpoS is required for stress tolerance and environmental fitness of Pseudomonas fluorescens Pf-5. Microbiology 151, 3001–3009 (2005). 84. Goerke, C., Koller, J. & Wolz, C. Ciprofloxacin and trimethoprim cause phage induction and virulence modulation in Staphy- lococcus aureus. Antimicrob. Agents Chemother. 50, 171–177 (2006). 85. Friedman, N., Vardi, S., Ronen, M., Alon, U. & Stavans, J. Precise temporal modulation in the response of the SOS DNA repair network in individual bacteria. PLoS Biol. 3, e238 (2005). Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 21 Vol.:(0123456789) www.nature.com/scientificreports/ 86. Ronen, M., Rosenberg, R., Shraiman, B. I. & Alon, U. Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics. Proc. Natl. Acad. Sci. USA. 99, 10555–10560 (2002). 87. Babin, B. M. et al. Selective proteomic analysis of antibiotic-tolerant cellular subpopulations in pseudomonas aeruginosa biofilms. MBio 8, 2 (2017). 88. Kamal, F. & Dennis, J. J. Burkholderia cepacia complex phage-antibiotic synergy (PAS): Antibiotics stimulate lytic phage activity. Appl. Environ. Microbiol. 81, 1132–1138 (2015). 89. Burmeister, A. R. et al. Pleiotropy complicates a trade-off between phage resistance and antibiotic resistance. Proc. Natl. Acad. Sci. USA. https: //doi.org/10.1073/pnas.191988 8117 (2020). 90. Ryan, E. M., Alkawareek, M. Y., Donnelly, R. F. & Gilmore, B. F. Synergistic phage-antibiotic combinations for the control of Escherichia coli biofilms in vitro. FEMS Immunol. Med. Microbiol. 65, 395–398 (2012). 91. Golais, F., Hollý, J. & Vítkovská, J. Coevolution of bacteria and their viruses. Folia Microbiol. (Praha) 58, 177–186 (2013). 92. Fothergill, J. L. et al. Effect of antibiotic treatment on bacteriophage production by a cystic fibrosis epidemic strain of Pseu- domonas aeruginosa. Antimicrob. Agents Chemother. 55, 426–428 (2011). 93. Chatterjee, M. et al. Antibiotic resistance in Pseudomonas aeruginosa and alternative therapeutic options. Int. J. Med. Microbiol. 306, 48–58 (2016). 94. Rozanov, D. V., D’Ari, R. & Sineoky, S. P. RecA-independent pathways of lambdoid prophage induction in Escherichia coli. J. Bacteriol. 180, 6306–6315 (1998). 95. Xie, X. T. Characterization of the fecal virome and fecal virus shedding patterns of commercial mink (Neovison vison) (University of Guelph, Guelph, 2017). 96. Tsao, Y. F. et al. Phage morons play an important role in Pseudomonas aeruginosa phenotypes. J. Bacteriol. 200, 1–15 (2018). 97. Winstanley, C. et al. Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool epidemic strain of Pseudomonas aeruginosa. Genome Res. 19, 12–23 (2008). 98. Blasdel, B. G., Chevallereau, A., Monot, M., Lavigne, R. & Debarbieux, L. Comparative transcriptomics analyses reveal the conservation of an ancestral infectious strategy in two bacteriophage genera. ISME J. 11, 1988–1996 (2017). 99. Chevallereau, A. et al. Next-generation-omics approaches reveal a massive alteration of host RNA metabolism during bacterio- phage infection of Pseudomonas aeruginosa. PLoS Genet. 12, 1–20 (2016). 100. De Smet, J. et al. High coverage metabolomics analysis reveals phage-specific alterations to Pseudomonas aeruginosa physiology during infection. ISME J. 10, 1823–1835 (2016). 101. Podkovyrov, S. & Larson, T. J. Lipid biosynthetic genes and a ribosomal protein gene are cotranscribed. FEBS Lett. 368, 429–431 (1995). 102. Liu, S. et al. Identification of novel genes including rpmF and yjjQ critical for Type II 1 persister formation in Escherichia coli. bioRxiv https ://doi.org/10.1101/310961 (2018). 103. Cornforth, D. M. et al. Pseudomonas aeruginosa transcriptome during human infection. Proc. Natl. Acad. Sci. U. S. A. 115, 2 (2018). 104. Quintana, J., Novoa-Aponte, L. & Argüello, J. M. Copper homeostasis networks in the bacterium Pseudomonas aeruginosa. J. Biol. Chem. 292, 15691–15704 (2017). 105. Zheng, X., Su, Y., Chen, Y., Huang, H. & Shen, Q. Global transcriptional responses of denitrifying bacteria to functional- ized single-walled carbon nanotubes revealed by weighted gene-coexpression network analysis. Sci. Total Environ. 613–614, 1240–1249 (2018). 106. Shin, H., Jeon, J., Lee, J.-H., Jin, S. & Ha, U.-H. Pseudomonas aeruginosa GroEL stimulates production of PTX3 by activating the NF-κB pathway and simultaneously downregulating MicroRNA-9. Infect. Immun. 85, 2 (2017). 107. Ito, F., Tamiya, T., Ohtsu, I., Fujimura, M. & Fukumori, F. Genetic and phenotypic characterization of the heat shock response in Pseudomonas putida. Microbiologyopen 3, 922–936 (2014). 108. Michta, E. et al. Proteomic approach to reveal the regulatory function of aconitase AcnA in oxidative stress response in the antibiotic producer Streptomyces viridochromogenes Tü494. PLoS ONE 9, 1 (2014). 109. Wang, Y., Kern, S. E. & Newman, D. K. Endogenous phenazine antibiotics promote anaerobic survival of Pseudomonas aeruginosa via extracellular electron transfer. J. Bacteriol. 192, 365–369 (2010). 110. Higgins, S. et al. Differential regulation of the phenazine biosynthetic operons by quorum sensing in Pseudomonas aeruginosa PAO1-N. Front. Cell. Infect. Microbiol. 8, 2 (2018). Acknowledgements We thank the students John Rodríguez Fernández and Daniel Ulate Rodríguez for their collaboration in multiple activities of the project. We also thank the Genome Technology Center of New York University for generating all sequencing data used in this work, and to all members of Centro de Investigación en Enfermedades Tropicales (Universidad de Costa, Costa Rica) for their support in the experimental assays, and PGx group of The Human Phenome Institute (Fudan University, Shanghai-China) for their support in the bioinformatic analysis. Finally we thank Rachel O’Dea for the critical reading of this work. This work was funded by project “B8114 Definición de la red transcriptómica y de las alteraciones genómicas inducidas por ciprofloxacina en Pseudomonas aeruginosa AG1", Vicerrectoría de Investigación, Universidad de Costa Rica (period 2017-2019). Author contributions J.M.M. and F.G. participated in the conception, design of the study and data selection. D.C.M., M.C.A. and A.U.M. run the experimental assays. J.M.M. implemented all bioinformatic analysis. J.M.M., R.C.S., R.M.R. and L.S. were involved in bioinformatic analysis interpretation. J.M.M. and F.G. participated in the interpretation of the data in the biological context. J.M.M. drafted the manuscript and all authors were involved in its revision. All authors read and approved the final manuscript. competing interests The authors declare no competing interests. Additional information Supplementary information is available for this paper at https ://doi.org/10.1038/s4159 8-020-70581 -2. Correspondence and requests for materials should be addressed to J.A.M.-M. Reprints and permissions information is available at www.nature.com/reprints. Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 22 Vol:.(1234567890) www.nature.com/scientificreports/ Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creati veco mmons. org/licens es/by/4.0/. © The Author(s) 2020 Scientific RepoRtS |        (2020) 10:13717  |  https://doi.org/10.1038/s41598-020-70581-2 23 Vol.:(0123456789) 155 GENERAL DISCUSSION AND CONCLUSIONS Antibiotic resistance is a major threat to public health because of its continuous emergence, worldwide spread, and increasing prevalence (Hong et al., 2015). Unlike highly host-adapted pathogens and symbionts undergoing genome reduction, as a versatile environmental organism, P. aeruginosa continually expands its genomic repertory (Mathee et al., 2008). With a high-risk ST-111 profile, PaeAG1 is a critical organism given its resistance to multiple antibiotics, including carbapenems (World Health Organization, 2017). In this context, a comprehensive multi-omics approach was implemented to study the molecular determinants of antibiotic tolerance in this strain. The case of PaeAG1 genome assembly was a first and important step to understand the genomic architecture of an ST-111 high-risk strain. A de novo approach was preferred since PaeAG1 has around 1.0 Mb of an additional DNA sequence when compared to the reference genome. These exclusive regions are composed of 57 genomic islands harboring two MBL-carrying integrons, pro- phages, and many other genes. The annotation also revealed all the genomic content and molecular determinants related to phenotypes, which for PaeAG1 are related to multi-resistance and virulence mainly. As it was shown here, those advances in sequencing technology play an outstanding and determinant role in infection investigation and tracking evolution of international lineage of high- risk bacterial clones in clinical context over long times and in great detail (Dößelmann et al., 2017). However, genome assembly is not obvious and it is challenged by sequencing technology, genomic features, and all bioinformatics algorithms, making it a real and open problem. An exhaustive comparison of different strategies to assembly the genome and their assessment give a better way to get close to the real genome sequence. Benchmarking using the 3C criterion is a consensus 156 approach that includes different levels and aims of comparison for the robust selection of a final assembly. A hybrid assembly was the best approach to achieve a single circular sequence with high- quality 3C for the case of the genome of a high-risk P. aeruginosa strain. Thus, the best features of short and long-read sequencing technologies are included and their drawbacks are compensated. Second, since PaeAG1 is a high-risk and critical organism due to its resistance to carbapenems, we performed a comparative genomic analysis to describe the genomic context associated with the MBL-carrying integrons. We analyzed 211 complete genome sequences using a pan-genome analysis, separating strains by MLST profile. Then, the analysis of the 57 PaeAG1 genomic islands showed a varying pattern of the presence/absence among all the strains, in particular for the closest genomes to PaeAG1. Two selected genomic island clusters, GICVIM-2 and GICIMP-18, were studied in- depth. GICVIM-2 sequence was completely found in other two known ST-111 strains, which contained the VIM-2-carrying integron as an old-acquaintance In59-like element. GICIMP-18 was partially found in another genome, but the IMP-18-carrying integron has an architecture never reported before, being considered as a novel In1666 integron. We provided new insights about the genomic determinants associated with this high-risk P. aeruginosa clone and its resistance to carbapenems using comparative genomics. Third, proteomic profiles of PaeAG1 after exposure to antibiotics demonstrated that ciprofloxacin effects are similar to the control without antibiotics, contrasting with the results for other antibiotics and the growth curves. In a subsequent analysis, to study the central response to multiple perturbations in the P. aeruginosa group, the core perturbome, and to identify gene expression patterns, we used a machine learning approach. Using public microarray data, two independent partition strategies (single and multiple with SP and MP methods respectively) and three classification algorithms, we were able to identify 46 perturbome elements. Both, network analysis and functional annotation of these genes showed coordinated modulation of biological 157 processes in response to multiple perturbations (including metabolism, biosynthesis and molecule binding, associated with DNA damage repairing, and aerobic respiration), all related to tolerance to stressors, growth arrest, and molecular regulation. In the last step, the particular gene expression response to CIP in PaeAG1 was studied using RNA-Seq. A concentration-dependent reduction of the PaeAG1 growth rate upon increasing sub- inhibitory CIP concentrations was reported when comparing growth curves. The RNA-Seq analysis of PaeAG1 after treatment with a sub-inhibitory CIP concentration allowed us to identify 518 DEGs along time at 2.5 and 5 h. Using a top-down systems biology approach, we identified diverse transcriptomic determinants: 14 hub genes, multiple gene clusters, and 15 enriched pathways. These included down-regulations of pathways related to metabolism, ribosomal activity, and adherence factors, most of them related to bacterial growth reduction. Phages, phenazines, and specific virulence factors were found to be up-regulated. In most cases, hub genes and complex relationships were identified showing pleiotropic effects that are mainly illustrated by clusters of highly connected genes. Two particular clusters of phage genes were up-regulated by CIP. The validation of CIP effects on phage induction was done at a phenomic level with a phage plaque assay, showing an exponential induction as CIP was increased. To our knowledge, this is the first report of the analysis of CIP response in an ST-111 high-risk P. aeruginosa strain, in particular by a combined strategy using a top-down systems biology approach. This led us to identify transcriptomic determinants in response to CIP, including resident phage induction as a potential therapeutic strategy to overcome antibiotic resistance. Together, these genomic and transcriptomic elements are molecular determinants of antibiotic tolerance and resistance in PaeAG1. This is particularly relevant for critical clones with the ability to conquer nosocomial environments and to develop a multi-resistance profile. As has been suggested, the biological markers of high-risk clones could be useful for future design of specific treatments 158 and infection control strategies (Mulet et al., 2013). Thus, in order to study the implications of these genomic and transcriptomic determinants in PaeAG1, more detailed analyses are needed, which include: different levels of molecular regulation, other expression analyses (including proteomic level), other stress conditions to define the perturbome, genetic and phenotypic variability, validation of the effect and power of hub genes, modeling molecular circuits, explorations of the relationship between the presence of specific virulence traits and severity, and phage induction as a potential therapy to overcome resistance. Finally, as shown here, the study of the molecular determinants in PaeAG1 was possible thanks to the integration of sequencing data, phenotypes, and bioinformatics pipelines. In view of the data complexity and results depending on algorithms, benchmarking strategies were required to analyze the data and to select the best protocols according to different criteria. Although we studied a bacterial genome (small in comparison to eukaryotic models), high-performance computational infrastructure was necessary mainly for comparative genomic and transcriptomic analyses. In addition, isolation and antibiotic resistance profiling, genome and RNA sequencing, as well as proteomic and other phenomic assays have been implemented for the last 10 years to study this bacterial model, implying a cost that can be estimated at more than $30 000, only considering sequencing and experimental assays. All these considerations remind us that these types of projects demand high-performance computational infrastructure, best bioinformatics practices, and investment in scientific research in general. 159 SUPPLEMENTARY MATERIAL Two-dimensional gel electrophoresis image analysis of two Pseudomonas aeruginosa clones Jose Arturo Molina-Mora1,2, Diana Chinchilla-Montero3, Carolina Castro-Peña1,2, Fernando García1,2 1 Centro de Investigación en Enfermedades Tropicales, Universidad de Costa Rica, San José, Costa Rica 2 Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica 3 Instituto Costarricense de Investigación y Enseñanza en Nutrición y Salud (INCIENSA), Tres Ríos, Costa Rica jose.molinamora@ucr.ac.cr Abstract. A classical strategy to analyze the protein content of a biological sample is the two-dimensional gel electrophoresis (2D-GE). This technique separates proteins by both isoelectric point and molecular weight, and images are taken for subsequent anal- yses. However, analyses of 2D-GE images require standardized image analysis due to susceptibility of gels to get deformed, presence of overlapping spots and stripes, fuzzy and unstained spots, and others. This represent a difficulty for final users (researchers), which demand for free and user-friendly solutions. We have previously reported the standardization of a protocol to analyze 2D-GE images, and in the current study we applied it to two new bacterial isolates Pseudomonas aeruginosa C25 and C50. We first extracted periplasmic proteins after exposure to antibiotics, and we then run a 2D-GE analysis. Images were analyzed using our standardized protocol, achieving the identifi- cation of protein spots using CellProfiler after pre-processing step. Comparison be- tween strains was done using differential spot analysis, revealing a specific pattern in the protein expression between bacteria. These results will help to study the biological meaning of these strains using proteomic profiling under different conditions. Keywords: 2D-GE, Image analysis, CellProfiler, P. aeruginosa C25, P. aeru- ginosa C50. 1 Introduction The study of the protein content in biological systems is the main study subject of pro- teomics. This included not only to identify the particular proteins that are expressed that can explain a biological context, but also the comparison between conditions to recognize differential proteomic patterns [1]. 2 A classical strategy to analyze the proteomic profile of a sample is the two-dimensional gel electrophoresis (2D-GE) [2]. This technique separates proteins in a layer of poly- acrylamide gel by both isoelectric point (pI, pH at which a molecule is electrically neu- tral) and molecular weight [3], creating spots that are then stained. Analyses of 2D-GE images require standardized image analysis [3], due to suscep- tibility of gels to get deformed, presence of overlapping spots and stripes, fuzzy and unstained spots, and others. [1], [4]. However, the 2D-GE image analysis is not straightforward. This represent a difficulty for final users (such as microbiologist, biol- ogist and researchers in general), which demand for user-friendly solutions. However, these user-friendly software are expensive commercial packages. Free options regularly requires command-line work, making it a drawback for researchers. In this scenario, we have previously reported the standardization of a protocol to analyze 2D-GE images using the Costa Rican bacteria Pseudomonas aeruginosa AG1 as model [5]. Now, in this work we applied our protocol to two new isolates, P. aeru- ginosa C25 and C50, which are two clones obtained from the former strain when ex- posed to high ciprofloxacin (antibiotic) concentrations. P. aeruginosa is an opportun- istic bacteria able to infect immunocompromised hosts, which is frequently associated with antibiotic multiresistance [6]. The three Costa Rican isolates have a multire- sistance profile. They are categorized as a high risk clones because are coming from a strain causing infections in hospitals. Thus, the goal of this study was to implement and assess an image analysis protocol using our previously reported protocol to identify protein spots in 2D-GE gels images from two P. aeruginosa strains C25 and C50. To achieve this, we first extracted periplasmic proteins of P. aeruginosa C25 and C50 after exposure to antibiotics, and we then run a 2D-GE analysis. Images were an- alyzed using our standardized protocol, by identifying spots using CellProfiler. Then, comparison between conditions was done using differential spot analysis. 2 Methods For the extraction of periplasmic proteins of P. aeruginosa C25 and C50, we followed the protocol by [5], [7]. Briefly, cells were cultured until the exponential phase in LB medium. The 2D-GE was performed using strips for separation by isoelectric point (GE HealthCare Immobiline Dry Strip GelsTM), and a SDS-GE gradient was done for the molecular weight separations. Images were taken using ChemiDoc™ photo viewer (Bi- oRad®). The processing step included an image alignment using bUnwarpJ package in the ImageJ program [8]. In this program, five spots were used as reference for the defor- mation of images and to achieve the alignment. Identification of spots was done using our previously reported protocol [5]. Briefly, CellProfiler (https://CellProfiler.org/) was used to analyze images following the next steps: images inversion, primary object recognition and segmentation, manual editing, intensity measuring and visualization of objects. 3 To compare 2D-GE images, a differential spot analysis was implemented. Pairs of images were compared to identify shared spots using an analysis of primary objects (segmentation) of overlapping spots, identification of exclusive spots in each image using the no-overlapping regions, and the subsequent representation spot borders sep- arating shared (red circles) or exclusive dots (green or blue circles). A B Fig. 1. Example of two-dimensional gel electrophoresis (2D-GE) of P. aeruginosa C25 (A) and C50 (B) after growing in LB medium. Assays was performed after cells were growth in LB me- dium. 4 3 Results and discussion Proteomics is considered an essential field for the systematic analysis of biological sys- tems, an assessment of changes in the abundance of proteins that occur in living organ- isms and that can be studied at various levels [4]. The two-dimensional gel electrophoresis 2D-GE is a classical technique used to an- alyze the protein content in biological samples [1]. Here we first performed a 2D-GE assay for the bacterial clones P. aeruginosa C25 and C50, as shown in Fig. 1-A-B. However, 2D-GE image analysis requires specific protocols due to image complex- ity [3]. In this way, we previously established a standardized protocol to identify protein spots using CellPro-filer and other image analysis tools [5]. For the pre-processing step, bUnwarpJ package in the ImageJ program was used to align images. According to this pipelines, five points between the target image (to be modified) and a reference image are selected as common denominator to make the alignment, creating a deformation field and grid (Fig. 2-A-B). A B C D Fig. 2. Analysis of 2D-GE images. Examples of deformation field (A) and deformation grid (B) to align images against a reference in the pre-processing step. (C) Example of a raw image used for the identification of spots using CellProfiler pipeline, as resulted in (D). 5 As shown in Fig. 2-C-D, identification of spots was achieved using CellProfiler soft- ware. Different metrics were used to optimize the segmentation algorithm, as previ- ously described [5]. Although automatic spot recognition is sensitive to complex re- gions, manual edition helped to solve these drawbacks. Commercial solutions have sim- ilar tools to deal with this particular features that are common un 2D-GE image analyses [3]. With a modified protocol, the pipeline was also able to recognize common and shared spots when comparison of proteomic profiles of the two strains was done. For this, a new consensus image was built using image operations (pixel operations), making possible the identification of common spots, which were identified in the same way as before but using the new image. After subtraction of shared dots, exclusive spots were marked and a final visualization was done in the initial images, as shown in Fig. 3. A B Fig. 3. Example of the differential spot identification with 2D-GE images from two P. aeruginosa strains C25 (A) and C50 (B). Shared spots were identified using red circles, and exclusive spots were marked as blue or green spots. 6 Regarding the CellProfiler program, this is a known tool used for cell imaging, for example for microscopy images. However, as we have demonstrated before [5] and here, it is possible to use the algorithms to recognize spots in 2D-GE images. See our previous work for details of the implementations, more details of the pipeline and com- parison of samples [5]. In summary, in this work we presented a new analysis of 2D-GE images using a standardized protocol to identify spots and compare conditions by proteomic profile. This was done using two P. aeruginosa clones, in which was possible to identify both shared and exclusive dots. Although this work is focused on the image analysis, these results will help us to apply this protocol to study P. aeruginosa strains under different experimental conditions, including antibiotics or other stressors and their effect on the proteomic profile of the bacteria. References [1] M. M. Goez, M. C. Torres-Madroñero, S. Röthlisberger, and E. Delgado-Trejos, “Preprocessing of 2-Dimensional Gel Electrophoresis Images Applied to Proteomic Analysis: A Review.,” Genomics. Proteomics Bioinformatics, vol. 16, no. 1, pp. 63–72, 2018. [2] P. H. O’Farrell, “High resolution two-dimensional electrophoresis of proteins.,” J. Biol. Chem., vol. 250, no. 10, pp. 4007–21, May 1975. [3] M. Natale, B. Maresca, P. Abrescia, and E. M. Bucci, “Image analysis workflow for 2- D electrophoresis gels based on imageJ,” Proteomics Insights, vol. 4, pp. 37–49, 2011. [4] T. S. Silva, N. Richard, J. P. Dias, and P. M. Rodrigues, “Data visualization and feature selection methods in gel-based proteomics.,” Curr. Protein Pept. Sci., vol. 15, no. 1, pp. 4–22, Feb. 2014. [5] J. A. Molina-Mora, D. Chinchilla-Montero, C. Castro-Peña, and F. Garcia, “Two- dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler,” Medicine., vol. IN-PRESS, 2020. [6] R. T. Cirz, B. M. O’Neill, J. A. Hammond, S. R. Head, and F. E. Romesberg, “Defining the Pseudomonas aeruginosa SOS response and its role in the global response to the antibiotic ciprofloxacin,” J. Bacteriol., vol. 188, no. 20, pp. 7101–7110, Oct. 2006. [7] G. F. Ames, C. Prody, and S. Kustu, “Simple, rapid, and quantitative release of periplasmic proteins by chloroform.,” J. Bacteriol., vol. 160, no. 3, pp. 1181–3, Dec. 1984. [8] I. Arganda-Carreras, C. O. S. Sorzano, J. Kybic, and C. Ortiz-de-solorzano, “bUnwarpJ : Consistent and Elastic Registration in ImageJ . Methods and Applications .,” Image (Rochester, N.Y.), 2006. Gene expression dynamics induced by ciprofloxacin and loss of LexA function in Pseudomonas aeruginosa PAO1 using data mining and network analysis Molina-Mora J.A. Campos-Sánchez R. Research Center on Tropical Diseases (CIET) Research Center in Cellular and Molecular Biology Faculty of Microbiology, University of Costa Rica (UCR) (CIBCM) San José, Costa Rica Faculty of Microbiology, University of Costa Rica (UCR) jose.molinamora@ucr.ac.cr San José, Costa Rica García F. Research Center on Tropical Diseases (CIET) Faculty of Microbiology, University of Costa Rica (UCR) San José, Costa Rica Abstract— Pseudomonas aeruginosa is an opportunistic culminates with the exposure of single stranded DNA (ssDNA) pathogen that causes a variety of infections in humans and and which represents the start signal of the SOS response [6]. frequently develops mechanisms of resistance to antibiotics, which In this process, protein RecA binds to ssDNA mediating makes its treatment difficult. In this study we applied gene recombinational repair but it also joins to LexA, a SOS expression analysis using data mining techniques and network analysis to evaluate the temporal effects of exposure to repressor gene, and induces its autocleavage. The loss of ciprofloxacin and the changes caused by the loss of function of repression by LexA results in the induction of proteins that LexA, a regulator of the SOS response to the cellular stress. mediate the SOS response for DNA repair and regulate damage Initially, global differential expression profiles using clustering tolerance mechanisms [7]. algorithms suggested that the effects of antibiotic exposure were Ciprofloxacin, an antibiotic of the fluoroquinolone family determined primarily by time and not by loss of LexA function. and classically used for the treatment of P. aeruginosa This was verified by performing attribute selection and infections, is an inducer of the SOS response in this bacterium. differential expression analysis among conditions, where less than The antibiotic alters the activity of the bacterial enzymes DNA 3.3% of maximum difference between strains but up to 21% of gyrase and topoisomerase IV, so it affects the correct differences were observed over time. Together with network analysis, a significant increase in topological metrics was replication of DNA, its recombination, repair and transcription determined when evaluating temporal changes. Functional [8], [9]. This condition causes activation of the SOS response; annotation showed metabolic pathways enriched over time but not however it has been characterized that in P. aeruginosa the SOS when comparing strains. Overall, the results obtained revealed response is mediated by 15 genes, which is much lower than that the response to ciprofloxacin tends to be exacerbated over that reported for other bacterial groups [5]. In addition to SOS time and that it remains stable in the face of the loss of function of response, P. aeruginosa generates a LexA-independent LexA activity. response after exposition to ciprofloxacin [2], [5]. Keywords— P. aeruginosa; Data mining; Network analysis; The biological aim of the present study is to describe the Differential expression; Ciprofloxacin. dynamics of the global differential expression response to perturbation with ciprofloxacin and the effects of loss of LexA I. INTRODUCTION function in P. aeruginosa. For this, curated data of the bacterial Pseudomonas aeruginosa is a Gram-negative bacterium, strain P. aeruginosa PAO1, a reference strain, were used. In metabolically and genomically versatile, found in natural addition, data of a PAO1 mutant, produced by mutagenesis with environments, but it also causes infections in animals and plants loss of function of LexA, were available. Both strains were [1]. In humans, it is an important opportunistic pathogen, being exposed to ciprofloxacin and data from the global differential the third most common cause of nosocomial infections [2]. expression profiles were obtained at 0, 30 and 120 minutes post Many P. aeruginosa infections can be controlled with exposure and using microarray technology. antibiotics, but are difficult to eradicate [3] due in part to the Due to the type of high-throughput technology used, ability of this pathogen to carry out progressive modifications amount of data (5900 genes per replicate, for 12 samples) and that facilitate infection and persistence, between those that the complexity and diversity of data available, an analysis emphasize their ability to adapt to environmental stress [4], [5]. mediated by data mining was required for classification, As in other bacterial groups, cellular stress induces changes clustering and selection of genes. Additionally, an analysis for in DNA architecture, either by direct damage to DNA or the creation, interpretation and evaluation of networks with a indirectly in the replication process as a result of stress, which large-scale systems biology approach was implemented. II. MATERIALS AND METHODS expression, we proceeded to compare the relative values of A. Data source expression of the housekeeping genes proC and rpoD. Both genes have been previously reported as the most stable for P. Data from 12 gene expression microarrays (GPL84 aeruginosa faced with disturbances [10]. This was done to Affymetrix P. aeruginosa Array, with 5900 genes per sample) verify that there was no significant difference between genes was available in NCBI database, accession number GSE5443 under the effects of ciprofloxacin, mutation and time. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5443). The data included duplicates of each sample at 3 time points (0, Differential expression protocol is detailed in the next section. 30 and 120 minutes) for two strains, PAO1 (wildtype, called WT Second, an algorithm for ranking and selection of attributes strain) and an isogenic mutant strain with affectation of LexA was applied, using the classes (experimental conditions) to activity (S125A in lexA gene, called mlexA strain), which were verify that it was possible to separate the conditions by selecting exposed to ciprofloxacin. groups of genes. For this, we applied a classification by SVM Suppport Vector Machine algorithm [11], creating a ranked list B. Normalization and evaluation of quality of the total genes (tolerance=1.0e-10, complexity=1). Using The temporal data from microarray readouts were analyzed quartiles as the parameter for selecting attributes, we eliminated with two quality protocols. First, an RMA algorithm (Robust the last 25% of the ranked list of genes and evaluated with HC Multi-array Average, molmine.com/magma/loading/rma.htm) the top 75% remaining. The elimination was repeated leaving was applied to create an expression matrix from the row data. 50 and 25% of the top genes, and in each step repeating the In short, the values of unprocessed intensity were normalized clustering by HC until obtaining the separation of classes (by and corrected compared to the background noise, with a strains and times). subsequent logarithmic transformation (log2) and Third, in order to validate the loss of LexA function in the quantification by quartiles. Next, a linear model was fitted to mutated strain, we proceeded to compare the expression levels the normalized data to obtain an adjusted measure of expression of the 15 genes associated with the SOS response at all times for each set of probes. and for both strains (with respect to the initial time for WT). In Second, in order to carry out a global study of the addition, for the time 120 minutes a representation of the differential expression profiles, a hierarchical clustering (HC) connections was made with a non-directed graph and algorithm and a PCA (Principal Components Analysis) were comparing the values of expression between strains. To do this, implemented. In both cases, data were loaded into the program Cytoscape (http://www.cytoscape.org) was used to create the MATLAB and defined functions were directly called (pca() and subnetwork of 15 genes, and coloring according to the level of clustering()). For visualization, PCA components were expression. exported as a table and the three components with the greatest D. Differential expression analysis variation were plotted in 3 dimensions using the Plot3D function of the Wolfram-Mathematica software. In the case of In order to identify differentially expressed genes among the HC, a 95% confidence and Euclidean metric were used. evaluated conditions, the algorithm of Benjamini & Hochberg [12] was implemented, using as a criterion a two-fold change C. Expression level comparisons with subsets of genes with respect to the control (all with values p>0.001). The In order to evaluate expression profile variations among relative comparison was made in two different ways. First, the strains and times, we proceeded to select defined sets of genes gene expression profile of the mlexA mutant strain was using three different methodologies. compared with the corresponding WT strain at the same time, First, for a strict quantitative comparison based on allowing to evaluate the differences between the strains at each housekeeping genes, as normally done in classic studies of gene time when exposed to ciprofloxacin. A. Dispersion of intensities B. Hierarchical Clustering (HC) C. Principal Component Analysis (PCA) Figure 1. Evaluation of quality, distribution and global differential expression profiles of samples. Quality was evaluated using dispersion of intensities for all samples using Robust Multi-array Average (RMA) algorithm, showing similar distribution (A). Global differential expression profiles were compared by two clustering techniques. HC shows separation of samples by time but not by strain (B). Similar results were found when a PCA was applied (C). III. RESULTS TABLE 1. Relative comparison of expression of housekeeping genes proC and rpoD A. Clustering algorithms suggest that global differential expression due to exposure to ciprofloxacin in P. aeruginosa logFC (p-value) tend to differ by time than by loss of LexA function Condition With the aim to study the time effect due to exposure to proC rpoD ciprofloxacin and the loss of LexA function in P. aeruginosa, mlexA_0min/WT_0min 0.2619 (0.4483) 0.1043 (0.7342) we analyzed microarray data. For this, we implemented different mlexA_30min/WT_30min 0.1150 (0.8076) 0.0439 (0.9015) analyses to evaluate dispersion, quality and comparability of mlexA_120min/WT_120min -0.4476 (0.1859) 0.3062 (0.3237) total data. First, an RMA algorithm was applied to evaluate dispersion of total data to obtain normalized, logarithmically mlexA_30min/WT_0min 0.0337 (0.9373) 0.4769 (0.2138) transformed and adjusted values (Figure 1-A). The general mlexA_120min/WT_0min -0.3480 (0.2838) 0.4040 (0.2112) dispersion of all the tests shows equivalence in the intensity signals for each of the samples and their replicates, which is In order to study the dynamics over time in the mlexA consistent with the criteria proposed by Bolstad et al. regarding strain, a second analysis was performed comparing the the correction of variation [13]. differential expression of the mlexA mutant strain at different Second, a hierarchical clustering (HC) algorithm based on times with the WT strain (time 0 minutes). Euclidean distance was applied to the data set with the aim of conducting a study of the global differential expression profiles E. Annotation and construction of gene networks and the relationship between conditions and replicates. In order to annotate and characterize the differentially According to the result of the HC, it is observed that the expressed genes, an ontology analysis was carried out by separation between the experimental conditions is achieved by biological process and metabolic pathways using the time but not clearly by strain (Figure 1-B). A second evaluation PANTHER database (http://pantherdb.org/). The genes, both criterion with PCA algorithm provided congruent results with up and down regulated, were directly incorporated into the HC (Figure 1-C). Moreover, differences obtained by both functional modules of the resource and with the specifications algorithms showed a cluster between samples at time 0 and 30 for P. aeruginosa. In addition, using the PseudomonasNET minutes. database (www.inetbio.org/pseudomonasnet/), we performed a Third, in order to evaluate the variation of differential screening of biological functions and their relationships at the expression profiles from the perspective of housekeeping genes, network level. To do this, the genes were incorporated and we compared gene expression from proC and rpo. We proved prioritized by candidate functions for P. aeruginosa using the that there are no statistical significant differences among Gene-centric search module. The resultant network was expression values (presented as a ratio between conditions) for exported in graph format and was incorporated into the both genes (p>0.001) under the perturbations by ciprofloxacin, Cytoscape program. The network analysis was established to the mutation and the different times (Table 1). visualize expression levels. The expression data matrix was Finally, given the biological importance of LexA in the SOS adjusted to the PseudomonasNET identifiers, and specifications response, we verified its inactivity in mlexA strain by comparing were selected to differentiate down and up regulated genes by the expression level of the 15 characterized genes in this color. pathway (relative to the initial profile of WT). As shown in A. Values of logFC for genes of the SOS response B. mlexA strain C. WT strain at time 120 min at time 120 min Figure 2. Comparison of logFC of strains mlexA and WT for genes of the SOS response. All logFC values were obtained by relative comparison with WT- 0min. In strain mlexA, which has no LexA function, SOS response is completely inactive as expected, in contrast to WT which has high levels of expression (A). Representation of values using relations between genes was done by building a subnetwork of the SOS response (B-C). Figure 2-A, the loss of LexA function affects gene expression of the generation of groups was achieved first by time and then by SOS genes, however all them are active for the WT at 30 and strain. This is consistent with previous results of global profiles, 120 minutes. The representation as a non-directed graph however now we clearly observed the separation of strains. alternatively shows the same observations for 120 minutes, In addition, an analysis of differential expression between although some connections are unknown (Figure 2-B-C). the mlexA strain and WT was carried out at the same time in Altogether, these results suggest that differential expression order to estimate accurately differences and variation with profiles of the samples exposed to ciprofloxacin differentiate statistical meaning at single gene resolution and not by global better in time than among strains. To study genes differentially profiles (Figure 3-B-C). The statistically significant differences expressed post exposition to ciprofloxacin, we conducted two (at least 2-fold change with p>0.001) provided evidence that a comparisons: (i) strain-to-strain in the same time, and (ii) discrete number of genes would be affected by the loss of dynamical (time course) analysis of mlexA strain relative to the function of LexA. This in turn corresponded to a minimum initial profile of WT, as detailed in the next sections. number of pathways affected. For example, at 30 minutes, a B. SVM and differential expression analysis show that total of 109 genes were differentially expressed beloging to 8 exposure to ciprofloxacin has a similar effect independently of metabolic pathways; while at 120 minutes 195 genes were LexA activity in P. aeruginosa identified corresponding to only 7 routes. Moreover, only 4 genes were identified for the 3 times and no metabolic pathway In order to contrast the differential expression profiles of the was common for all 3 times. With these results, we concluded mlexA and WT strains at each specific time and to verify that that a low number of transcripts, at most 195 genes (about there are few differences among them, we performed a 3.3%), differentiate the strains when exposed to ciprofloxacin. screening analysis based on data mining to select attributes. For Notably, when screening was done for expression networks this, variation of differential expression profiles was evaluated using the PseudomonasNET database (networks not shown), with defined sets of genes by first ranking all genes using SVM we observed that differentially expressed genes were not algorithm, then last genes were eliminated from the rank, and significantly associated with any particular metabolic pathway finally we evaluated the clustering with HC algorithm. Because at any time (Table 2, last row). This is consistent with previous successive elimination of genes was done by quartiles, it was results and suggests that expression differences between the necessary to remove the last three quartiles of ranked genes two strains (given by LexA activity) at any time have no greater (leaving the top 25% genes or the top quartile) to generate a biological effects (they have similar response). Topological separation of classes. This means that differences in expression metrics of the networks are included and compared in Table 2 level between strains are defined by less than 25% of genes. As to contrast with networks obtained by time (next section). shown in Figure 3-A, separation of the experimental classes and A. Comparison of differential expression profiles B. Differentially expressed genes post-attribute selection between strains Differentially Conditions expressed Pathways genes (number) Down Up mlexA-0min/ WT-0min 67 79 4 mlexA-30min/ WT-30min 55 54 8 mlexA-120min/ WT-120min 115 80 7 C. Comparison by strains in each time Figure 3. Differentially expressed genes by strain, using WT strain at same time for comparisons. Because initial analysis showed less differences betweens strain than by time, a features selection was done by SVM algorithm. An analysis by quartiles show that no more than 25% of top ranked genes can separate conditions, however, in order to incorporate variation at single gene level, an differential expression analysis was done, showing relatively few changes (differences were no higher than 3.3% when comparing mlexA and WT) (B-C). A. Differentially expressed genes by B. Diversity of pathways C. Comparison by time time Differentially Conditions expressed genes Down Up mlexA-0min/ WT-0min 67 79 mlexA-30min/ 288 161 Time 0 min Time 30 min Time 120 min WT-0min 4 pathways 22 pathways 38 pathways mlexA-120min/ WT-0min 614 609 Figure 4. Differentially expressed genes of strain mlexA by time, using WT-0min for comparison. Differences shows an increment by time (A) with enrichment of pathways (B). 280 differentially expressed genes were found to be shared between 30 and 120 minutes (C). C. Large-scale network approach and differential expression significantly associated with any metabolic pathway, however, analysis reveals time-intensified effects in P. aeruginosa mlexA this changes at 30 and 120 minutes (Figure 5 and Table 2, last after exposure to ciprofloxacin row). When performing a general analysis of the networks In order to analyze the temporal dynamics of differential obtained for each of the comparisons, as shown in Table 2, the gene expression in P. aeruginosa mlexA strain after exposure topological metrics revealed relevant changes by time of to ciprofloxacin, we compared the expression of this strain at exposure to the antibiotic in mlexA strain, but not significant all times with respect to the WT strain at 0 minutes using the when considering the difference between strains at the same same criteria applied to previous analysis of gene expression. time. For example, as shown in Table 2, when comparing the As shown in the Figure 4-A, the number of differentiated genes networks obtained between strains in each time (networks not triples from time 0 to 30 minutes, and increases 10 times shown), the number of nodes and edges was oscillating but with between time 0 and 120 minutes. Between time 30 and 120 relatively stable variations compared to the other cases, the minutes, 280 differentially expressed genes were shared, same for the degree of the nodes (1.92 at the beginning and then representing more than 62% of the genes at 30 minutes (Figure it passes to 1.49 at 30 minutes and 2.41 at 120 minutes). Despite 4-C). At 120 minutes, 1223 genes were differentially expressed, this observation, all networks generated presented significant representing 20.7%, which contrasts with the analyses among relationships (based on the p-value of the PPI) but none were strains, where the differences did not exceed 3.3%. significantly associated with any metabolic pathway (based on In addition, ontologies with characterized genes showed a functional enrichment). significant increase in the metabolic pathways involved, with When comparing the time for mlexA strain, the changes 22 pathways at 30 minutes and 38 pathways at 120 minutes, as were significant with a drastic increase in various metrics. For detailed in Figure 4-B. example, between 0 and 120 minutes, the number of nodes When annotation and screening of expression networks changed from 146 to 1223 and the average number of was done with PANTHER and PseudomonasNET databases, at connections per node increased from 1.49 to 16.1. 0 minutes the differentially expressed genes are not logFC 0 min 30 min 120 min Figure 5. Transcriptional networks of differentially expressed genes of strain mlexA by time, using WT-0min for comparison. Identification of one cluster was possible at times 30 and 120 minutes (logFC of up-regulated genes are shown in green and down-regulated genes in red). TABLE 2. Comparison of topological metrics of networks created using the differentially expressed genes Comparison of conditions Topological metrics mlexA-0min/ mlexA-30min/ mlexA-120min/ mlexA-30min/ mlexA-120min/ WT-0min WT-30min WT-120min WT-0min WT-0min Number of nodes 146 109 193 448 1223 Number of edges 140 81 233 2080 9841 Average node degree 1.92 1.49 2.41 9.29 16.1 Average local clustering coefficient 0.506 0.39 0.431 0.476 0.384 PPI enrichment p-value 5.99e-08 1.42e-08 2.58e-12 < 1.0e-16 < 1.0e-16 Functional enrichments detected No No No Yes Yes At 30 and 120 minutes, using the annotation with The model for studies, the strain PAO1 with loss of LexA PANTHER (Figure 4-B), the diversity of metabolic pathways function, it is of particular relevance because LexA is a was significantly linked to the differential expression profile by regulator of the SOS response, which constitutes a mechanism functional enrichment. of tolerance to DNA damage. For this strain, the SOS response On the other hand, in order to compare the values of is induced with the exposure to ciprofloxacin. expression of the down and up regulated genes in the network, To perform an initial evaluation, the data was normalized the preliminary graph was imported into Cytoscape and was with RMA, an algorithm regularly used to evaluate the quality edited to incorporate expression data. When performing the and comparability of the data [13]. This process guarantees that representation of the relative expression values (Figure 5, up- the differences in expression are biologically significant, regulated genes are shown in green and down-regulated genes referred to as interesting variation. In contrast with variation in red), we observed a random distribution of genes, except a introduced during sample preparation, array manufacture and group of up-regulated genes that formed a cluster (at 30 and 120 array processing (labeling, hybridization, and scanning), minutes). These clusters are also formed when networks for WT referred to as obscuring variation [17]. are created at 30 and 120 minutes (with differential expression Notably, clustering algorithms such as PCA and HC (Figure relative to WT-0min), so they are independent of LexA activity 1) showed that the effect of the mutation is reduced compared (networks not shown). When carrying out the characterization to the effect of time when the bacteria is exposed to of the genes that conformed these clusters, the functional ciprofloxacin. In the PCA, the trajectory of global profiles is the annotation revealed that the majority corresponded to: same for the two strains, to the point that it is not possible to hypothetical proteins of P.aeruginosa (not characterized), clearly differentiate the strains. Additionally, the global profiles phage-associated proteins (mostly hypothetical), pyocin were validated with housekeeping genes (Table 1) showing no metabolism, transcription, SOS response and other metabolic significant differences between conditions as expected. processes (Table 3). When comparing the two strains at each time, with or without LexA function, transcriptomic responses showed less IV. DISCUSSION than 3.3% differences between strains when a single gene The complexity of biological systems and the amount of analysis was performed. The initial screening was done by gene data obtained with high-performance technologies continue to ranking and elimination of quartiles; indicating that no more represent a limitation when extracting relevant information than 25% of the genes could differentiate the classes using the [14]. This is also true for prokaryotic biological systems like P. SVM algorithm. aeruginosa, whose physiology and regulatory mechanisms at the global level are barely understood. At the transcriptomic TABLE 3. Cluster genes annotation of mlexA strain level, the need to associate RNA molecules to decipher their (30min and 120min) complex interactions can be solved with pattern recognition within the data sets. This can be complemented with the Number of genes Functional Annotation knowledge stored in databases to characterize interactions and 30 min 120 min analyze them as a complex network. Bacteriophage protein 10 10 P. aeruginosa is a bacterium of high relevance for human 40 health due to the infections it causes and the common loss of Hypothetical protein 39 susceptibility to antibiotics leading to multiple drug resistance Pyocin metabolism 5 5 [2], [15], [16]. This is worsened by the absence of new SOS response regulator 4 - antimicrobials and the inability to control the development of 3* antibiotic resistance. Moreover, lack of knowledge of the Transcriptional regulator 3* antibiotic-pathogen interactions and the mechanisms of action Others (with only 1 gene) 7 2 of antimicrobial agents at the complete system level delay the Total 66 58 formulation of new strategies to control infections [15]. * Two genes were also counted in pyocin metabolism This combination of algorithms and analyses has not been REFERENCES previously reported for expression analyses of P. aeruginosa, [1] M. V. Olson et al., “Complete genome sequence of Pseudomonas since they are regularly performed separately. The aim was to aeruginosa PAO1, an opportunisticpathogen,” Nature, vol. 406, no. 6799, verify with two different techniques the variations among pp. 959–964, Aug. 2000. conditions, where global profiles are used (using SVM by [2] E. B. M. Breidenstein, M. Bains, and R. E. W. Hancock, “Involvement of the lon protease in the SOS response triggered by ciprofloxacin in quartiles) and then individual gene level is applied for Peudomonas aeruginosa PAO1,” Antimicrob. Agents Chemother., vol. 56, separation of classes (with analysis of differential expression). no. 6, pp. 2879–2887, 2012. These results are consistent with what was expected for P. [3] E. Drenkard and F. M. Ausubel, “Pseudomonas biofilm formation and aeruginosa, because LexA seems to regulate a discrete number antibiotic resistance are linked to phenotypic variation,” Nature, vol. 416, no. 6882, pp. 740–743, Apr. 2002. of genes, including 15 of the SOS response, as well as others in [4] J. R. Govan and V. Deretic, “Microbial pathogenesis in cystic fibrosis: various metabolic pathways [5]. Therefore, global profiles in mucoid Pseudomonas aeruginosa and Burkholderia cepacia.,” Microbiol. response to ciprofloxacin do not allow a clear separation of the Rev., vol. 60, no. 3, pp. 539–74, Sep. 1996. strains conditions, i.e. both are affected in a similar way. The [5] R. T. Cirz, B. M. O’Neill, J. A. Hammond, S. R. Head, and F. E. Romesberg, “Defining the Pseudomonas aeruginosa SOS response and its role in the functional effects of LexA loss were verified by comparing the global response to the antibiotic ciprofloxacin,” J. Bacteriol., vol. 188, no. strains, as shown in Figure 2. In other organisms such as 20, pp. 7101–7110, 2006. Escherichia coli and Bacillus subtilis [5], the SOS response also [6] E. Recacha et al., “Quinolone Resistance Reversion by Targeting the SOS involves a relatively small number of genes, although higher Response.,” MBio, vol. 8, no. 5, 2017. [7] C. Y. Mo, L. D. Birdwell, and R. M. Kohli, “Specificity Determinants for than for P. aeruginosa, so the loss of LexA function could also Autoproteolysis of LexA, a Key Regulator of Bacterial SOS Mutagenesis,” have discrete effects on the global differential expression Biochemistry, vol. 53, no. 19, pp. 3158–3168, May 2014. profiles. Additionally, no significantly enriched pathways for [8] R. L. Gibson, J. L. Burns, and B. W. Ramsey, “Pathophysiology and diferentially expressed genes were found and the topological Management of Pulmonary Infections in Cystic Fibrosis,” Am. J. Respir. Crit. Care Med., vol. 168, no. 8, pp. 918–951, Oct. 2003. metrics of correlation networks remained oscillating in [9] K. Drlica and X. Zhao, “DNA gyrase, topoisomerase IV, and the 4- relatively narrow ranges (Table 2). quinolones.,” Microbiol. Mol. Biol. Rev., vol. 61, no. 3, pp. 377–92, Sep. However, when making the temporal comparison of the 1997. mlexA strain with the initial profile of WT, a significant [10] H. Savli, A. Karadenizli, F. Kolayli, S. Gundes, U. Ozbek, and H. Vahaboglu, “Expression stability of six housekeeping genes: a proposal for increase of differentially expressed genes is evidenced, both for resistance gene quantification studies of Pseudomonas aeruginosa by real- 30 minutes and much larger for 120 minutes. For the latter case time quantitative RT-PCR,” J. Med. Microbiol., vol. 52, no. 5, pp. 403–408, reaching almost 21% of differences compared to the initial May 2003. control. For instance, the gene networks showed increasing [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification using Support Vector Machines,” Mach. Learn., vol. 46, no. exponential changes in the number of genes and interactions 1/3, pp. 389–422, 2002. among them (according to the topological metrics), and that [12] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A corresponded to an increase in metabolically enriched routes. Practical and Powerful Approach to Multiple Testing,” Journal of the Royal Inclusively, a cluster was identified at 30 and 120 minutes, Statistical Society. Series B (Methodological), vol. 57. WileyRoyal Statistical Society, pp. 289–300, 1995. whose characterization revealed many hypothetical proteins [13] B. M. Bolstad, T. P. Speed, R. A. Irizarry, and M. Astrand, “A comparison involved. This gap in knowledge regarding the function and of normalization methods for high density oligonucleotide array data based importance of these proteins is a limitation of the current state on variance and bias.,” Bioinformatics, vol. 19, no. 2, pp. 185–193, 2003. of knowledge in P. aeruginosa, because although there is [14] J. A. Molina-Mora, M. Kop-Montero, I. Quirós-Fernández, S. Quiros, J. L. Crespo-Mariño, and R. A. Mora-Rodríguez, “A hybrid mathematical experimental evidence that they are related, it is not possible to modeling approach of the metabolic fate of a fluorescent sphingolipid characterize them completely from the curated databases. analogue to predict cancer chemosensitivity,” Comput. Biol. Med., vol. 97, As previously reported, the knowledge of the transcriptional pp. 8–20, Jun. 2018. dynamics during exposure to ciprofloxacin contributes to the [15] M. D. Brazas, M. D. Brazas, R. E. W. Hancock, and R. E. W. Hancock, “Ciprofloxacin Induction of a Susceptibility Determinant in Pseudomonas understanding of its pathogenicity by the biological processes aeruginosa,” Antimicrob. Agents Chemother., vol. 49, no. 8, pp. 3222–3227, it regulates (phages, mobility, toxin production, others) and can 2005. potentially offer alternatives to modulate the stress response [16] F. Toval, A. Guzman-Marte, V. Madriz, T. Somogyi, C. Rodriguez, and F. [5], particularly the SOS response. This has an impact not only Garcia, “Predominance of carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP and blaVIM metallo- -lactamases in a major in the susceptibility to antibiotics, but could also regulate other hospital in Costa Rica,” J. Med. Microbiol., vol. 64, no. Pt_1, pp. 37–43, biological processes such as the induction of errors in the DNA Jan. 2015. polymerases and emergence of spontaneous mutations [18]. [17] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young, Such mechanisms are of particular importance in P. aeruginosa “Combining location and expression data for principled discovery of genetic regulatory network models.,” Pac. Symp. Biocomput., pp. 437–49, 2002. since resistance to multiple drugs originate mainly from point [18] E. Y. Valencia, F. Esposito, B. Spira, J. Blázquez, and R. S. Galhardo, mutations [19]. Moreover, SOS response affects other “Ciprofloxacin-mediated mutagenesis is suppressed by subinhibitory determinants of production of pyocins and transference and concentrations of amikacin in Pseudomonas aeruginosa,” Antimicrob. expression of exogenous genes of resistance [18]. Potentially, Agents Chemother., no. December, p. AAC.02107-16, 2016. [19] E. B. M. Breidenstein, C. de la Fuente-Núñez, and R. E. W. Hancock, all these biological processes could be targets for modulation of “Pseudomonas aeruginosa: all roads lead to resistance,” Trends Microbiol., the SOS response based on the knowledge obtained from this vol. 19, no. 8, pp. 419–426, Aug. 2011. and other studies. 174 REFERENCES Andersson, D. I., & Hughes, D. (2014). Microbiological effects of sublethal levels of antibiotics. Nature Reviews Microbiology, 12(7), 465 478. https://doi.org/10.1038/nrmicro3270 Bermingham, M. L., Pong- C. S. (2015). Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Scientific Reports, 5, 1 12. https://doi.org/10.1038/srep10312 Berti, A. D., & Hirsch, E. B. (2020, January 10). Tolerance to antibiotics affects response. Science. American Association for the Advancement of Science. https://doi.org/10.1126/science.aba0150 Brauner, A., Fridman, O., Gefen, O., & Balaban, N. Q. (2016, May 1). Distinguishing between resistance, tolerance and persistence to antibiotic treatment. Nature Reviews Microbiology. Nature Publishing Group. https://doi.org/10.1038/nrmicro.2016.34 Brazas, M. D., Brazas, M. D., Hancock, R. E. W., & Hancock, R. E. W. (2005). Ciprofloxacin Induction of a Susceptibility Determinant in Pseudomonas aeruginosa. Antimicrobial Agents and Chemotherapy, 49(8), 3222 3227. https://doi.org/10.1128/AAC.49.8.3222 Cabot, G., Zamorano, L., Moyà, B., Juan, C., Navas, A., Blázquez, J., & Oliver, A. (2016). Evolution of Pseudomonas aeruginosa antimicrobial resistance and fitness under low and high mutation rates. Antimicrobial Agents and Chemotherapy, 60(3), 1767 1778. https://doi.org/10.1128/AAC.02676-15.Address Caldera, M., Müller, F., Kaltenbrunner, I., Licciardello, M. P., Lardeau, C. H., Kubicek, S., & Menche, J. (2019). Mapping the perturbome network of cellular perturbations. Nature Communications, 10(1). https://doi.org/10.1038/s41467-019-13058-9 Chinchilla, D. (2018). Patrones de expresión de los genes de las metalo-b-lactamasas blaIMP-18 y 175 blaVIM-2 e IMP-18 en la cepa Pseudomonas aeruginosa AG1 resistente a carbapenems. Tesis del Posgrado en Microbiología con énfasis en Bacteriología. Universidad de Costa Rica, San José, Costa Rica. Ciofu, O., & Tolker-Nielsen, T. (2019, May 3). Tolerance and resistance of pseudomonas aeruginosabiofilms to antimicrobial agents-how P. aeruginosaCan escape antibiotics. Frontiers in Microbiology. Frontiers Media S.A. https://doi.org/10.3389/fmicb.2019.00913 Civelek, M., & Lusis, A. J. (2014). Systems genetics approaches to understand complex traits. Nature Reviews. Genetics, 15(1), 34 48. https://doi.org/10.1038/nrg3575 Cornforth, D. M., Dees, J. L., Ibberson, C. B., Huse, H. K., Mathiesen, I. H., Kirketerp- Whiteley, M. (2018). Pseudomonas aeruginosa transcriptome during human infection. Proceedings of the National Academy of Sciences of the United States of America, 115(22). https://doi.org/10.1073/pnas.1717525115 DeLong, E. F. (2012). . Springer. Dößelmann, B., Willmann, M., Steglich, M., Bunk, B., Nübel, U., Peter, S., & Neher, R. A. (2017). Rapid and Consistent Evolution of Colistin Resistance in Extensively Drug-Resistant Pseudomonas aeruginosa during Morbidostat Culture. Antimicrobial Agents and Chemotherapy, 61(9), e00043-17. https://doi.org/10.1128/AAC.00043-17 Farajzadeh Sheikh, A., Shahin, M., Shokoohizadeh, L., Halaji, M., Shahcheraghi, F., & Ghanbari, F. (2019). Molecular epidemiology of colistin-resistant Pseudomonas aeruginosa producing NDM-1 from hospitalized patients in Iran. Iranian Journal of Basic Medical Sciences, 22(1), 38 42. https://doi.org/10.22038/ijbms.2018.29264.7096 Fernández, M., Corral-Lugo, A., & Krell, T. (2018). The plant compound rosmarinic acid induces a broad quorum sensing response in Pseudomonas aeruginosa PAO1. Environmental Microbiology, 20(12), 4230 4244. https://doi.org/10.1111/1462-2920.14301 176 Firme, M., Kular, H., Lee, C., & Song, D. (2010). RpoS Contributes to Variations in the Survival Pattern of Pseudomonas aeruginosa in Response to Ciprofloxacin. Journal of Experimentall Microbiology and Immunology (JEMI), 14(April), 21 27. Retrieved from https://microbiology.ubc.ca/sites/default/files/roles/drupal_ungrad/JEMI/14/JEMI14_21- 27.pdf Fothergill, J. L., Mowat, E., Walshaw, M. J., Ledson, M. J., James, C. E., & Winstanley, C. (2011). Effect of antibiotic treatment on bacteriophage production by a cystic fibrosis epidemic strain of Pseudomonas aeruginosa. Antimicrobial Agents and Chemotherapy, 55(1), 426 428. https://doi.org/10.1128/AAC.01257-10 Glaab, E., Bacardit, J., Garibaldi, J. M., & Krasnogor, N. (2012). Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE, 7(7). https://doi.org/10.1371/journal.pone.0039932 Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology, 18(1), 83. https://doi.org/10.1186/s13059-017-1215-1 Hong, D. J., Bae, I. K., Jang, I. H., Jeong, S. H., Kang, H. K., & Lee, K. (2015). Epidemiology and characteristics of metallo-ß-lactamase-producing Pseudomonas aeruginosa. Infection and Chemotherapy, 47(2), 81 97. https://doi.org/10.3947/ic.2015.47.2.81 Kamal, F., & Dennis, J. J. (2015). Burkholderia cepacia complex phage-antibiotic synergy (PAS): Antibiotics stimulate lytic phage activity. Applied and Environmental Microbiology, 81(3), 1132 1138. https://doi.org/10.1128/AEM.02850-14 B. (2010). Genome diversity of Pseudomonas aeruginosa PAO1 laboratory strains. Journal of Bacteriology, 192(4), 1113 1121. https://doi.org/10.1128/JB.01515-09 Lu, P., Wang, Y., Zhang, Y., Hu, Y., Thompson, K. M., & Chen, S. (2016). RpoS-dependent sRNA RgsA 177 regulates Fis and AcpP in Pseudomonas aeruginosa. Molecular Microbiology, 102(2), 244 259. https://doi.org/10.1111/mmi.13458 Ma, C., Xin, M., Feldmann, K. A., & Wang, X. (2014). Machine Learning-Based Differential Network Analysis: A Study of Stress-Responsive Transcriptomes in Arabidopsis. The Plant Cell, 26(2), 520 537. https://doi.org/10.1105/tpc.113.121913 Dynamics of Pseudomonas aeruginosa genome evolution. Proceedings of the National Academy of Sciences, 105(8), 3100 3105. https://doi.org/10.1073/PNAS.0711982105 McVicker, G., Prajsnar, T. K., Williams, A., Wagner, N. L., Boots, M., Renshaw, S. A., & Foster, S. J. (2014). Clonal Expansion during Staphylococcus aureus Infection Dynamics Reveals the Effect of Antibiotic Intervention. PLoS Pathogens, 10(2). https://doi.org/10.1371/journal.ppat.1003959 Molina-Mora, J.-A., Campos-Sánchez, R., Rodríguez, C., Shi, L., & García, F. (2020). High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Scientific Reports, 10(1), 1392. https://doi.org/10.1038/s41598-020-58319-6 Molina-Mora, J.-A., Garcia-Batan, R., & Garcia, F. (2020). From pan-genome to the genomic context of the two integrons of ST-111 Pseudomonas aeruginosa AG1: A VIM-2-carrying old- acquaintance and a novel IMP-18-carrying integron. Research Square (Pre-Print). https://doi.org/10.21203/RS.3.RS-41474/V1 Molina-Mora, J., Montero-Manso, P., Batán, R. G., Sánchez, R. C., Fernández, J. V., & García, F. (2020). A first Pseudomonas aeruginosa perturbome: Identification of core genes related to multiple perturbations by a machine learning approach. BioRxiv, 2020.05.05.078477. https://doi.org/10.1101/2020.05.05.078477 178 Molina-Mora, J.A., Campos-Sanchez, R., & Garcia, F. (2018). Gene Expression Dynamics Induced by Ciprofloxacin and Loss of Lexa Function in Pseudomonas aeruginosa PAO1 Using Data Mining and Network Analysis. In 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI) (pp. 1 7). IEEE. https://doi.org/10.1109/IWOBI.2018.8464130 Molina-Mora, Jose Arturo, Chinchilla-Montero, D., Castro-Peña, C., & Garcia, F. (2020). Two- dimensional gel electrophoresis (2D-GE) image analysis based on CellProfiler: Pseudomonas aeruginosa AG1 as model. Medicine, IN-PRESS. Molina-Mora, Jose Arturo, Chinchilla-Montero, D., Castro-Peña, C., & García, F. (2020). Two- dimensional gel electrophoresis image analysis of two Pseudomonas aeruginosa clones. 2020 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), 1 6. Molina-Mora, Jose Arturo, Chinchilla, D., Chavarría, M., Ulloa, A., Campos-Sanchez, R., Mora- -111 Pseudomonas aeruginosa AG1 to ciprofloxacin identified by a top-down systems biology approach. Scientific Reports, 10, 1 23. https://doi.org/10.1038/s41598-020-70581-2 Morales-Berrocal, M. (2016). Descripción de un modelo infeccioso murino de Pseudomonas aeruginosa AG1. Tesis para optar por el grado de Licenciatura en Microbiología y Química Clínica. Universidad de Costa Rica, San José, Costa Rica. Mulet, X., Cabot, G., Ocampo- Spanish Network for Research in Infectious Diseases (REIPI). (2013). Biological Markers of Pseudomonas aeruginosa Epidemic High-Risk Clones. Antimicrobial Agents and Chemotherapy, 57(11), 5527 5535. https://doi.org/10.1128/AAC.01481-13 e Progress of Multi-Omics Technologies: Determining Function in Lactic Acid Bacteria Using a Systems Level Approach. Frontiers in Microbiology, 10, 3084. https://doi.org/10.3389/fmicb.2019.03084 179 Oliver, A., Mulet, X., López-Causapé, C., & Juan, C. (2015). The increasing threat of Pseudomonas aeruginosa high-risk clones. Drug Resistance Updates, 21 22, 41 59. https://doi.org/10.1016/j.drup.2015.08.002 Petitjean, M., Martak, D., Silvant, A., Bertrand, X., Valot, B., & Hocquet, D. (2017). Genomic characterization of a local epidemic Pseudomonas aeruginosa reveals specific features of the widespread clone ST395. Microbial Genomics, 3(10), e000129. https://doi.org/10.1099/mgen.0.000129 Stewart, P. S., Franklin, M. J., Williamson, K. S., Folsom, J. P., Boegli, L., & James, G. A. (2015). Contribution of stress responses to antibiotic tolerance in Pseudomonas aeruginosa biofilms. Antimicrobial Agents and Chemotherapy, 59(7), 3838 3847. https://doi.org/10.1128/AAC.00433-15 Subramanian, I., Verma, S., Kumar, S., Jere, A., & Anamika, K. (2020). Multi-omics Data Integration, Interpretation, and Its Application. Bioinformatics and Biology Insights, 14, 1177932219899051. https://doi.org/10.1177/1177932219899051 Toval, F., Guzmán-Marte, A., Madriz, V., Somogyi, T., Rodríguez, C., & García, F. (2015). Predominance of carbapenem-resistant Pseudomonas aeruginosa isolates carrying blaIMP and blaVIM metallo- -lactamases in a major hospital in Costa Rica. Journal of Medical Microbiology, 64(1), 37 43. https://doi.org/10.1099/jmm.0.081802-0 Turton, J. F., Wright, L., Underwood, A., Witney, A. A., Chan, Y. T., Al- (2015). High-resolution analysis by whole-genome sequencing of an international lineage (Sequence Type 111) of pseudomonas aeruginosa associated with metallo-carbapenemases in the United Kingdom. Journal of Clinical Microbiology, 53(8), 2622 2631. https://doi.org/10.1128/JCM.00505-15 Woodford, N., Turton, J. F., & Livermore, D. M. (2011). Multiresistant Gram-negative bacteria: the 180 role of high-risk clones in the dissemination of antibiotic resistance. FEMS Microbiology Reviews, 35(5), 736 755. https://doi.org/10.1111/j.1574-6976.2011.00268.x World Health Organization. (2017). Guidelines for the prevention and control of carbapenem- resistant Enterobacteriaceae, Acinetobacter baumannii and Pseudomonas aeruginosa in health care facilities. Geneva. Retrieved from https://apps.who.int/iris/bitstream/handle/10665/259462/9789241550178- eng.pdf?sequence=1&ua=1 on next generation sequencing data analysis using text mining algorithm. BMC Bioinformatics, 17(1), 213. https://doi.org/10.1186/s12859-016-1075-9