Getting stakeholders acquainted with the rationale behind the construct of the English language pro� ciency test of the University of Costa Rica for the Ministry of Education of Costa Rica Un acercamiento al constructo de la prueba de dominio lingüístico del idioma inglés desarrollada por la Universidad de Costa Rica para el Ministerio de Educación de Costa Rica Recepción: 27 de mayo de 2021 Aceptación: 21 de septiembre del 2021 doi: 10.22201/enallt.01852647p.2022.75.1013 Walter Araya Garita Universidad de Costa Rica, Facultad de Letras, Escuela de Lenguas Modernas walter.arayagarita@ucr.ac.cr José Fabián Elizondo González Universidad de Costa Rica, Facultad de Letras, Escuela de Lenguas Modernas josefabian.elizondo@ucr.ac.cr Ana Carolina González Ramírez Universidad de Costa Rica, Facultad de Letras, Escuela de Lenguas Modernas ana.gonzalezramirez@ucr.ac.cr [ 120 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Abstract This paper collects evidence to build a content-validity argument of the lan- guage proficiency test of the University of Costa Rica for high school students by analyzing the theoretical founda- tions supporting the construction and administration of a custom-made lan- guage test and its localized context, including a description of the test and the input of external stakeholders. Ad- ditionally, this paper provides sugges- tions and recommendations for future testing experiences of this type, while paving the way for researchers who in- tend to follow this area of interest. Keywords: language proficiency; standardized testing; content valid- ity; localization; foreign language assessment Resumen Este artículo tiene como objetivo re- copilar evidencia para construir un ar- gumento de validez de contenido acer- ca de la prueba de dominio lingüístico de la Universidad de Costa Rica para estudiantes de secundaria, a través de un análisis de los fundamentos teó- ricos que sustentan la construcción y administración de una prueba de idio- ma hecha a medida para un contexto localizado. El documento incluye una descripción de la prueba y las aporta- ciones de las principales partes intere- sadas. Al mismo tiempo, este artículo proporciona sugerencias y recomen- daciones para futuras experiencias en pruebas de este tipo, mientras que alla- na el camino para futuros investigado- res que pretendan continuar con estos esfuerzos académicos. Palabras clave: dominio lingüístico; evaluación estandarizada; validez de contenido; localización; evaluación de lengua extranjera Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 121 ] 1. Introduction In 2016, the Costa Rican Ministry of Education (Ministerio de Educación Pública; mep, for its abbreviation in Spanish) modifi ed the English curriculum nationwide. In a national effort coined as Alliance for Bilingualism, the mep implemented multiple chang- es in the English programs to meet cultural, societal, and fi nan- cial demands as a strategy to transform Costa Rica into a bilingual country, which in turn would attract investors, generate jobs, re- vitalize the economy, and foster study opportunities abroad (Azo- feifa, 2019). Additionally, in 2019 the mep decided to eliminate their traditional national English tests, which were pass-or-fail, reading comprehension multiple-choice tests. Before 2019, senior high school students were required to obtain a mark of at least 70 out of 100 to be eligible for graduation. Students failing to meet this score would require taking this test as many times as needed to achieve the minimum pass score, impeding them from starting college or getting a job. However, instead of administering the traditional test, the mep opted for administering a language profi - ciency test, which is diagnostic in nature as defi ned by Brown and Abeywickrama (2019: 10). This new test will not evaluate content but the English performance of students based on the Common Eu- ropean Framework of Reference for Languages (cefr descriptors) (Cordero, 2019: November 29). Despite the multiple language certifi cations currently avail- able, none seems to meet mep’s specifi c requirements and needs. First, the test administration and publication of its results must consider mep’s annual schedule, which requires test administrators to carry out all related activities within a very tight time window. Second, the uneven distribution and availability of resources have introduced challenges for public education institutions. Finally, mep’s economic restrictions should also be considered when choos- ing between the different options for English certifi cation. In an attempt to address these needs, the School of Modern Languages (Escuela de Lenguas Modernas; elm, for its abbrevia- Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 122 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez tion in Spanish) of the University of Costa Rica (ucr) decided to create a more locally-sensitive option called Language Profi ciency Test (prueba de dominio lingüístico; pdl). Over the past 30 years, the elm has accumulated vast experience in the design and ad- ministration of a variety of reliable language tests providing valid evidence supporting the interpretation of their scores according to the intended uses. The language profi ciency test designed for the English for Specifi c Purposes program of the National Council of University Presidents — a test for faculty members, and the offi - cial certifi cation of translators and interpreters, among others — attest to elm’s expertise in the fi eld, broadened by the guidance of international language testing professionals. The School not only administers its own language tests but is also an internationally recognized center for some renowned language certifi cations. This acknowledgment has required many faculty members to receive extensive training and gain valuable experience as authorized cer- tifi ed examiners. In addition, the elm possesses the know-how for administering large-scale, high-stakes tests nationwide, such as the Entrance Examination designed by the Psychological Research In- stitute and the School of Statistics at ucr. More recently, the elm has started digitizing some of its tests to facilitate their application and recording of results. This test automatization has widened the options towards considering three possible formats for test admin- istration: online, offl ine, and hybrid (a combination of offl ine and online options that requires minimum bandwidth) suitable for the digital infrastructure of Costa Rica’s public high schools, which may or may not have a strong Internet connection. 2. Literature review Standardized language testing may seem overwhelming and intim- idating to many stakeholders,1 especially educators, whose work 1 The term stakeholder is understood as “[any] person whose interests should be taken into consideration” (Coombe, 2018: 38). Furthermore, the four types Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 123 ] effectiveness is at stake, and their students. As Shohamy (2001, as cited in Fulcher, 2010: 8) argued, “one reason why test takers and teachers dislike tests so much is that they are a means of control”. Students conceive this type of testing as a punishment focused on identifying their weaknesses and areas for improvement; teachers, on the other hand, believe it to a demonstration of lack of trust in their expertise by supervisors. Brown and Abeywickrama (2019: 1) summarized this view when they stated that students and teach- ers, both stakeholders of the assessment process, “are not likely to view a test as positive, pleasant, or affi rming”. Despite these nega- tive perspectives, different scholars have viewed standardized lan- guage assessments as a means through which distributive justice, as Messick (1989, as cited in Fulcher, 2010: 4) proposed, could be achieved. Fulcher (2010: 4) further acknowledged the importance of testing as necessary and acceptable worldwide norms on which high-stakes decisions can be made, not only for those in charge of designing the instruments but also for test users, who need to see the test as designed, administered, scored, and reported fairly and equitably. The language competence assessment concept has been con- tinuously redefi ned through time, according to the needs of us- ers and the evolution of language teaching and learning theories. Authors such as Oller (1979, as cited in Brown & Abeywickra- ma, 2019: 13) stated that during the decades of 1970 and 1980, language competence was viewed as “a unifi ed set of interacting activities that could not be tested separately”. Cloze and dicta- tion exercises, where several skills were assessed simultaneously, represented the language competence concept. However, during the mid-1980s, Canale and Swain (1980, as cited in Brown & Abeywickrama, 2019: 14) recommended a shift from this struc- of stakeholders described by Hooge and Helderman (2008, as mentioned in Hooge, Burns & Wilkoszewski, 2012: 13) are also used in this paper, to wit: primary stakeholders, internal stakeholders, vertical stakeholders, and hori- zontal stakeholders. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 124 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez ture-centered approach to assessment towards a more communica- tive one, which involved more real-life tasks that language learners may eventually face. Accordingly, Savignon (1985: 131) agreed that “communicative competence certainly requires more than knowledge of surface features of sentence-level grammar”. What is more, when it comes to authenticity in assessment, Bachman (1990) and Weir (1990: 86, as cited in Brown & Abeywickrama, 2019: 16) highlighted the importance of asking questions such as “where, when, how, with whom, and why language is to be used, and on what topics, and with what effect” in order to measure lan- guage competence. Supporting this view, Jamieson, Eignor, Grabe and Kunnan (2008: 57) asserted that communicative competence “accounts for language performance across a wide range of con- texts, includes complex abilities responsible for a particular range of goals and takes into account relevant contexts”. More recently, Bachman and Palmer (2010, as cited in Brown & Abeywickra- ma, 2019: 15) included “the need for a correspondence between language test performance and language use” among the funda- mental principles of language testing. This more realistic commu- nicative view of language assessment permeates some of the most renowned language tests currently on the market. Today, assessing communicative language competence is per- formed more holistically. Since profi ciency in a given language goes beyond knowing its grammar, other equally — if not more — important features should also be accounted for when testing lan- guage profi ciency. In fact, as Badger and Yan (2012: 7), stated, “the main feature of the pedagogic orientation of a clt [Communica- tive Language Teaching] course is students’ ability to use the sec- ond language (L2), rather than knowledge about language, with a balance between the four skills”. Along these same lines, the Common European Framework of Reference for Languages (cefr) (Council of Europe, 2002) provides a framework that lists the necessary communicative language activities and strategies, as well as the communicative language competences (linguistic, sociolin- Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 125 ] guistic, and pragmatic) that should be considered when designing language assessment instruments. Likewise, the American Coun- cil on the Teaching of Foreign Languages (actfl) shares cefr’s emphasis on communication, expanding it to an intercultural com- munication approach. More recently, actfl coined the term inter- cultural communicative competence, defi ned as “using language skills, and cultural knowledge and understanding, in authentic con- texts to effectively interact with people. It is not simply know- ing about the language and about the products and practices of a culture” (Van & Shelton, 2018, January: 35). Hence, it is evident that the concept of mastering a language as a second or foreign language by a speaker keeps changing as new theories continue to evolve. One may think that the analysis and construction of standard- ized tests may have reached a stagnation point; however, the val- idation of language assessment is an ongoing process (Chapelle, 2012; Brown & Abeywickrama, 2019). A key step in the test vali- dation process entails defi ning validity: “an overall evaluative judg- ment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpre- tations and actions based on test scores or other modes of assess- ment” (Messick, 1989, as cited in Messick, 1995: 741). Therefore, this concept “is not a property of the test or assessment as such, but rather of the meaning of the test scores” (Messick, 1996: 245). Recently, this view has been supported and expanded in the Stan- dards for Educational and Psychological Testing (American Ed- ucational Research Association [aera], American Psychological Association [apa], National Council on Measurement in Education [ncme], and Joint Committee on Standards for Educational and Psychological Testing [us], 2014: 11), which further highlights the importance of “accumulating relevant evidence to provide a sound scientifi c basis for the proposed score interpretations”. The second step entails collecting evidence to build a validity argu- ment, which analyzes it “to make a case justifying the score-based Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 126 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez inferences and the intended uses of the test” (Carr, 2015: 331). To operationalize this validity argument, the guide Standards for Edu- cational and Psychological Testing outlines fi ve sources of validity evidence: evidence based on response processes, evidence based on internal structure, evidence of relations to other variables, evi- dence for validity and consequences of testing, and content-based evidence (American Educational Research Association [aera] et al., 2014). In light of the vast scope of validation processes, the numer- ous types of evidence available to demonstrate it, and the multiple research approaches that can be adopted, this paper will attempt to gather ‘content-based evidence’ as its primary focus aiming to meet some of the validity standards stated above. First, a test can claim to have content validity “if [it] actually samples the subject matter about which conclusions are to be drawn, and if it requires the test-taker to perform the behavior measured” (Brown & Abey- wickrama, 2019: 32). To add further detail to the elements to be an- alyzed and demonstrate content validity, the Standards for Educa- tional and Psychological Testing (American Educational Research Association [aera] et al., 2014: 14) state that “test content refers to the themes, wording, and format of the items, tasks, or questions on a test”. Content validity is also linked to being ecologically sen- sitive: “serving the local needs of teachers and learners. What this means in practice is that the outcomes of testing — whether these are traditional ‘scores’ or more complex profi les of performance — are interpreted in relation to a specifi c learning environment” (Fulcher, 2010: 2). More recently, other authors have also studied this concept; some of them, such as Coombe (2018: 28), even re- named it localization and have defi ned it as: A test that is designed to cater to the local needs of the test population. This may mean choosing appropriate cultural top- ics and making sure the processes of test design, piloting, ad- ministration and scoring refl ect local needs and expectations. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 127 ] In more recent localization movements, this has also involved localization of language use in context to include the spread and changing shape of English in countries that use English as an offi cial language. Based on the concepts and context provided above, it is possible to affi rm that ucr’s language profi ciency test supports this fi rst claim: it serves the local needs of teachers and learners in Costa Rican high schools as test users can interpret students’ scores and profi les of performance as sensitive to this specifi c learning environment. Last, the fi nal block towards building a solid validity argument for a standardized test consists of aligning a test with language pro- fi ciency descriptors such as those provided by the cefr the charac- teristics of test takers and administrators (O’Sullivan, 2016: 148). Similarly, the Association of Language Testers in Europe (2020: 26) further emphasizes that the test should be linked to a theoret- ical construct as a minimum standard in test construction. Hence, a second claim to be made about ucr’s language profi ciency test is that this test is properly aligned with the cefr language profi - ciency descriptors. 3. Description of the test 3.1. Validity argument For the purpose of the validity argument in this paper, we provide the following descriptions and analyses as suggested in the Stan- dards for Educational and Psychological Testing (section 4.1). 3.1.1. Purpose of the test The purpose of the pdl-mep test is to assess Costa Rican high school students regarding their understanding and production of non-technical English related to both regional and global contexts pertaining to formal and informal socio-interpersonal, transac- Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 128 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez tional, and academic domains while using the cefr descriptors as reference. This test is merely diagnostic. All senior high school students in Costa Rica who take this test will be able to know their profi ciency level in terms of reading and listening comprehension skills, according to the cefr. This will not be considered a lan- guage certifi cation test; hence, it should not be used for college admissions, visa applications, or job applications. 3.1.2. Score interpretation As indicated by mep’s authorities, the purpose of this new test is to determine the language profi ciency of students as a means to diagnose the effi cacy of the language programs recently adopted in Costa Rica, as well as the language development stage of stu- dents. Hence, testees should interpret the results as a refl ection of progress in their foreign language education, which will evidence the areas where they perform strongly and those that need improve- ment. Although students are not required to obtain a particular score to graduate from high school, they should take this test as a requirement for graduation. Based on the scores obtained nationwide, the mep might make the necessary adjustments to better achieve its goal: getting stu- dents to perform at the B1 level by the end of their high school years. To illustrate such adjustments, if the test results show a clear lack of command of B2-level tasks, mep teachers will be able to use this information and address the defi ciencies identifi ed by re- inforcing the unsatisfactory tasks in class. Internal stakeholders, as outlined by Hooge, Burns, and Wilkoszewski (2012: 13) might use the test results to determine, for example, where to recruit new bilingual personnel or whether to invest in additional language pro- grams for underprivileged populations. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 129 ] 3.2. The constructs of the test 3.2.1. Reading comprehension Reading comprehension profi ciency is defi ned as demonstrating an understanding of non-technical texts in English related to both regional and global contexts that pertain to formal and informal socio-interpersonal, transactional, and academic domains, taking cefr’s descriptors as reference. The contents to be included are determined following the mep guidelines. Furthermore, the skills assessed range from recognizing “familiar words accompanied by pictures, such as a fast-food restaurant menu illustrated with pho- tos or a picture book using familiar vocabulary” to understanding “in detail lengthy, complex texts, whether or not they relate to [examinees’] own area of speciality” (North, Piccardo & Goodier, 2018: 60). Finally, some of the strategies to be demonstrated by testees are included in the cefr descriptors, such as skimming, scanning, understanding a writer’s tone and humor, and identify- ing attitudes and implied opinions (cefr, 2018). 3.2.2. Listening comprehension Listening comprehension profi ciency is defi ned as demonstrating an understanding of non-technical English aural texts related to both regional and global contexts that pertain to formal and in- formal socio-interpersonal, transactional, and academic domains, using the cerf descriptors as reference. The contents to be in- cluded are determined following the mep guidelines. Some of the skills to be assessed range from recognizing “numbers, prices, dates, and days of the week, provided they are delivered slowly and clearly in a defi ned, familiar, everyday context” to following “extended speech even when it is not clearly structured and when relationships are only implied and not signaled explicitly” (cefr, 2018: 55). Last, some of the strategies to be demonstrated by test- ees are encompassed in the cefr descriptors, such as understanding Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 130 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez the main ideas and specifi c details, making inferences, and discern- ing speakers’ attitudes (cefr, 2018). 3.3. Claims 3.3.1. Claim 1 (MEP) The ucr language test gives the mep valid and reliable informa- tion about the English performance of students regarding nation- wide language standards and cefr profi ciency bands, including communicative activities, strategies, and language competences. Based on this information, the mep can report students’ language performance by classroom, school, district, and region. With this in mind, the Ministry will be able to design strategies to focus on those areas most in need of support regarding language profi ciency. 3.3.2. Claim 2 (teachers) The ucr language test gives the mep valid and reliable informa- tion about the English performance of students regarding nation- wide standards and cefr profi ciency bands, including communi- cative activities, strategies, and language competences. Based on this information, the Ministry of Education can adjust classroom activities — formative and summative — to meet the standards established by the mep. 3.3.3. Claim 3 (parents and students) The ucr language test gives parents and students valid and reliable information about the English performance of students regarding nationwide language standards and cefr profi ciency bands, includ- ing communicative activities, strategies and language competences. Based on this information, these stakeholders can determine stu- dents’ progress across the entire education system. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 131 ] 3.4. Rationale Given the scenario described above, the Ministry of Education and the University of Costa Rica agreed to build, administer, and de- liver the results of an English language profi ciency test that would represent a customized and convenient option for both institutions. Accordingly, the mep would have an instrument that meets their specifi c needs and the ucr would have another opportunity to give back to society all the knowledge it has acquired through research and its socially-oriented education programs over the years. More- over, as part of a well-documented and reliable process, “language testers shall endeavor to communicate the information they pro- duce to all relevant stakeholders in as meaningful a way as possi- ble” (International Language Testing Association, 2018: 2). This transparency is particularly important in documenting such a pi- oneering initiative whereby a Ministry of Education in a Latin American country enforces a national policy of bilingualism in conjunction with a higher education public institution through a large-scale language test. This article aims to gather evidence to build a pdl-mep content validity argument through an analysis of the theoretical founda- tions supporting the construction and administration of a custom- ized standardized language test and its localized context, including a description of the test and the input of the internal stakeholders. Additionally, this article provides suggestions and recommenda- tions for future testing experiences of this type while setting the grounds for future researchers who intend to follow this academic approach. The following literature review has been included as a refer- ence to support both the claims established in the validity argu- ments proposed above and the claims to defi ne the construct of the ucr Language Profi ciency Test (content domain). Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 132 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez 3.5. Localized test context 3.5.1. MEP diagnostic test for high school students To meet the localization principles mentioned above, the national English language profi ciency test will assess the reading and listen- ing comprehension skills of high school students — as per mep’s request — based on the themes, domains, and scenarios set by the Ministry of Education in its Programas de Estudio de Inglés (En- glish Language Programs), which have been aligned with the cefr guidelines (Ministerio de Educación Pública, 2016). The topics addressed in this document include, but are not limited to, confl ict resolution, democracy and democratic principles, economic devel- opment, environmental sustainability, blurring of national borders, and human rights defense and protection (Ministerio de Educación Pública, 2016: 13). Three axes encompass all these topics: a glob- al citizenship with local belonging, education for sustainable de- velopment, and new digital citizenship (Ministerio de Educación Pública, 2016: 13, 55). The contexts or domains where the target language is to be used and selected for this test include socio-inter- personal, transactional, and academic domains (Ministerio de Edu- cación Pública, 2016: 38). Among the multiple scenarios provided by the mep for all secondary education levels, the test may include, for example, “Enjoying Life” (7th grade), “Going Shopping” (8th grade), “Lights, Camera, Action” (9th grade), “Stories Come in All Shapes and Sizes” (10th grade), and “The Earth – Our Gift and Our Responsibility” (11th grade). Therefore, a third claim is that ucr language profi ciency test items address the topics and themes identifi ed as important by the Ministry of Education, including, but not limited to, confl ict resolution, democracy and democratic principles, economic development and environmental sustainabil- ity, blurring of national borders, and defense and protection of hu- man rights. A fourth claim is that the test items address the three axes identifi ed by the Ministry of Education: a global citizenship with local belonging, education for sustainable development, and Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 133 ] new digital citizenship. Finally, a fi fth claim is that the test items refl ect three domains in which students must demonstrate English language profi ciency: socio-interpersonal, transactional, and aca- demic domains. The contextual needs addressed above will be further opera- tionalized by simple and clear wording items and instructions that not only meet the expected language profi ciency levels of test- ees but also comply with the design requirements of trained lan- guage specialists. Brown and Abeywickrama (2019: 74) warned against wordiness, redundancy, and unnecessarily complex lexi- cal items that might confuse the testee. Consequently, to ensure that test takers can understand what is expected of them, the lan- guage complexity in the instrument should correspond to the one described in the respective cefr band. The “Text Inspector” tool (Cambridge University Press, 2015) will be used to guarantee this match. Finally, as mandated by the Standards for Educational and Psychological Testing, items will be designed and proofread by trained specialists, some of whom are native English speakers and whose teaching expertise is also valuable (American Educa- tional Research Association [aera] et al., 2014: 75). The format of the items, tasks, and questions will follow the guidelines in the document Programas de Estudio de Inglés (Minis- terio de Educación Pública, 2016). For example, the mep (2016: 44) suggests the following items to assess reading comprehension profi ciency in the classroom: “reading aloud, multiple choice, and picture-cued items. Selective reading performances are gap fi ll- ing, matching tasks, and editing”. The test will prioritize those tasks that lend themselves to be used in large-scale, standardized, computer-based scenarios. The mep also provides a list of possi- ble tasks to be used to assess the listening skill, such as summariz- ing, note taking, and identifying specifi c information (Ministerio de Educación Pública, 2016: 42). These tasks should all mimic real-life scenarios similar to those that students encounter in the classroom over their learning process. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 134 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez 3.5.2. Stakeholders’ expectations To further contextualize the test and accurately place it within the given ecosystem, stakeholders’ views should also be carefully con- sidered and analyzed. This analysis includes the information pro- vided by two of the most relevant decision-makers: the national coordinator of the Alliance for Bilingualism at the mep and the di- rector of the School of Modern Languages at the ucr. The former acknowledged that the transition from a traditional reading com- prehension test to a skills-based one took the mep approximately 11 years, led by the Dirección de Gestión y Evaluación de la Cali- dad [Quality Management and Evaluation Department]. The fi rst administration of the test aims to diagnose the English profi ciency level of high school students; comparing these results against those to be obtained in the future will demonstrate the impact of the re- cently introduced Programas de Estudio de Inglés. Said impact could then be further analyzed using predictive validity evidence studies. In the words of the national coordinator of the Alliance for Bilingualism at the mep (personal communication, November 5, 2020), this fi rst test administration will also determine whether or not the mep has the adequate physical and digital infrastructure to administer the test to more than 60 000 students. The coordinator also stated that this test will help other interested organizations to evaluate their adaptation capacity to the potential scenarios that may arise when monitoring their tests at the mep. For example, some students might need to take the test at facilities other than their school due to poor or no Internet connectivity. Consider- ing the multiple circumstances that regions or institutions may face (e.g., insuffi cient number of working computers, lack of person- nel to supervise the test administration, or the variety of school types), several administration schedules should be arranged, for example, three or four different agreed-upon times according to the particular requirements of each institution. Additionally, this stakeholder emphasized some of the requirements to be met by any Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 135 ] candidate testing organization when working with the mep: avail- ability of immediate technical support, verifi able language testing experience, standardized administration protocols, and provision and modifi cation of physical resources according to the particular needs of testees. The director of the School of Modern Languages at the ucr — the second stakeholder interviewed in this study — highlighted several points (personal communication, December 15, 2020). The School of Modern Languages presents itself as a valuable part of the process because of its signifi cant technical and human capacity and previous experience in standardized language assessment. The director underlined the fact that the School has the necessary ca- pacity, in terms of technology and human resources, to administer such a high-stakes test successfully. However, he also recognized that further support and investment from ucr authorities would be advisable to improve computer laboratories, security protocols, and data collection and analysis instruments. He also acknowledged the added value that collaborating with other University depart- ments could provide in the future. In terms of staff, the institution has properly trained personnel for the successful administration of the test in any of the formats requested by the mep, although ad- ditional support for the continuous training of these professionals would be advisable. In terms of the School capacity, the director affi rmed that the University can administer this test three times a year at large-scale, specifi cally examining 5000 mep students per day, with results de- livered within three weeks. Moreover, the technological know-how and experience in testing facilitate any specifi c adaptations and modifi cations that the mep may require. Finally, the director high- lighted the importance of familiarizing the target population with the test by facilitating a customized digital mock test to bridge the gap between their knowledge of computerized testing and class- room testing. The director expressed his confi dence in the reliability of the listening and reading online tests based on the evidence gathered Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 136 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez but also considers that using printed instruments could be feasible, and equally reliable, depending on the needs of the target popula- tion. Both types of formats can be equally reliable regarding data collection, provided they fulfi ll the safety and procedural protocols designed by the ucr. Production skills may be evaluated in the future, although fur- ther training and studies are necessary to ensure the reliability and validity of the results assumptions based on student performance. As a novel feature, he elaborated on the future possibility of using artifi cial intelligence as a tool in the development and scoring of ucr language tests. This stakeholder believes the test should accurately measure the language level of testees by evaluating their understanding of everyday and academic English, which is the core of the construct approved by the mep for this test. This goal will be achieved by using an instrument that will include 40 to 60 items per skill and which would take approximately 60 to 70 minutes to complete per macro skill. The items may include multiple-choice, sequencing, matching, drag-and-drop, and short answers; the specifi c items se- lected will depend upon how these perform in pilot testing. Finally, the director added that the text reading and listening sources will be authentic and ecologically sensitive material that will match the cefr levels intended to assess. The operationalization of this test and its construct have been approved by the mep. 4. Next steps Given the large-scale nature of this assessment and its implications, and as a pioneering enterprise, institutions should work together in organizing the test logistics. The expertise of the involved par- ties (in this case, mep and ucr) is key in successfully attaining the goals set by localizing the test. This would require the University of Costa Rica to conduct the following: Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 137 ] 1) Arrange meetings with mep representatives and deci- sion-makers to agree on the operationalization of the test features 2) Carry out analyses of needs: a. Survey teachers of English in Costa Rican high schools regarding, among other topics, their technological lite- racy, type of instruction, expectations from the test, and attitude towards standardized testing b. Gather information on the views and experience of stu- dents with standardized and computer-based testing, fa- miliarity with online testing and item types, and prefe- rred topics for English language profi ciency assessment, among others c. Interview regional and national English advisors to co- llect information regarding the availability of human resources and infrastructure required for reliable test monitoring d. Ensure that all students are treated fairly throughout the assessment process, having an unobstructed opportuni- ty to demonstrate their level of English language pro- fi ciency e. Conduct additional analyses of mep’s English curricula to determine additional topics and domains that could be tested 3) Organize and hold massive training programs for all stake- holders in this ecosystem 4) Design the test around the language pro� ciency concept and its two main pillars according to the cefr: communica- tive language activities and strategies and communicative language competences 5) Create a draft test blueprint to share with stakeholders to obtain their feedback before starting the item development phase Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 138 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez 6) Share the draft blueprint with key stakeholders and have them complete a survey with questions about it to gath- er their opinion about its adequacy 7) Pilot-test items in the real population and conduct statisti- cal analyses to assess their usefulness and reliability before the offi cial test administration. This step would ensure the- se items are fair for various subgroups (e.g., male / female, urban / rural / suburban, different racial / ethnic groups, low ses / high ses) by conducting differential item functioning (dif) analyses. As Brown and Abeywickrama (2019), Fulcher (2010), and Coombe (2018) argued, building a localized validity argument for a national standardized test from scratch requires multiple steps and studies that would involve massive amounts of fi eldwork to meet the par- ticular needs and characteristics of the context and population as- sessed. By localizing the English profi ciency test to meet the spe- cifi c needs of Costa Rican high school students, the latter should consider the test fair after realizing it was not developed lightly but resulted from careful consideration and design, as Fulcher (2010: 4) recommended. In turn, this may contribute to neutralizing the generalized negative perceptions of standardized testing. Since this is a customized test, it will need to address our country’s needs, lacks, and wants in foreign language standardized testing by bas- ing its tests on mep’s unit contents, theoretical constructs, and item familiarity. The locality-sensitive assessment will be produced in paral- lel with the new national policy of bilingualism (Ministerio de Educación Pública, 2016) where, in agreement with Badger and Yan (2012) as well as Brown and Abeywickrama (2019), students should learn to use the language. This rationale underlies ucr’s choice of the skill-based assessment provided by the cefr, which emphasizes and evaluates testees’ competences. The customized nature of the test would not only reduce the anxiety and fear of those involved (managers and students), but would also help us Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 139 ] obtain more precise evidence of the testees’ performance in lan- guage receptive skills. This is indeed the current communicative concept of language assessment and advocated by Canale and Swain (1980), Brown and Abeywickrama (2019), and Jamieson et al. (2008). The results from this test will be diagnostic (see Brown and Abeywickrama, 2019: 10), which would, in turn, provide authori- ties with a clearer perspective of the system’s strengths and weak- nesses in so far as such interpretation is aligned with the theoretical construct of the test (Carr, 2015; Fulcher, 2010). Since validation is a never-ending process (Chapelle, 2008; Brown & Abeywickrama, 2019), this pioneering nationwide stan- dardized testing exercise is an ongoing project that has just started with this fi rst step in language standardized testing in Costa Rica and Latin America. 5. Recommendations The following recommendations are suggestions for those resear- chers who are developing localized and standardized language tests. 1) Researchers should review international guidelines on de- veloping standardized tests. Some of these guidelines have been issued by institutions such as ilte, alte, and apa. Guides such as the Standards for Educational and Psy- chological Testing are user-friendly starting points for re- searchers in the fi eld. 2) Localizing a standardized language test requires more than designing an assessment instrument for a specifi c popula- tion. As outlined above, this continuous process should be done from the beginning with hand in hand with stakehol- ders, especially students. Given the short- and long-term impact of these tests and since multiple actors will be invol- Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 140 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez ved in the process, researchers are advised to consider the opinion of all stakeholders before making any decisions. 3) Researchers should take the input of some stakeholders with caution. For example, some may over- or underrepresent their particular context needs, lacks, or preferences. Con- sequently, it is imperative to corroborate the information with real-time observations and multiple sources to confi rm the test requirements. 4) Investigators are advised to seek the assistance of langua- ge-testing specialists during the development of their own standardized tests. These specialists should help investiga- tors to solve any issues since the former may have already dealt with these previously. There is nothing wrong with asking for help when it comes to such high-stakes tests. 5) Institutions aiming to develop standardized language tests may consider the possibility of certifying their language professionals in the different areas they intend to test. For instance, the actfl offers international certifi cation for pro- fessionals who want to become offi cial certifi ed testers of English (for oral and written production). Having certifi ed testers as part of the team constructing the test would provi- de valuable support to the process of developing, pilot-tes- ting, and assessing the performance of items designed to measure those skills. 6) If an institution is planning to develop a standardized lan- guage test, it should consider the available human resour- ces. Since this is a continuous process, it is convenient to have team members in charge of the different tasks rela- ted to the test, so as not to burden them with excessive workloads. For example, one group of language specialists could be dedicated to item writing; another, to item analy- sis; and yet another, to collecting evidence for the multiple claims. Assigning all of these tasks to the same team may induce a “burnout” feeling in the team members. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 141 ] 7. References American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), & Joint Committee on Standards for Educational and Psychological Testing (US) (2014). Standards for educational and psycho- logical testing. Washington: American Psychological Association. Association of Language Testers in Europe (2020). ALTE principles of good practice. https://pt.alte.org/resources/Documents/ALTE%20Principles%20 of%20Good%20Practice%20Online%20version%20Proof%204.pdf Azofeifa, Mauricio (2019, February 11). Más de 5.300 estudian inglés gracias a Alianza para el Bilingüismo (ABi). Ministerio de Educación Públi- ca. Gobierno de Costa Rica. https://www.mep.go.cr/noticias/mas- 5300-estudian-ingles-gracias-alianza-bilingueismo-abi Bachman, Lyle (1990). Fundamental considerations in language testing. New York: Oxford University Press. Badger, Richard, & Yan, Xiaobiao (2012). To what extent is communicative language teaching a feature of ielts classes in China? In Jenny Osborne & idp: ielts Australia (Eds.), IELTS research reports 2012 (Vol. 13, pp. 1–44). Australia, United Kingdom: idp: ielts Australia Pty Limit- ed, British Council. Brown, H. Douglas, & Abeywickrama, Priyanvada (2019). Language assessment: Prin- ciples and classroom practices (3rd. ed.). Hoboken: Pearson Edu- cation. Cambridge University Press (2015). Text Inspector. https://languageresearch.cam bridge.org/wordlists/text-inspector Carr, Nathan T. (2015). Designing and analyzing language tests. Oxford: Ox- ford University Press. Chapelle, Carol A. (2012). Validity argument for language assessment: The framework is simple… Language Testing, 29(1), 19–27. Chapelle, Carol A. (2008). The toefl validity argument. In Carol A. Chapelle, Mary K. Enright & Joan M. Jamieson (Eds.), Building a validity ar- gument for the Test of English as a Foreign Language (pp. 319–352). New York: Routledge. Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 [ 142 ] Walter Araya Garita, José Fabián Elizondo González & Ana Carolina González Ramírez Council of Europe (2002). Common European framework of reference for lan- guages: Learning, teaching, assessment. Cambridge: Press Syndi- cate of the University of Cambridge. Coombe, Christine (2018). An A to Z of second language assessment: How lan- guage teachers understand assessment concepts. London: British Council. Cordero, Monserrat (2019, November 29). mep: colegios públicos tienen nivel bási- co en dominio del inglés. Semanario Universidad. https://semanar iouniversidad.com/ultima-hora/mep-colegios-publicos-tienen-niv el-basico-en-dominio-del-ingles/ Fulcher, Glenn (2010). Practical language testing. London: Hodder Education. Hooge, Edith; Burns, Tracy, & Wilkoszewski, Harald (2012). Looking beyond the numbers: Stakeholders and multiple school accountability. OECD Education Working Papers, No. 85. Paris: oecd. http://dx.doi.org/10.1787/5k 91dl7ct6q6-en International Language Testing Association (2018). ILTA code of ethics. https://www. iltaonline.com/page/CodeofEthics Jamieson, Joan M.; Eignor, Daniel; Grabe, William, & Kunnan, Antony John (2008). Frameworks for a new toefl. In Carol A. Chapelle, Mary K. Enright & Joan M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 55–95). New York: Routledge. Messick, Samuel (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientifi c inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741 Messick, Samuel (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256. doi:10.1177/026553229601300302 Ministerio de Educación Pública (2016). Programas de estudio de inglés: tercer ciclo y educación diversi� cada. Costa Rica: Imprenta Nacional. North, Brian; Piccardo, Enrica, & Goodier, Tim (2018). Common European Framework of Reference for Languages: Learning, teaching, assessment. Com- panion volume with new descriptors. Strasbourg: Council of Europe. O’Sullivan, Barry (2016). Adapting tests to the local context. British Council new directions in language assessment: JASELE journal special edi- Estudios de Lingüística Aplicada, año 40, número 75, julio de 2022, pp. 119–143 doi: 10.22201/enallt.01852647p.2022.75.1013 Getting stakeholders acquainted with the rationale behind the construct of the English language profi ciency test [ 143 ] tion (pp.145–158). Tokyo: Japan Society of English Language Ed- ucation, British Council. Van Houten, Jacque, & Shelton, Kathleen (2018, January). Leading with culture. The Language Educator. https://www.actfl .org/sites/default/fi les/tle/ TLE_JanFeb18_Article.pdf Savignon, Sandra J. (1985). Evaluation of communicative competence: The actfl provisional profi ciency guidelines. The Modern Language Journal, 69(2), 129–134. doi:10.1111/j.1540-4781.1985.tb01928.x