UNIVERSIDAD DE COSTA RICA SISTEMA DE ESTUDIOS DE POSGRADO MULTIPLE MULTIPARTITE GRAPHS OCCLUSION SOLVER FOR SOCCER PLAYERS’ TRACKING Tesis sometida a la consideración de la Comisión del Programa de Estudios de Posgrado en Ingeniería Eléctrica para optar al grado y título de Maestría Académica en Ingeniería Eléctrica LENNON NÚÑEZ MEOÑO Ciudad Universitaria Rodrigo Facio, Costa Rica 2024 Dedication To my parents and family, for always supporting me into achieving every goal. To my managers, coworkers, and friends that helped me survive this adventure. To God, for granting me challenges for growth. Thanks ii Acknowledgement This research was supported by CeNAT - CONARE Fellowship Program, in collaboration with CNCA (National Advanced Computing collaboratory) This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center iii This thesis is approved by the Electrical Engineering Postgraduate Studies Committee of the University of Costa Rica, as a partial requirement to opt for the Master’s degree in Electrical Engineering. Esta tesis fue aceptada por la Comisión del Programa de Estudios de Posgrado en Ingeniería Eléctrica de la Universidad de Costa Rica, como requisito parcial para optar al grado y título de Maestría Académica en Ingeniería Eléctrica. Dr. Jaime Cascante Vindas Dean Representative Sistema de Estudios de Posgrado Dr. rer. nat. Francisco Siles Canales Thesis Director Ph.D. Esteban Meneses Rojas Thesis Assessor Ph.D. Marvin Coto Jiménez Thesis Assessor M. Sc. Marco Villalta Fallas Director Representative Programa de Posgrado en Ingeniería Eléctrica Lennon Núñez Meoño Candidate iv Contents Cover Page i Dedication ii Acknowledgement iii Approval Sheet iv Table of Contents v Abstract vii Resumen viii List of Tables ix List of Figures x Glossary xiii Acronyms xv Nomenclature xvii 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Professional Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Theoretical Background 17 v 2.1 Multiple Object Tracking - Soccer Players Tracking . . . . . . . . . . . . . . . . . . . . . 17 2.2 Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 The Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Multiple Multipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Player Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Multiple Object Tracking Metrics for Validation . . . . . . . . . . . . . . . . . . . . . . . 40 3 Methodology 46 4 Implementation 48 4.1 Multipartite Graph Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Duplicate solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 MHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 MMPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Filtering spurious . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 Handling Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Results and Discussion 68 5.1 Test Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Results Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6 Conclusion and Recommendations 90 Bibliography 92 Appendix A ACE tracking platform configuration file 104 Appendix B Annotation Campaign Notifications Google App Scripts Code 108 Appendix C Detecting Occlusion 112 vi Abstract Multiple Multipartite Graphs tracking, a derived approach by mixing Multiple Hypothesis Tracking with Multipartite Graphs, was integrated into ACE soccer player tracking system to address occlusion events. Additionally, a spurious filtering based on hue histograms and predominant color swatches evaluation was integrated to provide the tracking algorithms with a data set clean of spurious objects, which considerably and negatively affect the tracker performance. Combining this spurious filter with other methods to handle occlusion events, as occlusion alarm and merge blob splitting, it was possible to achieve a reduction of 88% in false positives and an increase of 25% in MOTA in the test data set. When evaluating the tracker in reduced sets to study occlusion events more closely, a MOTA of 76% was achieved, representing an improvement of 50% when compared against the original tracker with a MOTA of 50.55% under the same data set. vii Resumen Rastreo por Múltiples Grafos Multipartitos, derivado de mezclar rastreo por múltiples hipótesis con grafos multipartitos, fue integrado en la plataforma de rastreo de jugadores de fútbol ACE para enfretar eventos de oclusión. Adicionalmente, un filtro de espurias basado en histogramas del tono y la evaluación de colores predominantes en el muestreo de color fue integrado para proveer a los algoritmos de rastreo de conjuntos de datos libres de espurias, las que considerablemente afectan de manera negativa al rendimiento del rastreador. Al combinar este filtro de espurias con otros métodos para manejar eventos de oclusión, como lo son la alarma de oclusión y el separador de blobs fusionados, fue posible obtener una reducción del 88 % en los falsos positivos y un incremento del 25 % en MOTA. Al evaluar el rastreador en un conjunto de datos reducido para estudiar más de cerca los eventos de oclusión, se alcanzó un MOTA de 76 %, representando una mejora del 50 % al comparar con el MOTA de 50.55 % obtenido por el rastreador original para el mismo conjunto de datos. viii List of Tables 4.1 Comparison of Hypothesis-Oriented MHT and Track-Oriented MHT . . . . . . . . . . . . . 53 5.1 TARÁ System Specification [92] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Tracking Metrics Summary - Individual Enabled Features . . . . . . . . . . . . . . . . . . . 80 5.3 Tracking Metrics Improvement - Individual Enabled Features . . . . . . . . . . . . . . . . . 81 5.4 Tracking Metrics Summary for Multiple Hypothesis Tracking (MHT) Variations . . . . . . 81 5.5 Tracking Metrics Summary for Multiple Multipartite Graphs (MMPG) Variations . . . . . 82 5.6 Metrics for various feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 Metrics for Original and mht_hue_dp_rm_ms features . . . . . . . . . . . . . . . . . . . . 84 5.8 Performance Improvement Metrics for mht_hue_dp_rm_ms feature . . . . . . . . . . . . . 84 5.9 MOTA evaluation with false positives error weight equals 0 . . . . . . . . . . . . . . . . . . 85 5.10 MOTA for the MHT with Hue Filter, Duplicate Solver and Merge Splitter features with false positives error weight equals 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.11 Metrics of MMPG tracking of occlusion event O12 951,979 . . . . . . . . . . . . . . . . . . . . . 89 ix List of Figures 1.1 Football Soccer (Association Football) field dimensions . . . . . . . . . . . . . . . . . . . . . 2 1.2 TRACAB Soccer Players Tracking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Match Analysis system flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 ACE computational platform architecture layers . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Segmentation issues on players detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Examples of scenarios perceived as occlusion events. . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Sequential graph where missing identities suggest an occlusion event [26] . . . . . . . . . . . 22 2.5 Graph examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Complete graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Bipartite Graph Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Bipartite Graph examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Multipartite graph hypothesis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.10 Multipartite graph processing with a window size 3 of a scene comprised of 6 frames . . . . 30 2.11 Multipartite graph processing with a window size 6 of a scene comprised of 6 frames . . . . 31 2.12 Hypothesis tree representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.13 Multiple multipartite graphs depiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.14 Comparing players from different teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.15 Player model and similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.16 Problem with occlusion and euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.17 Tracking object’s area overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Flow diagram of ACE multipartite graph tracker . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Examples of errors in ACE tracker identity assignment . . . . . . . . . . . . . . . . . . . . . 51 4.3 Examples of duplicate identities error in ACE tracker fixed by duplicate solver . . . . . . . 52 4.4 MHT Track Oriented with levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 TrackHypothesis UML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Updated Player Blob . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Spurious blobs vs Player blobs hue channel histogram filtering . . . . . . . . . . . . . . . . 59 x 4.8 Predominant color swatches for spurious filtering . . . . . . . . . . . . . . . . . . . . . . . . 60 4.9 Example of effectiveness of the hue filter in removing spurious blobs . . . . . . . . . . . . 61 4.10 ACE Platform detecting corner as player . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.11 Spurious blobs vs Player blobs hue channel histogram filtering . . . . . . . . . . . . . . . . 62 4.12 Ghost Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.13 Ghost Blob wit drifting next position prediction . . . . . . . . . . . . . . . . . . . . . . . . 64 4.14 Splitting merged blobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Kabre supercomputer overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Camera Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 ISSIA database issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4 Input vide example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 PRIS Annotation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6 PRIS Annotation Tool - Annotation pannel . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.7 Annotation Campaign enrollment automation flow . . . . . . . . . . . . . . . . . . . . . . . 76 5.8 Annotation Campaign form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.9 Annotation Campaign e-mail confirmation message . . . . . . . . . . . . . . . . . . . . . . . 77 5.10 Examples of occlusion events and players’ position manual annotations generated with the PRIS Annotation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.11 ACE Platform identity retention after an occlusion event . . . . . . . . . . . . . . . . . . . 86 5.12 ACE Platform original vs updated in multiple occlusion events . . . . . . . . . . . . . . . . 86 5.13 Occlusion event altered by segmentation error . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.14 Snippet of complex occlusion event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 C.1 Occlusion clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C.2 Graphical description of modified Stretch Index SI(R) . . . . . . . . . . . . . . . . . . . . . 116 C.3 Soccer player match scenes classification by player’s agglomeration . . . . . . . . . . . . . . 117 C.4 Players Dispersion Level of a 40 s soccer match clip . . . . . . . . . . . . . . . . . . . . . . 118 C.5 Estimated bivariate probabilistic function of soccer players spatial distribution for frame k = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.6 Estimated bivariate probabilistic function of soccer players spatial distribution for frame k = 600 and PDL < 400, classified as Bad-Behavedscene . . . . . . . . . . . . . . . . . . . . 121 xi C.7 Estimated bivariate probabilistic function of soccer players spatial distribution for frame k = 1100 and PDL > 500, classified as Well-Behavedscene . . . . . . . . . . . . . . . . . . . 122 xii Glossary Bad-Behaved Scene Frame from a soccer match video recording in which player blobs are agglom- erated in such a way it becomes .. xi, 113, 114, 117, 120, 121 Dynamic Occlusion Event Occlusion event in which the occlusion’s Agglomeration Size is variable and/or the event is comprised of smaller intermittent occlusion events, e.g. new players inter- acting with previously occluded ones at any point of the event, players constantly entering and leaving the occlusion event area, two players occluding intermittently, etc. . 23, 69 False-new-player Case in which a player’s identity is lost during a certain amount of frames, after which the same player is identified as a new one, therefore being assigned a new player identity label. E.g. have players identified as p1 and p2 during frames 40 to 45, during frames 46 to 50 p2 is not detected, but it is detected again as a player during frames 51 to 60. Due to being missing in frames 46 to 50, it is detected as a new player entering the scene, therefore, it is assigned label p3 instead of p2, with no connection in tracking history of labels p2 and p3.. 22 Identity Swap Case in which at least two players have their identity labels swapped. Usually occurs with players of the same team in which visual features and proximity lead the tracking algorithm to detect one as the other.. 7–9, 11, 12, 20, 22, 24 Medium-Well-Behaved Scene Frame from a soccer match video in which player blobs are some- what spaced, tracking algorithm may or may not solve occlusion events.. 113, 117 Multiple Object Tracking Accuracy A performance metric used in computer vision and tracking systems to evaluate how accurately and consistently objects are tracked over time. It measures the overall quality of a tracking system by considering factors such as missed objects, false positives, and identity switches in the tracking process.. 11, 69 Multiple Object Tracking Precision A performance metric used to evaluate the precision of a multiple object tracking system. It measures the accuracy of the predicted object locations by calculating the average distance between the predicted and ground truth object positions.. 84 xiii Occlusion Visual effect in which, due to the projection of a 3D scene into a 2D plane, the object of interest is visually blocked by another object from the same scene. Angle perspective of the camera and objects proximity affect occlusion perception. The occlusion is identified by its length (video frames) and event complexity.. 6–13, 15–17, 19–24, 112, 113 Occlusion Agglomeration Size Number of players involved in an occlusion event.. 23 Occlusion Dynamics Index Value that represents the complexity of the occlusion event’s dynamics. It is calculated as the sum of the weights of different occlusion dynamics: intermittency, team interaction.. 23 Occlusion Event Complexity Score that depicts how complex an occlusion event is in terms of the involved objects’ interaction and the occlusion area objects density. The event can be categorized by its Dynamics, as Steady Occlusion Event or Dynamic Occlusion Event, and by its Agglom- eration Size, represented by the number of players involved. It’s the product of the Occlusion Dynamics Index and its Agglomeration Size.. 23, 113 Occlusion Event Region Region in the image plane where the occlusion event occurs. Its location changes accordingly to the players location, and its size may vary if it is a Dynamic Occlusion Event.. 23 Steady Occlusion Event Occlusion event in which the occlusion’s Agglomeration Size is constant, i.e. the same occluded players are present from start to end of the event, and there is no intermittency of players. . 23, 69 Well-Behaved Scene Frame from a soccer match video in which player blobs are spaced and may be present only simple occlusion events that the tracking algorithm can solve without needing an occlusion solver.. xii, 113, 117, 120, 122 xiv Acronyms AI Artificial Intelligence. 14 BPG Bipartite Graph. 10 CNCA National Advanced Computing Collaboratory (acronym in Spanish). 70 DOE Dynamic Occlusion Event. 23, 69, Glossary: Dynamic Occlusion Event HOMHT Hypothesis Oriented Multiple Hypothesis Tracking. 54 IoU Intersection over Union. 40 KDE Kernel Density Estimation. 70, 122–124 MHT Multiple Hypothesis Tracking. ix, 6, 11–14, 31, 32, 34, 47, 48, 52, 54, 57, 62, 68, 72, 81, 84, 85, 88, 90, 112 MMPG Multiple Multipartite Graphs. ix, 6, 13, 14, 17, 32, 48, 55, 57, 62, 72, 80–82, 84, 86, 88–90, 124 MOT Multiple Object Tracking. 6, 8, 10, 17, 20, 62 MOTA Multiple Object Tracking Accuracy. 11, 69, 84–88, Glossary: Multiple Object Tracking Accuracy MOTP Multiple Object Tracking Precision. 84, Glossary: Multiple Object Tracking Precision MPG Multipartite Graph. 6, 10, 13, 14, 27–32, 34, 49–51, 54–57, 65, 90, 122, 123 OAS Occlusion Agglomeration Size. 23, 24, Glossary: Occlusion Agglomeration Size ODI Occlusion Dynamics Index. 23, 24, Glossary: Occlusion Dynamics Index ODI-I Occlusion Dynamics Index Intermittency weight. 23, 24 xv ODI-T Occlusion Dynamics Index Team weight. 24 OEC Occlusion Event Complexity. 23, 24, 113, Glossary: Occlusion Event Complexity OEC Occlusion Event Region. 23, Glossary: Occlusion Event Region OECS Occlusion Event Complexity Score. 23, 24, 69, 85 PDL Players Dispersion Level. 116, 117, 120, 123, 124 PF Particle Filter. 8, 9, 11, 12, 22, 23, 27, 38 PRIS-Lab Pattern Recognition and Intelligent Systems Laboratory. 18, 69, 70, 72, 74, 78 ROI Region of Interest. 12, 22 SOE Steady Occlusion Event. 23, 69, Glossary: Steady Occlusion Event TOMHT Track Oriented Multiple Hypothesis Tracking. 54–56, 124 WSL Windows Subsystem for Linux. 48, 70 xvi Nomenclature AgIx Agglomeration Index. It has a value of 1 if there are only two players present in the occlusion event, and the number of players involved in any other case. CS Complexity Score, calculated as CS = DyIx ∗AgIx DyIx Dynamics Index. The sum of Intermittency and team interaction weights. DyIx = ODIteam +ODIinttermitency Gi Single graph, with position i inside a multipartite graph. k Frame number. kf Final frame. ki Initial frame. mpgki,kf Multipartite graph starting on frame ki and ending on frame kf . N Number of nodes. n Node of a graph. O Occlusion event. o Occlusion event snippet. Ocomplexity ki,kf Occlusion event from frame ki to kf , and complexity score complexity. ok,objects Occlusion event snippet in the frame k, involving objects number of individ- uals. w Window of frames used to perform object tracking in sections of the video. wK Window of size k. wki,kf Window composed of the frames from ki to kf . Examples: xvii • Let mpg300,800 a multipartite graph of size 501 graphs, result of the tracking process from frames ki = 300 to kf = 800 in the video window w300,800. • The graph G4, of the multipartite graph mpg300,800, has N = 22 nodes, while the graph G5 has N = 18 nodes, an error in the tracking is possible as less objects are tracked in the subsequent graph. xviii 1 Chapter 1 Introduction This chapter covers the project’s background and motivation, problem description and proposal of the solution to the problem stated. 1.1 Background and Motivation Object tracking is used in different areas where it is important to know the motion of the objects of interest; some examples are submarine tracking, where a submarine needs to recognize projectiles and other ships moving alongside using passive sonars, aircraft surveillance with a similar scenario but using passive radars or an electronic warfare device to recognize the moving surroundings [77], in analysis of sports data [92], and, more recently, in medicine fields for cell tracking [54]. In terms of sport disciplines, the athletes usually use visual guides to improve and practice, be it recordings of the sport in execution or other visual aids on details of the discipline. The traditional method to analyze all the information available from those sources is to manually pick the scenes of interest and then select key plays to replay and study, the coach then will design a strategy based on the information extracted to improve the athlete’s performance. Object tracking is introduced as a way to automatize data extraction for sports analysis, some disciplines require the tracking of an athlete’s motion, in other cases like team sports, it is required to capture the motion of the many players involved. Players tracking goal is to provide information of the position of the player at any time during the whole match, by analyzing this simple piece of data it is possible to tell different statistics of its performance: total time played, distance traveled, area of the field covered during the match, speed, fatigue and placement of the player during different key plays [69]. Thus, players’ performance and tactical evaluation can be assessed based on information from object tracking, making it a valuable tool for sports analysis. One of the most popular sports in the world, with an estimate of more than 3.5 billion of fans, is the Association Football, or simply Football or Soccer [97]. It is a game in which two teams of 11 2 players, using any part of their bodies except their hands and arms, try to maneuver the ball into the opposing team’s goal area, with a valid maneuvering field of about 90–120 m long and 45–90 m wide. The team that scores more goals wins at the end of a match [72]. Figure 1.1: Football Soccer (Association Football) field. Source https://www.conceptdraw.com/ examples/football-pitch-dimensions-in-metres In the last years, some companies have dabbled into soccer players tracking for sports data analysis, the following are some examples of the most cited commercial applications in professional literature: Kizanaro: It’s a Uruguayan company started in 2008 as part of an academic project, with the purpose to provide services into soccer applied technology. It is used by the soccer teams of Angola, Uruguay, Paraguay, Venezuela, Colombia and others. The company states that it allows a fast visualization and analysis of the matches, obtaining information about the performance of the teams and players into video sections. The software classifies and breaks down the main plays of the match [41]. TRACAB: Optical Player Tracking solution from ChyronHego team, started in 2003 in Sweden. It’s installed in about 300 soccer stadiums, being the official tracking technology used in leagues such as the English Premier League, German Bundesliga and Spanish La Liga, having also been selected for the major international UEFA and FIFA tournaments, as well as other sports than soccer. It uses camera arrays to record the field and delivers 3D information of the matches. The object tracking is supervised by a human operator (figure 1.2) [91]. https://www.conceptdraw.com/examples/football-pitch-dimensions-in-metres https://www.conceptdraw.com/examples/football-pitch-dimensions-in-metres 3 Figure 1.2: TRACAB Soccer Players Tracking System: Camera array and super- vised real time tracking. Source: https://chyronhego.com/products/sports-tracking/ tracab-optical-tracking/ ChyronHego also offers others solution to sports analysis as an annotation tool, named Paint, used in sports news broadcasting to better visualize the different key plays. Nonetheless, the tool is a graphical visualization aid for sports journalists rather than a tracking or ”ground truth” generator framework [63]. STATS: Company with more than 35 years in the market working with sport fans and athletes. The company offers different solutions into sports analysis and fan engagement, for data capture the system SportVU is used, which offers performance statistics via extraction and processing of players and ball coordinates during the match, using High Definition (HD) cameras and statistical software and algorithms [89], [90]. Match Analysis: Started as a way to share the coach’s view to the team players, introduced into the market in 2000 in USA with data collection centrals in USA and Mexico. It brings a set of tools for video analysis and digital libraries to provide performance information and data storage, to perform tactical analysis of soccer teams. Real time data is collected and then synced with the video recordings by the analysts [51]. Since in 2015 the IFAB (International Football Association Board) allowed the use of techno- wearables during some soccer matches [23] as well as the standardization of tracking systems by FIFA [24], many other technologies are also taking part into soccer player tracking, including different accessories into the players’ uniforms, one example is the use of GPS devices placed into the players’ shirts [34], [88]. However, optical players tracking has the advantage of not being intrusive on the players and that it requires just the video cameras and not any other support accessory, it’s cheaper and the methods can be exported to other areas on optical tracking. More recently even artificial https://chyronhego.com/products/sports-tracking/tracab-optical-tracking/ https://chyronhego.com/products/sports-tracking/tracab-optical-tracking/ 4 Figure 1.3: Match Analysis system flow. Source: http://matchanalysis.com/process.htm intelligence systems are integrated into tracking players. Due to the nature of private commercial software, it is not clear in general terms if the implemen- tation is fully automated and which algorithms or methods are used to perform the object tracking tasks. The Laboratory of Pattern Recognition and Intelligent Systems (PRIS-Lab) has developed a tool to perform Automated Soccer Players Tracking. This platform, named ACE, processes off-line video recordings of soccer matches’ obtained from a single point of view, and returns the tracklets of the players’ positions throughout the video. The current project development is on this platform, accessing the source code and soccer matches ground truth data base. The main goal is to improve robustness of the platform to occlusion events by implementing a Multiple Hypothesis method on it’s multiple object tracking process. 1.2 Document Outline The present work is organized as follows: first and current chapter is an introduction to object tracking and automatic soccer player tracking to define the field of research and the problem to solve, it develops the State of the Art i.e. the study of related professional literature to settle down the line of work, and the scope and objectives of the project; chapter 2 contains the theoretical background on optical object tracking and multiple hypothesis approaches; chapter 3 corresponds to the methodology, where framework and time frame are described; chapter 4 covers the Multiple Hypothesis with Multipartite Graphs implementation details for multiple object tracking as well as other developments into increas- ing the tracker performance, chapter 5 presents and discusses the results, chapter ?? includes final http://matchanalysis.com/process.htm 5 conclusions and recommendations. The final section contains the appendices with different supporting documentation, as well as an additional brief research on estimating players’ distribution on the field to dynamically enable/disable tracking algorithms in a multi-feature tracking platform. 6 1.3 Professional Literature There exist different methods and algorithms that have been developed to solve the many problems that arise when trying to track multiple objects, specially when these objects share so many features in common. M. Villalta [92], based on the Object Tracking survey by A. Yilmaz et al. [99], proposes a classification of these methods in three main classes according to the core functionality of the tracking algorithms, suited as Deterministic Methods, Probabilistic Methods and Data Association Methods. It is important to highlight that the order of appearance of these classes is a remark on the timeline of evolution of the approach mindset, as prior methods were not able to completely solve the Multiple Object Tracking (MOT) problem, it also depicts the focus transition when simpler issues were solved and the more complex had to be taken into account, like occlusionevents. Furthermore, M. Manafifard et al. [49] bring a detailed and extense survey on Player Tracking in soccer videos, reviewing about 163 different publications on the issue, studying cases like Multiple Hypothesis Tracking (MHT), Particle Filters, Graph Representation and Occlusion Detection and Handling. From this study it is important to highlight the following: detection (segmentation, background substraction) is critical in the performance of the tracker, occlusionsare the biggest challenge in MOT, MHT has improved different existent methods in occlusionsolving matters outperforming single hypothesis tracking, there is no mention of Multipartite Graph (MPG) approach as presented by [92] nor a mixture of MHT and MPG resulting in Multiple Multipartite Graphs (MMPG). The following reviewed literature is used to illustrate the previous statements, the evolution of different approaches and to sustain MHT for occlusionmanaging. In addition, a subsection has been separated to discuss the works which implementations are focused on being robust to identity ambi- guities and occlusionevents, appendix ?? depicts the distribution of reviewed methods in a table. Deterministic Methods These are methods that use visual characteristics to perform a deterministic trajectory tracking, i.e. there is a model that describes its motion and/or its appearance. The basis of most of these methods is to use brute force to find areas on an image that match a predefined and well known pattern. These methods fail as the track is lost when objects interact with others alike (players of the same team) and can’t solve occlusionby itself. The main approach in this category is known as template matching, where a predefined template is compared with different regions of the image, whenever it matches best, an object (player) would 7 be identified [10], [79]. The problem with template matching is that the algorithm is fully dependent on its similarity with the objects of interest, if the template is not compatible with the players shape or color in a determined frame, it will fail. This is common on soccer player tracking due to the many visual changes a person has as it moves around the field turning around, crouching, falling and other shape shifting events, enforcing the necessity of a big and robust templates dataset to consider the many scenarios, also, the tracking result will have Identity Swap as the players of the same team have similar chromatic distribution. Lefèvre introduces the use of active contours into soccer player tracking with a snake-based method [46]. The non-rigid nature of a soccer player is considered in this method, allowing a reconstruction of the contours of the players whenever its necessary, adding flexibility in comparison with a fixed template-matching. However, it would fail when occlusionhappens, as the contour may consider both players (in a 2 objects occlusioncase) as a single player, due to one player blocking the other in a 2D view. To combat the occlusionproblem, some methods implement data fusion of information provided from different sources to aid the players detection. A template is updated with newer information from new observations over time, characteristics as size, texture and color are used to differentiate between players, and states are stored so that correct assignment is done after the occlusionpasses, which doesn’t fully solve occlusionassignment. This method increases the complexity of the tracking platform as it incorporates an observation unit to update the templates, and also the template complexity by adding new features to recognize (color, size, texture). As it still depends on a template, the right update of the template and its initial state are crucial, which forces to manually declare the initial states of the match to aid tracking [53], anyways, similar and occluded players remains a problem. Probabilistic Methods Most of these methods are based on Bayesian inference and estimations to perform object tracking. They were introduced as a solution to optimal estimation of non-linear non-Gaussian state-space models, situations where the typical linear model-based Kalman Filter was not enough [22]. The main idea of these methods is to estimate the position of the object on a frame and update its likely position with new measurements, the measurements are assigned certain probability (weights) based on previous information until an optimal result is selected. It is important to highlight that these methods’ decisions are based on probability, it chooses the most likely solution based on the cumulative probability of the 8 assignments along the different frames, therefore, a wrong assignment will spread over time. One of the most popular probabilistic algorithms for object tracking is known as Particle Filter (PF), introduced in 1993 as bootstrap filter, a numerical method for the solution of optimal sequential state estimation problems in non-linear non-Gaussian scenarios [57], and with increasing applications due to how the algorithm has evolved in hand with computational power. Due to its extended use in the topic it has been considered important to dedicate this section to its implementation. Czyz, Ristic & Macq implement a color-based particle filtering algorithm, it was first designed to work with single object tracking, thus leading to wrap all players of the same team in a single global state. As particles are attracted to areas of high density, error from a single player is dragged through states update. The solution approached is to add a color histogram and to join detection and tracking tasks to distinguish players of the same team [19]. The experiments show that PF is a powerful tool on MOT, but the scenes analyzed did not consider occlusionevents, and the issue of particles moving towards areas of high density persists, which leads to Identity Swap of players that are close together. To solve the wrong assignment due to particle mixing, some authors have implemented single particle filtering for each player. Single particle filters per player increases the computational processing requirements as the al- gorithm is applied for each player. Dearden, Demiris & Grau not only use Particle Filter for each player but decide to spatially segment the most likely areas where the player may move, thus including a spatial window and delimiting the possible solutions of the algorithm; PF is then applied on the delimited area [20]. This approach diminishes the influence of one player’s error into the rest, however, the occlusionproblem remains. During an occlusion, the particles used may mix and get messy, there- fore, the reliability of the data obtained throughout the event decreases the longer the players remain occluded. After the players separate from each other, using color-based recognition, the particles are reassigned to the respective player by estimating the trajectory that each player had before the event. Even though, the authors mention that particles mixing is still a possible outcome, so that all particles are assigned to a single player (therefore, losing the other player) or leading to Identity Swap with players of the same team. Kristan et al. implement multiple single-trackers to confront the multi-target tracking problem, including closed-world assumptions to aid in the visuals processing [43]. The robustness of the algo- rithm is improved due to the constraints of a closed-world, as it was tested for indoor matches, however, those assumptions cannot be considered when working with outdoor sports, as outdoor conditions are 9 not easy predictable (e.g. a cloud’s shadow may cover the field totally or partially for a short or long period of time, it may rain or get foggy), also, the camera is statically placed above the court hanging from the ceiling, position that cannot be achieved in an outdoors stadium. A major problem when dealing with Particle Filters is the degradation of particle assignment, specially when dealing with occlusion, as particles are wrongly assigned due to lose of information in the involved frames, which leads to losing players’ identity after an occlusionevent (Identity Swap and new player condition after a split), therefore, PF alone cannot solve the problem. Itoh et al. [35] introduce a time-situation graph beforehand to an occlusionevent, the graph is used to solve the players’ tracking lost after an occlusion, the quantity of players involved in the event is derived from the graph association prior and after the players interaction, thus allowing to assign the “new” players detected after the occlusionto the existing ones after the occlusion, fixing the splitting problem, but the swapping remains as the algorithm is based on PF. Similar approaches introduce the use of graphs to retain the states of the players so that they are not lost after an occlusionor in boundaries conditions as when player exits the frame (for broadcast recordings) or exits the field limits. Data Association Methods These methods address the object tracking as to find the optimal solution to a data association problem, where the data to associate is obtained from the object detection step. The main goal of these methods is to solve the wrong assignment and object track losing due to occlusionor similar players collision, most of the time used as a complement to a probabilistic/deterministic approach. The simplest implementation of a data association method is using the Kalman filter due to its recursive nature, used to update the object-track assignments [75]. It is stated that for multiple-object tracking, there should be an additional computing to identify each detected object before actually tracking them, thus the method results are dependent on the right object detection step frame to frame. The preferred representation of data for these methods is via a graph, were usually the nodes represent different states of a single player or team, and edges are weights that connect one state with the possible next. Soomro, K. et al. [86] implement a graph approach where each team is represented by a graph, in which the nodes correspond to player positions and the edge weights depend on spatial inter-player distance. They use color-based tracking to differentiate between teams to assign the player to one or another graph, and teams formations data to aid into the estimation of the position of the 10 players, with the assumption that players remain in an organized manner, therefore, it is expected that error increases when players won’t completely follow the initial formation. It also uses only video clips from broadcast matches with an overhead view, occlusionis dismissed. Figueroa et al. use the graph representation for player tracking by focusing on the occlusionevents, where the nodes are used to represent different blobs to distinguish the players, aided with features as mentioned in above studies, like multi-camera view and color. Moreover, the study adds the use of morphological features for blob separation (like a pixeled player’s size and shape model); the algorithm uses a backward and forward graph representation to optimize the graph and aid on multiple cameras to simplify the graph in occlusionevents. This is done by selecting the best view depending on the distance of the blob to the center of the field of camera view and which one brings the most information of isolation of a player, i.e. if in a view the occlusionoccurs, other camera can be used to verify if a player becomes isolated in that view [25], [26]. Wei-Lwun et al. use the graph representation to keep track of the previous assignments, using Bipartite Graphs (BPGs) to guarantee that one-to-one assignment is done by graph constrains; the graphs allow continuity from existing tracks, in contrast with other methods which nature tends to the creation of new tracks anytime an object is detected (as with particle filtering after occlusionwhen dealing with MOT). Wei’s implementation considers the players as entities, each containing a set of features (faces, numbers on uniform, skin or hair color) that compose the Conditional Random Field (CRF), thus integrating different sources of information to improve the player recognition and tracking [96]. Siles, F. follows a similar approach by combining a set of features (color, texture, shape and kinematic model of soccer players) to the edges of a BPG to aid the tracklet-player assignment [81]. Later, Villalta, M. expanded the graph usage from a BPG to a MPG, the assignment is the solution of a graph energy minimization problem [92]. Villalta also includes a single panoramic view of the whole field by stitching the images from two 4K cameras used to record the matches. This approach eliminated problems related to resolution or temporal segmentation needed for clips from TV broadcast, however, occlusionproblems remain as the optical capture is from a single point of view, even though the use of MPGs improves the correct assignment of blobs identities, occlusionis not fully solved. It is also important to highlight that using two 4K images recording impacts in the amount of data to process (2x4k cameras of 3840x2160 px each, recording at 30 fps, leading to a load of 995 328 000 pixels per frame), thus, the use of high capacity computers like clusters or parallelization of the algorithm is needed to speed-up the process [93]. 11 The typical implementation of a graph tracker requires to divide the whole video track into several blocks to process, where the initial state of the next block is defined by the graph solved from the previous one. There are two considerations on this: first, an error on any assignment (new blobs, Identity Swap, etc) will spread to the next block, second, the graph solving is a local solution for the block being contemplated, there exists information that’s ignored on the current block because it is contained in further blocks, specially in occlusionmatters. Gedikli, S. et al. [30] implement a probabilistic method for estimating the trajectory of the players adding a MHT algorithm. The MHT preserves multiple hypotheses and their evolution through a defined temporal sequence (window), at the end of the window a decision is made with the best fitting hypothesis; Gedikli states that MHT was able solve some misclassifications result from occlusionevents that the typical tracker could not solve. Multiple Hypothesis, initially presented as a solution for noisy and complex object tracking scenes in 1979 by Reid [1], is formerly widely used in scanner applications [6], where different measures are received on each scan cycle and spreading of a wrong assignment is critical in application such as projectile tracking, submarine location and others with a tight error acceptance. Moreover, it is recently extended to scenes considered complex due to the dynamic interaction between the objects to be tracked, like pedestrians and sport players tracking, where it outperforms single hypothesis algorithms [49], [100]. Due to its high computational cost, it is usually used as a complementary aid for other algorithms, commonly particle filters and/or a graph representation, instead of being used as the core algorithm of the tracker. Breitenstein’s [14] experiments show that adding a MHT layer improves the results from a single algorithms implementation, in this case using MHT as an addition to PF, in particular in soccer with a Multiple Object Tracking Accuracy (MOTA) higher than 80% for the scenes selected for evaluation, however, identity switches persist due to the use of particle filters and robustness is low during the initial sequences. Therefore, most improvements on tracking results due to robustness on occlusion events are found from a mixture of algorithms, usually by mixing MHT with other algorithms (Kalman filter, PF, ) or similar approaches like a multi-layer outline, or, with PF based algorithms [9], [58], [101], where MHT has proven to be comparable to ”stat-of-the-art methods on standard benchmark datasets” even in its early 90’s implementation [40], [49]. 12 Robustness against identity ambiguity occlusionevent, as stated before with different examples, is one of the biggest problems in optical object tracking. Of the several ways to fight it, some examples are: improving the current algorithm with more features, omit occlusionand care only for previous and posterior identities, increase points of view adding more cameras, and treat the tracking as an optimization problem via multiple layers of assignment (typically via MHT or graphs plus additional criteria). Hess & Fern improve the PF with their Discriminatively Trained Particle Filters [32], which is an alternative for the typical particle filtering approach by training the data set of the players with features as appearance (size, color), motion (via 11 different probabilistic features) and player interaction (to prevent particles mixing between different players), all features constantly updated frame to frame. This method increases the complexity of implementation of the algorithm as it requires to update the features and train the data set for each player, but shows great robustness when occlusionoccurs, especially considering that the algorithm was implemented in American Football, sport with more contact between players than Association Football. Ok, Seo & Hong attack the particles mixing problem when using PFs by implementing an Occlusion Alarm Probability (OAP) [60], intended to repel the particles from sticking between players in multi- target tracking, thus preventing particles mixing and Identity Swap. The results show that there are some occlusioncases in which the repel leads to unassigned particles to one of the players, after the occlusionends, the player previously hidden is now considered as a new player (blob) entering the game due to the missing information from its previous states during the occlusion, this due to not considering tracking splits scenarios. Sabirin et al. [74] detect early occlusionevents by defining a rectangular Region of Interest (ROI) to separate each player, rather than a tight contour, when two or more of these regions are overlapped by a 60% of their area, an occlusion is detected. In general, occluded players are considered a big single ROI during the occlusion event, their identities are assigned after the event, when there’s enough information to identify the players during the occlusion, single regions are assigned preserving the shapes from previous frames, updated just after the event finishes. There’s no complete identity solution during an occlusionevent, making the algorithm robust to occlusions (the identities are recovered after the event) but not occlusionsolving. It is important to highlight that Sabirin included players’ texture as a feature for object identification, alongside other features as position and estimated motion based on optical flow in the hue channel, to identify different players, showing it as an useful feature allowing easily 13 differentiation from different teams and also being the key on the algorithms robustness to occlusion. However, the method is dependent on the segmentation results, as many others, thus the importance of efforts on improving the frames segmentation. Morais and Xu [55], [98], although working with the indoor sport futsal, both suggest the use of multiple cameras to improve the players distinction, as different planes will add position information that a single plane cannot easily obtain, the positions of the players are estimated using homography. Iwase, S. & Saito, H. implement multiple view images in outdoors scenarios locating cameras around the game field [36]. Even though the use of multiple cameras has proven to improve the segmentation results and diminish the occlusionissues (especially when the field is surrounded by cameras and 3D information is obtained), it is dependent on the optimal different views needed to prevent occlusion, which can anyway happen for non-rigid 3D objects in motion, e.g. one or more players may fall over a single player so that it is completely hidden from any camera, or there may be a group of more than two players colliding, which leads to players ambiguity associations. [42] also mixes information from multiple camera views into an MHT environment, using 3D information to aid correct identification of players during an occlusionevent. To prevent occlusionusing multiple cameras, new research questions may arise like: how many cameras? And, where to place them to cover the most, if not all, the 3D space? Thus, making the solution hardware dependant. In [49], the previous questions are exemplified by a variety of camera configurations, many different ways to place the camera and occlusion persists being a nuisance. Jiang [38] represents the possible locations of a single object as nodes of a graph, building a multi-layer graph with the occupation probability of the object, a set of graphs would represent the multiple objects to track. Shitrit et al. [80], in a similar way, implement a graph representation for players’ tracklets, reducing the problem to a multi-commodity network flow optimization, using K- shortest path to solve the assignments on a multi-layer graph. These representations work similar to a MHT approach, where different tracks are preserved and optimized to find the best later on, instead of picking local bests on each iteration of the algorithms. The above implementations resemblance a MPG with MHT approach, differing in the way the graph is constructed, which is considered on the current work’s implementation as a MMPG. In [50], it is used an appearance-based model MHT, in which they consider appearance and motion cues to refine players’ identification. The combination of different features in order to improve the segmentation, as stated by Manafifard, is a needed modification to data association, as these 14 methods usually start by considering that the nodes to associate (in a graph representation) are the right measurements. They show that an appearance-based model MHT performs better than a simple MHT algorithm, giving it the a better discrimination of players and false alarms. This consideration is taken into account in the current project, where the nodes of the graph are not only a label and a position, but a set of optical features that weight on the association solution phase. MHT outperforms those methods using single hypothesis in the case of occlusion and noise [49], however, it is still not a flawless method, as it depends on the estimates of the measurements within an occluding blob, where a sudden change on velocity (reality differs highly from estimation) or a simple bad estimate results in tracking error [39]. Also, it has a high computational cost that makes it unattractive for real-time applications, nonetheless, it sets an interesting paradigm from which a multi-layer structure for assignment solving can be implemented [38], [80], which sets a different way to resolve the typical graph optimization problem. The previous leads to the idea of implementing a multi-hypothesis/multi-layer structure as a mixture of MHT and MPG, a Multiple Multipartite Graphs (MMPG) tracker. Artificial Intelligence More recently, with the approvals and standardization in the use of performance and tracking tools in sports like soccer [23], [24] and the raise of Artificial Intelligence (AI), AI powered tools have been incorporated to aid in player’s performance evaluations via automated optical tracking, from the generation of ground truth data sets to train, test and validate the tracking solutions [61], to the integration of AI to track the players real-time from soccer matches recordings, detecting off-sides, goals and tracing players’ trajectories [2], [4], [66], [67], [78], [85]. However, it is important to highlight these solutions do not rely solely on artificial intelligence, they follow the thread of combining different tools and methods into a robust and more complete platform, including but not limited to wearable trackers, installation of multiple cameras mounted underneath the roof of the stadiums ([78] use 12 dedicated tracking cameras) and human aid to monitor the results and make corrections as needed, since even with AI errors can occur, as recorded on 2020 when an AI camera operator confused the head of a bald referee as the soccer ball [3]. 15 1.4 Problem Description Research question Will a Multiple Multipartite Graphs approach improve precision of soccer players’ tracking in comparison to the current method based on single Multipartite Graphs? Problem Automatic multiple object tracking is not a flawless process, there are different events that make it a difficult task, in particular, the occlusion events. When performing optical object tracking, occlusion events cause information losing that leads to wrong or incomplete tracking results. Many algorithms have been implemented to solve this issue, but they fail one way or another, making it necessary to add a final layer of human aided correction, thus current tracking platforms are not fully automatic. 1.5 Hypothesis By implementing a Multiple Multipartite Graphs data association tracking method, it is possible to improve the occlusion event management of the tracker, which will lead to a metric value of MOTA greater than 75% on a whole match. 1.6 Objectives General Objective Implement a Multiple Multipartite Graphs occlusion solver to manage occlusion events on automatic soccer players’ tracking from Ultra High Definition video. Specific Objectives • Develop the framework needed for the Multiple Multipartite Graphs tracker. • Design a metric to detect simple occlusion events on soccer matches. • Implement Multiple Multipartite Graphs occlusion solver. • Validate the algorithm against annotated ground truth datasets. • Disseminate research to scientific community. 16 1.7 Scope The current project focuses on the occlusion events problematic, it departs from the previous ACE platform version which includes multipartite graphs to perform players’ tracking, using panoramic field view in ultra high definition (UHD) images from the stitching of two 4k recordings of each half of the soccer field. The platform will transform from using single multipartite graphs to a Multiple Hypoth- esis Tracking scheme, self-named Multiple Multipartite Graphs. Some other changes are considered to improve the platform, according to different specific needs of the project encountered during its implementation. 17 Chapter 2 Theoretical Background This chapter covers important concepts and key theory regarding Multiple Object Tracking (MOT), Occlusion, graph representation and the intuition on Multiple Multipartite Graphs (MMPG), as well other concepts related to complementary integrations to ACE tracking platform with the goal of improving the tracking results. 2.1 Multiple Object Tracking - Soccer Players Tracking To perform soccer players tracking, a single player must be distinguishable from any other player along the match. This task, as easy as it may seem for the trained eye, is a hard task in object tracking, taking into account the next considerations: not relevant objects (e.g. goal posts, banners, audience) must be filtered from the objects of interest (players, ball, referee), each player must be identified in each single frame, the same player must be identified throughout the video (consecution of frames), the referee is not a player (can be or not removed from the tracklets depending on the desired analysis), there are active players substitutions during the match with fresh players, both teams fight for the same ball leading to occlusion and tons of players interaction, and many others. To cover all of these and more, object tracking is separated into different steps dedicated to specific tracking tasks. Three key steps can be recognized: detection of objects of interest, tracking of such objects from frame to frame, and analysis of object tracks to recognize their behavior. The whole tracking task can be defined as “the problem of estimating the trajectory of an object in the image plane as it moves around a scene” [99]. Tracking objects can be messy due to loss of information caused by projecting the 3D world onto a 2D image (some platforms like SportVU [90] operate directly with 3D data from multiple views coupling), noise in images, resolution of the video recordings, complex object motion, scene illumination changes, partial and full object occlusions, and other problems application dependent; each step has its own difficulties. The three key steps mentioned above can be recognized in the implementation of the computational 18 platform ACE. The Pattern Recognition and Intelligent Systems Laboratory (PRIS-Lab) develops a computational platform for soccer digital video analysis, called ACE, platform with the task of providing automated soccer matches analysis to human motion scientist, coaches, teams and general public. This platform is implemented in a multi-layer architecture comprised of two main stages: perception and cognitive [82]. First stage is covered by the temporal and spatial segmentation blocks. The temporal segmenta- tion performs detection and classification of scenes with players information (applicable to videos of broadcasted soccer games) to remove those recording scenes that have no information of interest, like advertising spaces; if the soccer game is recorded in panoramic view of the whole field, this block is not needed. The spatial segmentation detects and localizes the objects of interest in each frame of the scene. The second stage is covered by the trajectory tracking and semantics of the game’s tracked activity. The trajectory tracking is where the position-object assignment is performed, using as input data the results from object segmentation. The semantic analysis performs a reinterpretation of the trajectories to detect actions and classify events, this block is responsible of generating statistics, occupational map and other information to provide to the customer [81]. Figure 2.1: ACE computational platform architecture layers Different methods and algorithms are used to try to solve each object tracking step. As the project focuses on the players trajectory tracking, the current scope prioritizes methods and algorithms directly related to the trajectory-object association. The inputs of the algorithms to study are the objects spatially segmented by the previous block in ACE’s flow, i.e. player blobs, block which graphically identifies them by drawing a contour around the player’s detected shape, the centroid of the contour 19 is used as the estimated position of the player. It is important to highlight the influence of the results of the segmentation task into the trajectory tracking, any error in the players’ segmentation and detection will highly lead to a wrongful tracking. Details of the base implementation of ACE platform are discussed in [92]. To understand the complexity of assigning the positions detected in a set of frames to a player and identify its trajectory, it is important to know how a typical soccer game is broken down. First, the game officially starts with all 22 players ”easily” distinguished as they are separated one another following a game strategy formation, both teams start located in opposite sides of the field. As the game begins, the players start to move over the field reacting to the game events as the match unfolds, a player may decide to stay in position, move arms or crouch (shape change), jump, go forward or backward, left or right, follow a player or the ball; as the ball moves from one side of the field to the other, the player may change its previous intended trajectory to another; the motion is errant and not much predictable without the semantics of the game, even though there is a general strategic plan. If the players would not interact one with another, the trajectory-player assignment may not be as hard, but, as the game definition states, both teams are fighting for the same ball, this leads to more than one player sharing a close range to the ball as they struggle to control it, possibly generating occlusion when one player is placed behind another (from the perspective of the 2D image recorded from the same view plane); which player is which? Using the colors of the soccer suits is useful only if the players are of different teams, but if the players are of the same it won’t help much, even less if it is confused with the background (see Figure 2.2). Figure 2.2: Players detection issues due to visual properties, scene conditions and segmentation process. Results from ACE platform, 2018 Automatic object tracking has to deal with previous mentioned issues and many others. Most of 20 these issues correspond to the first stage, specifically from spatial segmentation, as they are dependent on the image characteristics. A perfect segmentation or feature extraction would simplify the second stage, where the association algorithms could operate with clean and organized information. However, as these issues are innate to image processing, data association and object identification have to deal with messy and flawed detected objects. Even though there are different methods that try to improve the quality of the detection phase results, some events persist to the other phases, like occlusion events, resulting in the identity assignment stage consuming data with those problems, causing the previously mentioned errors like Identity Swap, missing identities and spurious objects. 2.2 Occlusion When using images, the features extracted depict visual characteristics of the scene under evaluation, one way or another, they describe each object contained in the scene. However, images are but a projection of a 3D world into a 2D plane, where these visual characteristics are transformed and even hidden, making them hard to extract sometimes. Typical features extracted in this context relay on pixel color (using histograms of color in different representations like YUV, RGB or HSV) to detect areas of interest in the image (background extraction, possible objects detection) and shape/size of the objects to filter spurious detections. However, some information is lost in the 3D to 2D world transformation, sometimes vital information like depth location of the objects, which leads to occlusion events. In an occlusion event, objects that share a common axis location, but different depth from the camera, appear to share the same spatial location, as depth information is lost in the plane of the image. When this occurs, it is hard to identify which object is which, as figures and colors are mixed and the common segmentation methods fail to distinguish each object involved; this particular scenario, as stated in [49], is the biggest challenge in MOT. Figure 2.3 illustrates occlusion events on real soccer matches, even though two close players are not visually blocked for the human eye, its proximity and visual features can lead either segmentation or tracking algorithms to give a faulty result (figure 2.3a). Some other events are more complex and practically impossible to automatically correct (figure 2.3b). 21 (a) Minimum proximity between players is considered as an occlusion event by the algorithm, which groups both players as a single one. [74] (b) Four players occlusion on TV broadcast video. Figure 2.3: Examples of scenarios perceived as occlusion events. The existence of occlusion events, and a segmentation process that is easily affected by it and many other noisy situations, makes it a need for the identity assignment phase to be robust in order to deliver a clean tracking output, it must deal with false detections and missing information. However, such a perfectly robust algorithm does not currently exist, and the current ones which outputs have the higher accuracy percentages are not fully reliable on its own. Recent approaches tend to mix different methods to aid the core algorithm into solving assignments, be it hardware solutions like increasing the points of view, or a mixture of feature models and tools [45], [49]. In terms of occlusion , three ways to confront the event were identified: no-handling of the event, event detection, and event solving. No-handling implies that occlusion is not considered as an input event for the identity assignment. These methods usually rely on the ability of the algorithm to make corrections or its robustness to false detections. As discussed previously, there is no flawless algorithm, and this approach will lead to an incomplete tracking history that requires human aided cleaning of the results. Other way of not handling the occlusion when tracking is trying to prevent it from happening when capturing the scene, this can be done by increasing the points of view to perform a 3D projection of the field. With depth information, most occlusion scenarios are dismissed in exchange of having to care about homography and camera placement. However, even in a 3D projection some occlusion can prove to be challenging (like in figure 2.3b). More recently, with advancements on technology and its usage approval in official games [23], [24], tracking is aided, if not even replaced, by alternatives to optical image segmentation, the use of GPS trackers provides an accurate occlusion-free extraction of the players’ position at any point of the game with a highest refresh rate than that provided by 30 or 60 FPS videos, simplifying the tracking challenge to matching the GPS tracked position to coordinate in an image [88]. 22 Event detection is the ability to recognize occlusion events, usually indirectly by human interpre- tation of tracking information, but to decide to dismiss it, i.e. identities are assigned before and after the frames in which the occlusion occurs, but during the event there is no verification of the identities. This can be seen in [25], [26], where the missing nodes in the sequential graph hints an occlusion event (the graph passes from having two identified players to only one during many frames, and then again having two players, see figure 4.1). However, there is no occlusion handling, the missing information is not corrected and the tracking output only shows the player missing in some frames, but it is con- sidered enough to know the players’ position before and after the occlusion , as sometimes the events are only a few frames long, in the best of cases. However, scenes are not always so simple and there occurs identity swapping and false-new-player labeling after the occlusion event. Figure 2.4: Sequential graph where missing identities suggest an occlusion event [26] Some occlusion detection techniques are used to improve the main algorithm behavior, like in [60] where occlusions are detected by a proximity alarm in order to disperse the particles from the Particle Filter (PF) that is used to track players. Another example is [74], that uses occlusion detection to trigger occlusion handling. Occlusion handling refers to trying to solve identity assignment during an occlusion event, and/or to verify and correct identity assignment before and after the occlusion occurs. Some examples are [25], that builds forward and backward graphs of the frames with missing information in order to merge data prior and after the occlusion occurs; [74] uses different player attributes (like color and movement direction) and prior-occlusion Region of Interest (ROI) placement to define the position of missing players during an occlusion event. Handling and solving occlusion requires at least three phases: detection of the event (in which frame it starts and when it ends), revealing hidden players, and identity preservation during the event. It is important to note that this is a simplification of what it would take to solve occlusion , as there are different degrees of complexity in which it can occur depending on the number of players involved and if there are other dynamics like players intermittently occluded while others are permanently occluded. 23 Occlusion event classification and identification Occlusion events are different and vary in complexity. Simple events as a short sequence of two players can be solved in most cases, for example with a forward-backward sequential graph [25] or with PF [32], however, even in these scenarios occlusion solving effectivity is not 100% and there exists many other scenarios more complex that are yet to be confronted. An occlusion event O can be identified by its length in video frames, and its complexity. The length can be represented either by the number of frames or by the initial and final frames ki, kf . The Occlusion Event Complexity (OEC) is a score that depicts how complex an occlusion event is in terms of the involved objects’ interaction and the occlusion area objects density, where O1 5,30 is an occlusion event that occurs from frame k = 5 to k = 30 and has an Occlusion Event Complexity Score (OECS) CS = 1, where the OECS is the product of the Occlusion Dynamics Index (ODI) and its Occlusion Agglomeration Size (OAS), to be detailed later. To study the event’s complexity, it can be categorized by its Dynamics as Steady Occlusion Event (SOE) or Dynamic Occlusion Event (DOE), and by its OAS, represented by the number of players involved. From the dynamics perspective, it considers the interactions of the players during the event. It may happen that during the whole occlusion there were only two players present, both move to posses the ball, fight for it for some frames, and then one takes possession of the ball and runs faster than the other bringing an end to the occlusion . The previous example depicts an event in which the OAS is constant, but it may vary in more complex events. In the case of an occlusion that starts with two players for a couple of frames and later changes to three players when a new one gets close to aid its teammate, the OAS varies, and thus the complexity of the scene increases. In this case, let OCS 5,30 be the occlusion event described previously, where oCS1 5,10 and oCS2 11,30 are the sub-events that comprise OCS 5,30 (event starts with two players and ends with three), the OAS is useful to break off a complex event into several simpler ones. However, the event can get even more complex by simply adding intermittency to one of the players. If we take the previous example and add intermittency, let’s say that the third player enters and leaves the Occlusion Event Region (OER) several times, the OEC increases as the tracking algorithm must deal with an intermittent merge/split for several continuous frames, result of the third player interacting with the previously occluded pair. If there are many intermittent players, the complexity of the scene increases even more. Thus, we can identify a SOE as one that has no intermittency, with an Occlusion Dynamics Index Intermittency weight (ODI-I) of 0; and a DOE as one with intermittency, 24 with an ODI-I of the number of intermittent players. Other factor to consider for the OEC is if the players interacting are from the same team, as similar color distribution leads to Identity Swap, it is easier to distinguish players from different teams. In this sense, we define an Occlusion Dynamics Index Team weight (ODI-T) of 1 if there are players from different teams, an a ODI-T of 2 otherwise. The final ODI is the summatory of the ODI-I and ODI-T, e.g. an occlusion event with two players of the same team (ODI-T of 2) where one is intermittent (ODI-I of 1) would have an ODI of DyIx = 3. To calculate the final OECS of this example, we use its OAS, with a value of AgIx = 1 as there are only two players present, and multiply it by the DyIx, the OECS of this simple event would be CS = DyIx ∗ AgIx = 3 ∗ 1 = 3; if we take an scenario with the same OAS but with no intermittency and players from different teams (DyIx = 1) the final OECS is CS = 1, which is the simplest possible occlusion event (a CS = 0 would mean no occlusion is happening). CS = DyIx ∗AgIx = (ODI − T +ODII) ∗ [1 + (x− 2) · 1x 6=2] (2.1) This project will focus on occlusion events with an OECS CS = 1, 2, which are simple occlusion events, cases where CS >= 3 will be tested, as those are interesting and challenging scenarios (specially if the high score is due to intermittency), but no specific result is pursued over it. The most common OECS are yet to be obtained from scene labeling on real matches, to be shown in further results. 2.3 The Graph Representation A graph G is a mathematical structure that shows relations between the objects it comprises, it mainly consists of two things, a set V = V (G) whose elements are called vertices, points, or nodes of G, and a set E = E(G) of unordered pairs of distinct vertices called edges of G, such a graph is denoted by G(V,E). If there is an edge e = {u, v} (the edge e connects vertices u and v), then vertices u and v are adjacent and are the endpoints of e [73]. Graphs can be pictured by diagrams where each vertex v in V is represented by a dot or a small circle, and each edge e = {v1, v2} is represented by a curve which connects its endpoints v1 and v2 (see figure 2.5 for a generic example of a graph). 25 v1 v2 v3 v4 v5 e1 e2 e3 e4 e5 e6e7 e8 e9 v1 v2 v3 v4 v5 5 8 10 9 3 124 1 9 Figure 2.5: Graph G pictured by its vertices and edges relations (Left), example of a weighted graph (Right). If graph G is assigned data to its edges and/or vertices, it is called a labeled graph, in particular, G is called a weighted graph if each edge e is assigned a number w(e) called the weight or length of e, and thus an edge-weighted graph (see figure 2.5). A complete graph is a graph G in which every vertex is connected to every other vertex in G (see figure 2.6). A graph is directed if it is made up of a set of vertices connected by edges, where the edges have a direction associated with them. v1 v2 v3 v4 e1 e2 e3 e4 e5 e6 Figure 2.6: Complete graph example A multipartite or k-partite graph G is one in which its vertices V can be partitioned into k different independent subsets such that vertices of the same subset are not adjacent. If the amount of subsets 26 k = 2, it is said to be a bipartite graph, if k ≥ 3, it is said to be a k-partite graph. It is a complete k-partite graph if every pair of graph vertices in the k sets are adjacent, if the subsets have p, q, ..., r vertices respectively, the complete k-partite graph is denoted Kp,q,...,r (see figure 2.7). u1 u2 u3 u4 u5 v1 v2 v3 v4 U V Figure 2.7: Bipartite Graph Example In the case of ACE platform, a series of k frames are represented by a directed edge-weighted k-partite graph mpg, where the vertices (nodes) of each subset k represent a set of characteristics for each object of interest (players, ball and referee), and the weighted edges represent a cost of making the nodes adjacent (i.e. the weight of node ui from subset k-1 representing the same object as node vj from subset k). The mpg is initiated as a complete graph, and is then minimized by a shortest-path algorithm to obtain a graph where ideally each node has exactly one adjacency (see figure 2.8). If there were no faults when tracking the players, each subset k would have the same exact number of nodes (not considering player expulsions or replacements) and the ideal minimized graph would be the common output, however, that is not the case in a real scenario. 27 u1 u2 u3 u4 v1 v2 v3 v4 U V u1 u2 u3 u4 v1 v2 v3 v4 U V Figure 2.8: Bipartite graph example with 4 nodes in each subset. Left: Complete Bipartite Graph. Right: Minimized Ideal Bipartite Graph. 2.4 Multiple Multipartite Graphs Object tracking involves detecting and identifying the object of interest in an image k, and preserving its identity in the next images {k + 1, ..., k + i} from a sequence of consecutive images (detecting and identifying the same object as itself in each image). Different features are extracted to estimate the object’s position in the image, in this sense, the identity assignation and position estimation are considered to be an assignment hypothesis. Different considerations are taken to determine if that hypothesis actually represents the desired object, in a probabilistic tracking approach, each possible assignment with a high probability is a hypothesis (for example each particle in a PF[22]), where the tracking result can be a single hypothesis with the highest probability, the average of the hypotheses with highest probability, or a region of high probability. In a data association tracking method, in particular a Multipartite Graph (MPG) approach, each node adjacency combination is a hypothesis (see figure 2.9), it can be said that a complete MPG is one that represents the whole hypothesis space. 28 u1 u2 u3 v1 v2 v3 w1 w2 w3 U V V Hypothesis 1 u1 u2 u3 v1 v2 v3 w1 w2 w3 U V V Hypothesis 2 u1 u2 u3 v1 v2 v3 w1 w2 w3 U V V Hypothesis 3 Figure 2.9: 3 assignment hypothesis examples for a multipartite graph of 3 subsets with 3 nodes each. When considering the features extracted and other constraints (like distance and speed constraints) many of those hypotheses can be easily discarded, as they are not possible due to known physical interactions (e.g. a player cannot be on one side of the field in frame k and on the other side of the filed in frame k+1, if consecutive frames represent a lapse less than a second). For the plausible hypotheses, usually a shortest-path algorithm is used to determine the best fit minimized graph according to the different weights assigned to each edge (from features and constraints). However, the usual approach is to build a MPG for a fixed window size of K frames, and pick the very best assignment hypothesis and discard the rest, which would be the final solution if we suppose that the data from which the MPG was built is accurate with no noisy, spurious or missing identities, what happens in real case scenarios, thus the local best hypothesis may not be the global best, or even the best for a MPG that 29 considers the frames from the same window and the next one. One way to illustrate it is by comparing the hypothesis decision by shortest-path to what some- times occurs with a GPS map navigator. It can occur that the driver is suggested to take an exit to go through a path which is just 1 second faster than the current course, the driver may decide not to do so considering that 1 second is not gain enough and that taking the exit would lead to a known hill that the tracker dismisses as it lacks topography information, leading to a slow down and therefore lasting even more than estimated, the algorithm is designed to pick the shortest path from the currently available data. It can be argued that incorporating a way to calculate the estimated time of arrival (ETA) considering the hill slow down would improve ETA’s accuracy, let’s now consider the case in which the navigator does not know beforehand every route from a map, but it generates the best route evaluating a portion of the map by fixed size windows as the car advances, then, until the window with the portion of the map that includes the hill is processed the algorithm won’t take into account its existence, it may still offer the driver to continue to the hill, but it is only after the algorithm detects the hill that it can conclude that the current path does not lead to the global optimum. Similarly when tracking football players (specially if the final goal is to track live players), when the current window of frames is being processed, the algorithm does not know how the batch relates to the next, if there is loss of information in the next batch or if the current one is missing a player that is detected in the next; the shortest-path approach picks the local best hypothesis, which may not coincide with a global best hypothesis. The above scenario may result in cuts in motion flow, for example a motion scene being composed of two MPG due to window sizing and sequential graphs creation instead of a single MPG, an occlusion event could be split into several MPG with its own faulty local optimum that makes no sense once put together; in the end, the final graph representing the whole match is a chain of local bests hypothesis added sequentially, which may not have the same results as a MPG made from the entirety of the match’s frames, more study is required to understand the extent of the effect of different window sizes. As of now, the tracking phase is fed an optimum initial state that may or may not include errors and assume it as the truth so as to link each sequential batch (hence the importance of a clean segmentation and detection before the tracking phase to prevent spreading errors the tracking algorithms will take as truths). Figures 2.10 and 2.11 compares the general processing of the MPGs of a scene comprised of 6 frames with a window size 3 and 6, respectively. 30 u1 u2 u3 v1 v2 v3 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 U V W X Y Z MPG frames 1-3 u1 u2 u3 v1 v2 v3 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 U V W X Y Z Reducing MPG frames 1-3 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 W X Y Z MPG frames 4-6 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 W X Y Z Reducing MPG frames 4-6 Figure 2.10: Multipartite graph processing with a window size 3 of a scene comprised of 6 frames. Subsets U, V and W represent the first 3 frames of the scene, as the window size is 3, these subsets comprise the first MPG, it is minimized and the final subset W is the starting point for the minimization of the next MPG for the last 3 frames of the scene. 31 u1 u2 u3 v1 v2 v3 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 U V W X Y Z u1 u2 u3 v1 v2 v3 w1 w2 w3 x1 x2 x3 y1 y2 y3 z1 z2 z3 U V W X Y Z Figure 2.11: Multipartite graph processing with a window size 6 of a scene comprised of 6 frames. Subsets U, V, W, X, Y and Z represent the whole scene in a single MPG, which is then minimized with the whole scene’s information. Considering the local optimum as an assignment hypothesis (one possible assignment solution based on the available data), let’s suppose the tracking can be improved by holding the assignment, or hypothesis selection, until there is more information for an actual best pick (we have knowledge that there is a hill coming, so let’s be flexible with the MPGs optimization). Keeping a set of best local optimums instead of only the best local optimum grants a set of alternative assignments the tracker can piece together into building a wider optimum without committing to the initial assignment based on limited scene information. The idea of propagating multiple assignments hypothesis was first presented in 1974 by Singer, Sea and Housewright [83] to be used in dense multitarget environments, and later expanded in 1979 by Reid [1], presented as Multiple Hypothesis Tracking (MHT) algorithm, a systematic approach to solve tracking of multiple objects in a complex scenarios. The algorithm mainly consists on creating assignation hypothesis and propagating them as new measurements and detections are incorporated into the assignment space, usually depicted as a decision tree that encompasses the hypothesis space (figure 2.15). Those discarded hypothesis (branches) are pruned from the hypothesis tree, each new measure adds new information and thus creates a new subset of hypotheses (hypothesis branches). 32 Figure 2.12: Hypothesis Tree: tree representation of formed hypotheses. [6] There exist different implementations of a MHT algorithm [1], [6], [8], [13], [18], [29], as well as variations depending on the area of research, improvements or the algorithm it is mixed with [14], [30], [42], [50], [58], [101], but two key aspect remains, the first one is its high complexity and computational cost due to the hypothesis creation process (hypothesis can become virtually infinite if not limited) that makes it a requirement to use a robust pruning method and to limit its length with a fixed maximum size. This issue is similar to the one previously discussed with MPG, however, due to its complexity and cost, it is not desired to use MHT for the entirety of the match but rather in sections where it may come handy, like in occlusion events (here it is paramount the idea of occlusion events detection and classification). The second aspect regarding MHT is its tree-like shape, comparing the main structure of a Hy- pothesis Oriented hypothesis tree [6] with an MPG (figures 2.9 and 2.11), where each branch of the tree is a hypothesis and each possible assignment combination in a MPG is a hypothesis (taking each subset as a single node of a hypothesis branch), the possible MPGs can be arranged each as a branch of a hypothesis tree, thus making the whole hypothesis tree into an MMPG space (figure 2.13), where hypothesis pruning would mean to discard an MPG, and hypothesis propagation the incorporation of new subsets, i.e. expanding each hypothesis/MPG with new measurements from the following video frames. Each new subsequent MPG window adds a new set of hypothesis branches to the existent ones, making the hypothesis tree grow. 33 u1 u2 u3 v1 v2 v3 w1 w2 w3 U1 V 1 W1 Hypothesis 1 u1 u2 u3 v1 v2 v3 w1 w2 w3 U2 V 2 W2 Hypothesis 2 u1 u2 u3 v1 v2 v3 w1 w2 w3 U3 V 3 W3 Hypothesis 3 ——————————————————————————————————————————- u1 u2 u3 v1 v2 v3 w1 w2 w3 U V 1 W1 Hypothesis 2 v1 v2 v3 w1 w2 w3 V 2 W2 Hypothesis 1 v1 v2 v3 w1 w2 w3 V 3 W3 Hypothesis 3 34 ——————————————————————————————————————————- U V3 W3 Hypothesis 2 V2 W2 Hypothesis 1 V1 W1Hypothesis 3 Figure 2.13: Depiction of Multiple Multipartite Graphs structure based on a Hypothesis Tree, where the first common subset is the starting point, each branch is a hypothesis composed of a possible reduced MPG Following the previous intuition, an MHT-like structure can be obtained by an MPG of MPGs, where each MHT represents a hypothesis branch. 2.5 Player Identification An essential aspect of a correct object tracking is the ability to retain the unique identity of each object throughout the video. In optical object tracking, characteristics extracted from the video can be used as a first layer of attributes that distinguishes one object from another; properties derived from the first set of attributes, as speed or trajectory direction, can be used as complementary attributes. Moreover, building a set of attributes can be extended to groups of objects to determine belonging to a certain group (spurious blobs vs player blobs, referees vs players, team A vs team B, etc). Player Model From a 2D image object identification standpoint, an object of interest is but a cluster of pixels that standout from the rest of pixels in the image. From these pixels some attributes can be extracted to build the object’s characteristics that make it a unique object. Directly from each pixel we can extract the color information in different formats, hue, saturation, its position relative to the global frame of reference; from the cluster of pixels we can calculate histograms of visual properties, centroid, contour, area, perimeter, aspect ratio, and more. This first set of attributes becomes useful when identifying objects in the same image scene, as the slightest variance can be used to distinguish between similar objects. These attributes can also be used to filter between spurious identifications and actual objects 35 of interest, e.g. using aspect ratio to remove small objects detected from the scene’s background noise, or using the HSV histogram to filter false positive clusters of pixels that belong to the soccer field markings and not to a player. After a proper tracking of an object across multiple sequential images, some other attributes can be estimated from the historical data, from the record of previous positions the speed and direction of the object can be estimated, which can be used to predict the object’s next position. Team Model Similarly to an individual player, a group of players can also be identified by a set of attributes. In the case of soccer, the game is designed to have the players belonging to one of two teams, where the players of the same team wear uniforms of visual similarity (color, shape, patterns) with the exception of the goal keepers. Sometimes a spurious blob can pass a filter designed from considering individual objects, i.e. a player is expected to have some aspect ratio or a size within some margins, but spurious blobs can satisfy these conditions with unexpected shapes and produce false positives. Using the attributes derived from the common properties of a group of known players of the same team can be used as an additional filter where a blob is a player blob only if it fits the expected pattern of one of the two teams. This useful team model, however, depends on the correct identification of a player, which can lead to a chicken and egg situation. In the case of knowing beforehand the initial position of the players, this piece of information can be used to create a trusty team model, but a bit of caution to keep in mind is that as the match advances and the conditions change (players’ being replaced, the uniforms getting dirty changing the color distribution, abrupt scene light changes) this model can become obsolete, thus appropriate tools to keeping this model valid are required to make the most out of it. 36 (a) Yellow player with its hue channel his- togram (b) Blue player with its hue chan- nel histogram Figure 2.14: Comparing players from different teams Player similarity Having determined the model of properties that describes a player, its identity is spread to the next frame by comparing its set of attributes with the objects of the next frame that might represent the same player. If the objects seem closely related, the identity is spread, otherwise no connection is established between the objects. Figure 2.15: Using similarity between blob models to perform identity assignment across sequential frames Each attribute is not treated equally, some have a greater impact in determining if the two objects represent the same player, as is the case of the position, taking into account the physical limitations of the players and the rate of the video recording (for example 30 FPS, two consecutive frames represent 1/30 of a second) it is impossible for one player to be positioned in one corner of the field in one frame and in the opposite corner in the next frame, the huge difference in the object’s position rules 37 out the possibility of them representing the same player. In the other hand, other attributes are less influential when determining if the objects represent the same player, as shape or the contained area, as the players are expected to be humans running and kicking the ball with two legs. Not to say these traits are not useful or cannot be used to distinguish players, but differences between detected objects are expected to be low when considering this kind attributes. Each attribute has its strengths and weaknesses when used for identifying objects (position helps filtering distanced players but becomes a lesser indicator when comparing objects close together), it becomes beneficial to use as many traits as available to cover a wider range of situations. Since some attributes can be used more generally than others, these are integrated into comparing object’s similarity with a weighted value that can be adjusted to determine the influence of each attribute. Thus, we can define the dissimilarity between two objects A and B as the weighted difference between the object attributes, where the lesser the value the more the similarity. D(A,B) = n∑ i=1 wi∆(Ai, Bi) (2.2) Where D(A,B) is the total dissimilarity between objects A and B, n is the number of attributes, wi is the weight associated with the i-th attribute pair (Ai, Bi), and ∆(Ai, Bi) is the delta (difference) between the i-th attribute of object A and object B. Next Position Prediction When tracking projectiles or objects expected not to change much its trajectory [58], [77], a clear mathematical model that describes the object’s trajectory is quite useful, whenever the optical tracking fails, the tracker can rely on the model to predict the object’s position in subsequent frames until it is detected again by the tracker. The trajectory model can also be used to trace back in time the objects motion to estimate the trajectory in a time frame not present in the current data set. When tracking objects with more complex motions or movement freedom, these models can become complex to determine. In the case of human motion, in spite of having a general sense of where the person might go based on an objective (the players follow the ball from one side of the field to another), unknown factors can unexpectedly alter the behavior of the person, and thus its motion (the player tripped over, slowed down due to fatigue, got distracted by a flying bird, does not know where the ball is at and so it is not following it, it is positioning itself in advance of a future play). 38 PF comes handy in situations like these [32], [58], [65], where instead of predicting the players’ next position, a cloud of points represent the space of possible next positions, the player is then predicted to be where most of these points cluster. This capability is not currently incorporated into ACE platform, and integrating it is a whole project on itself. Assuming some constraints in the motion freedom of a soccer player, a simpler approach can be used. Considering the input video has a rate of 30 FPS, where the time change between two consecutive frames is 1/30 of a second, taking into account the distance from which the camera is set from the field to record the match, the pixel velocity can be calculated to describe the motion of players in the scene, which can be used to expose how much change are we to expect in between frames for a player in movement [33]. Keeping things simple, we can assume a rectilinear motion for immediate frames k and k+1, by knowing the player’s direction and speed from the history of positions the immediate next position can be estimated. How well the prediction fits depends on how well we can estimate the player’s speed and direction, which depends on the length of the previous positions history, too short a history makes the prediction too sensitive to new position information (direction may change on every new prediction) and too long a history will include outdated position information that may prevent the direction to be updated (the player moved to the right of the field for too long and then decides to turn the other side, the history of ’right direction’ weight more in predicting the new direction than the new ’left direction’). Object Proximity As stated before, distance between the position of objects is a useful metric that can be used to rule out relationships between objects that are too far apart. In the case of occlusion probability, the closeness between two objects can be used as an early indicator of an occlusion event. However, the distance between two objects is usually measured between the objects’ centroids, but the occlusion can occur with any part of the region each object holds, a bigger object can occlude a smaller object even when the euclidean distance of the centroids is great. Figure 2.16 illustrates this problem by comparing two set of objects with the same euclidean distance from their centroids, top shows the bigger object partially occluding the smaller one, bottom shows no interaction between objects of same size. 39 Figure 2.16: Problem with occlusion and euclidean distance as proximity metric To address the previous scenario, the proximity between two objects is defined taking into account not only the (euclidean) distance between the objects, but also the size ratio between the objects (width and height ratios), including weighted values to provide flexibility to the calculation on how each attribute influences the proximity. proximity = distance× (wdistance + (wsizeWidth × sizeRatioWidth + wsizeHeight × sizeRatioHeight)) (2.3) Where: • distance is the Euclidean distance between the centers of the two objects, with (xA, yA) and (xB , yB) the positions of objects A and B, respectively