The largest collection ever of spatio-temporal data of soccer matches is made public on Scientific Data. A crucial resource for the developing of Sports Analytics.
In the history of soccer, we remember the exploits of great champions - Maradona, Baggio, Ronaldo. Nonetheless, we can learn essential lessons from the career of less known players, too.
Carlos Henrique Raposo, also known as Kaiser, has been a Brazilian professional footballer for more than 20 years. He played as a forward for ten different clubs in Brazil, Argentina, Mexico, USA, and France. His career, however, has a unique peculiarity: Carlos Kaiser played just two official matches. By establishing a friendship with famous footballers and asking them to recommend him to the managers of their new clubs, Kaiser managed to change the team almost every year. Once hired by a club, Kaiser simulated fake injuries throughout the season, thus hiding his mediocre footballing talent. An intricate network of lies and social relationships that made him survive in the football world for around 20 years.
Although the case of Kaiser is one in a million, the history of soccer is not new to sensational purchases that then turned out to be resounding failures. Not even talented and legendary managers were immune to these situations (who remembers Luther Blissett?). The reason behind these blunders resides solely in one place: the lack of data describing the performance of players throughout their careers. The availability of data can provide a way to track the evolution of Kaiser’s performance in matches and training sessions, probably highlighting his inadequacy to play soccer at a high level.
Nowadays, we have tools to discover effective players, hence avoiding failing purchases and unusual situations like Kaiser. Indeed, massive data about the performance of players are collected by specialized companies, thanks to sensing technologies that provide high-fidelity data streams extracted from every match. In particular, the so-called soccer-logs (aka match events data) describe the events that occur during a match and are collected through proprietary tagging software. Each match event contains information about its type (pass, shot, foul, tackle, etc.), a time-stamp, the player(s), the position on the field, and additional information (e.g., pass accuracy). The volume and complexity of these data provide an unprecedented opportunity to observe the performance of players and teams during a match and track their evolution during a season. Here is an example of a visualization, based on soccer-logs, of the evolution of the performance of players for an entire seasons: https://playerank.d4science.org/.
Unfortunately, soccer-logs are private data owned by the companies that collect them. Acquiring these data for research purposes is difficult, and it is a considerable cost for companies and especially researchers in the field of sports analytics. It goes without saying that the lack of public soccer-logs constitutes a severe limit to the development of sports analytics.
This is why, in collaboration with company Wyscout/Hudl, we make publicly available an extensive collection of soccer-logs that covers seven prominent male soccer competitions. The collection has been used recently during the Soccer Data Challenge initiative organized by European project SoBigData and, to the best of our knowledge, it is the largest collection of soccer-logs ever released to the public. These data are hugely beneficial to the scientific community because they can contribute to foster sports analytics research in several directions, such as the ones we sketch below.
Performance and tactical analysis. The evaluation of performance is crucial for many actors in the sports industry: from managers who want to monitor the quality of their players to scouts who aim to improve the retrieval of talents. In this regard, we recently developed PlayeRank, an open algorithm based on soccer-logs to evaluate automatically the quality of the performance of players during a season. The automatic discovery of tactics is also crucial in soccer: while tactical analyses are currently performed by reviewing matches in videos, soccer-logs can be used to perform automatic discovery of tactics, simplifying the complex process of match analysis. Our collection can serve as a common ground to compare and validate different solutions to the aforementioned problems.
Complex Systems analysis. Two soccer teams in a match represent a complex system whose global behavior depends in subtle ways on the dynamics of the interactions among the players. Soccer-logs enable the representation of a team as a network, in which nodes represent players and the edges interactions between nodes, usually passes. Soccer-logs allow the definition of different types of interactions between both teammates and opponents by relying on the several event types they encode. Such a richness of information, combined with the dichotomous nature of soccer matches (where collaboration and competition coexist), provides an unprecedented opportunity to investigate novel aspects about the dynamics of complex networks.
Science of Success. The possibility to track players and team performance creates the opportunity to explore the relationship between performance and success, where a team's success can be intended as its outcome in a competition and the player’s one as their popularity or market value. While this relationship has been investigated for individual sports, apart from a few attempts, there is no much work for soccer, partly due to the absence of public datasets of performance. Our dataset gives the unprecedented opportunity to answer fascinating questions like What are the tactical patterns of successful teams? What are the factors influencing a player's popularity and market value? To what extent is success predictable from the observable performance?
We hope our open data collection can stimulate the creativity of scientists all around the world and foster the development of new ideas, methods, and analyses that can contribute to strengthen the emerging field of sports analytics. Nowadays, by tracking the performance of a player in time we may avoid one-in-a-million situations like Kaiser. Unless the new Kaisers are good data hackers.
Luca Pappalardo and Paolo Cintia
Reference: L. Pappalardo et al., A public data set of spatio-temporal match events in soccer competitions (2019) Scientific Data, DOI: 10.1038/s41597-019-0247-7, https://www.nature.com/articles/s41597-019-0247-7