Prof. Roberto Di Cosmos
Software Heritage
France
Abstract: There is a strong interplay between software development and machine learning:
AI models are providing new tools to develop software, while the inclusion of large pubicly available codebases in training datasets helps improve large language models' reasoning abilities, well beyond coding tasks.
In the specific domain of source code the issue of transparency of the training dataset assumes a special weight in the broader debate around open versus closed models.
Software Heritage, launched by Inria and in partnership with UNESCO, has been building the largest archive of publicly available source code for nearly a decade, and provides today the Software Hash Identifier for the over 50 billion software artifacts it collected from over 300 million projects, ensuring availability, guaranteeing integrity and enabling traceability of all its contents. Because of the core values that inform its approach to open access and code preservation, it is naturally concerned by these challenges.
In this talk we will start from the principled stance on the use of the Software Heritage archive for training models, report on the lessons learned from the collaboration with the BigCode project that created StarCoder2, and then focus on the challenges, ethical considerations, and technical limitations that arise in the current approaches to use open codebases in AI, in particular when it comes to transparency, accountability, and resource efficiency.
These limitations underscore the need for a Code Commons: a dedicated initiative to expand Software Heritage into a central resource for transparency, quality, accountability, and sustainability in machine learning on code.
By promoting transparency and responsible stewardship, Software Heritage aims to help researchers, developers, and organizations navigate the challenges of AI in code-based applications. This talk invites all stakeholders to collaborate on this ambitious vision.
Bio: An alumnus of the Scuola Normale Superiore di Pisa , with a PhD in Computer Science from the University of Pisa , Roberto Di Cosmo was associate professor for almost a decade at Ecole Normale Supérieure in Paris. In 1999, he became a Computer Science full professor at University Paris Diderot, where he was head of doctoral studies for Computer Science from 2004 to 2009. President of the board of trustees and scientific advisory board of the IMDEA Software institute and chair of the Software chapter of the National Committee for Open Science in France , he is currently on leave at Inria .
His research activity spans theoretical computing, functional programming, parallel and distributed programming, the semantics of programming languages, type systems, rewriting and linear logic, and, more recently, the new scientific problems posed by the general adoption of Free Software, with a particular focus on static analysis of large software collections. He has published over 20 international journal articles and 50 international conference articles .
In 2008, he created and coordinated the European research project Mancoosi , which had a budget of 4.4Me and brought together 10 partners to improve the quality of package-based open source software systems.
Following the evolution of our society under the impact of IT with great interest, he is a long term Free Software advocate, contributing to its adoption since 1998 with the best-seller Hijacking the world , seminars, articles and software. He created in October 2007 the Free Software thematic group of Systematic , which helped fund over 50 Open Source research and development collaborative projects for a consolidated budget of over 200Me. From 2010 to 2018, he was director of IRILL , a research structure dedicated to Free and Open Source Software quality.
He created in 2015, and now directs Software Heritage , an initiative to build the universal archive of all the source code publicly available, in partnership with UNESCO .