Automatically generating and summarizing open source project documentation for newcomers

Grant number: 15/07399-1
Support type:Scholarships in Brazil - Post-Doctorate
Effective date (Start): July 01, 2015
Effective date (End): January 31, 2016
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Cooperation agreement: Coordination of Improvement of Higher Education Personnel (CAPES)
Principal Investigator:Marco Aurélio Gerosa
Grantee:Christoph Treude
Home Institution: Instituto de Matemática e Estatística (IME). Universidade de São Paulo (USP). São Paulo , SP, Brazil


Summary: Newcomers to open source projects face many barriers, including unclear documentation, not being able to find a task to start with, and technical hurdles, according to evidences raised by previous research. Many of these challenges can be addressed by producing documentation that is specifically aimed at and customized for newcomers. In this project, we propose techniques to automatically identify, extract, generate, summarize, and present documentation that is relevant to newcomers in open source projects. We will use a variety of existing and novel natural language processing techniques to automatically parse documentation from a wide range of sources for each open source project, including the official documentation on the project's website, its issue tracker, forums, blogs, and the questions and answers website, such as Stack Overflow. We propose to build on existing summarization techniques and to develop novel supervised techniques for the identification of the information that is most relevant to open source project newcomers. Our main contribution will be the design, development, and evaluation of a completely automated documentation generator for open source project newcomers. This approach will lower the barriers for newcomers when attempting to make their first contribution to an open source project, it will make it more likely that newcomers will remain active contributors, and it will ultimately lead to open source projects being more accessible to individuals from outside the projects. The work will also inform academia and industry about what knowledge newcomers in open source projects need, and it will elicit insights into the application of natural language processing techniques to software documentation in English and Portuguese. (AU)