Welcome

This website is for the Code Review Open Platform (CROP).


CROP is an open source dataset of code review data intended for supporting software engineering researchers and practitioners. For each software system in CROP, we link code review data to complete versions of the system's code base at the time of review. CROP currently provides data for 11 software systems, accounting for a total of 50,959 code reviews and 144,906 revisions.


Here you can find useful information, such as how to download the dataset, find support and understand the dataset.


About

The Code Review Open Platform, a.k.a. CROP, is an open source dataset of code review data. CROP collects code review information from open source software systems and links this data to complete versions of the code base for each of these systems. CROP was first designed by Matheus Paixao as part of his PhD thesis in the CREST Centre at University College London. Dr. Jens Krinke, Donggyun Han and Prof. Mark Harman have also contributed for the first incarnation of CROP.


CROP collects code review information from open source software systems and links this data to complete versions of the code base for each of these systems.


Given a certain software system, CROP contains code review data and versions of the code base for each revision ever submitted for review, including intermediary revisions before merging and revisions that were even abandoned by the system's developers. Each version of the system represents a complete snapshot of the system's code base, in a way that each revision of the system is fully buildable, compilable and testable.


By leveraging the data contained in CROP, software engineering researchers and practitioners can perform empirical studies to assess how effective the code review process is for different aspects of software development. Since CROP provides complete snapshots of the software system, these experiments can be enhanced by using a wide range of approaches for static and dynamic analysis.


Moreover, during code review, developers are constantly providing reasoning and rationale for the changes they make in the system, both when they submit code for review and when they inspect code from their peers. Thus, the data contained in CROP is a valuable source of knowledge regarding motivation for and explanation of software changes.


Download

The CROP dataset can be downloaded here. You'll be taken to a zenodo page that hosts the complete dataset.


In the metadata.zip file, you will find the CSV files that hold the metadata for each system in CROP. While in the discussion.zip file you can find the code review data in the format of discussion files, you will be able to access the complete code base for each revision submitted for review in the git repositories contained in the git_repos.zip file.


For more information on the structure of CROP's dataset, please see our structure section.


Data Currently in CROP

At the present moment, CROP contains data from two open source communities: Eclipse and Couchbase. For each of these communities, we provide the data for the 4 most popular projects in terms of the number of code reviews performed.


The following table reports statistics concerning the data collected for each of these 8 systems, where the Eclipse projects are presented in the upper section of the table and the Couchbase projects in the lower section.


Systems Time Span
#Reviews #Revisions
Language Community
egit Oct-09 to Nov-17 5,336 13,211 Java Eclipse
jgit
Sep-09 to Nov-17 5,382 14,027 Java Eclipse
linuxtools Jun-12 to Nov-17 4,129 11,763 Java Eclipse
platform.ui Feb-13 to Nov-17 4,756 14,115 Java Eclipse
ns_server Apr-10 to Nov-17 8,944 30,063 JavaScript
Couchbase
testrunner Oct-10 to Apr-16 10,421 24,207 Python Couchbase
ep-engine Feb-11 to Nov-17 6,475 22,885 C++ Couchbase
indexing Mar-14 to Nov-17 3,240 8,316 Go Couchbase
java-client Jan-12 to Nov-17 916 2,635 Java Couchbase
jvm-core Apr-14 to Nov-17 841 2,301 Java Couchbase
spymemcached May-10 to Jul-17 519 1,383 Java Couchbase

Dataset Structure

The CROP dataset is organised in three main directories: Metadata, Git Repos and Discussion. Details for each directory as follows:


Metadata

For each software system in CROP, we provide a CSV file that contains the general metadata for each review and revision stored in CROP. The CSV will be primarily used to navigate through the data and access the code review discussion files and the versions of the code base regarding each revision in the git repository. Each row in the CSV corresponds to a single revision submitted for code review, where you will find the following information:


id: an unique id to identify the revision within an specific community
review_number: the unique review number in which the revision is part of
revision_number: the number of the revision in the specific review
author: the author of the revision
status: the status of the revision
change_id: the change id of this revision
before_commit_id: the commit id that represents the version of the system before the revision took place
after_commit_id: the commit id that represents the version of the system after the revision took place


Git Repos

CROP provides git repositories that recreate the projects’ reviewing history to include all the revisions submitted for code review. Each repository has a single master branch, where the before and after versions of the source code for each revision were committed sequentially, based on the review and revision numbers. Such versions are accessible through the commit ids provided in the projects’ CSV file, as discussed above.


Discussion

This is the directory in which CROP stores the discussion files for each revision. The directory follows a tree structure, organised by review number, in which the discussion files for each revision are contained in the directory of its respective review.


A discussion file presents the reviewing data in the following order: first, the description of the revision is presented, which denotes the commit message of the revision. Such a message includes the revision’s change-id and author. The comments that were made during review by other developers are presented next. In the discussion file, CROP includes the author of the comment and the respective message.


Disclaimer

CROP was first published and described in this research paper. Since its first publication, CROP has evolved and changed in its content and structure. Although the paper is the official guideline for the CROP dataset, this website will always describe the most up to date version of the dataset.


Data Protection

All data in CROP has been extracted from publicly available data from Eclipse and Couchbase. No other data sources have been used.


Pseudonymisation

The authors of CROP are commited to avoid the publication of any personal identifiable information in the dataset. To this end, we pseudonymised the personal data in CROP to the best of our ability.


We have addressed data anonymisation as follows:

First, we replaced all names in the metadata and code review discussion files by randomly generated names. Second, all email addresses in the discussion were also made anonymous.


Licenses

All software contained in CROP are released under their original license. Indeed, the original license files and headers are included in every version of the systems' code base in CROP.


Opt out

We understand that being part of a open source dataset might have concequences for your online privacy. For this reason (and also to comply with legal data processing requirements), if you recognise your data among the dataset you can opt out CROP. If you want to opt out, please send us an email.


Support

In order to support the users of the dataset, CROP provides an email list for users to get in touch with CROP creators and maintainers.


Feel free to join the mail list if you have any question about the data and/or want to report any inconsistency you find. Overall, suggestions for improvement and general discussions are always welcome.


Contribute to the CROP

The source code used to create CROP is fully available at CROP's github project. The team behind CROP is always open for contributions from the users and community. Become part of the CROP now!


Replicate the CROP

CROP's source code can also be used to replicate the dataset. However, the software projects contained in CROP are active, with new code being submitted and reviewed on a regular basis. Therefore, in case of replication, the dataset will always be slightly different from the official version currently on the website.


Please note that the scripts provided for replication will generate a non-anonymised version of the data.


If you have any questions on how use CROP's source code to replicate the dataset, feel free to contact the CROP's team.


CROP's Team

Matheus Paixao
Dr. Jens Krinke
Prof. Mark Harman
Donggyun Han


Cite CROP

If you want to cite CROP in your research paper, please use the following:


Bibtex

@inproceedings{Paixao2018,
author = {Paixao, Matheus and Krinke, Jens and Han, DongGyun and Harman, Mark},
title = {CROP: Linking Code Reviews to Source Code Changes},
booktitle = {International Conference on Mining Software Repositories},
series = {MSR},
year = {2018},
}


Formatted Citation

Matheus Paixao, Jens Krinke, DongGyun Han, and Mark Harman. 2018. CROP: Linking Code Reviews to Source Code Changes. In International Conference on Mining Software Repositories (MSR).


Publications

This is a list of publications that use the CROP dataset. If you have a published piece of work that uses CROP and it is not listed above, feel free to contact us. Your publication will be included in the list soon.


Journal Articles

Matheus Paixao, Jens Krinke, DongGyun Han, Chaiyong Ragkhitwetsagul, Mark Harman. 2019. In IEEE Transactions on Software Engineering (TSE). Preprint


Conference Papers

Luca Pascarella, Davide Spadini, Fabio Palomba, Alberto Bacchelli. 2020. On The Effect Of Code Review On Code Smells. In IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Preprint


Matheus Paixao, Paulo Henrique Maia. 2019. Rebasing in Code Review Considered Harmful:A Large-scale Empirical Investigation. In International Conference on Source Code Analysis and Manipulation (SCAM). Preprint


Matheus Paixao, Jens Krinke, DongGyun Han, and Mark Harman. 2018. CROP: Linking Code Reviews to Source Code Changes. In International Conference on Mining Software Repositories (MSR). Preprint


Contact

You can contact the CROP's team through the following channels:


CROP's mailing list
CROP's github project