Home
Research Publications
Teaching Student Funding Service
Diversity
Talks
GitHub

Research Projects 

The mission of Software Evolution and Analysis Laboratory is to improve developer productivity.
 

Automated Debugging and Testing for Big Data Analytics



An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics. The current big data computing model lacks the kinds of debugging features found in traditional desktop computing, forcing data scientists to debug by trial and error. To address this challenge, we designed interactive debugging, data provenance, delta debugging, taint analysis, flow analysis, symbolic execution, and fuzz testing for Apache Spark.

Domain-Aware Natural Test Generation


testing
Despite the availabiltiy of automated test generation techniques such as fuzzing, symbolic-execution, random testing, search-based testing, property-based testing, etc., it is difficult for a developer to comprehend generated tests and take actions based on generated tests. While developers desire domain-aware test inputs that preserve grammars and wellformedness semantics, there is no easy means to bootstrap an efficient and effective domain-aware test generator. Building on our experiences with designing domain-aware test generators, our goal is to automatically synthesize a domain-aware test generator that produces synthetic diverse tests and helps developers root-cause why tests fail, with the help of generative AI. 
  • DepFuzz: Co-Dependence Aware Mutation for Big Data Analytics, FSE 2023
  • NaturalFuzz: Mix and Match Mutation for Big Data Analytics, ASE 2023
  • HFuzz: Fuzzing with and for Hard Acceleration, FSE 2023
  • Sibyl: Sibylvariant Transformatimons for Text Data, ACL 2022


Software Developer Tools for Heterogeneous Computing Applications

Specialized hardware accelerators like GPUs and FPGAs become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To democratize heterogeneous computing, our goal is to design new waves of refactoring, testing, and debugging tools for heterogeneous application development.

Mining, Assessing, and Visualizing Code Examples at Internet Scale

data scientists  
There is a growing interest in leveraging large collections of open-source repositories such as GitHub. Currently, it is difficult for a user to understand the commonalities and variances among a massive number of related code examples.
To tackle the new frontier of mining software repositories research, we design ultra-scale API usage mining, interactive visualization, code search, and recommendation.

Java Bytecode Debloating for Size Reduction and Security 

Modern software is bloated. Demand for new functionality has led developers to include more and more features, many of which become unneeded or unused. This phenomenon, known as software bloat, results in software consuming more resources and an unnecessary increase in attack surfaces.

To this end, we developed an end-to-end bytecode debloating framework called JDebloat. It augments traditional static reachability analysis with dynamic profiling, and it accounts for new dynamic language features in modern Java. This work is motivated and sponsored by Office of Naval Research Total Protection Cyber Platform program and has made a tech transfer impact to Navy. Information on debloating can be found here.

Data Scientists in Software Teams: Backgrounds, Activities, Tools, Challenges and Best Practices 

data scientists research.pdf I initiated academia and industry coalition to investigate the emerging role of data scientists. We conducted an in-depth study on the emerging roles of data scientists, and we conducted a large scale survey with 793 professional data scientists.
This quantification and sub-categorization of data scientists is important---although many companies are hiring data scientists and universities are creating new graduate programs, we lack scientific understandings of who data scientists are.
research.pdf
  • The Emerging Roles of Data Scientists ICSE 2016
  • A Large Scale Survey with 793 Data Scientists TSE 2018

Code Clone Detection, Management, and Removal 


Sydit
Code duplication created by copy and paste is common in large software and changing software often requires systematic edits---similar but not identical enhancements, refactorings, and bug fixes to many similar methods.  We developed novel example-based program transformation, clone removal, differential testing, and code review.
  • Clone Transplantation and Differential Testing ICSE 2017
  • Interactive Clone Search, ICSE 2015
  • Learning Transformation from Multiple Examples ICSE 2013
  • Generating Transformation from a Single Example PLDI 2011
The following techniques find copy and paste bugs and reconstructs clone evolution.

Refactoring Automation, Inspection, Testing, and Studies 

windows7rearch

Refactoring is a technique that is used for cleaning up legacy code for bug fixes or feature additions. To create a scientific foundation on refactoring, we quantified the impact of a multi-year Windows re-architecting effort---we analyzed version history data, conducted a survey of over 300 developers, and interviewed the architects and development leads to assess the impact of refactoring on size, churn, complexity, test coverage, failure, and organization metrics.

  • A Field Study of Refactoring at Microsoft FSE 2012, TSE 2014
  • API Refactoring and Bug Fixes ICSE 2011, Nominated for ACM SIGSOFT Distinguished Paper Award
  • API Stability and Adoption Most Influential Paper Award from ICSME 2013 ICSM 2013,
The following techniques find refactoring bugs

Logical Program Differencing

CHIME We invented a suite of analysis tools that can help programmers investigate code modifications. We also developed RefFinder, a logic-query approach to refactoring reconstruction. Our insight was that the skeleton of refactoring edits can be expressed as a logical constraint.