Publications
-
Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fatima, S. Lahiri, N. Swamy. Accepted to be published in 47th International Conference in Software Engineering ICSE'25. [abstract] [pdf] [slide] [code]
Abstract : Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F⋆. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F⋆ programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F⋆ definitions, each representing a type-directed program and proof synthesis problem—producing a definition given a formal specification expressed as an F⋆ type. We provide a programfragment checker that queries F⋆ to check the correctness of candidate solutions. We believe this is the largest corpus of SMTassisted program proofs coupled with a reproducible programfragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F⋆, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various typebased retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
-
Ranking LLM-Generated Loop Invariants for Program Verification, S. Chakraborty, S. K Lahiri, S. Fakhoury, M. Musuvathi, A. Lal, A. Rastogi, A. Senthilnathan, R. Sharma, N. Swamy, Published at the findings of The 2023 Conference on Empirical Methods in Natural Language Processing EMNLP-findings'23.. [abstract] [pdf] [slide] [code]
-
GrACE: Generation using Associated Code Edits, P. Gupta, A. Khare, Y. Bajpai, S. Chakraborty, S. Gulwani, A. Kanade, A. Radhakrishna, G. Soares, A. Tiwari, The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering ESEC/FSE'23. [abstract] [pdf] [slide] [code]
-
CONCORD: Clone-aware Contrastive Learning for Source Code, Y. Ding, S. Chakraborty, L. Buratti, S. Pujar, A. Morari, G. Kaiser, B. Ray,32nd ACM SIGSOFT International Symposium on Software Testing and Analysis ISSTA'23. [Won ACM Sigsoft Distinguished Paper Award]. [abstract] [pdf] [slide] [code]
Abstract : Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, selfsupervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning generalpurpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers’ coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD’s clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.
-
On ML-Based Program Translation: Perils and Promises. Aniketh Malyala, Katelyn Zhou, Baishakhi Ray, Saikat Chakraborty. Accepted to be published at The New Idea and Emerging Results track of 45th International Conference on Software Engineering (ICSE-NIER'23). [abstract] [pdf] [slide] [code]
Abstract : With the advent of new and advanced programming languages, it becomes imperative to migrate legacy software to new programming languages. Unsupervised Machine Learning-based Program Translation could play an essential role in such migration, even without a sufficiently sizeable reliable corpus of parallel source code. However, these translators are far from perfect due to their statistical nature. This work investigates unsupervised program translators and where and why they fail. With in-depth error analysis of such failures, we have identified that the cases where such translators fail follow a few particular patterns. With this insight, we develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns. We show that our code processing tool, in conjunction with the program translator, can form a hybrid program translator and significantly improve the state-of-the-art. In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline using pre- and post-processing steps.
-
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages. Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. Accepted to be published at The 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL'23). [abstract] [pdf] [slide] [code]
Abstract : Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available.
-
AVATAR: A Parallel Corpus for Java-Python Program Translation. Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, Kai-Wei Chang. Accepted to be published at finding of Association for Computational Linguistics (ACL-findings'23). [abstract] [pdf] [slide] [code]
Abstract : Program translation refers to migrating source code from one programming language to another. It has tremendous practical value in software development, as porting software across languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enables supervised fine-tuning with a small number of labeled examples. Therefore, we present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python. AVATAR is collected from competitive programming sites, online platforms, and open-source repositories. Furthermore, AVATAR includes unit tests for 250 examples to facilitate functional correctness evaluation. We benchmark several pre-trained language models fine-tuned on AVATAR. Experiment results show that the models lack in generating functionally accurate code.
-
NatGen: Generative pre-training by "Naturalizing" source code. Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. Published and Presented at ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'22). [abstract] [pdf] [slide] [code]
Abstract : Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow).
-
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts. Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, and Saikat Chakraborty. Published in 60th Annual Meeting of the Association for Computational Linguistics. (ACL'22). [abstract] [pdf] [slide] [code]
Abstract : Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO (DISsimilarity of COde), a novel self-supervised modelfocusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datasets. Rather, we design structure-guided code transformation algorithms to generate synthetic code clones and inject real-world security bugs, augmenting the collected datasets in a targeted way. Wepropose to pre-train the Transformer model with such automatically generated program contrasts to better identify similar code in the wild and differentiate vulnerable programs from benign ones. To better capture the structural features of source code, we propose a new cloze objective to encode the local tree-based context (e.g., parents or sibling nodes). We pre-train our model with a much smaller dataset, the size of which is only 5% of the state-of-the-art models’ training datasets, to illustrate the effectiveness of our data augmentation and the pre-training approach. The evaluation shows that, even with much less data, DISCO can still outperform the state-of-the-art models in vulnerability and code clone detection tasks.
-
Retrieval Augmented Code Generation and Summarization. M. Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, & Kai-Wei Chang (2021). Accepted to be published in findings of Empirical Method in Natural Language Processing (EMNLP-findings) - 2021. [abstract] [pdf] [slide] [code]
Abstract : Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.
-
On Multi-Modal Learning of Editing Source Code. Saikat Chakraborty, & Baishakhi Ray (2021). Accepted to be published in The 36th IEEE/ACM International Conference on Automated Software Engineering (ASE'21). [abstract] [pdf] [slide] [code]
Abstract : In recent years, Neural Machine Translator (NMT) has shown promise in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not always be at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers’ requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language, or providing patch context, NMT models can benefit from them.
As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers’ hint in natural language) to automatically generate edits with NMT models. To that end, we build MODIT, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers’ hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position. -
Deep learning based vulnerability detection: Are we there yet? Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, & Baishakhi Ray (2021). Published in IEEE Transactions on Software Engineering (TSE). [abstract] [pdf] [slide] [code]
Abstract : Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has resulted in a surge of interest in applying DL for automated vulnerability detection. Several recent studies have demonstrated promising results achieving an accuracy of up to 95% at detecting vulnerabilities. In this paper, we ask, “how well do the state-of-the-art DL-based techniques perform in a real-world vulnerability prediction scenario?”. To our surprise, we find that their performance drops by more than 50%. A systematic investigation of what causes such precipitous performance drop reveals that existing DL-based vulnerability prediction approaches suffer from challenges with the training data (e.g., data duplication, unrealistic distribution of vulnerable classes, etc.) and with the model choices (e.g., simple token-based models). As a result, these approaches often do not learn features related to the actual cause of the vulnerabilities. Instead, they learn unrelated artifacts from the dataset (e.g., specific variable/function names, etc.). Leveraging these empirical findings, we demonstrate how a more principled approach to data collection and model design, based on realistic settings of vulnerability prediction, can lead to better solutions. The resulting tools perform significantly better than the studied baseline—up to 33.57% boost in precision and 128.38% boost in recall compared to the best performing model in the literature. Overall, this paper elucidates existing DL-based vulnerability prediction systems’ potential issues and draws a roadmap for future DL-based vulnerability prediction research.
-
Unified Pre-training for Program Understanding and Generation. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, & Kai-wei Chang (2021). Published in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'21). [abstract] [pdf] [slide] [code]
Abstract : Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.
-
Codit: Code editing with tree-based neural models. Saikat Chakraborty, Yangruibo Ding, Miltos Allamanis, & Baishakhi Ray (2020). Published in IEEE Transactions on Software Engineering (TSE). [abstract] [pdf] [slide] [code]
Abstract : The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of deep neural networks and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, deep neural network based modeling for code changes and code in general introduces some specific problems that needs specific attention from research community. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art neural network models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel tree-based neural network system to model source code changes and learn code change patterns from the wild. Specifically, we propose a tree-based neural machine translation model to learn the probability distribution of changes in code. We realize our model with a change suggestion engine, CODIT, and train the model with more than 24k real-world changes and evaluate it on 5k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches. CODIT can also learn specific bug fix pattern from bug fixing patches and can fix 25 bugs out of 80 bugs in Defects4J.
-
A transformer-based approach for source code summarization. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, & Kai-wei Chang (2020). Published in The 58th Annual Meeting of the Association for Computational Linguistics (ACL'20). [abstract] [pdf] [slide] [code]
Abstract: Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.
Toward Optimal Selection of Information Retrieval Models for Software Engineering Tasks. M. Masudur Rahman, Saikat Chakraborty, Gail Kaiser, & Baishakhi Ray (2019, September). Published in 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM'19) (pp. 127-138). IEEE. [pdf] [slide] [code]
Building language models for text with named entities. M. Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, & Kai-wei Chang (2020). Published in 56th Annual Meeting of the Association for Computational Linguistics (ACL'18). [pdf] [slide] [code]
Which similarity metric to use for software documents? A study on information retrieval based software engineering tasks. M. Masudur Rahman, Saikat Chakraborty, & Baishakhi Ray (2018, May). In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (ICSE-poster). [pdf]