Human-machine Collaborative Vulnerability Discovery for C/C++
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Software security auditing, which includes code vulnerability detection, severity and impact analysis, proof-of-concept exploitation, patching, and patch efficacy assessment, is a critical step during the software development and maintenance process. If not properly assessed, large applications can turn into a Pandora’s box that can bring destruction to the cyber world if exploited. Unfortunately, due to lack of expertise, increasing size of code bases, frequency of releases, and cost constraints, software producers are finding it increasingly difficult to complete comprehensive end-to-end security assessments prior to software release and deployment. More reliable, robust, extensible, and automated tools are needed to fulfil this demand at this scale. This dissertation proposes Cross Domain Code Property Graphs (CDCPG), a new relational graph representation of software artifacts (source and assembly code) that can be mined to find violations of code safety and security properties. The approach encapsulates lower level constraint checking to present auditors a high level correspondence, facilitating discovery and analysis of vulnerabilities that are concealed or invisible at the source level without binary analysis. In contrast with state-of-the-art static analysis tools, the approach presents generated alarms sorted by vulnerability score calculated based on the feature set. The proposed point-of- interest (POI) triaging method can detect 80% of reported vulnerable functions on average by exploring 34% of high risk functions, and achieve nearly 100% by exploring 60%. A CDCPG prototype implemented for the DARPA CHESS (Computers and Humans Exploring Software Security) program is evaluated, with results demonstrating high accuracy and effectiveness on large code bases compared to prior works. Collaborations with other auditing approaches (e.g., fuzzing-based approaches) is facilitated by broadcasting the detected POIs, which contain both source- and binary-level locations. The information improves the effectiveness of trace analyses, such as inter-procedural control-flow graph construction and analysis. The research also resulted in a new repository of real-world vulnerable function data, which is the largest of its kind published in the open literature. Based on this raw data, this dissertation further develops a framework to extract features from software artifacts to create tabular data sets. The framework exhibits high utility for feature-based learning models to predict vulnerabilities, solving an open problem that has prevented existing vulnerability classification tools from being effectively evaluated on real-world open source data that is heavily imbalanced. The approach innovates a novel semi-supervised learning methodology that leverages triplet mixup data augmentation to address the imbalanced data problem in tabular security data sets.