Automated extraction of data constraints from software documentation
Data constraints encompass crucial business rules that specify the values allowed or required for the data utilized within a software system. These constraints are typically described in textual software artifacts (e.g., requirements and design documents, or user manuals). Previous research on data constraints in software focused on studying their implementation in the code for identifying inconsistencies or to support their traceability. This thesis contribute to the existing knowledge by studying 548 data constraints described in the documentation of nine systems. We identified and documented 15 linguistic discourse patterns employed by stakeholders to describe data constraints in natural language. In a comprehensive extensive study, we explore the use of the discourse patterns we discovered, along with linguistic elements, the operands of the data constraints and their types, as features for automatically classifying sentence fragments as data constraint descriptions. The best combination of features and learner achieves 70.87% precision and 59.73% recall (64.76% F1). The discoveries made in this thesis represent a significant advancement in the automated identification and extraction of data constraints from natural language text, which in turn is essential for enabling the automation of traceability to code and facilitating test generation associated with these constraints.