Session Chair: Elena Troubitsyna
Session Chair: Helene Waeselynck
Session Chairs: John Knight and Michel Cukier
Session Chair: Karthik Pattabiraman
Session Chair: Stefano Russo
Session Chair: Tadashi Dohi
Session Chair: Marco Vieira
Session Chair: Roberto Natella
Session Chair: Henrique Madeira
Session Chair: Mohamed Kaaniche
Session Chair: Nuno Antunes
Session Chair: Michael Lyu
Session Chair: Alexander Romanovsky
Session Chair: Karama Kanoun
Session Chair: Bojan Cukic
On adaptive sampling-based testing for software reliability assessment Roberto Pietrantuono and Stefano Russo
Assessing reliability of software programs during validation is a challenging task for engineers. The assessment is not only required to be unbiased, but it needs to provide tight variance (hence, tight confidence interval) with as few test cases as possible. Statistical sampling is a theoretically sound approach for reliability testing, but it is often impractical in its current form, because of too many test cases required to achieve desired confidence levels, especially when the software has few residual faults inside.
We claim that the potential of statistical sampling methods is largely underestimated. This paper presents an adaptive sampling-based testing (AST) strategy for reliability assessment. A two-stage conceptual framework is defined, where adaptiveness is included to uncover residual faults earlier, while various sampling-based techniques are proposed to improve the efficiency (in terms of variance-test cases tradeoff) by better exploiting the information available to tester. An empirical study is conducted to assess the AST performance and compare the proposed sampling techniques to each other on real programs.
Experience Report: Automated System Level Regression Test Prioritization using multiple factors Per Erik Strandberg, Daniel Sundmark, Wasif Afzal, Thomas J. Ostrand and Elaine J. Weyuker.
We propose a new method of determining an effective ordering of regression test cases, and describe its implementation as an automated tool called SuiteBuilder developed by Westermo Research and Development AB. The tool generates an efficient order to run the cases in an existing test suite by using expected or observed test duration and combining priorities of multiple factors associated with test cases, including previous fault detection success, interval since last executed, and modifications to the code tested. The method and tool were developed to address problems in the traditional process of regression testing, such as lack of time to run a complete regression suite, failure to detect bugs in time, and tests that are repeatedly omitted. The tool has been integrated into the existing nightly test framework for Westermo software that runs on large-scale data communication systems. In experimental evaluation of the tool, we found significant improvement in regression testing results. The re-ordered test suites finish within the available time, the majority of fault-detecting test cases are located in the first third of the suite, no important test case is omitted, and the necessity for manual work on the suites is greatly reduced.
Frequent Subgraph based Familial Classification of Android Malware Ming Fan, Jun Liu, Xiapu Luo, Kai Chen, Tianyi Chen, Zhenzhou Tian, Xiaodong Zhang, Qinghua Zheng and Ting Liu.
The rapid growth of Android malware poses great challenges to anti-malware systems because the sheer number of malware samples overwhelm malware analysis systems. A promising approach for speeding up malware analysis is to classify malware samples into families so that the common features in malwares belonging to the same family can be exploited for malware detection and inspection. However, the accuracy of existing classification solutions is limited because of two reasons. First, since the majority of Android malware is constructed by inserting malicious components into popular apps, the malware's legitimate part may misguide the classification algorithms. Second, the polymorphic variants of Android malware could evade the detection by employing transformation attacks. In this paper, we propose a novel approach that constructs frequent subgraph (fregraph) to represent the common behaviors of malwares in the same family for familial classification of Android malware. Moreover, we propose and develop FalDroid, an automatic system for classifying Android malware according to fregraph, and apply it to 6,565 malware samples from 30 families. The experimental results show that FalDroid can correctly classify 94.5% malwares into their families using around 4.4s per app.
SCOUT: A Multi-objective Method to Select Components in Designing Unit Testing Eduardo Freitas, Celso Camilo-Junior and Auri Vincenzi.
The creation of a suite of unit testing is preceded by the selection of which components (code units) should be tested. This selection is a significant challenge, usually made based on the team member’s experience or guided by defect prediction or fault localization models. We modeled the selection of components for unit testing with limited resources as a multi-objective problem, addressing two different objectives: maximizing benefits and minimizing testing cost. To measure the benefit of a component, we made use of metrics from static analysis (cost of future maintenance), dynamic analysis (risk of fault, and frequency of calls), and business value. We tackled gaps and challenges in the literature to formulate an effective method, the Selector of Software Components for Unit Testing (SCOUT). SCOUT provides an automated extraction of all necessary data followed by a multi-objective optimization process. SCOUT is a method able to assist testers in different domains, and the Android platform was chosen to perform our experiments, taking nine leading open-source applications as our subjects. SCOUT was compared with two of the most frequently used strategies in terms of efficacy. We also compared the effectiveness and efficiency of seven algorithms in solving a multi-objective component selection problem. Our experiments were performed under different scenarios, and reveal the potential of SCOUT in reducing the market vulnerability, compared to others approaches. To the best of our knowledge, SCOUT is the first method to assist in an automated way software testing managers in selecting components for the development of unit testing, combining static and dynamic metrics and business value.
The Effect of Test Suite Type on Regression Test Selection Nima Dini, Allison Sullivan, Milos Gligoric and Gregg Rothermel.
Regression test selection (RTS) techniques reduce the cost of regression testing by running only test cases related to code modifications. RTS techniques have been extensively researched, and the effects of several context factors on techniques have been empirically studied, but no prior work has explored the effects that might arise due to differences in types of test suites. We believe such differences may matter, and thus, we designed an empirical study to investigate them. Specifically, we consider two types of test suites obtained with automated test case generation techniques---feedback-directed random techniques and search-based techniques---along with manually written test suites. We assess the effects of these test suite types on two RTS techniques: a "fine-grained" technique that selects test cases based on dependencies tracked at the method level and a "coarse-grained" technique that selects test cases based on dependencies tracked at the file level. We performed our study on eight open-source projects across 800 commits. Our results show that on average, fine-grained RTS was more effective for test suites created by search-based test case generation techniques whereas coarse-grained RTS was more effective for test suites created by feedback-directed random techniques, and that commits affect RTS techniques differently for different types of test suites.
WAP: Novel Automatic Test Generation Technique Based on Moth Flame Optimization Aya Metwally, Eman Hosam, Marwa Hassan and Sarah Rashad.
In this work, we present a novel technique for automatic test data generation that generates the whole test suite in a single run. The novelty of our proposed technique lies mainly in (i) using Moth Flame Optimization (MFO) algorithm for the first time in automatic test generation, and (ii) introducing the use of a generic objective function that is independent of the method under test (MUT). The proposed objective function dynamically evaluates the fitness of each solution, i.e. test case, with respect to the so-formed test suite, based on the effectiveness of adding this test case to the currently formed test suite. This dynamic approach replaces the use of a static fitness function that evaluates the fitness of each solution independently of other solutions. The proposed technique tries to find small-sized test suite with maximum coverage by iteratively eliminating test cases that do not contribute to the overall coverage of the test suite. The results show that our technique is better than the random generator with improvement up to two orders of magnitude. It also outperforms Genetic Algorithm (GA) in four out of five benchmark methods achieving improvements between 74% and 83% in the number of generated test cases.
Distance-integrated Combinatorial Testing Eun-Hye Choi, Cyrille Artho, Takashi Kitamura, Osamu Mizuno and Akihisa Yamada.
This paper proposes a novel approach to combinatorial test generation, which achieves an increase of not only the number of new combinations but also the distance between test cases. We applied our distance-integrated approach to a state-of-the-art greedy algorithm for traditional combinatorial test generation by using two distance metrics, Hamming distance, and a modified chi-square distance. Experimental results using numerous benchmark models show that combinatorial test suites generated by our approach using both distance metrics can improve interaction coverage for higher interaction strengths with low computational overhead.
Evaluating the Effects of Compiler Optimizations on Mutation Testing at the Compiler IR Level Farah Hariri, August Shi, Hayes Converse, Darko Marinov and Sarfaz Khurshid.
Software testing is one of the most widely used approaches for improving software reliability. The effectiveness of testing depends to a large extent on the quality of test suites. Researchers have developed various techniques to evaluate quality of test suites. Of these techniques, mutation testing is generally considered to be the most advanced. A key result of applying mutation testing to a given test suite is the mutation score representing the percentage of mutants killed by the test suite. Ideally the mutation score is computed ignoring the mutants that are semantically equivalent to the original code under test or duplicate to one another. In this paper, we investigate a new perspective on mutation testing: evaluating how the mutation testing process and its results are affected by compiler optimizations, i.e., semantics-preserving program transformations applied to improve the program’s performance. Our study targets LLVM, a popular compiler infrastructure that supports multiple source and target languages. We use 18 Coreutils programs in our evaluation and find some surprising relations between the number of mutants (including results on equivalent and duplicated mutants) and mutation scores on unoptimized and optimized programs
Automatically Classifying Test Results by Semi-Supervised Learning Marc Roper and Rafig Almaghairbe.
A key component of software testing is deciding whether a test case has passed or failed: an expensive and error-prone manual activity. We present an approach to automatically classify passing and failing executions using semi-supervised learning on dynamic execution data (test inputs/outputs and execution traces). A small proportion of the test data is labelled as passing or failing and used in conjunction with the unlabelled data to build a classifier which labels the remaining outputs (classify them as passing or failing tests). A range of learning algorithms are investigated using several faulty versions of three systems along with varying types of data (inputs/outputs alone, or in combination with execution traces) and different labelling strategies (both failing and passing tests, and passing tests alone). The results show that in many cases labelling just a small proportion of the test cases – as low as 10% – is sufficient to build a classifier that is able to correctly categorise the large majority of the remaining test cases. This has important practical potential: when checking the test results from a system a developer need only examine a small proportion of these and use this information to train a learning algorithm to automatically classify the remainder.
Using A Cognitive Psychology Perspective on Errors to Improve Requirements Quality: An Empirical Investigation Vaibhav Anu, Gursimran Walia, Wenhua Hu, Jeffrey Carver and Gary Bradshaw.
Software inspections are an effective method for early detection of faults present in software development artifacts (e.g., requirements and design documents). However, many faults are left undetected due to the lack of focus on the underlying sources of faults (i.e., what caused the injection of the fault?). To address this problem, research work done by Psychologists on analyzing the failures of human cognition (i.e., human errors) is being used in this research to help inspectors detect errors and corresponding faults (manifestations of errors) in requirements documents. We hypothesize that the fault detection performance will demonstrate significant gains when using a formal taxonomy of human errors (the underlying source of faults). This paper describes a newly developed Human Error Taxonomy (HET) and a formal Error-Abstraction and Inspection (EAI) process to improve fault detection performance of inspectors during the requirements inspection. A controlled empirical study evaluated the usefulness of HET and EAI compared to fault based inspection. The results verify our hypothesis and provide useful insights into commonly occurring human errors that contributed to requirement faults along with areas to further refine both the HET and the EAI process.
On the Personality Traits of GitHub Contributors Ayushi Rastogi and Nachiappan Nagappan.
Abstract: People's personality has the potential to explain the behavior in different situations. This understanding of the behavior is required to understand the intricacies of the team, which can then be used to optimize work performance. As an initial step towards optimizing work performance, in this paper, we explore the inferential power of the personality traits in explaining the behavior of contributors in various contexts of software development in GitHub. Analyses of 243 actively discussed projects showed that the contributors with extreme (high or low) levels of contributions are more neurotic compared to the contributors with medium-level of contributions. Analyses of 423 active contributors showed that contributors evolve as more conscientious, more extrovert and less agreeable over the years of participation. The findings of this study match our ideas and are promising for further explorations.
WAP: Understanding the brain at software debugging João Durães, Henrique Madeira, João Castelhano, Catarina Duarte and Miguel Castelo Branco.
We propose that understanding functional patterns of activity in mapped brain regions associated with code comprehension tasks and, more specifically, to the activity of finding bugs in traditional code inspections could reveal useful insights to improve software reliability and to improve the software development process in general. This includes helping to select the best professionals for the debugging effort, improving the conditions for code inspections, and identify new directions to follow for training code reviewers. This paper presents an interdisciplinary study to analyze the brain activity during code inspection tasks using functional magnetic resonance imaging (fMRI), which is a well-established tool in cognitive neuroscience research. We used several programs where realistic bugs representing the most frequent types of software faults found in the field were injected. The code inspectors involved in the research include programmers with different levels of expertise and experience in real code reviews. The goal is to understand brain activity patterns associated with code comprehension tasks and, more specifically, the brain activity when the code reviewer identifies a bug in the code (ÔeurekaÕ moment), which can be a true positive or a false positive. Our results confirmed that brain areas associated with language processing and mathematics are highly active during code reviewing and shows that there are specific brain activity patterns that can be related to the decision-making moment of suspicion/bug detection. Importantly, the activity at the anterior insula region that we find to play a relevant role in the process of identifying software bugs is positively correlated to the precision of bug detection by the inspectors. This finding provides a new perspective on the role of this region on error awareness and monitoring and of its potential predictive value in predicting the quality of bug removing.
RRF: A Race Reproduction Framework for use in Debugging Process-Level Races Supat Rattanasuksun, Tingting Yu, Witawas Srisa-An and Gregg Rothermel.
Process-level races are endemic in modern systems. These races are difficult to debug because they are sensitive to execution events such as interrupts and scheduling. Unless a process interleaving that can result in the race can be found, it cannot be reproduced and cannot be corrected. In practice, however, the number of interleavings that can occur among processes in practice is large, and the patterns of interleavings can be complex. Thus, approaches for reproducing process-level races to date are often ineffective. In this paper, we present RRF, a race reproduction framework that can help software engineers reproduce reported process-level races, enabling them to potentially debug these races. RRF performs a hybrid analysis by leveraging existing static program analysis tools, dynamic kernel event reporting tools, and yield points to provide the observability and controllability needed to reproduce races. We conducted an empirical study to evaluate RRF; our results show that RRF can be effective for reproducing races.
Cause Points Analysis for Effective Handling of Alarms Tukaram Muske and Uday P. Khedker.
Static analysis tools are widely used in practice to improve the quality and reliability of software through early detection of defects. However, the number of alarms generated is a major concern because of the cost incurred in their manual inspection required to partition them into true errors and false positives. In this paper, we propose a static analysis to identify the causes of alarms generated by a client static analysis. This simplifies the manual inspections and reduces the cost involved. The proposed analysis involves the following: (1) Modeling the basic reasons for alarms as alarm cause points of several types, (2) ranking these cause points based on three different metrics, (3) a workflow in which a user answers queries about the cause points and the answers are used in subsequent round of the client analysis. The collaboration between the user and the client analysis helps the tool to resolve the unknowns encountered during the analysis and weeding out the alarms. It also helps the user expedite the manual inspections of alarms. Further, the ranking of cause points helps to prioritize the alarms. Our experimental evaluation in several settings demonstrated that our approach (a) reduces manual effort by 23% to 72% depending on various parameters, with an average reduction of 42%, and (b) is also effective in identifying the alarms that are more likely to be true errors.
ORPLocator: Identifying Reading Points of Configuration Options via Static Analysis Zhen Dong, Artur Andrzejak, David Lo and Diego Elias Damasceno Costa.
Configuration options are widely used for customizing the behavior and initial settings of software applications, server processes, and operating systems. Their distinctive property is that each option is processed, defined, and described in different parts of a software project - namely in code, in configuration file, and in documentation. This creates a challenge for maintaining project consistency as it evolves. It also promotes inconsistencies leading to misconfiguration issues in production scenarios.
We propose an approach for detection of inconsistencies between source code and documentation based on static analysis. Our approach automatically identifies source code locations where options are read, and for each such location retrieves the name of the option. Inconsistencies are then detected by comparing the results against the option names listed in documentation.
We evaluated our approach on multiple components of Apache Hadoop, a complex framework with more than 800 options. Our tool ORPLocator was able to successfully locate at least one read point for 93% to 96% of documented options within four Hadoop components. A comparison with a previous state-of-the-art technique shows that our tool produces more accurate results. Moreover, our evaluation has uncovered 4 previously unknown, real-world inconsistencies between documented options and source code.
Proving Concurrent Data Structures Linearizable Vineet Singh, Iulian Neamtiu and Rajiv Gupta.
Linearizability of concurrent data structure implementations is notoriously hard to prove. Consequently, current verification techniques can only prove linearizability for certain classes of data structures. We introduce a generic, sound, and practical technique to statically check the linearizability of concurrent data structure implementations. Our technique involves specifying the concurrent operations as a list of sub-operations and passing this specification on to an automated checker that verifies linearizability using relationships between individual sub-operations. We have proven the soundness of our technique. Our approach is expressive: we have successfully verified the linearizability of 12 popular concurrent data structure implementations including algorithms that are considered to be challenging to prove linearizable such as elimination back-off stack, lazy linked list, and time-stamped stack. Our checker is effective, as it can verify the specifications in less than a second.
Detecting, Exposing, and Classifying Sequential Consistency Violations Mohammad Majharul Islam and Abdullah Muzahid.
Sequential Consistency (SC) is the most intuitive memory model for parallel programs. However, modern architectures aggressively reorder and overlap memory accesses, causing SC violations. An SC violation is virtually always a bug. Most prior schemes either search the entire state space of a program, or use a constraint solver to find SC violations. A promising recent scheme uses active testing technique but fails to be effective for SC violations involving larger number of threads and variables, and larger codebases. We propose Orion, the first active testing technique that can detect, expose, and classify any arbitrary SC violations in any program. Orion works in two phases. In the first phase, it finds potential SC violation cycles by focusing on racing accesses. In the second phase, it exposes each SC violation cycle by enforcing the exact scheduling order. We present a detailed design of Orion in the paper. We tested different concurrent algorithms, bug kernels, SPLASH2, PARSEC applications, and an open source program, Apache. We experimented with TSO and PSO memory models. We detected and exposed 60 SC violations of which 15 violations involve more than two processors and variables. Orion exposes SC violations quickly and with high probability. Compared to a state-of-the-art active testing technique, it has a much better SC violation detection ability.
Approximate Lock: Trading off Accuracy for Performance by Skipping Critical Sections Riad Akram, Mohammad Mejbah Ul Alam and Abdullah Muzahid.
Approximate computing is gaining a lot of traction due to its potential for improving performance and consequently, energy efficiency. This project explores the potential for approximating locks. We start out with the observation that many applications can tolerate occasional skipping of computations done inside a critical section protected by a lock. This means that for certain critical sections, when the enclosed computation is occasionally skipped, the application suffers from quality degradation in the final outcome but it never crashes/deadlocks. To exploit this opportunity, we propose Approximate Lock (ALock). The thread executing ALock checks if a certain condition (e.g., high contention, long waiting time) is met and if so, the thread returns without acquiring the lock. We modify some selected critical sections using ALock so that those sections are skipped when ALock returns without acquiring the lock. We experimented with 14 programs from PARSEC, SPLASH2, and STAMP benchmarks. We found a total of 37 locks that can be transformed into ALock. ALock provides performance improvement for 10 applications, ranging from 1.8% to 164.4%, with at least 80% accuracy.
Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports Xinli Yang, David Lo, Xin Xia, Lingfeng Bao and Jianling Sun.
Similar bugs are bugs that require handling of many common code files. Developers can often fix similar bugs with a shorter time and a higher quality since they can focus on fewer code files. Therefore, similar bug recommendation is a meaningful task which can improve development efficiency. Rocha et al. propose the first similar bug recommendation system named NextBug. Although NextBug performs better than a start-of-the-art duplicated bug detection technique REP, its performance is not optimal and thus more work is needed to improve its effectiveness. Technically, it is also rather simple as it relies only upon a standard information retrieval technique, i.e., cosine similarity. In the paper, we propose a novel approach to recommend similar bugs. The approach combines a traditional information retrieval technique and a word embedding technique, and takes bug titles and descriptions as well as bug product and component information into consideration. To evaluate the approach, we use datasets from two popular open-source projects, i.e., Eclipse and Mozilla, each of which contains bug reports whose bug ids range from [1,400000]. The results show that our approach improves the performance of NextBug statistically significantly and substantially for both projects.
Experience Report: Understanding Cross-Platform App Issues From User Reviews Yichuan Man, Cuiyun Gao and Michael R. Lyu, Jiuchun Jiang
App developers publish apps on different platforms, such as Google Play, App Store, and Windows Store, to maximize the user volumes and potential revenues. Due to the different characteristics of the platforms and the different user preference (e.g., Android is more customized than iOS), app testing cases on these three platforms should also be designed differently. Comprehensive app testing can be time-consuming for developers. Therefore, understanding the differences of the app issues on these platforms can facilitate the testing process.
In this paper, we propose a novel framework named CrossMiner to analyze the essential app issues and explore whether the app issues exhibit differently on the three platforms. Based on five million user reviews, the framework automatically captures the distributions of seven app issues, i.e., “battery”, “crash”, “memory”, “network”, “privacy”, “spam”, and “UI”. We discover that the apps for different platforms indeed generate different issue distributions, which can be employed by app developers to schedule and design the testing cases. The verification based on the official user forums also demonstrates the effectiveness of our framework. Furthermore, we also identify that the issues related to “crash” and “network” are more concerned by users than the other issues on these three platforms. To assist developers in gaining a deep insight on the user issues, we also prioritize the user reviews corresponding to the issues. Overall, we aim at understanding the differences of issues on different platforms and facilitating the testing process for app developers.
CoLUA: Automatically Predicting Configuration Bug Reports and Extracting Configuration Options Wei Wen, Tingting Yu and Jane Hayes.
Configuration bugs are among the dominant causes of software failures. Software organizations often use bug-tracking systems to manage bug reports collected from developers and users. In order for software developers to understand and reproduce configuration bugs, it is vital for them to know whether a bug in the bug report is related to configuration issues; this is not often easily discerned due to a lack of easy to spot terminology in the bug reports. In addition, to locate and fix a configuration bug, a developer needs to know which configuration options are associated with the bug. To address these two problems, we introduce CoLUA, a two step automated approach that combines natural language processing, information retrieval, and machine learning. In the first step, CoLUA selects features from the textual information in the bug reports, and uses various machine learning techniques to build classification models; developers can use these models to label a bug report as either a configuration bug report or a non-configuration bug report. In the second step, CoLUA identifies which configuration options are involved in the labeled configuration bug reports. We evaluate CoLUA on 900 bug reports from three large open source software systems. The results show that CoLUA predicts configuration bug reports with high accuracy and that it effectively identifies the root causes of configuration options.
Anomaly Detection and Root Cause Localization in Virtual Network Functions Carla Sauvanaud, Kahina Lazri, Mohamed Kaaniche and Karama Kanoun.
The maturity of hardware virtualization has motivated Communication Service Providers (CSPs) to apply this paradigm to network services. Virtual Network Functions (VNFs) result from this trend and raise new dependability challenges related to network softwarisation that are still not thoroughly explored. This paper describes a new approach to detect Service Level Agreements (SLAs) violations and preliminary symptoms of SLAs violations. In particular, one other major objective of our approach is to help CSP administrators to identify the anomalous VM at the origin of the detected SLA violation, which should enable them to proactively plan for appropriate recovery strategies. To this end, we make use of virtual machine (VM) monitoring data and perform both a per-VM and an ensemble analysis. Our approach includes a supervised machine learning algorithm as well as fault injection tools. The experimental testbed consists of a virtual IP Multimedia Subsystem developed by the Clearwater project. Experimental results show that our approach can achieve high precision and recall, and low false alarm rate and can pinpoint the root anomalous VNF VM causing SLA violations. It can also detect preliminary symptoms of high workloads triggering SLA violations.
Experience Report: System Log Analysis for Anomaly Detection Shilin He, Jieming Zhu, Pinjia He and Michael R. Lyu.
Anomaly detection plays an important role in management of modern large-scale distributed systems. Logs, which record system runtime information, are widely used for anomaly detection. Traditionally, developers (or operators) often inspect the logs manually with keyword search and rule matching. The increasing scale and complexity of modern systems, however, make the volume of logs explode, which renders the infeasibility of manual inspection. To reduce manual effort, many anomaly detection methods based on automated log analysis are proposed. However, developers may still have no idea which anomaly detection methods they should adopt, because there is a lack of a review and comparison among these anomaly detection methods. Moreover, even if developers decide to employ an anomaly detection method, re-implementation requires a nontrivial effort. To address these problems, we provide a detailed review and evaluation of six state-of-the-art log-based anomaly detection methods, including three supervised methods and three unsupervised methods, and also release an open-source toolkit allowing ease of reuse. These methods have been evaluated on two publicly-available production log datasets, with a total of 15,923,592 log messages and 365,298 anomaly instances. We believe that our work, with the evaluation results as well as the corresponding findings, can provide guidelines for adoption of these methods and provide references for future development.
SV-AF – A Security Vulnerability Analysis Framework Sultan Alqahtani, Ellis E. Eghan and Juergen Rilling.
The globalization of the software industry has introduced a widespread use of system components across traditional system boundaries. Due to this global reuse, also vulnerabilities and security concerns are no longer limited in their scope to individual systems but instead can now affect global software ecosystems. While known vulnerabilities and security concerns are reported in specialized vulnerability databases, these repositories often remain information silos. In this research, we introduce a modeling approach, which eliminates these silos by linking security knowledge with other software artifacts to improve traceability and trust in software products.
In our approach, we introduce a Security Vulnerabilities Analysis Framework (SV-AF) to support evidence based vulnerability detection. Two case studies are presented to illustrate the applicability of our presented approach. In these case studies, we link the NVD vulnerability databases and the Maven build repository to trace vulnerabilities across repository and project boundaries. In our analysis, we identify that 750 Maven project releases are directly affected by known security vulnerabilities and by considering transitive dependencies, an additional 415604 Maven projects can be identified as potentially affected by these vulnerabilities.
Goal-driven deception tactics design Cristiano De Faveri, Ana Moreira and Vasco Amaral.
Deception-based defense relies on intentional actions employed to induce erroneous inferences on attackers. Existing deception approaches are included in the software development process in an ad-hoc fashion, and are fundamentally realized as single tools or entire solutions repackaged as honeypot machines. We propose a systematic goal-driven approach to include deception tactics early in the software development process so that conflicts and risks can be found in the initial phases of the development, reducing costs of ill-planed decisions. The process integrates three phases: system modeling (producing a goal model of the application domain), security modeling (producing a threat model specifying the typical security concerns from the attacker perspective), and deception modeling (producing a deception tactic model, a variability model, and deception story models). The feasibility of the proposed approach is shown via a case study where deception defense strategies are designed for a students’ presence control system for our University.
Quantifying the Attack Detection Accuracy of Intrusion Detection Systems in Virtualized Environments Aleksandar Milenkoski, K. R. Jayaram, Nuno Antunes, Marco Vieira and Samuel Kounev.
With the widespread adoption of virtualization, intrusion detection systems (IDSes) are increasingly being deployed in virtualized environments. When securing an environment, IT security officers are often faced with the question of how accurate deployed IDSes are at detecting attacks. To this end, metrics for assessing the attack detection accuracy of IDSes have been developed. However, these metrics are defined with respect to a fixed set of hardware resources available to the tested IDS. Therefore, IDSes deployed in virtualized environments featuring elasticity (i.e., on-demand allocation or deallocation of virtualized hardware resources during system operation) cannot be evaluated in an accurate manner using existing metrics. In this paper, we demonstrate the impact of elasticity on IDS attack detection accuracy. In addition, we propose a novel metric and measurement methodology for accurately quantifying the accuracy of IDSes deployed in virtualized environments featuring elasticity. We demonstrate their practical use through case studies involving commonly used IDSes.
Using Approximate Bayesian Computation to Empirically Test Email Malware Propagation Models Relevant to Common Intervention Actions Edward Condon and Michel Cukier.
There are different ways for malware to spread from device to device. Some methods depend on the presence of a vulnerability that can be exploited along with some action taken by a user of the device. Malware propagating through email are one such example. While existing research has explored potential factors and models for simulating this form of propagation, it remains for these potential factors and models to be empirically tested and supported using field collected incident data. We review a common model for simulating the spread of email malware and use simulations to illustrate the potential impacts of connection topologies and different distributions of associated user actions. We use simulations to examine the potential impact of two types of commonly available interventions–patching vulnerable devices and blocking the transmission of infected messages in combination with different connection topologies and different distributions of user actions. Finally, we explore the use of Approximate Bayesian Computation (ABC) as a method to compare simulation results to empirical data to assess different model features, and to infer corresponding model parameter values from field collected email malware incident data.
MHCP Model for Quality Evaluation for Software Structure Based on Software Complex Network Yuwei Yang, Jun Ai, Xuelin Li and W. Eric Wong.
Accidents caused by defective software systems have long been a nightmare. Though engineers utilize advanced techniques and rigorous quality control procedures, we still have to admit that the increasing complexity and expanding scale of software systems make it extremely difficult to guarantee high quality deliverables. Since large-scale software systems exhibit the characteristics of complex networks, applying the principles of complex networks to evaluate the quality of software systems has attracted attention from both academia and industry. Unfortunately, most current research studies focus only on one or a limited number of attributes of software structures which makes them ineffective in providing comprehensive and insightful quality evaluation for software structures. To overcome this problem, we propose an approach based on various software structural characteristics to evaluate software structures from modularity, hierarchy, complexity, and fault propagation points of view. A model based on these four aspects is proposed to better understand software structural quality. A prediction model is also proposed to provide insights on the nature of software evolution and its current status. Experiments using two software projects were performed against the thresholds obtained by evaluating more than 5,000 versions of open source projects. Our results suggest that the approach described in this paper can help us analyze real-world software projects for better quality evaluation.
The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison Zhou Xu, Jin Liu, Gege An and Zijiang Yang.
Software defect prediction aims to determine whether a software module is defect-prone by constructing prediction models. The performance of such models is susceptible to the high dimensionality of the datasets that may include irrelevant and redundant features. Feature selection is applied to alleviate this issue. Because many feature selection methods have been proposed, there is an imperative need to analyze and compare these methods. Prior empirical studies may have potential controversies and limitations, such as the contradictory results, usage of private datasets and inappropriate statistical test techniques. This observation leads us to conduct a careful empirical study to reinforce the confidence of the experimental conclusions by considering several potential source of bias, such as the noise in the dataset and the dataset types. In this paper, we investigate the impact of 32 feature selection methods on the defect prediction performance over two versions of the NASA dataset (i.e., the noisy and clean NASA datasets) and one open source AEEEM dataset. We use a state-of-the-art double Scott-Knott test technique to analyze these methods. Experimental results show that the effectiveness of these feature selection methods on defect prediction performance varies significantly over all the datasets.
Experience Report: Practical Software Availability Prediction in Telecommunication Industry Kazu Okumoto.
A large number of software reliability growth models have been developed since early 1970s, but there is no single model which can be used in every situation. Predicting software availability based on test defect data can be challenging. This paper provides a methodology to approach it. The proposed approach has been successfully implemented for key telecommunication products over several years. A piecewise application of exponential models is used to precisely capture an entire defect trend from internal test phases to customer site test and operation phase. The need for multiple curves is explained in terms of software content changes and test resources allocation such as testers, lab time and test cases. We will then present how to predict software reliability and availability based on test defect data. Actual defect and field outage data from several releases of two large-scale software development projects are used to illustrate and validate the proposed approach.
Failure Models for Testing Continuous Controllers Dominik Holling, Alexandru-Alvin Stanescu, Kristian Beckers, Alexander Pretschner and Matthias Gemmar.
Ranging from temperature control to safety-critical applications, continuous controllers are used in a plethora of applications becoming increasingly complex. In turn, testing continuous control systems also is more complex. Particularly, application-specific manual formal analysis or testing the complete input range becomes infeasible.
We present a comprehensive failure-based testing methodology and a respective automated tool for continuous controllers. Our methodology is based on an existing automated approach, testing stability, liveness, smoothness and responsiveness in a single value-response scenario only. We performed a practitioner survey and literature review in the domain revealing the quality criteria steadiness and reliability to be vital for meaningful testing of continuous controllers. In addition, we identified 4 further scenarios including disturbance response for comprehensive testing.
We contribute a library of failure models and quality criteria for the automated testing of continuous control systems more complete than in previous approaches. On the grounds of our comprehensive experiments on 9 real-world control systems, our results demonstrate our failure-based testing methodology to provide better worst cases than manual testing (effectiveness) within an adequate time frame (efficiency) for any configuration used in our experiments (reproducibility).
Evaluation Metrics of Service-Level Reliability Monitoring Rules of a Big Data Service Keun Soo Yim.
This paper presents new metrics to evaluate the reliability moni-toring rules of a large-scale big data service. Our target service uses manually-tuned, service-level reliability monitoring rules. Using the measurement data, we identify two key technical chal-lenges in operating our target monitoring system. In order to improve the operational efficiency, we characterize how those rules were manually tuned by the domain experts. The charac-terization results provide useful information to operators sup-posed to regularly tune such rules. Using the actual production failure data, we evaluate the same monitoring rules by using standard metrics and the presented metrics. Our evaluation re-sults show the strengths and weaknesses of each metric and show that the presented metrics can further help operators rec-ognize when and which rules need to be re-tuned.
Bear: A Framework for Understanding Application Sensitivity to OS (Mis)Behavior Ruimin Sun, Andrew Lee, Aokun Chen, Donald E. Porter, Matt Bishop and Daniela Oliveira.
Applications are generally written assuming a predictable and well-behaved OS. In practice, they experience unpredictable misbehavior at the OS level and across OSes: different OSes can handle network events differently, APIs can behave differently across OSes, and OSes may be compromised or buggy. This unpredictability is challenging because its sources typically manifest during deployment and are hard to reproduce. This paper introduces Bear, a framework for statistical analysis of application sensitivity to OS unpredictability that can help developers build more resilient software, discover challenging bugs and identify the scenarios that most need validation. Bear analyzes a program with a set of perturbation strategies on a set of commonly used system calls in order to discover the most sensitive system calls for each application, the most impactful strategies, and how they predict abnormal program outcome. We evaluated Bear with 113 CPU and IO-bound programs, and our results show that null memory dereferencing and erroneous buffer operations are the most impactful strategies for predicting abnormal program execution and their impacts increase ten-fold with workload increase (e.g. number of network requests from 10 to 1000). Generic system calls are more sensitive than specialized system calls, for example, write and sendto can both be used to send data through a socket, but the sensitivity of write is twice as of sendto. System calls with an array parameter (e.g. read) are more sensitive to perturbations than those having a struct parameter with a buffer (e.g readv). Moreover, the fewer parameters a system call has, the more sensitive it is.
Dodging Unsafe Update Points in Java Dynamic Software Updating Systems Walter Cazzola and Mehdi Jalili.
Dynamic Software Updating (DSU) provides mechanisms to update a program without stopping its execution. An indiscriminate update, that does not consider the current state of the computation, potentially undermines the stability of the running application. To automatically determine a safe moment when to update the running system is still an open problem often neglected from the existing DSU systems. This paper proposes a mechanism to support the choice of a safe update point by marking which point can be considered unsafe and therefore dodged during the update. The method is based on decorating the code with some specific meta-data that can be used to find the right moment to do the update. The proposed approach has been implemented as an external component that can be plugged into every DSU system. The approach is demonstrated on the evolution of the \texttt{HSQLDB} system from two distinct versions to their next update.
Fixing Resource Leaks in Android Apps with Light-weight Static Analysis and Low-overhead Instrumentation Jierui Liu, Tianyong Wu, Jun Yan and Jian Zhang.
Fixing bugs according to bug reports is a labor-intensive work for developers and automatic techniques can effectively decrease the manual efforts. A feasible solution is to fix specific bugs by static analysis and code instrumentation. In this paper, we present a light-weight approach to fix the resource leak bugs that exist widely in Android apps while guaranteeing the safety that the patches should not interrupt normal execution of the original program. This approach first performs a light-weight static analysis and then carefully designs the concise patch code that will be inserted into the byte-code. When the program is running, the patches will trace the state of leaked resources and release them in a proper time. Our experiments on dozens of real-world apps show that our approach can effectively fix resource leaks in the apps with negligible extra execution time and less than 4% extra code in a few seconds.
Predicting Consistent Clone Change Fanlong Zhang, Siau-Cheng Khoo and Xiaohong Su.
Code clones, being an inevitable by-product of rapid software development, can impact software quality. The introduction of code clone groups and clone genealogies enable software developers to be aware of the presence of and changes to clones as a collective group; they also allow developers to understand how clone groups evolve throughout software life cycle. Due to similarity in codes within a clone group, a change in one piece of the code may require developers to make changes to other clones in the group. Failure in making consistent change to a clone group when necessary is commonly known as ¡°clone consistency-defect¡±, which can adversely impact software reliability. We propose an approach to predict clone consistency-requirement at the time when changes have been made to a clone group. Our predictor is a Bayesian network implemented in WEKA. We build a variant of clone genealogies to collect all consistent/inconsistent changes to clone groups, and extract three sets of attributes from clone groups as input for predicting consistent clone change. These three sets are: code attributes, context attributes and evolution attributes. We conduct experiments on three open source projects. These experiments show that our approach has high precision and recall in predicting clone consistency-requirement. This holistic approach can aid developers in maintaining code clone changes, and avoid potential clone consistency-defect, which can improve the software quality and reliability.
Switching to Git: the Good, the Bad, and the Ugly Sascha Just, Kim Herzig, Jacek Czerwonka and Brendan Murphy.
Since its introduction 10 years ago, GIT has taken the world of version control systems (VCS) by storm. Its success is partly due to creating opportunities for new usage patterns that empower developers to work more efficiently. Users can manipulate version archives from VCS like GIT in nearly any way, even re-write history. The resulting change in both user behavior and the way GIT stores changes impacts data mining and data analytics procedures [6], [13]. While some of these unique characteristics can be managed by adjusting mining and analytical techniques, others can lead to severe data loss creating challenges to established development process analytics. This paper is based on our experience in attempting to provide process analysis for Microsoft product teams such as WINDOWS and OFFICE, which are currently switching to using GIT as their primary version control. We illustrate how GIT’s mechanisms and usage patterns create a need for changing well-established data analytic processes. We offer insights into those internal GIT concepts that are most difficult to interpret through data mining to raise awareness how certain GIT operations may damage information about historical code changes and make it hard or even impossible to provide a precise and continuous data flow for some types of process analytics. To that end, we provide a list of common GIT usage patterns with a description of how these operations impact data mining applications. Finally, we provide examples of how one may counteract the effects of such destructive operations in the future. We further provide a new algorithm to detect integration paths that is specific to distributed version control systems like GIT, which allows us to reconstruct the information that is crucial to most development process analytics.
Does geographical distance effect distributed development teams: How aggregation bias in software artifacts causes contradicting findings Thanh H. D. Nguyen, Bram Adams and Ahmed E. Hassan.
Does geographic distance affect distributed software development teams? Researchers have been mining software artifacts to find evidence that geographic distance between software team members introduces delay in communication and deliverables. While some studies found that geographical distance negatively impacts software teams, other studies dispute this finding. It has been speculated that various confounding factors are the reason for the contradicting findings. For example, newer tools and practices that enable team members to communicate and collaborate more effectively, might have negated the effects of distance in some studies.
In this study, we examine an alternate theory to explain the contradicting findings: the different aggregations of the software artifacts used in past studies. We call this type of bias: the aggregation bias. We replicated the previous studies on detecting the evidence of delay in communication using the data from a large commercial distributed software project. We use two different levels of artifacts in this study: the class files and the components that are the aggregation of the class files. Our results show that the effect of distance does appear in low level artifacts. However, the effect does not appear in the aggregated artifacts. Since mining software artifacts has became a popular methodology to conduct research in software engineering, the result calls for careful attention in the use of aggregating artifacts in software studies.
API Failures in Cloud Environments - An Empirical Study on OpenStack Pooya Musavi, Bram Adams and Foutse Khomh.
Stories about service outages in cloud environments have been making the headlines recently. In many cases, the reliability of cloud infrastructure Application Programming Interfaces (APIs) were at fault. Hence, understanding the factors affecting the reliability of these APIs is important to improve the availability of cloud services. In this study, we mined bugs of 25 modules within the 5 most important OpenStack APIs to understand API failures and characteristics. Our results show that in OpenStack, only one third of all API-related changes are due to bug fixes with 7% of all fixes even changing the API interface, potentially breaking clients. Through qualitative analysis of 230 sampled API failures we observed that the majority of API related bugs are due to small programming faults. Fortunately, the subject, message and stack trace as well as reply lag between comments included in these failures' bug reports provide a good indication of the cause of the failure.
Domain Arguments in Safety Critical Software Development Jonathan Rowanhill and John Knight.
This paper explores domain arguments—arguments about why techniques, processes, and designs possess properties as believed by their domain experts. An elicitation technique for their recovery from domain documents is presented. This is followed by demonstrated application of the technique to several domain artifacts from aviation engineering. The elicited arguments are presented and analyzed for their properties. The inherent importance of such arguments is discussed as well as their potential contribution to system assurance arguments such as the safety case.
Model-based Test Automation of a Concurrent Flight Software Bus: An Experience Report Dharmalingam Ganesan, Mikael Lindvall, Susanne L Strege and Walter Moleski.
Many systems make use of concurrent tasks, however it is often difficult to test concurrent design. Therefore, many test cases are simplified and do not fully test all concurrency aspects of the system. We encountered this problem when analyzing test cases for concurrent flight software at NASA. To address this problem, we developed and evaluated a model based testing (MBT) technique for testing of concurrent systems. Using MBT, the tester creates a model, which is based on the requirements of the system under test (SUT), and lets the computer generate innumerable test cases automatically from the model. We evaluate the effectiveness of the technique using Microsoft‘s Spec Explorer MBT tool. We apply the technique on NASA‘s Core Flight Software (cFS) software bus module API, which is based on a concurrent publisher-subscriber architecture style and is a safety-critical system. We describe how we created a test automation architecture for testing concurrent inter-task communication as carried out by the software bus. We also investigate the type of issues the technique for testing of concurrent systems can find as well as what degree of code coverage it can achieve.
Peeking into the Past: Efficient Checkpoint-assisted Time-traveling Debugging Armando Miraglia, Dirk Vogt, Herbert Bos, Andy S. Tanenbaum and Cristiano Giuffrida.
Debugging long-lived latent software bugs that manifest themselves only long after their introduction in the system is hard. Even state-of-the-art record/replay debugging techniques are of limited use to identify the root cause of long-lived latent bugs in general and event-driven bugs in particular. We propose DeLorean, a new end-to-end solution for time-travelling debugging based on fast memory checkpointing. Our design trades off replay guarantees with efficient support for history-aware debug queries (or time-travelling introspection) and provides novel analysis tools to diagnose event-driven latent software bugs. DeLorean imposes low run-time performance and memory overhead while preserving in memory as much history information as possible by deduplicating and/or compressing the collected data. We evaluate DeLorean by extensive experimentation, exploring the performance-memory tradeoffs in different configurations and comparing our results against state-of-the-art solutions. We show that DeLorean can efficiently support high-frequency checkpoints and store millions of them in memory.
Risk Assessment of User-Defined Security Configurations for Android Devices Daniel Vecchiato, Marco Vieira and Eliane Martins.
The wide spreading of mobile devices, such as smartphones and tablets, and their advancing capabilities, ranging from taking photos to accessing banking accounts, make them an attractive target for attackers. This, together with the fact that users frequently store critical information in such devices and that many organizations allow employees to use their personal devices to access the enterprise information infrastructure and applications, makes security assessment a key need.
This paper proposes an approach for assessing the security risk posed by user-defined configurations in Android devices. The approach is based on the analysis of the risk (impact and likelihood) of user misconfiguration to harm the device or the user. The impact and likelihood values are defined based on a Multiple-Criteria Decision Analysis (MCDA) performed on the inputs provided by a set of security experts. A case study considering the user-defined configurations of 561 Android devices is presented, showing that the majority of the users neglect important and basic security configurations and that the proposed approach can be used in practice to characterize the security risk level of such devices.
Software Aging Analysis of the Android Mobile OS Antonio Ken Iannillo, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono and Francesco Fucci.
Mobile devices are significantly complex, feature-rich, and heavily customized, thus they are prone to software reliability and performance issues. This paper considers the problem of software aging in Android mobile OS, which causes the device to gradually degrade in responsiveness, and to eventually fail. We present a methodology to identify factors (such as workloads and device configurations) and resource utilization metrics that are correlated with software aging. Moreover, we performed an empirical analysis of recent Android devices, finding that software aging actually affects them. The analysis pointed out processes and components of the Android OS affected by software aging, and metrics useful as indicators of software aging to schedule software rejuvenation actions.
Pretect: Detecting Poor-Responsive UI in Android Applications Yu Kang, Yangfan Zhou, Min Gao, Yixia Sun and Michael Lyu.
Good user interface (UI) design is key to successful mobile apps. UI latency, which can be considered as the time between the commencement of a UI operation and its intended UI update, is a critical consideration for app developers. Current literature still lacks a comprehensive study on how much UI latency a user can tolerate or how to identify UI design defects that cause intolerably long UI latency. As a result, bad UI apps are still common in app markets, leading to extensive user complaints. This paper examines user expectations of UI latency, and develops a tool to pinpoint intolerable UI latency in Android apps. To this end, we design an app to conduct a user survey of app UI latency. Through the survey, we find the tendency between user patience and UI latency. Therefore a timely screen update (e.g., loading animations) is critical to heavy-weighted UI operations (i.e. those that incur a long execution time before the final UI update is available). We then design a tool that, by monitoring the UI inputs and updates, can detect apps that do not follow this criterion. The survey and the tool are open-source released on-line. We also apply the tool to many real-world apps. The results demonstrate the effectiveness of the tool in combating app UI design defects.