Toolchain

Cyber-Physical Systems (CPS) are commonly used in mission-critical or safety-critical applications which demand high reliability and strong assurance for safety. These systems frequently operate in highly uncertain environments where it is infeasible to explicitly design for all possible situations within the environment. Assuring safety in these systems requires supporting evidence from testing data, formal verification, expert analysis, etc. Data-driven methods, such as machine learning, are being applied in CPS development to address these challenges.

Fig. 1 ALC Toolchain Design Flow

Assurance-based Learning-enabled Cyber-Physical Systems (ALC) toolchain is an integrated set of tools and corresponding workflows specifically tailored for model-based development of CPSs that utilize learning-enabled components (or LECs). Machine learning relies on inferring relationships from data instead of deriving them from analytical models, leading many systems employing LECs to rely almost entirely on testing results as the primary source of evidence. However, test data alone is generally insufficient for assurance of safety-critical systems to detect all the possible edge cases. This set of tools support various tasks including architectural system modeling, data construction of experimental data and LEC training sets, performance evaluation using formal verification methods and system safety assurance monitoring. Figure 1 (see above) shows the general order activity for each of these steps. Each step of the process can be refined through iterations to adjust parameters, retrain LECs, adjust testing solution spaces, etc.

Evidence used for safety assurance should be traceable and reproducible. Since LECs are trained with data instead of derived from analytical models, the quality of an LEC is dependent on the history and quality of the training data. Therefore, it is necessary to maintain data provenance when working with LECs to allow the model to be reproducible. Manual data management across the complex tool suites often used for CPS development is a time consuming and error-prone process. This issue is even more pronounced for systems using LECs where training data and the resulting trained models must also be properly managed. With this toolchain, all generated artifacts - including system models, simulation data, trained networks, etc. - are automatically stored as accessible data sets and managed to allow for both traceability and reproducibility.

The design process begins with the initial modeling of the how the system hardware and software components act and interact. A system architecture model based on SysML Internal Block Diagrams allows the user to describe the system architecture in terms of the underlying components (hierarchical blocks) and their interaction via signal, energy, and material flows. System configuration instances are defined and provide parameters for adjustments during testing efforts to allow exploration of the system use cases and optimization of the system elements to meet the desired requirements.

Once the system elements have been modeled, relevant data is created to allow testing and evaluations of the system performance. LECs are built using either supervised or reinforcement learning techniques. Once a LEC is created, it can be retrained given different system scenarios and configurations to optimize the system. Assurance monitoring LECs are created simultaneously either utilizing training data used to create the system LECs (for supervised learning) or created from trained LECs (for reinforcement learning).

The verification, validation and assurance testing provide methods to assess the training models and their ability to meet the system requirements and execute the desired tasks safely. A fundamental problem with LECs is that the training set is finite and it may not capture all possible situations the system encounters at operation time. For such unknown situations, the LEC may produce incorrect or unacceptable results and the rest of the system may not even know. Continuous monitoring of the LEC performance and level of confidence in the output of the LEC enables assurance monitoring, which oversees the LEC and gives a clear indication of problematic situations. Formal verification methods and testing evaluation metrics can be used to determine the solution space possible with the trained LEC and robustness of the system while under adversarial attacks. This information can also be referenced as evidence in static system assurance arguments.

Each portion of the model can be used to iterate on the design for improvement of the LECs models, adjustment of the system design parameters to determine impact or alteration of the testing scenarios to include solution spaces not originally used in the design process to determine performance issues. Workflow tools are available to simplify the automation of these system level iterative tasks.

Toolchain Resources

Fig. 2 ALC Toolchain Resource Usage

The toolchain is built on the WebGME infrastructure which provides a web-based, collaborative environment where changes are automatically and immediately propagated to all active users. The user created system models, data collection and testing activities are created and managed utilizing the WebGME servers accessed using web browsers from remote terminals, as shown in figure 2. In order to promote reproducibility and maintain data provenance, all models, training data, and contextual data are stored in a version-controlled database and data management is automated.

The toolchain supports embedded Jupyter notebooks within the context of an experiment, training, or evaluation model. The users can configure the code in a Jupyter notebook to execute the model. This allows users to launch their execution instances in an interactive manner and debug their code if required. Additionally, it allows users to write custom code to evaluate the system performance.

Whenever any model is executed, all parameters and configuration data needed to repeat the execution are stored in a metadata file with the results. This metadata file also contains references to any artifacts used as inputs to the model in order to maintain data provenance. Metadata files for LEC training contain the Uniform Resource Identifier (URI) of each data file used in the training set as well as a copy of the parent LEC metadata if training was continued from a previously trained model. Similarly, metadata for an evaluation experiment contains references to any trained LECs used in the experiment. This ensures that the complete history of any artifact can be traced back to the original data regardless of how many iterations of the design cycle are required. Additionally, the toolchain includes a dataset manager for viewing and analyzing this lineage.

The ALC toolchain allows remote deployment of computationally intense tasks on appropriately equipped servers (or execution servers). This enables developers of CPS to configure and launch computationally intensive system execution (or simulation) and training exercises on powerful machines from local web browsers, while collaborating with a distributed team of developers. The execution server is often a remote server utilizing available forms of hardware acceleration, such as graphics processing units (GPU), digital signal processors (DSP), FPGA or ASIC.

Large data sets (eg. simulation data and trained LEC models) are stored on dedicated fileservers. Each data set is linked to a corresponding metadata file which is returned to the WebGME server and stored in the version controlled model database. The metadata files provide enough information for retrieving a particular data set from the fileserver when needed for other tasks such as LEC training, performance evaluation, or LEC deployment. When experiment results are uploaded to the fileserver, configuration files and other artifacts used to execute the experiments are stored with the generated data. This allows the experiment to be repeated and any generated data to be reproduced as needed. This pattern of uploading the data to a dedicated server and only storing the corresponding metadata in the model frees WebGME from handling large files and improves efficiency as well as model-scalability. Additionally, WebGME provides a version control scheme similar to Git where model updates are stored in a tree structure and assigned an SHA1 hash. For each update, only the differences between the current state and the previous state of the model are stored in the tree. This allows the model to be reverted to any previous state in the history by rolling back changes until the hash corresponding to the desired state is reached. User access to the model data sets is available in a section labeled DataSets.

Publications

D. Stojcsics, D. Boursinos, N. Mahadevan, X. Koutsoukos, and G. Karsai, Fault-Adaptive Autonomy in Systems with Learning-Enabled Components, Sensors (Basel, Switzerland), vol. 21, no. 18, p. 6089, Sep. 2021.
```
@article{stojcsics_fault-adaptive_2021,
  title = {Fault-{Adaptive} {Autonomy} in {Systems} with {Learning}-{Enabled} {Components}},
  volume = {21},
  tag = {tool},
  issn = {1424-8220},
  url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8470782/},
  doi = {10.3390/s21186089},
  number = {18},
  urldate = {2022-01-04},
  journal = {Sensors (Basel, Switzerland)},
  author = {Stojcsics, Daniel and Boursinos, Dimitrios and Mahadevan, Nagabhushan and Koutsoukos, Xenofon and Karsai, Gabor},
  month = sep,
  year = {2021},
  pmid = {34577296},
  pmcid = {PMC8470782},
  pages = {6089},
  file = {PubMed Central Full Text PDF:/Users/abhishek/Zotero/storage/G9SF27BA/Stojcsics et al. - 2021 - Fault-Adaptive Autonomy in Systems with Learning-E.pdf:application/pdf}
}
```
Autonomous Cyber-Physical Systems (CPS) must be robust against potential failure modes, including physical degradations and software issues, and are required to self-manage contingency actions for these failures. Physical degradations often have a significant impact on the vehicle dynamics causing irregular behavior that can jeopardize system safety and mission objectives. The paper presents a novel Behavior Tree-based autonomy architecture that includes a Fault Detection and Isolation Learning-Enabled Component (FDI LEC) with an Assurance Monitor (AM) designed based on Inductive Conformal Prediction (ICP) techniques. The architecture implements real-time contingency-management functions using fault detection, isolation and reconfiguration subsystems. To improve scalability and reduce the false-positive rate of the FDI LEC, the decision-making logic provides adjustable thresholds for the desired fault coverage and acceptable risk. The paper presents the system architecture with the integrated FDI LEC, as well as the data collection and training approach for the LEC and the AM. Lastly, we demonstrate the effectiveness of the proposed architecture using a simulated autonomous underwater vehicle (AUV) based on the BlueROV2 platform.
C. Hartsell, N. Mahadevan, H. Nine, T. Bapty, A. Dubey, and G. Karsai, Workflow Automation for Cyber Physical System Development Processes, in 2020 IEEE Workshop on Design Automation for CPS and IoT (DESTION), 2020.
```
@inproceedings{Hartsell_2020,
  author = {Hartsell, Charles and Mahadevan, Nagabhushan and Nine, Harmon and Bapty, Ted and Dubey, Abhishek and Karsai, Gabor},
  title = {Workflow Automation for Cyber Physical System Development Processes},
  booktitle = {2020 IEEE Workshop on Design Automation for CPS and IoT (DESTION)},
  year = {2020},
  tag = {tool},
  month = apr,
  publisher = {IEEE},
  doi = {http://dx.doi.org/10.1109/DESTION50928.2020.00007},
  isbn = {9781728199948},
  journal = {2020 IEEE Workshop on Design Automation for CPS and IoT (DESTION)}
}
```
Development of Cyber Physical Systems (CPSs) requires close interaction between developers with expertise in many domains to achieve ever-increasing demands for improved performance, reduced cost, and more system autonomy. Each engineering discipline commonly relies on domain-specific modeling languages, and analysis and execution of these models is often automated with appropriate tooling. However, integration between these heterogeneous models and tools is often lacking, and most of the burden for inter-operation of these tools is placed on system developers. To address this problem, we introduce a workflow modeling language for the automation of complex CPS development processes and implement a platform for execution of these models in the Assurance-based Learning-enabled CPS (ALC) Toolchain. Several illustrative examples are provided which show how these workflow models are able to automate many time-consuming integration tasks previously performed manually by system developers.
S. Ramakrishna, C. Harstell, M. P. Burruss, G. Karsai, and A. Dubey, Dynamic-weighted simplex strategy for learning enabled cyber physical systems, Journal of Systems Architecture, vol. 111, p. 101760, 2020.
```
@article{ramakrishna2020dynamic,
  title = {Dynamic-weighted simplex strategy for learning enabled cyber physical systems},
  journal = {Journal of Systems Architecture},
  volume = {111},
  pages = {101760},
  year = {2020},
  tag = {da,tool},
  issn = {1383-7621},
  doi = {https://doi.org/10.1016/j.sysarc.2020.101760},
  url = {https://www.sciencedirect.com/science/article/pii/S1383762120300540},
  author = {Ramakrishna, Shreyas and Harstell, Charles and Burruss, Matthew P. and Karsai, Gabor and Dubey, Abhishek},
  keywords = {Convolutional Neural Networks, Learning Enabled Components, Reinforcement Learning, Simplex Architecture}
}
```
Cyber Physical Systems (CPS) have increasingly started using Learning Enabled Components (LECs) for performing perception-based control tasks. The simple design approach, and their capability to continuously learn has led to their widespread use in different autonomous applications. Despite their simplicity and impressive capabilities, these components are difficult to assure, which makes their use challenging. The problem of assuring CPS with untrusted controllers has been achieved using the Simplex Architecture. This architecture integrates the system to be assured with a safe controller and provides a decision logic to switch between the decisions of these controllers. However, the key challenges in using the Simplex Architecture are: (1) designing an effective decision logic, and (2) sudden transitions between controller decisions lead to inconsistent system performance. To address these research challenges, we make three key contributions: (1) dynamic-weighted simplex strategy – we introduce “weighted simplex strategy” as the weighted ensemble extension of the classical Simplex Architecture. We then provide a reinforcement learning based mechanism to find dynamic ensemble weights, (2) middleware framework – we design a framework that allows the use of the dynamic-weighted simplex strategy, and provides a resource manager to monitor the computational resources, and (3) hardware testbed – we design a remote-controlled car testbed called DeepNNCar to test and demonstrate the aforementioned key concepts. Using the hardware, we show that the dynamic-weighted simplex strategy has 60% fewer out-of-track occurrences (soft constraint violations), while demonstrating higher optimized speed (performance) of 0.4 m/s during indoor driving than the original LEC driven system.
C. Hartsell, N. Mahadevan, S. Ramakrishna, A. Dubey, T. Bapty, T. T. Johnson, X. D. Koutsoukos, J. Sztipanovits, and G. Karsai, Model-based design for CPS with learning-enabled components, in Proceedings of the Workshop on Design Automation for CPS and IoT, DESTION@CPSIoTWeek 2019, Montreal, QC, Canada, 2019, pp. 1–9.
```
@inproceedings{Hartsell2019,
  author = {Hartsell, Charles and Mahadevan, Nagabhushan and Ramakrishna, Shreyas and Dubey, Abhishek and Bapty, Theodore and Johnson, Taylor T. and Koutsoukos, Xenofon D. and Sztipanovits, Janos and Karsai, Gabor},
  title = {Model-based design for {CPS} with learning-enabled components},
  booktitle = {Proceedings of the Workshop on Design Automation for {CPS} and IoT, DESTION@CPSIoTWeek 2019, Montreal, QC, Canada},
  year = {2019},
  tag = {tool},
  pages = {1--9},
  month = apr,
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/bib/conf/cpsweek/HartsellMRDBJKS19},
  category = {workshop},
  doi = {10.1145/3313151.3313166},
  file = {:Hartsell2019-Model-based_design_for_CPS_with_learning-enabled_components.pdf:PDF},
  keywords = {assurance},
  project = {cps-autonomy},
  timestamp = {Wed, 20 Nov 2019 00:00:00 +0100},
  url = {https://doi.org/10.1145/3313151.3313166}
}
```
Recent advances in machine learning led to the appearance of Learning-Enabled Components (LECs) in Cyber-Physical Systems. LECs are being evaluated and used for various, complex functions including perception and control. However, very little tool support is available for design automation in such systems. This paper introduces an integrated toolchain that supports the architectural modeling of CPS with LECs, but also has extensive support for the engineering and integration of LECs, including support for training data collection, LEC training, LEC evaluation and verification, and system software deployment. Additionally, the toolsuite supports the modeling and analysis of safety cases - a critical part of the engineering process for mission and safety critical systems.
S. Ramakrishna, A. Dubey, M. P. Burruss, C. Hartsell, N. Mahadevan, S. Nannapaneni, A. Laszka, and G. Karsai, Augmenting Learning Components for Safety in Resource Constrained Autonomous Robots, in IEEE 22nd International Symposium on Real-Time Distributed Computing, ISORC 2019, Valencia, Spain, May 7-9, 2019, 2019, pp. 108–117.
```
@inproceedings{Ramakrishna2019,
  author = {Ramakrishna, Shreyas and Dubey, Abhishek and Burruss, Matthew P. and Hartsell, Charles and Mahadevan, Nagabhushan and Nannapaneni, Saideep and Laszka, Aron and Karsai, Gabor},
  title = {Augmenting Learning Components for Safety in Resource Constrained Autonomous Robots},
  booktitle = {{IEEE} 22nd International Symposium on Real-Time Distributed Computing, {ISORC} 2019, Valencia, Spain, May 7-9, 2019},
  year = {2019},
  tag = {tool},
  pages = {108--117},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/bib/conf/isorc/RamakrishnaDBHM19},
  category = {selectiveconference},
  doi = {10.1109/ISORC.2019.00032},
  file = {:Ramakrishna2019-Augmenting_Learning_Components_for_Safety_in_Resource_Constrained_Autonomous_Robots.pdf:PDF},
  keywords = {assurance},
  project = {cps-autonomy},
  timestamp = {Wed, 16 Oct 2019 14:14:53 +0200},
  url = {https://doi.org/10.1109/ISORC.2019.00032}
}
```
Learning enabled components (LECs) trained using data-driven algorithms are increasingly being used in autonomous robots commonly found in factories, hospitals, and educational laboratories. However, these LECs do not provide any safety guarantees, and testing them is challenging. In this paper, we introduce a framework that performs weighted simplex strategy based supervised safety control, resource management and confidence estimation of autonomous robots. Specifically, we describe two weighted simplex strategies: (a) simple weighted simplex strategy (SW-Simplex) that computes a weighted controller output by comparing the decisions between a safety supervisor and an LEC, and (b) a context-sensitive weighted simplex strategy (CSW-Simplex) that computes a context-aware weighted controller output. We use reinforcement learning to learn the contextual weights. We also introduce a system monitor that uses the current state information and a Bayesian network model learned from past data to estimate the probability of the robotic system staying in the safe working region. To aid resource constrained robots in performing complex computations of these weighted simplex strategies, we describe a resource manager that offloads tasks to an available fog nodes. The paper also describes a hardware testbed called DeepNNCar, which is a low cost resource-constrained RC car, built to perform autonomous driving. Using the hardware, we show that both SW-Simplex and CSW-Simplex have 40% and 60% fewer safety violations, while demonstrating higher optimized speed during indoor driving around 0.40m/s than the original system (using only LECs).
C. Hartsell, N. Mahadevan, S. Ramakrishna, A. Dubey, T. Bapty, T. T. Johnson, X. D. Koutsoukos, J. Sztipanovits, and G. Karsai, CPS Design with Learning-Enabled Components: A Case Study, in Proceedings of the 30th International Workshop on Rapid System Prototyping, RSP 2019, New York, NY, USA, October 17-18, 2019, 2019, pp. 57–63.
```
@inproceedings{Hartsell2019b,
  author = {Hartsell, Charles and Mahadevan, Nagabhushan and Ramakrishna, Shreyas and Dubey, Abhishek and Bapty, Theodore and Johnson, Taylor T. and Koutsoukos, Xenofon D. and Sztipanovits, Janos and Karsai, Gabor},
  title = {{CPS} Design with Learning-Enabled Components: {A} Case Study},
  booktitle = {Proceedings of the 30th International Workshop on Rapid System Prototyping, {RSP} 2019, New York, NY, USA, October 17-18, 2019},
  year = {2019},
  pages = {57--63},
  tag = {tool},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/bib/conf/rsp/HartsellMRDBJKS19},
  category = {selectiveconference},
  doi = {10.1145/3339985.3358491},
  file = {:Hartsell2019b-CPS_Design_with_Learning-Enabled_Components_A_Case_Study.pdf:PDF},
  keywords = {assurance},
  project = {cps-autonomy},
  timestamp = {Thu, 28 Nov 2019 12:43:50 +0100},
  url = {https://doi.org/10.1145/3339985.3358491}
}
```
Cyber-Physical Systems (CPS) are used in many applications where they must perform complex tasks with a high degree of autonomy in uncertain environments. Traditional design flows based on domain knowledge and analytical models are often impractical for tasks such as perception, planning in uncertain environments, control with ill-defined objectives, etc. Machine learning based techniques have demonstrated good performance for such difficult tasks, leading to the introduction of Learning-Enabled Components (LEC) in CPS. Model based design techniques have been successful in the development of traditional CPS, and toolchains which apply these techniques to CPS with LECs are being actively developed. As LECs are critically dependent on training and data, one of the key challenges is to build design automation for them. In this paper, we examine the development of an autonomous Unmanned Underwater Vehicle (UUV) using the Assurance-based Learning-enabled Cyber-physical systems (ALC) Toolchain. Each stage of the development cycle is described including architectural modeling, data collection, LEC training, LEC evaluation and verification, and system-level assurance.