Toolchain

Cyber-Physical Systems (CPS) are commonly used in mission-critical or safety-critical applications which demand high reliability and strong assurance for safety. These systems frequently operate in highly uncertain environments where it is infeasible to explicitly design for all possible situations within the environment. Assuring safety in these systems requires supporting evidence from testing data, formal verification, expert analysis, etc. Data-driven methods, such as machine learning, are being applied in CPS development to address these challenges.

img/ALC-Toolchain-Init-Overview.png

Fig. 1 ALC Toolchain Design Flow

Assurance-based Learning-enabled Cyber-Physical Systems (ALC) toolchain is an integrated set of tools and corresponding workflows specifically tailored for model-based development of CPSs that utilize learning-enabled components (or LECs). Machine learning relies on inferring relationships from data instead of deriving them from analytical models, leading many systems employing LECs to rely almost entirely on testing results as the primary source of evidence. However, test data alone is generally insufficient for assurance of safety-critical systems to detect all the possible edge cases. This set of tools support various tasks including architectural system modeling, data construction of experimental data and LEC training sets, performance evaluation using formal verification methods and system safety assurance monitoring. Figure 1 (see above) shows the general order activity for each of these steps. Each step of the process can be refined through iterations to adjust parameters, retrain LECs, adjust testing solution spaces, etc.

Evidence used for safety assurance should be traceable and reproducible. Since LECs are trained with data instead of derived from analytical models, the quality of an LEC is dependent on the history and quality of the training data. Therefore, it is necessary to maintain data provenance when working with LECs to allow the model to be reproducible. Manual data management across the complex tool suites often used for CPS development is a time consuming and error-prone process. This issue is even more pronounced for systems using LECs where training data and the resulting trained models must also be properly managed. With this toolchain, all generated artifacts - including system models, simulation data, trained networks, etc. - are automatically stored as accessible data sets and managed to allow for both traceability and reproducibility.

The design process begins with the initial modeling of the how the system hardware and software components act and interact. A system architecture model based on SysML Internal Block Diagrams allows the user to describe the system architecture in terms of the underlying components (hierarchical blocks) and their interaction via signal, energy, and material flows. System configuration instances are defined and provide parameters for adjustments during testing efforts to allow exploration of the system use cases and optimization of the system elements to meet the desired requirements.

Once the system elements have been modeled, relevant data is created to allow testing and evaluations of the system performance. LECs are built using either supervised or reinforcement learning techniques. Once a LEC is created, it can be retrained given different system scenarios and configurations to optimize the system. Assurance monitoring LECs are created simultaneously either utilizing training data used to create the system LECs (for supervised learning) or created from trained LECs (for reinforcement learning).

The verification, validation and assurance testing provide methods to assess the training models and their ability to meet the system requirements and execute the desired tasks safely. A fundamental problem with LECs is that the training set is finite and it may not capture all possible situations the system encounters at operation time. For such unknown situations, the LEC may produce incorrect or unacceptable results and the rest of the system may not even know. Continuous monitoring of the LEC performance and level of confidence in the output of the LEC enables assurance monitoring, which oversees the LEC and gives a clear indication of problematic situations. Formal verification methods and testing evaluation metrics can be used to determine the solution space possible with the trained LEC and robustness of the system while under adversarial attacks. This information can also be referenced as evidence in static system assurance arguments.

Each portion of the model can be used to iterate on the design for improvement of the LECs models, adjustment of the system design parameters to determine impact or alteration of the testing scenarios to include solution spaces not originally used in the design process to determine performance issues. Workflow tools are available to simplify the automation of these system level iterative tasks.

Toolchain Resources

img/Toolchain-Deployment.png

Fig. 2 ALC Toolchain Resource Usage

The toolchain is built on the WebGME infrastructure which provides a web-based, collaborative environment where changes are automatically and immediately propagated to all active users. The user created system models, data collection and testing activities are created and managed utilizing the WebGME servers accessed using web browsers from remote terminals, as shown in figure 2. In order to promote reproducibility and maintain data provenance, all models, training data, and contextual data are stored in a version-controlled database and data management is automated.

The toolchain supports embedded Jupyter notebooks within the context of an experiment, training, or evaluation model. The users can configure the code in a Jupyter notebook to execute the model. This allows users to launch their execution instances in an interactive manner and debug their code if required. Additionally, it allows users to write custom code to evaluate the system performance.

Whenever any model is executed, all parameters and configuration data needed to repeat the execution are stored in a metadata file with the results. This metadata file also contains references to any artifacts used as inputs to the model in order to maintain data provenance. Metadata files for LEC training contain the Uniform Resource Identifier (URI) of each data file used in the training set as well as a copy of the parent LEC metadata if training was continued from a previously trained model. Similarly, metadata for an evaluation experiment contains references to any trained LECs used in the experiment. This ensures that the complete history of any artifact can be traced back to the original data regardless of how many iterations of the design cycle are required. Additionally, the toolchain includes a dataset manager for viewing and analyzing this lineage.

The ALC toolchain allows remote deployment of computationally intense tasks on appropriately equipped servers (or execution servers). This enables developers of CPS to configure and launch computationally intensive system execution (or simulation) and training exercises on powerful machines from local web browsers, while collaborating with a distributed team of developers. The execution server is often a remote server utilizing available forms of hardware acceleration, such as graphics processing units (GPU), digital signal processors (DSP), FPGA or ASIC.

Large data sets (eg. simulation data and trained LEC models) are stored on dedicated fileservers. Each data set is linked to a corresponding metadata file which is returned to the WebGME server and stored in the version controlled model database. The metadata files provide enough information for retrieving a particular data set from the fileserver when needed for other tasks such as LEC training, performance evaluation, or LEC deployment. When experiment results are uploaded to the fileserver, configuration files and other artifacts used to execute the experiments are stored with the generated data. This allows the experiment to be repeated and any generated data to be reproduced as needed. This pattern of uploading the data to a dedicated server and only storing the corresponding metadata in the model frees WebGME from handling large files and improves efficiency as well as model-scalability. Additionally, WebGME provides a version control scheme similar to Git where model updates are stored in a tree structure and assigned an SHA1 hash. For each update, only the differences between the current state and the previous state of the model are stored in the tree. This allows the model to be reverted to any previous state in the history by rolling back changes until the hash corresponding to the desired state is reached. User access to the model data sets is available in a section labeled DataSets.

Publications

  1. D. Stojcsics, D. Boursinos, N. Mahadevan, X. Koutsoukos, and G. Karsai, Fault-Adaptive Autonomy in Systems with Learning-Enabled Components, Sensors (Basel, Switzerland), vol. 21, no. 18, p. 6089, Sep. 2021.
  2. C. Hartsell, N. Mahadevan, H. Nine, T. Bapty, A. Dubey, and G. Karsai, Workflow Automation for Cyber Physical System Development Processes, in 2020 IEEE Workshop on Design Automation for CPS and IoT (DESTION), 2020.
  3. S. Ramakrishna, C. Harstell, M. P. Burruss, G. Karsai, and A. Dubey, Dynamic-weighted simplex strategy for learning enabled cyber physical systems, Journal of Systems Architecture, vol. 111, p. 101760, 2020.
  4. C. Hartsell, N. Mahadevan, S. Ramakrishna, A. Dubey, T. Bapty, T. T. Johnson, X. D. Koutsoukos, J. Sztipanovits, and G. Karsai, Model-based design for CPS with learning-enabled components, in Proceedings of the Workshop on Design Automation for CPS and IoT, DESTION@CPSIoTWeek 2019, Montreal, QC, Canada, 2019, pp. 1–9.
  5. S. Ramakrishna, A. Dubey, M. P. Burruss, C. Hartsell, N. Mahadevan, S. Nannapaneni, A. Laszka, and G. Karsai, Augmenting Learning Components for Safety in Resource Constrained Autonomous Robots, in IEEE 22nd International Symposium on Real-Time Distributed Computing, ISORC 2019, Valencia, Spain, May 7-9, 2019, 2019, pp. 108–117.
  6. C. Hartsell, N. Mahadevan, S. Ramakrishna, A. Dubey, T. Bapty, T. T. Johnson, X. D. Koutsoukos, J. Sztipanovits, and G. Karsai, CPS Design with Learning-Enabled Components: A Case Study, in Proceedings of the 30th International Workshop on Rapid System Prototyping, RSP 2019, New York, NY, USA, October 17-18, 2019, 2019, pp. 57–63.