Job performance data in XDMoD is obtained from collection software running on the compute nodes of the HPC resource. The software architecture comprises three major components.
The job-level summary records are stored in a MongoDB document database.
The Job Performance (SUPReMM) module in XDMoD may be extended to support multiple different data collection packages. The XDMoD team recommends using the open-source Performance Co-Pilot (PCP) software as the data source for new software installs. This document assumes that PCP is used as the data source. For more information about other data sources please contact the development team via the mailing list or email (contact details on the main overview page).
A simplified high-level overview of the SUPReMM dataflow is shown in Figure 1. below. This diagram shows the principal features of the collection and processing architecture. The data aggregation and display components present in Open XDMoD are not shown.
Figure 1. SUPReMM data flow diagram
Performance Co-Pilot (PCP) runs on every compute node and is configured to log metric data every 30 seconds and at the start and end of each HPC Job (via hooks in the job prolog and epilog scripts). The PCP data are logged to a shared filesystem.
The accounting logs from the resource manager are ingested into the XDMoD datawarehouse. These accounting logs include information about the start and end times of each HPC job as well as the compute nodes that were assigned to the job.
The summarization software runs periodically via cron. The software uses the accounting information from the XDMoD datawarehouse and the information in the PCP archives to generate a job-level summary for each HPC job. These job-level summaries are stored in a MongoDB document database.
The summarized job data is then ingested into the XDMoD datawarehouse for display in the web-based user interface.