Getting Started
Welcome to the ACCESS-MOPPy Getting Started guide!
This section will walk you through the initial setup and basic usage of ACCESS-MOPPy, a tool designed to post-process ACCESS model output and produce CMIP-compliant datasets. You’ll learn how to configure your environment, prepare your data, and run the CMORisation workflow using both the Python API and Dask for scalable processing.
Set up configuration
When you first import access_moppy in a Python environment, the package will automatically create a user.yml file in your home directory (~/.moppy/user.yml). During this initial setup, you will be prompted to provide some basic information, including: - Your name - Your email address - Your work organization - Your ORCID
This information is stored in user.yml and will be used as global attributes in the files generated during the CMORisation process. This ensures that each CMORised file includes metadata identifying who performed the CMORisation, allowing us to track data provenance and follow up with the responsible person if needed.
Dask support
ACCESS-MOPPy supports Dask for parallel processing, which can significantly speed up the CMORisation workflow, especially when working with large datasets. To use Dask with ACCESS-MOPPy, you can create a Dask client to manage distributed computation. This allows you to take advantage of multiple CPU cores or even a cluster of machines, depending on your setup.
import dask.distributed as dask
client = dask.Client(threads_per_worker=1)
client
Data selection
The ACCESS_ESM_CMORiser class (described in detail below) takes as input a list of paths to NetCDF files containing the raw model output variables to be CMORised. The CMORiser does not assume any specific folder structure, DRS (Data Reference Syntax), or file naming convention. It is intentionally left to the user to ensure that the provided files contain the original variables required for CMORisation.
This design is intentional: ACCESS-NRI plans to integrate ACCESS-MOPPy into extended workflows that leverage the [ACCESS-NRI Intake Catalog](https://github.com/ACCESS-NRI/access-nri-intake-catalog) or evaluation frameworks such as [ESMValTool](https://www.esmvaltool.org/) and [ILAMB](https://www.ilamb.org/). By decoupling file selection from the CMORiser, ACCESS-MOPPy can be flexibly used in a variety of data processing and evaluation pipelines.
import glob
files = glob.glob("../../Test_data/esm1-6/atmosphere/aiihca.pa-0961*_mon.nc")
Parent experiment information
In CMIP workflows, providing parent experiment information is required for proper data provenance and traceability. This metadata describes the relationship between your experiment and its parent (for example, a historical run branching from a piControl simulation), and is essential for CMIP data publication and compliance.
However, for some applications—such as when using ACCESS-MOPPy to interact with evaluation frameworks like [ESMValTool](https://www.esmvaltool.org/) or [ILAMB](https://www.ilamb.org/)—strict CMIP compliance is not always necessary. In these cases, you may choose to skip providing parent experiment information to simplify the workflow.
If you choose to skip this step, ACCESS-MOPPy will issue a warning to let you know that, if you write the output to disk, the resulting file may not be compatible with CMIP requirements for publication. This flexibility allows you to use ACCESS-MOPPy for rapid evaluation and prototyping, while still supporting full CMIP compliance when needed.
parent_experiment_config = {
"parent_experiment_id": "piControl",
"parent_activity_id": "CMIP",
"parent_source_id": "ACCESS-ESM1-5",
"parent_variant_label": "r1i1p1f1",
"parent_time_units": "days since 0001-01-01 00:00:00",
"parent_mip_era": "CMIP6",
"branch_time_in_child": 0.0,
"branch_time_in_parent": 54786.0,
"branch_method": "standard"
}
Set up the CMORiser for CMORisation
To begin the CMORisation process, you need to create an instance of the ACCESS_ESM_CMORiser class. This class requires several key parameters, including the list of input NetCDF files and metadata describing your experiment.
A crucial parameter is the compound_name, which should be specified using the full CMIP convention: table.variable (for example, Amon.rsds). This format uniquely identifies the variable, its frequency (e.g., monthly, daily), and the associated CMIP table, ensuring that all requirements for grids and metadata are correctly handled. Using the full compound name helps avoid ambiguity and guarantees that the CMORiser applies the correct standards for each variable.
You can also provide additional metadata such as experiment_id, source_id, variant_label, and grid_label to ensure your output is CMIP-compliant. Optionally, you may include parent experiment information for full provenance tracking.
from access_moppy import ACCESS_ESM_CMORiser
cmoriser = ACCESS_ESM_CMORiser(
input_paths=files,
compound_name="Amon.rsds",
experiment_id="historical",
source_id="ACCESS-ESM1-5",
variant_label="r1i1p1f1",
grid_label="gn",
activity_id="CMIP",
cmip_version="CMIP6", # Optional, default is CMIP6
parent_info=parent_experiment_config # <-- This is optional, can be skipped if not needed
)
Choosing CMIP6 vs CMIP6Plus vs CMIP7
------------------------------------
From a user point of view, vocabulary selection is controlled by the ``cmip_version`` argument in ``ACCESS_ESM_CMORiser``:
- ``cmip_version="CMIP6"`` (default) uses CMIP6 controlled vocabularies
- ``cmip_version="CMIP6Plus"`` uses CMIP6Plus controlled vocabularies
- ``cmip_version="CMIP7"`` uses CMIP7 controlled vocabularies
.. code-block:: python
from access_moppy import ACCESS_ESM_CMORiser
# CMIP6 (default)
cmip6_cmoriser = ACCESS_ESM_CMORiser(
input_data=files,
compound_name="Amon.rsds",
experiment_id="historical",
source_id="ACCESS-ESM1-5",
variant_label="r1i1p1f1",
grid_label="gn",
activity_id="CMIP",
cmip_version="CMIP6",
)
# CMIP6Plus
cmip6plus_cmoriser = ACCESS_ESM_CMORiser(
input_data=files,
compound_name="Amon.rsds",
experiment_id="historical",
source_id="ACCESS-CM2",
variant_label="r1i1p1f1",
grid_label="gn",
activity_id="CMIP",
cmip_version="CMIP6Plus",
)
Use ``source_id``, ``experiment_id``, and other metadata values that exist in the selected controlled vocabulary set. CMIP6 and CMIP6Plus entries are not always identical.
Exploring Variable Mappings
ACCESS-MOPPy provides an enhanced variable mapping display that helps you understand how your raw model variables are mapped to CMIP-compliant variables. The variable_mapping attribute provides a rich, interactive display in Jupyter notebooks that shows:
Variable metadata (CF standard names, units, dimensions)
Mapping completeness and validation status
Model-specific mapping information
Easy-to-read tabular format with color coding
# Display the variable mapping with enhanced formatting
cmoriser.variable_mapping
The variable mapping display shows:
Variable Name: The CMIP variable name (e.g., rsds - surface downwelling shortwave flux)
CF Standard Name: The Climate and Forecast conventions standard name
Units: Expected units for the CMIP-compliant variable
Dimensions: How the data dimensions map between raw and CMIP formats
Model Info: Shows the ACCESS model version used for this mapping
You can also access the raw mapping data programmatically:
# Access the raw mapping dictionary if needed for programmatic use
print("Variable:", list(cmoriser.variable_mapping.keys()))
print("CF Standard Name:", cmoriser.variable_mapping['rsds']['CF standard Name'])
print("Units:", cmoriser.variable_mapping['rsds']['units'])
print("Compound name:", cmoriser.variable_mapping.compound_name)
print("Model ID:", cmoriser.variable_mapping.model_id)
The VariableMapping class acts as both a dictionary-like interface for programmatic access and provides rich visual feedback in Jupyter environments to help users understand and validate their variable mappings before processing.
Running the CMORiser
To start the CMORisation process, simply call the run() method on your cmoriser instance as shown below. This step may take some time, especially if you are processing a large number of files.
We recommend using the [dask-labextension](https://github.com/dask/dask-labextension) with JupyterLab to monitor the progress of your computation. The extension provides a convenient dashboard to track task progress and resource usage directly within your notebook interface.
cmoriser.run()
In-memory processing with xarray and Dask
The CMORisation workflow processes data entirely in memory using xarray and Dask. This approach enables efficient parallel computation and flexible data manipulation, but requires that your system has enough memory to handle the size of your dataset.
Once the CMORisation is complete, you can access the resulting dataset by calling the to_dataset() method on your cmoriser instance. The returned object is a standard xarray dataset, which means you can slice, analyze, or further process the data using familiar xarray operations.
ds = cmoriser.to_dataset()
ds
Writing the output to a NetCDF file
To save your CMORised data to disk, use the write() method of the cmoriser instance. This will create a NetCDF file with all attributes set according to the CMIP Controlled Vocabulary, ensuring compliance with CMIP metadata standards.
After writing the file, we recommend validating it using [PrePARE](https://github.com/PCMDI/cmor/tree/master/PrePARE), a tool provided by PCMDI to check the conformity of CMIP files. PrePARE will help you identify any issues with metadata or file structure before publication or further analysis.
cmoriser.write()
CMIP7 Support with Full Branded Names
ACCESS-MOPPy also supports the new CMIP7 compound name format, which uses full branded names instead of the table.variable format used in CMIP6. CMIP7 introduces a more descriptive naming convention that includes detailed information about the data processing and grid specifications.
The CMIP7 format follows the pattern: realm.variable.operation.frequency.domain (e.g., atmos.tas.tavg-h2m-hxy-u.mon.glb)
This provides more explicit information about:
Realm: Model component (atmos, ocean, land, etc.)
Variable: The physical quantity
Operation: Temporal and spatial processing applied
Frequency: Data output frequency
Domain: Spatial domain specification
Using CMIP7 Compound Names
Here’s how to use CMIP7 compound names with ACCESS-MOPPy:
# Example: CMIP7 compound name for atmospheric temperature
cmip7_cmoriser = ACCESS_ESM_CMORiser(
input_data=files, # Can reuse the same atmospheric files
compound_name="atmos.tas.tavg-h2m-hxy-u.mon.glb", # CMIP7 full branded name
experiment_id="piControl-spinup",
source_id="pcmdi-test-1-0", # CMIP7 test source identifier
variant_label="r1i1p1f1",
grid_label="gn",
activity_id="CMIP",
parent_info=parent_experiment_config,
cmip_version="CMIP7" # Explicit CMIP7 support
)
# Display the CMIP7 variable mapping
cmip7_cmoriser.variable_mapping
CMIP6 vs CMIP7 Compound Name Comparison
The table below shows the key differences between CMIP6 and CMIP7 compound name formats:
Aspect |
CMIP6 Format |
CMIP7 Format |
|---|---|---|
Structure |
|
|
Example |
|
|
Information |
Table and variable only |
Detailed processing and grid info |
Length |
Compact |
More descriptive |
CMIP7 Compound Name Breakdown:
atmos: Atmospheric realmtas: Near-surface air temperaturetavg-h2m-hxy-u: Time-averaged, 2-meter height, horizontal grid, unstructuredmon: Monthly frequencyglb: Global domain
ACCESS-MOPPy automatically handles the mapping between these formats and the underlying model variables.
Batch Processing with PBS
For large-scale CMORisation workflows, ACCESS-MOPPy provides a batch processing system designed for PBS-based HPC environments like NCI Gadi. This system allows you to process multiple variables in parallel, each running as a separate PBS job with its own Dask cluster.
Configuration File
Create a YAML configuration file specifying your batch processing parameters:
# batch_config.yml
# List of variables to process
variables:
- Amon.pr
- Omon.tos
- Amon.tauu
- Amon.ts
- Omon.zos
# CMIP6 metadata
experiment_id: piControl
source_id: ACCESS-ESM1-5
variant_label: r1i1p1f1
grid_label: gn
activity_id: CMIP
# Input and output paths
input_folder: "/g/data/p73/archive/CMIP7/ACCESS-ESM1-6/spinup/JuneSpinUp-JuneSpinUp-bfaa9c5b"
output_folder: "/scratch/tm70/rb5533/moppy_output"
# File patterns for each variable (relative to input_folder)
file_patterns:
Amon.pr: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
Omon.tos: "output[0-4][0-9][0-9]/ocean/ocean-2d-surface_temp-1monthly-mean*.nc"
Amon.tauu: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
Amon.ts: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
Omon.zos: "output[0-4][0-9][0-9]/ocean/ocean-2d-sea_level-1monthly-mean*.nc"
# PBS job configuration
queue: normal
cpus_per_node: 14
mem: 32GB
jobfs: 100GB
walltime: "02:00:00"
scheduler_options: "#PBS -P tm70"
storage: "gdata/p73+gdata/tm70+scratch/tm70"
# Environment setup for each job
worker_init: |
source /g/data/tm70/rb5533/miniforge3/bin/activate
conda activate esmvaltool_dev
# Optional: Wait for all jobs to complete before exiting
wait_for_completion: false
Running Batch CMORisation
Submit your batch job using the command-line interface:
moppy-cmorise batch_config.yml
This command will:
Initialize a tracking database in your output directory to monitor job progress
Start a Streamlit dashboard at http://localhost:8501 for real-time monitoring
Create and submit PBS jobs for each variable in your configuration
Generate job scripts in a local cmor_job_scripts/ directory
Monitoring Progress
The batch system includes several monitoring tools:
- Streamlit Dashboard
A web-based dashboard automatically starts at http://localhost:8501, showing:
Real-time status of all CMORisation tasks
Progress tracking (pending, running, completed, failed)
Filtering options by status and experiment
Task completion times and error logs
- Terminal Dashboard (``moppy-tui``)
A
rich-based terminal alternative to the Streamlit dashboard, useful when opening a browser to a Gadi login node is awkward. Reads the same tracking database, supports interactive paging (j/k,n/p,g/G, arrow / PgUp / PgDn keys), and offers--once/--jsonmodes for scripts and cron jobs.On Gadi,
moppy-tuicomes preinstalled with theconda/analysis3module:module use /g/data/xp65/public/modules module load conda/analysis3 moppy-tui --db <output_folder>/cmor_tasks.db
Elsewhere, install the optional extra:
pip install "access_moppy[tui]".See Batch Processing Guide for the full feature list.
- Command Line Monitoring
Monitor PBS jobs directly:
# Check job status qstat -u $USER # Monitor specific jobs (job IDs provided by moppy-cmorise) qstat 12345678 12345679 12345680
- Database Tracking
The system maintains an SQLite database at {output_folder}/cmor_tasks.db that tracks:
Task status for each variable
Start and completion times
Error messages for failed tasks
Experiment metadata
File Organization
The batch system organizes files as follows:
your_work_directory/
├── batch_config.yml # Your configuration file
├── cmor_job_scripts/ # Generated PBS and Python scripts
│ ├── cmor_Amon_pr.sh # PBS script for Amon.pr
│ ├── cmor_Amon_pr.py # Python script for Amon.pr
│ ├── cmor_Amon_pr.out # Job stdout
│ ├── cmor_Amon_pr.err # Job stderr
│ └── ...
└── output_folder/ # Your specified output directory
├── cmor_tasks.db # Progress tracking database
└── CMIP6/ # CMORised output files (if drs_root specified)
└── CMIP/
└── ACCESS-NRI/
└── ACCESS-ESM1-5/
└── ...
Configuration Options
Required Parameters:
variables: List of variables to process (format:table.variable)experiment_id,source_id,variant_label,grid_label: CMIP6 metadatainput_folder: Root directory containing input filesoutput_folder: Directory for CMORised output
File Pattern Mapping:
file_patterns: Dictionary mapping variables to glob patterns (relative toinput_folder)
PBS Configuration:
queue: PBS queue name (default: “normal”)cpus_per_node: Number of CPUs per job (default: 4)mem: Memory per job (default: “16GB”)jobfs: Local scratch space (optional)walltime: Job time limit (default: “01:00:00”)scheduler_options: Additional PBS directivesstorage: Required storage systems
Environment Setup:
worker_init: Shell commands to set up the environment in each job
Optional Parameters:
activity_id: CMIP activity (default: derived from experiment)drs_root: Enable CMIP6 DRS directory structurewait_for_completion: Wait for all jobs before exiting (default: false)
Best Practices
- Resource Planning:
Use
jobfsfor large datasets to improve I/O performanceAdjust
cpus_per_nodeandmembased on your data sizeSet appropriate
walltimebased on dataset complexity
- File Access:
Ensure
input_folderandoutput_folderare on shared filesystemsUse relative paths in
file_patternsfor portabilityTest file patterns with a small subset first
- Error Handling:
Monitor the dashboard for failed jobs
Check job stderr files in
cmor_job_scripts/for detailed error messagesFailed jobs can be resubmitted by running
moppy-cmoriseagain
Example PBS Configuration for NCI Gadi:
queue: normal
cpus_per_node: 16
mem: 64GB
jobfs: 200GB
walltime: "04:00:00"
scheduler_options: "#PBS -P tm70"
storage: "gdata/p73+gdata/tm70+scratch/tm70"
worker_init: |
module load netcdf/4.7.4
source /g/data/tm70/rb5533/miniforge3/bin/activate
conda activate esmvaltool_dev
Troubleshooting
Common Issues:
Database access errors: Ensure
output_folderis on a shared filesystem accessible from compute nodesFile not found errors: Verify
file_patternsmatch your actual file structure usinglsorfindMemory errors: Increase
memor reducecpus_per_nodefor memory-intensive variablesEnvironment errors: Check
worker_initcommands work on compute nodesPermission errors: Ensure write access to
output_folderand job script directory
For advanced usage and troubleshooting, see the example configuration at src/access_moppy/examples/batch_config.yml