Getting Started
===============

Welcome to the ACCESS-MOPPy Getting Started guide!

This section will walk you through the initial setup and basic usage of ACCESS-MOPPy, a tool designed to post-process ACCESS model output and produce CMIP-compliant datasets. You’ll learn how to configure your environment, prepare your data, and run the CMORisation workflow using both the Python API and Dask for scalable processing.

.. contents:: Table of Contents
   :local:
   :depth: 2

Set up configuration
--------------------

When you first import `access_moppy` in a Python environment, the package will automatically create a `user.yml` file in your home directory (`~/.moppy/user.yml`).
During this initial setup, you will be prompted to provide some basic information, including:
- Your name
- Your email address
- Your work organization
- Your ORCID

This information is stored in `user.yml` and will be used as global attributes in the files generated during the CMORisation process. This ensures that each CMORised file includes metadata identifying who performed the CMORisation, allowing us to track data provenance and follow up with the responsible person if needed.

Dask support
------------

ACCESS-MOPPy supports Dask for parallel processing, which can significantly speed up the CMORisation workflow, especially when working with large datasets. To use Dask with ACCESS-MOPPy, you can create a Dask client to manage distributed computation. This allows you to take advantage of multiple CPU cores or even a cluster of machines, depending on your setup.

.. code-block:: python

   import dask.distributed as dask
   client = dask.Client(threads_per_worker=1)
   client

Data selection
--------------

``ACCESS_ESM_CMORiser`` offers two ways to supply input data:

**Option 1 — Auto-discovery (recommended):** pass the root directory of your
payu archive via ``input_folder``.  ACCESS-MOPPy reads the variable's mapping
entry and the model's ``file_discovery`` configuration to locate the relevant
files automatically — no manual ``glob`` required.

.. code-block:: python

   # Point to the archive root; ACCESS-MOPPy finds the files for you.
   archive_root = "/g/data/p73/archive/CMIP7/ACCESS-ESM1-6/historical/MyRun"

Optionally, restrict the search to a specific time window:

.. code-block:: python

   # Only files within 1900–1950 will be included.
   archive_root = "/g/data/p73/archive/CMIP7/ACCESS-ESM1-6/historical/MyRun"
   start_year, end_year = 1900, 1950

**Option 2 — Explicit file list:** collect the files yourself and pass them
via ``input_data``.  This is useful when the auto-discovery pattern does not
match your archive layout, or when you want fine-grained control.

.. code-block:: python

   import glob
   files = glob.glob("../../Test_data/esm1-6/atmosphere/aiihca.pa-0961*_mon.nc")

Parent experiment information
-----------------------------

In CMIP workflows, providing parent experiment information is required for proper data provenance and traceability. This metadata describes the relationship between your experiment and its parent (for example, a historical run branching from a piControl simulation), and is essential for CMIP data publication and compliance.

However, for some applications—such as when using ACCESS-MOPPy to interact with evaluation frameworks like [ESMValTool](https://www.esmvaltool.org/) or [ILAMB](https://www.ilamb.org/)—strict CMIP compliance is not always necessary. In these cases, you may choose to skip providing parent experiment information to simplify the workflow.

If you choose to skip this step, ACCESS-MOPPy will issue a warning to let you know that, if you write the output to disk, the resulting file may not be compatible with CMIP requirements for publication. This flexibility allows you to use ACCESS-MOPPy for rapid evaluation and prototyping, while still supporting full CMIP compliance when needed.

.. code-block:: python

   parent_experiment_config = {
       "parent_experiment_id": "piControl",
       "parent_activity_id": "CMIP",
       "parent_source_id": "ACCESS-ESM1-5",
       "parent_variant_label": "r1i1p1f1",
       "parent_time_units": "days since 0001-01-01 00:00:00",
       "parent_mip_era": "CMIP6",
       "branch_time_in_child": 0.0,
       "branch_time_in_parent": 54786.0,
       "branch_method": "standard"
   }

Set up the CMORiser for CMORisation
------------------------------------

To begin the CMORisation process, you need to create an instance of the `ACCESS_ESM_CMORiser` class. This class requires several key parameters, including the input data and metadata describing your experiment.

A crucial parameter is the `compound_name`, which should be specified using the full CMIP convention: `table.variable` (for example, `Amon.rsds`). This format uniquely identifies the variable, its frequency (e.g., monthly, daily), and the associated CMIP table, ensuring that all requirements for grids and metadata are correctly handled. Using the full compound name helps avoid ambiguity and guarantees that the CMORiser applies the correct standards for each variable.

You can also provide additional metadata such as `experiment_id`, `source_id`, `variant_label`, and `grid_label` to ensure your output is CMIP-compliant. Optionally, you may include parent experiment information for full provenance tracking.

Using ``input_folder`` (auto-discovery):

.. code-block:: python

   from access_moppy import ACCESS_ESM_CMORiser

   cmoriser = ACCESS_ESM_CMORiser(
       input_folder=archive_root,   # archive root — files discovered automatically
       start_year=1900,             # optional: restrict to 1900–1950
       end_year=1950,
       compound_name="Amon.rsds",
       experiment_id="historical",
       source_id="ACCESS-ESM1-5",
       variant_label="r1i1p1f1",
       grid_label="gn",
       activity_id="CMIP",
       cmip_version="CMIP6",        # Optional, default is CMIP6
       parent_info=parent_experiment_config,  # Optional
   )

Or, using an explicit file list (``input_data``):

.. code-block:: python

   from access_moppy import ACCESS_ESM_CMORiser

   cmoriser = ACCESS_ESM_CMORiser(
       input_data=files,
       compound_name="Amon.rsds",
       experiment_id="historical",
       source_id="ACCESS-ESM1-5",
       variant_label="r1i1p1f1",
       grid_label="gn",
       activity_id="CMIP",
       cmip_version="CMIP6",        # Optional, default is CMIP6
       parent_info=parent_experiment_config,  # Optional
   )

Choosing CMIP6 vs CMIP6Plus vs CMIP7
-------------------------------------

From a user point of view, vocabulary selection is controlled by the ``cmip_version`` argument in ``ACCESS_ESM_CMORiser``:

- ``cmip_version="CMIP6"`` (default) uses CMIP6 controlled vocabularies
- ``cmip_version="CMIP6Plus"`` uses CMIP6Plus controlled vocabularies
- ``cmip_version="CMIP7"`` uses CMIP7 controlled vocabularies

.. code-block:: python

   from access_moppy import ACCESS_ESM_CMORiser

   # CMIP6 (default)
   cmip6_cmoriser = ACCESS_ESM_CMORiser(
       input_data=files,
       compound_name="Amon.rsds",
       experiment_id="historical",
       source_id="ACCESS-ESM1-5",
       variant_label="r1i1p1f1",
       grid_label="gn",
       activity_id="CMIP",
       cmip_version="CMIP6",
   )

   # CMIP6Plus
   cmip6plus_cmoriser = ACCESS_ESM_CMORiser(
       input_data=files,
       compound_name="Amon.rsds",
       experiment_id="historical",
       source_id="ACCESS-CM2",
       variant_label="r1i1p1f1",
       grid_label="gn",
       activity_id="CMIP",
       cmip_version="CMIP6Plus",
   )

Use ``source_id``, ``experiment_id``, and other metadata values that exist in the selected controlled vocabulary set. CMIP6 and CMIP6Plus entries are not always identical.

Exploring Variable Mappings
---------------------------

ACCESS-MOPPy provides an enhanced variable mapping display that helps you understand how your raw model variables are mapped to CMIP-compliant variables. The `variable_mapping` attribute provides a rich, interactive display in Jupyter notebooks that shows:

- Variable metadata (CF standard names, units, dimensions)
- Mapping completeness and validation status
- Model-specific mapping information
- Easy-to-read tabular format with color coding

.. code-block:: python

   # Display the variable mapping with enhanced formatting
   cmoriser.variable_mapping

The variable mapping display shows:

- **Variable Name**: The CMIP variable name (e.g., rsds - surface downwelling shortwave flux)
- **CF Standard Name**: The Climate and Forecast conventions standard name
- **Units**: Expected units for the CMIP-compliant variable
- **Dimensions**: How the data dimensions map between raw and CMIP formats
- **Model Info**: Shows the ACCESS model version used for this mapping

You can also access the raw mapping data programmatically:

.. code-block:: python

   # Access the raw mapping dictionary if needed for programmatic use
   print("Variable:", list(cmoriser.variable_mapping.keys()))
   print("CF Standard Name:", cmoriser.variable_mapping['rsds']['CF standard Name'])
   print("Units:", cmoriser.variable_mapping['rsds']['units'])
   print("Compound name:", cmoriser.variable_mapping.compound_name)
   print("Model ID:", cmoriser.variable_mapping.model_id)

The VariableMapping class acts as both a dictionary-like interface for programmatic access and provides rich visual feedback in Jupyter environments to help users understand and validate their variable mappings before processing.

Running the CMORiser
--------------------

To start the CMORisation process, simply call the `run()` method on your `cmoriser` instance as shown below. This step may take some time, especially if you are processing a large number of files.

We recommend using the [dask-labextension](https://github.com/dask/dask-labextension) with JupyterLab to monitor the progress of your computation. The extension provides a convenient dashboard to track task progress and resource usage directly within your notebook interface.

.. code-block:: python

   cmoriser.run()

In-memory processing with xarray and Dask
-----------------------------------------

The CMORisation workflow processes data entirely in memory using `xarray` and Dask. This approach enables efficient parallel computation and flexible data manipulation, but requires that your system has enough memory to handle the size of your dataset.

Once the CMORisation is complete, you can access the resulting dataset by calling the `to_dataset()` method on your `cmoriser` instance. The returned object is a standard xarray dataset, which means you can slice, analyze, or further process the data using familiar xarray operations.

.. code-block:: python

   ds = cmoriser.to_dataset()
   ds

Writing the output to a NetCDF file
-----------------------------------

To save your CMORised data to disk, use the `write()` method of the `cmoriser` instance. This will create a NetCDF file with all attributes set according to the CMIP Controlled Vocabulary, ensuring compliance with CMIP metadata standards.

After writing the file, we recommend validating it using [PrePARE](https://github.com/PCMDI/cmor/tree/master/PrePARE), a tool provided by PCMDI to check the conformity of CMIP files. PrePARE will help you identify any issues with metadata or file structure before publication or further analysis.

.. code-block:: python

   cmoriser.write()

CMIP7 Support with Full Branded Names
======================================

ACCESS-MOPPy also supports the new CMIP7 compound name format, which uses full branded names instead of the table.variable format used in CMIP6. CMIP7 introduces a more descriptive naming convention that includes detailed information about the data processing and grid specifications.

The CMIP7 format follows the pattern: ``realm.variable.operation.frequency.domain`` (e.g., ``atmos.rsds.tavg-u-hxy-u.mon.GLB``)

This provides more explicit information about:

- **Realm**: Model component (atmos, ocean, land, etc.)
- **Variable**: The physical quantity
- **Operation**: Temporal and spatial processing applied
- **Frequency**: Data output frequency
- **Domain**: Spatial domain specification

Using CMIP7 Compound Names
---------------------------

Here's how to use CMIP7 compound names with ACCESS-MOPPy:

.. code-block:: python

   # Example: CMIP7 compound name for surface downwelling shortwave radiation
   cmip7_cmoriser = ACCESS_ESM_CMORiser(
       input_data=files,  # Can reuse the same atmospheric files
       compound_name="atmos.rsds.tavg-u-hxy-u.mon.GLB",  # CMIP7 full branded name
       experiment_id="piControl-spinup",
       source_id="ACCESS-ESM1-6",
       variant_label="r1i1p1f1",
       grid_label="gn",
       activity_id="CMIP",
       cmip_version="CMIP7"  # Explicit CMIP7 support
   )

   # Display the CMIP7 variable mapping
   cmip7_cmoriser.variable_mapping

CMIP6 vs CMIP7 Compound Name Comparison
----------------------------------------

The table below shows the key differences between CMIP6 and CMIP7 compound name formats:

.. list-table:: CMIP6 vs CMIP7 Compound Names
   :widths: 25 35 40
   :header-rows: 1

   * - Aspect
     - CMIP6 Format
     - CMIP7 Format
   * - **Structure**
     - ``table.variable``
     - ``realm.variable.operation.frequency.domain``
   * - **Example**
     - ``Amon.tas``
     - ``atmos.rsds.tavg-u-hxy-u.mon.GLB``
   * - **Information**
     - Table and variable only
     - Detailed processing and grid info
   * - **Length**
     - Compact
     - More descriptive

**CMIP7 Compound Name Breakdown:**

- ``atmos``: Atmospheric realm
- ``tas``: Near-surface air temperature
- ``tavg-h2m-hxy-u``: Time-averaged, 2-meter height, horizontal grid, unstructured
- ``mon``: Monthly frequency
- ``glb``: Global domain

ACCESS-MOPPy automatically handles the mapping between these formats and the underlying model variables.

----

Batch Processing with PBS
=========================

For large-scale CMORisation workflows, ACCESS-MOPPy provides a batch processing system designed for PBS-based HPC environments like NCI Gadi. This system allows you to process multiple variables in parallel, each running as a separate PBS job with its own Dask cluster.

Configuration File
------------------

Create a YAML configuration file specifying your batch processing parameters:

.. code-block:: yaml

   # batch_config.yml
   # List of variables to process
   variables:
     - Amon.pr
     - Omon.tos
     - Amon.tauu
     - Amon.ts
     - Omon.zos

   # CMIP7 metadata
   experiment_id: piControl
   source_id: ACCESS-ESM1-6
   variant_label: r1i1p1f1
   grid_label: gn
   activity_id: CMIP
   cmip_version: CMIP7

   # Input and output paths
   input_folder: "/g/data/p73/archive/CMIP7/ACCESS-ESM1-6/spinup/JuneSpinUp-JuneSpinUp-bfaa9c5b"
   output_folder: "/scratch/tm70/rb5533/moppy_output"

   # File patterns for each variable (relative to input_folder)
   file_patterns:
     Amon.pr: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
     Omon.tos: "output[0-4][0-9][0-9]/ocean/ocean-2d-surface_temp-1monthly-mean*.nc"
     Amon.tauu: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
     Amon.ts: "output[0-4][0-9][0-9]/atmosphere/netCDF/*mon.nc"
     Omon.zos: "output[0-4][0-9][0-9]/ocean/ocean-2d-sea_level-1monthly-mean*.nc"

   # PBS job configuration
   queue: normal
   cpus_per_node: 14
   mem: 32GB
   jobfs: 100GB
   walltime: "02:00:00"
   scheduler_options: "#PBS -P tm70"
   storage: "gdata/p73+gdata/tm70+scratch/tm70"

   # Environment setup for each job
   worker_init: |
     source /g/data/tm70/rb5533/miniforge3/bin/activate
     conda activate esmvaltool_dev

   # Optional: Wait for all jobs to complete before exiting
   wait_for_completion: false

Running Batch CMORisation
--------------------------

Submit your batch job using the command-line interface:

.. code-block:: bash

   moppy-cmorise batch_config.yml

This command will:

1. **Initialize a tracking database** in your output directory to monitor job progress
2. **Start a Streamlit dashboard** at http://localhost:8501 for real-time monitoring
3. **Create and submit PBS jobs** for each variable in your configuration
4. **Generate job scripts** in a local `cmor_job_scripts/` directory

Monitoring Progress
-------------------

The batch system includes several monitoring tools:

**Streamlit Dashboard**
   A web-based dashboard automatically starts at http://localhost:8501, showing:

   - Real-time status of all CMORisation tasks
   - Progress tracking (pending, running, completed, failed)
   - Filtering options by status and experiment
   - Task completion times and error logs

**Terminal Dashboard (``moppy-tui``)**
   A ``rich``-based terminal alternative to the Streamlit dashboard, useful
   when opening a browser to a Gadi login node is awkward.  Reads the same
   tracking database, supports interactive paging (``j/k``, ``n/p``,
   ``g/G``, arrow / PgUp / PgDn keys), and offers ``--once`` / ``--json``
   modes for scripts and cron jobs.

   On Gadi, ``moppy-tui`` comes preinstalled with the
   ``conda/analysis3`` module:

   .. code-block:: bash

      module use /g/data/xp65/public/modules
      module load conda/analysis3
      moppy-tui --db <output_folder>/cmor_tasks.db

   Elsewhere, install the optional extra: ``pip install "access_moppy[tui]"``.

   See :doc:`batch_processing` for the full feature list.

**Command Line Monitoring**
   Monitor PBS jobs directly:

   .. code-block:: bash

      # Check job status
      qstat -u $USER

      # Monitor specific jobs (job IDs provided by moppy-cmorise)
      qstat 12345678 12345679 12345680

**Database Tracking**
   The system maintains an SQLite database at `{output_folder}/cmor_tasks.db` that tracks:

   - Task status for each variable
   - Start and completion times
   - Error messages for failed tasks
   - PBS job IDs and experiment metadata

   At monitor finalisation it also writes
   `{output_folder}/moppy_batch_report.json`, a schema-versioned JSON export
   of the tracker state for after-the-fact completion checks and provenance.
   When available, the report includes filtered structured PBS metadata for
   each sub-job, including exit status, queue/project, requested resources,
   and resources used.
   You can regenerate this report later with:

   .. code-block:: bash

      moppy-batch-report --db <output_folder>/cmor_tasks.db

File Organization
-----------------

The batch system organizes files as follows:

.. code-block:: text

   your_work_directory/
   ├── batch_config.yml                    # Your configuration file
   ├── cmor_job_scripts/                   # Generated PBS and Python scripts
   │   ├── cmor_Amon_pr.sh                 # PBS script for Amon.pr
   │   ├── cmor_Amon_pr.py                 # Python script for Amon.pr
   │   ├── cmor_Amon_pr.out                # Job stdout
   │   ├── cmor_Amon_pr.err                # Job stderr
   │   └── ...
   └── output_folder/                      # Your specified output directory
       ├── cmor_tasks.db                   # Progress tracking database
       ├── moppy_batch_report.json         # Final JSON coordination report
       └── CMIP6/                          # CMORised output files (if drs_root specified)
           └── CMIP/
               └── ACCESS-NRI/
                   └── ACCESS-ESM1-5/
                       └── ...

Configuration Options
----------------------

**Required Parameters:**

- ``variables``: List of variables to process (format: ``table.variable``)
- ``experiment_id``, ``source_id``, ``variant_label``, ``grid_label``: CMIP metadata
- ``input_folder``: Root directory containing input files
- ``output_folder``: Directory for CMORised output

**File Pattern Mapping:**

- ``file_patterns``: Dictionary mapping variables to glob patterns (relative to ``input_folder``)

**PBS Configuration:**

- ``queue``: PBS queue name (default: "normal")
- ``cpus_per_node``: Number of CPUs per job (default: 4)
- ``mem``: Memory per job (default: "16GB")
- ``jobfs``: Local scratch space (optional)
- ``walltime``: Job time limit (default: "01:00:00")
- ``scheduler_options``: Additional PBS directives
- ``storage``: Required storage systems

**Environment Setup:**

- ``worker_init``: Shell commands to set up the environment in each job

**Optional Parameters:**

- ``activity_id``: CMIP activity (default: derived from experiment)
- ``drs_root``: Enable CMIP6 DRS directory structure
- ``wait_for_completion``: Wait for all jobs before exiting (default: false)

Best Practices
--------------

**Resource Planning:**
   - Use ``jobfs`` for large datasets to improve I/O performance
   - Adjust ``cpus_per_node`` and ``mem`` based on your data size
   - Set appropriate ``walltime`` based on dataset complexity

**File Access:**
   - Ensure ``input_folder`` and ``output_folder`` are on shared filesystems
   - Use relative paths in ``file_patterns`` for portability
   - Test file patterns with a small subset first

**Error Handling:**
   - Monitor the dashboard for failed jobs
   - Check job stderr files in ``cmor_job_scripts/`` for detailed error messages
   - Failed jobs can be resubmitted by running ``moppy-cmorise`` again

**Example PBS Configuration for NCI Gadi:**

.. code-block:: yaml

   queue: normal
   cpus_per_node: 16
   mem: 64GB
   jobfs: 200GB
   walltime: "04:00:00"
   scheduler_options: "#PBS -P tm70"
   storage: "gdata/p73+gdata/tm70+scratch/tm70"
   worker_init: |
     module load netcdf/4.7.4
     source /g/data/tm70/rb5533/miniforge3/bin/activate
     conda activate esmvaltool_dev

Troubleshooting
---------------

**Common Issues:**

1. **Database access errors**: Ensure ``output_folder`` is on a shared filesystem accessible from compute nodes

2. **File not found errors**: Verify ``file_patterns`` match your actual file structure using ``ls`` or ``find``

3. **Memory errors**: Increase ``mem`` or reduce ``cpus_per_node`` for memory-intensive variables

4. **Environment errors**: Check ``worker_init`` commands work on compute nodes

5. **Permission errors**: Ensure write access to ``output_folder`` and job script directory

For advanced usage and troubleshooting, see the example configuration at ``src/access_moppy/examples/batch_config.yml``