Getting Started with AI-toolkit#

Jupyter Notebooks typically follow the same architecture involving five major sections, which are supported by markdown cells, comments and plots:

  • Import necessary libraries and mltools
  • Load data (s3 object store)
    • Initialize data mask
  • Assign train/test split
    • Random sampling
    • Block sampling
  • Model set-up (linear regression with 1 node/ shallow neural network)
  • Model training and testing
    • Iteration over chunks of the datacubes and subsequent sampling
    • Preprocessing (filtering NaNs, standardization, normalization)
    • Get train/test data
    • Generate training batches using existing data loading and transformation mechanisms from Keras and PyTorch (DataGenerator, DataLoader)
    • Train model, return error and loss
    • Evaluate model
    • Plot results

It is mandatory to enable machine learning that respects the basic principles of geo-data way beyond naive applications of machine learning in the Earth system context. To avoid auto-correlation during the training phase of the model, data sampling is preferably guided by a block sampling strategy. Data blocks are chunks of datacubes, rectangular shaped, varying in size and number of data points.

The workflow is implemented (Credit: DeepESDL) for three python-based Machine Learning libraries (scikit-learn, PyTorch, TensorFlow) based on a generic use case and will be provided as Jupyter Notebooks

Machine Learning workflow on Analysis Ready Data Cubes (Credit: DeepESDL)