New discoveries #20

SatlasPretrain dataset, Data-Centric Land Cover Classification Challenge, Map-sat, How to deploy an ML model to Amazon SageMaker, From edge detection to deep learning & TorchGeo v0.5.0

Oct 03, 2023

Welcome to the 20th edition of the newsletter. I'm delighted to share that the newsletter now has almost 7k subscribers 🔥 This month's newsletter features the new SatlasPretrain dataset, designed for training foundational geospatial models. Building on last month's focus on the Prithvi-100M model, there's growing momentum in developing foundational datasets and models, hinting at far-reaching consequences for the field. This newsletter also highlights a generative AI paper, signalling yet another ascending trend with notable implications for the domain. Overall, I anticipate that 2023 could be a transformative year for deep learning with remote sensing imagery.

SatlasPretrain dataset

SatlasPretrain is a large-scale pre-training dataset which consists of Sentinel-2 images, NAIP images, corresponding labels, and metadata. It includes 302M labels under 137 categories and seven label types: points, polygons, polylines, properties, segmentation labels, regression labels, and classification labels. Rather than being tied to individual remote sensing images, the labels are associated with geographic coordinates (i.e., longitude latitude positions) and time ranges. This enables methods to make predictions from multiple images across time, as well as leverage long-range spatial context from neighbouring images.

Utilising SatlasPretrain for pre-training enhances average performance on seven downstream tasks by 18% over ImageNet and 6% over DOTA and iSAID. Weights for models based on SatlasNet, pre-trained on SatlasPretrain, are publicly available. Satlas, the platform, currently hosts four geospatial data products; wind turbines, solar farms, offshore platforms, and tree cover - which are generated using these pretrained weights.

The comprehensive nature of the SatlasPretrain dataset sets a new standard for pre-training datasets in the geospatial domain. What further distinguishes it is the geospatial and temporal indexing of these labels, which enables the training of models capable of making predictions that consider long-range spatial context and changes over time. I look forward to seeing what innovative approaches are enabled by the availability of this dataset.

Data-Centric Land Cover Classification Challenge

The data-centric land cover classification challenge, as part of the Workshop on Machine Vision for Earth Observation (MVEO) and Environment Monitoring and the British Machine Vision Conference (BMVC) 2023, targets novel data-centric methods for selecting a core set of training samples in semantic segmentation tasks.

Participants are tasked with developing a ranking strategy that assigns a score to each sample from a pool of candidates, based on the sample's importance to training. The generated score/ranking will then inform the selection of a core set of training samples for a pre-defined U-Net classifier. Success is gauged by training the classifier multiple times with datasets of varying sizes according to the ranking/scores (e.g., using the top 1000 and top 500 samples) and calculating the average Jaccard index on an undisclosed test dataset.

Given that competitions often drive entrants towards high-metric but impractical ensembles, this challenge's focus on dataset optimization for a commonly-used classifier is a welcome shift.

🖥️ Website
🗓️ Submission Deadline: Sunday, 15 October 2023
💻 Code

Map-sat

Generative AI has captured mainstream media attention, largely focusing on text generation services like ChatGPT. However, significant strides have also been made in image generation, where diffusion models are now rivaling GANs in popularity. This shift is attributed to their relative ease of training and ability to produce competitively high-quality outputs.

The paper 'Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps' showcases how a cutting-edge, pre-trained diffusion model, ControlNet, can be conditioned on OpenStreetMap data to produce highly realistic satellite images. When compared to actual imagery, the generated images are nearly indistinguishable to the untrained eye.

This synthetic imagery can be leveraged to augment real-world images during machine learning model training, enhancing performance. Several studies have confirmed the advantages of this mixed-data approach over using solely real images. However, the potential for malicious applications of deep-fake satellite imagery also exists, underlining the importance of parallel efforts to reliably detect generated images.

To learn more about generative methods, I highly recommend the course DiffusionFastForward by Mikolaj Czerkawski

How to deploy an ML model to Amazon SageMaker

If you've built a machine learning model, the next step is to get it off your laptop and into use. This isn't always straightforward, especially for complex models like those used in semantic segmentation. I faced this challenge too, which led me to Francesco Pochetti's work.

Francesco has put together a 2-hour video tutorial that guides you through training a semantic segmentation model and deploying it on AWS SageMaker via a custom Docker container. This method is increasingly common, but it's not without its intricacies. I'm collaborating with Francesco to share this valuable guide with you. You can find the details in the link below

📺 How to deploy an ML model to Amazon SageMaker

From edge detection to deep learning

Dilsad Unsal, the creator of the HRPlanesv2 dataset—High Resolution Satellite Imagery for Aircraft Detection—has published an article that provides the background and context for this important dataset. Serving as the first instalment in a series, the article brings readers up to speed on the uses of object detection and traces the evolution of techniques from edge-based methods (illustrated above) to contemporary deep learning approaches. Additionally, the article outlines the key elements that make up a high-quality training dataset, a topic slated for more detailed exploration in future articles.

📖 Article on Medium
💻 HRPlanesv2 Dataset on Github

TorchGeo v0.5.0

TorchGeo is a python library providing datasets, samplers, transforms, and pre-trained models specific to geospatial data. TorchGeo v0.5.0 encompasses over 8 months of hard work and new features contributed by 20 users from around the world.

Support for the pytorch lightning CLI → replace ad-hoc notebooks with python scripts run with configuration files ⚡
You can now easily pre-train models with trainers for self supervised learning (SSL) techniques like BYOL, MoCo, and SimCLR
SSL models pre-trained on LandSat
New utilities for splitting GeoDatasets including random_bbox_assignment and time_series_split

This release is quite significant for me personally, as I recently adopted the LightningCLI. I still use Jupyter notebooks for initial prototyping, but when the training code is stable I switch to scripts which are run with configuration files. This approach is compatible with git version control and integrates nicely with my metric & artefact logging solution.

📃 Release notes
📺 YouTube video: TorchGeo with Caleb Robinson

Poll

In the previous poll I asked where people use GPUs. The majority (57%) use a GPU that they have physical access to (i.e. in the office or lab), and only 28% use a cloud based GPU. In the context of a worldwide shortage of GPUs, this perhaps shouldn’t come as a surprise, although I was surprised the uptake of cloud GPUs wasn’t higher. In this poll I want to understand the appeal of owning a GPU

satellite-image-deep-learning

Discussion about this post