In this episode, Robin catches up up with Alistair Francis and Mikolaj Czerkawski to learn about Major TOM, which is a significant new public dataset of Sentinel 2 imagery. Noteworthy for its immense size at 45 TB, Major TOM also introduces a set of standards for dataset filtering and integration with other datasets. Their aim in releasing this dataset is to foster a community-centred ecosystem of datasets, open to bias evaluation and adaptable to new domains and sensors. The potential of Major TOM to spur innovation in our field is truly exciting. Note you can also view the video of this recording on YouTube here. The video also includes a demonstration of accessing the dataset and a walkthrough of the associated Jupyter notebooks.
Alistair Francis is a Research Fellow at the European Space Agency’s Φ-lab in Frascati, Italy. Having studied for his PhD at the Mullard Space Science Laboratory, UCL, his research is focused on image analysis problems in remote sensing, using a variety of supervised, self-supervised and unsupervised approaches to tackle problems such as cloud masking, crater detection and land use mapping. Through this work, he has been involved in the creation of several public datasets for both Earth Observation and planetary science.
Mikolaj Czerkawski is a Research Fellow at the European Space Agency’s Φ-lab in Frascati, Italy. He received the B.Eng. degree in electronic and electrical engineering in 2019 from the University of Strathclyde in Glasgow, United Kingdom, and the Ph.D. degree in 2023 at the same university, specialising in applications of computer vision to Earth observation data. His research interests include image synthesis, generative models, and use cases involving restoration tasks of satellite imagery. Furthermore, he is a keen supporter and contributor to open-access and open-source models and datasets in the domain of AI and Earth observation.
Very interesting work, I have been thinking about using this dataset, but could you give an example of a machine learning project that integrates this data? I have some ideas, but would not like to misuse the data or use the data outside of the intended use.
Hi! Thank you! So far we are focusing heavily on unlabelled data, since it's a necessary starting point - we might expand to labelled tasks soon (hopefully with some helping hands from the community).
For the unlabelled use cases, I really recommend playing around with self-supervised learning and generative models! We were actually thinking of showing some examples in our project, but didn't prioritise it to avoid confusion (Major TOM is mostly about data).
For self-supervised learning, there are popular approaches that could be worth a try, like SimCLR or masked vision transformers.
https://lightning.ai/docs/pytorch/LTS/notebooks/course_UvA-DL/13-contrastive-learning.html
https://github.com/facebookresearch/mae
For generative modelling, I can't help but recommend my own course on diffusion models that includes example training notebooks and explains everything from scratch:
https://github.com/mikonvergence/DiffusionFastForward
I realise it would be nice to have some examples that integrate Major TOM and a ready-to-use training pipeline, hopefully we can deliver something like that soon enough!