Earth observation satellites generate a lot of data, and the volume is growing at an exponential rate.
The existing dataset is huge, gathered by a fleet of satellites like the US Landsat and MODIS missions which have been orbiting the planet over a 16-day cycle since the early 1970s. Newer satellites like the Japanese Himawari-8 captures images of the country every 10 minutes. The files are big as well; the European-led Sentinel satellites capture images with a resolution of 10 metres.
Scientists wanting to analyse, scrutinise and compare different regions, and changes over time, face a major data wrangling task. There are different file types and coordinate projections to contend with, and the data must be drawn from different datasets.
Once an “occupational necessity” taking up half of researcher’s time, the satellite data wrangling exercise has become so unwieldy many projects are now “not feasible”, explains the NCI, the National Computational Infrastructure, based at ANU.
To solve the problem, Geoscience Australia is working on a Digital Earth Australia (DEA) platform, which will offer ‘analysis ready’ data in an easy to use, Google Maps style user interface.
Powering DEA is a new data service developed by NCI called GSKY, which was revealed this week.
NCI’s research engagement and initiatives team approached the problem by “reconsidering the options for on-demand processing” the organisation said.
“In essence, GSKY combines all the perks of user interfaces found in contemporary mapping frontends – kind of similar to Google Maps – with the higher dimensional geospatial data that is stored at NCI. Even if the requests encompass a large geographical area, NCI’s service can process queries in milliseconds – or virtually instantaneously, as far as the user is concerned,” a NCI spokesperson explained.
The DEA featured in this year's federal budget, receiving $15.3 million in funding. The project will receive an additional $36.9 million over the next four years.
“Digital Earth Australia is a revolutionary project, and adding NCI’s GSKY functionality will greatly enhance its usefulness,” said DEA program director, Dr Trevor Dhu.
“Previously, if researchers wanted to access our data they had to know where and how that data had been stored, hand-select the relevant files, and either come to NCI or have somewhere to download their data for their own analysis. GSKY has provided us with an easy way to get this data into the hands of a wide range of users and is the first step in truly unlocking the power of satellite data in Australia,” he said.
It is hoped the DEA, and the ‘real-time analysis’ GSKY enables, will have significant benefits for government and industry, particularly agriculture and mining.
It will “allow a new class of data-intensive workflows and fast-track answers to important questions, such as identifying areas where vegetation cover shows bushfire risk” the NCI said.
There is also benefit for the spatial information and services industry. Studies have predicted that access to new satellite imagery could create 12,600 new jobs in the development of applications and services based on the data.
“With the buzz surrounding machine learning, deep learning and artificial intelligence, GKSY offers a tantalising glimpse into the future of this flourishing computer science field,” the NCI spokesperson said.
“Just like humans, the algorithms that enable machine learning require access to large pre-prepared data collections – something that GSKY makes short work of.”
GSKY relies on several software systems developed by the NCI team, and better ways to organise datasets. Here they explain how it works:
For each data source, the underlying data are first organised into versioned timeseries datasets. NCI then uses software called MAS (Metadata Attribute Search) that scans and stores all metadata that is associated within the data files and makes it available for software (such as GSKY) requiring extremely fast and deep search. MAS is kept up-to-date by using ‘crawlers’ to seek out new or modified data across NCI’s petabytes of data collections.
GSKY’s underlying compute engine is programmed to allow a work-flow capable of creation of a scalable distributed processing system so as to take advantage of parallel processing utilising hundreds of CPU cores.
The internal GSKY pipeline is composed of distinct modules that are networked to create a manageable workflow.
The first step of the workflow is to determine the user’s intentions through the given parameters – that is, the location and timeframe of the requested information. The engine examines this request, and uses the indexing system to identify the files that contain the relevant data.
The second step then extracts the required data from each file found, and makes the necessary transformations to fit the user’s request. This module is special – the workload can be distributed amongst many computer nodes in the cluster. This decision was made by the developers to mitigate the high CPU and I/O usage they were finding when running this module.
Remote Procedure Calls (RPCs) split the work across a cluster of dedicated real-time nodes, and then reconstruct the information once the processing is completed.
These individual pieces of files are then merged, scaled and finally generated and sent back to applications using commonly used network query systems as either images or raw data files.