Tackling duplicate data in the NT justice system
- 15 March, 2018 12:58
Get caught breaking the law in the Northern Territory and chances are your details will end up within the government’s Integrated Justice Information System.
The system records and manages justice information for police, courts and corrections all the way from initial arrest and apprehension of a person, to court appearances, judge’s rulings, chasing fines, and depending on the crime, prisoner management.
Originally implemented as a mainframe application in 1992, over time it has unfurled into three systems: Police Real-time Online Management Information System (PROMIS), IJIS and the Integrated Offender Management System (IOMS), with individuals’ data stored in and pulled from numerous sources.
There are large mainframe applications – like the ones used for fines recovery and motor vehicle registration – and multiple data warehouses. “We’ve got heaps of them,” says NT government solutions architect Liz Shenton, “big ones, small ones, we’ve got databases we call data warehouses. Everyone’s got one.”
The aim is for the system to provide ‘fluid and governed’ data exchange across multiple agencies and business processes.
“The reality – lots of paper, lots of manual processes, many points of entry for the same data across multiple systems, multiple versions of the truth and a myriad of systemic data quality problems, and aging business applications,” Shenton explained at a Gartner Data and Analytics Summit in Sydney last month.
Duplicate data had become rife. For example, the system held records on 560,000 individuals across the territory.
“The territory has a population of around 240,000,” Shenton said. “Between them they had 1.5m addresses. And there were 650k names – so everybody’s got a name and a bit… So we had a bit to work with.”
In 2013, a $16 million modernisation of IJIS began – Project Veritas. Central to its success was a Master Data Management (MDM) system.
“Duplicate parties were everyone’s pain point. We wanted to identify duplicates, resolve them and ultimately prevent them being created,” Shenton added.
The MDM solution is now being rolled out in earnest. In getting to this point, Shenton and her team have had to overcome resistance from conservative stakeholders, tackled tricky technical challenges and managed nuances of the data that are unique to the territory.
The implications of misidentifying a duplicate and deleting the record of an individual in the justice system are severe.
“It extends far beyond people having the same first name, last name, date of birth, driver’s licence. There are a lot more things that have to be considered,” Shenton explained.
“And the risks and implications of getting it wrong are far greater – there’s lawsuits, there’s coronials, there’s all sorts of nasty things you don’t want to consider. So it is really important, and the police and other agencies are extremely conservative when it comes to accepting that someone is a duplicate,” she said.
For that reason, the MDM initiative established two key principles: The data would remain and always be owned by the agencies, and the agency systems would be considered the source of truth.
The MDM would be advisory, Shenton said — “we don’t insist to things are the same, we suggest they are”.
Following a number of years of consultancy, proof of concept work and enterprise architecture design, the territory government put the MDM system to tender in 2015. Orchestra Networks was selected to provide the solution, with Ascention chosen to provide support.
“The main thing that we looked for was something that was simple, simple enough for us to use, not to be tied to vendors for forever and a day. Something our local resources could pick up and take forward,” Shenton said.
Work began by harmonising and standardising reference data — the set of permissible values to be used by other data fields.
“When you’re describing a party there are a whole heap of attributes that are code tables, they’re common data that everybody uses across the agency. But there are a lot of ways of describing brown hair or blue eyes, they’re all different for the different systems,” Shenton said.
In some cases – ethnicity and country of origin for example – the Australian Bureau of Statistics standard terms were adopted. Standard policing terms were also adopted, such as the way they describe tattoos and their location.
Other descriptors proved more tricky to standardise. Tribal languages, for example, were not found easily in ABS codesets.
“There were some – like relationships, which should be you’d imagine fairly standard, but for some reason, in the mainframe system, they were free text,” Shenton said. “There were many ways of spelling ‘uncle’ for some reason. And ‘wife, spouse, former wife, wife number 2, wife relation’ and so on. And my personal favourite – ‘as above’.”
As work to identify duplicate individuals in the systems began, Shenton and her team discovered some “very unique challenges”.
“A lot of dates of birth are missing, a lot of DOBs are guessed, a lot of DOBs are unknown,” she said. “In Indigenous communities often they’ll know they were born in the dry season in 1940 and that’s about as good as they’re ever going to get. So the standard practice is just use the first of the first in whatever year they might be,” Shenton explained.
In the data, that has meant a huge number of individuals share the same birthday.
Using location has also proved problematic. The Northern Territory is the least populated of Australia’s states and territories – with fewer than half as many people as Tasmania – and half of resident are concentrated in Darwin.
Those living in rural communities often have an address relating to a town hundreds of kilometres away.
“Often it will say in a system ‘House 3, Yuendumu, via Alice Springs’. While Yuendumu is 300km to the west, the actual town is Alice Springs,” Shenton said.
“So when our statistics people look at where crimes occur they’re getting Alice Spring not Yuendumu because its sitting in address line one.”
Names in Indigenous communities can also be fluid, with monikers being passed on if there is a death in the family for example.
Always in synch
The Orchestra Networks EBX MDM solution applies set logic rules to the data to identify duplicates. Things like equivalent date of birth and similar names come first. Matched candidates are then funneled through further rules – Do they have the same drivers licence? An exact date of birth match? An address in common? And so on.
“Out of the end of the sausage machine we’ll get a list of likely candidates, with varying degrees of reliability they are a match and indicators why we think they’re a match. It’s a many pronged attack. We’re constantly refining it. We draw a ring around everyone we think to be the same, that’s our golden record,” she added.
As per the principles laid out at the beginning of the project, duplicates are not just deleted, but the agencies are alerted to any doubling.
“If they don’t want to merge two people together – sometimes they can’t – we maintain the fact they are most likely to be the same person. We know at this top level they are one,” Shenton said.
“If the source systems choose to merge that gets reflected back up to us and one of those people [the duplicate] will disappear and the attributes will merge into another. We’re always in synch.”