This year marked the conclusion of one of the largest technology projects ever undertaken by the National Library of Australia, with the organisation completing an overhaul of its core digital library systems.
The NLA’s Digital Library Infrastructure Replacement Program (DLIR) represented a massive investment in the organisation’s ongoing ability to grow and manage its massive collections as well as collect new and emerging digital formats.
The five-stage, six-year DLIR program kicked off in 2011. A core goal of the DLIR was replacing key digital library applications that had reached end of life, with first three stages focused on replacing legacy systems and building systems that managed digitisation. The last two stages focused on new functionality in the area of digital collecting capability, the library’s chief information officer David Wong told Computerworld.
Wong said the applications had been built using “legacy programming languages,” and employed in-house developed frameworks which the CIO noted is now considered a “big no-no”. The age and nature of the underlying codebases meant both maintenance and upgrades were a struggle.
But it was not just ease of maintenance that drove DLIR. The library also had to enhance digital preservation of material, manage at-risk digital formats, and ensure that its systems were able to scale to meet growing demands.
“There were things that were missing, from a functionality viewpoint, in digital preservation,” Wong said. “Our systems just didn't meet the expectations of our digital preservation area.”
The NLA wanted more from its digital library systems “to ensure content could be collected, stored, preserved and accessible for generations to come,” the CIO said.
“It’s a bit of a funny point because IT people tend have a different view to librarians when it comes to digital preservation,” he explained. “We think if the content is digitised and accessible, then Bob's your uncle, that’s all there is to it.”
Librarians have to look at things differently, Wong explained, including recording not just bibliographic metadata but provenance and preservation metadata.
“Then they need to make sure the formats the content is stored in are formats that will last,” the CIO said.
“A lot of content requires specific software to access it,” he added. “Like a WordStar file needs WordStar. To a digital preservation specialist, having things described properly and in the right formats is very important. From an IT perspective, we tend to oversimplify things and say, ‘You can just get an emulator.’”
In addition to digital preservation of material, the library has in recent years had to face the challenge of digital formats.
“I think it’s one thing digitising physical content; it’s another thing collecting new formats, like e-books, social media, websites,” Wong explained.
In 2015, changes to the Copyright Act enabled the NLA to significantly broaden its efforts to preserve Australia’s digital cultural heritage.
Those changes gave the library the ability to collect digital-only publications in a way that it had previously collected physical publications.
The legislation enabled the NLA to issue a request for electronic material that it considers belongs in its Australian collections, regardless of whether that material was first published on a locally hosted website or one hosted in another country.
Although the changes to the Copyright Act gave the library the legislative mandate to collect digital publications, it was still necessary, through the DLIR, to build the systems to support formats such as eBooks and PDFs.
But even without bringing new and emerging media into the picture, the library’s systems needed to be able to scale significantly, Wong said.
“There’s a lot of content out there and our collection doubles every four years,” he explained. “That’s just via business-as-usual collection growth — so when you add in new collection types and content, the rate of increase of the collection size is not quite exponential but it’s increasing far more rapidly.”
The emergence of the Internet of Things and ultra-high-definition video formats will add to the challenge.
“We need infrastructure to support not only increased volumes of content but also user growth and computational power needs,” Wong said.
A final driver for the program was the changes in end-user expectations, the CIO said. People are increasingly seeking to access content on an array of mobile devices and expect rich, interactive experiences.
Building digital infrastructure
Originally $7 million was earmarked for DLIR but the program grew to around $15 million — reasonable, Wong said, given that there was nothing on the market that would meet the organisation’s requirements.
Although an approach to market by the NLA drew offers to build the systems the library needed, the cost would have been far larger, he added.
“Building in-house, and buying components where mature market solutions were available, to be part of a broader digital library ecosystem, was a good approach,” he said.
Of the $15 million, around $3 million was spent on sourcing off-the-shelf commercial solutions.
The CIO believes that getting an external software implementer to build the systems would have at least doubled the cost.
Some of the commercial vendors in the space have products that are “very rudimentary,” he added. “When we look at what state libraries and other overseas libraries have achieved, there have been numerous advantages with our approach.”
Java and MySQL (as well as an open source graph database) played key roles in building out the systems. The library team employed open source technologies whenever possible and sought to use established frameworks.
“We tried finding a framework for content repositories but for the size of the collection and for our requirements we didn't find anything out there that was suitable,” the CIO said.
“We had to do ‘bake your own’ quite often. To be honest, I was a little bit uncomfortable with writing so much code, but looking at the NLA and tradition and history, we’ve built systems for many years and have successfully scaled them and maintained them.”
The NLA’s new ecosystem of digital library applications deliver end-to-end support for digitising content, collection management, metadata and digital content storage, content delivery and content discovery.
On top of that there are user engagement features, such as those incorporated in the library’s Trove search engine that allow readers to correct the OCR output for digitised publications as well as tag, comment and make lists.
“There are various workflows in the range of systems to manage different content types,” Wong said. For example, the NLA’s system for harvesting web content is specific to web archives collection.
“Same with the other content types that we've got — there are workflows optimised to manage books journals, and different ones to manage oral history and unpublished manuscripts,” the CIO said.
The library considered shifting to a single, unified system, with single workflow, content repository and delivery systems. “I think you have the whole conundrum of, when you try to build something generic, how complex does it get?” Wong said.
“The other thing is, over time, even though our systems were old and end of life, they were developed and improved over time — so there’s a lot of IP in the workflows, a lot of optimisation. To throw that away and just chuck it all into a generic workflow, it means you lose a lot of the efficiency.
“We’ve tried to find a balance between keeping it simple and customising. I think we've reduced the number of workflows but still support the needs of specific collections and content types.”
With one of the most significant technology programs ever attempted now behind it, the NLA’s technology team now has the space to plot out future enhancements to the library’s systems.
As with other CIOs, Wong said he is faced with the challenge of “keeping the lights on”, while also attempting to find opportunities to innovate.
“We’re in the process now of trying to think about what we need to do, going forward,” Wong said.
Because the library has relied so much on in-house development efforts, it needs to work on a sustainable approach to maintaining its systems over the medium and long term.
There’s also significant duplication in the organisation’s application portfolio, with Wong noting that the library’s structure is well-established and, like many organisations, “faces challenges operating in the both the traditional and digital spaces”. Addressing that will be a big priority for his team, the CIO said.
He said a key driver for improvement at the moment is the user expectations that come from the popularity of Internet giants such Google, Facebook and Netflix.
“People expect Google-like search when they're searching for stuff,” he said. “People expect Facebook-like for social. And people expect Amazon’s buying experience; you get customised recommendations and it’s easy to buy things with one click.
“People use those sites and then come to our services and, over time, they're going to expect us to be like that.”
The CIO said he’s assessing the potential application of machine learning approaches across a range of areas. Opportunities include running automatic image captioning across the NLA’s substantial image collections as well as automatically cataloguing articles, and improving Trove’s OCR quality.
Another area of development that he’s got his eye on is digital assistants and bots. “Rather than have people call librarians to ask questions, perhaps we can have a digital assistant that can respond to the queries, tapping into the vast indexes and content stores that we have,” the CIO said.