Patrick Egan | Digital Humanist and Ethnomusicologist in Cork, Ireland

Digital Archiving

In June 2021, I began a four month position with the Cork LGBT Archive, working for my friend and colleague, Orla Egan. We both attended UCC together since 2014, and shared an office during the past number of years. Orla’s work has always interested me, as the archive that she works on originated in a basement! Since I have known Orla, she has been sorting, digitising, promoting and sharing all sorts of interesting insight into this fascinating project.

CAHG Scotland & SCA

As digital archivist, my job has been to oversee the workflow of digitisation from box to digital repository. I also liaise with the Digital Repository of Ireland (DRI) on best practices, and have been set up by Orla on the webinar series by the Community Archives and Heritage Group (CAHG) Scotland & Scottish Council on Archives (SCA). The CAHG & SCA deliver training and networking events to support and provide skills to volunteers in community heritage groups (more about them here: https://www.scottisharchives.org.uk/). The 2021 series deals with the following topics:

  • How to digitise your archive collections.
  • Increase visibility of your digital collections.
  • Strategically promote your digital collections and optimise your collection for discovery via Google and other search tools.
  • Increase your engagement with your local community and beyond.

Armed with these connections, some research, webinars, library experience and a good deal of enthusiasm, I was ready to make a start with the Cork LGBT Archive project.

Week 1: An Overview

First and foremost, to keep track of steps involved with the project, a series of promotions were in order. Sharing tweets about the first weeks working with the Cork LGBT Archive, and representation with the Irish Heritage Council were important. The main aim was to document the process alongside other members of the team as we went along. The promotion was backed up with these detailed blog posts about the steps involved in the project.

In week one, three important first steps were to:

* plan for the project review every month
* manage ingesting documents into the DRI
* investigate formats that were needed for this ingestion

The DRI provides a very helpful formats guide, which allowed us to understand their requirements for Cork LGBT Archive to be ingested into their repository. The general guideline was that PDF/A was a preferred format for archiving, and that PDF was also acceptable. Some initial explorations into this revealed that:

  • PDF/A is good for some images and not for others (see: http://www.archives.nysed.gov/records/mr_advisories_pdfa.shtml)
  • PDF for multiple images is okay, but PDF/A is better. It allows you to add extra “technical” metadata to the PDF file so that the metadata can be portable – it can move with the file wherever it goes. PDF/A also embeds fonts, which makes sure it is reproducible across systems. It is also strict – it doesn’t allow embeddings other than images (videos etc.).

Organising a workflow

The most important step in the digitisation and ingestion process for the DRI was to establish a way of organising how every part of the process fitted together. This has obvious advantages, “fail to plan, plan to fail”, but it also has others – for speed, efficiency, accuracy and quality.

Audit of the Cork LGBT Archive

  • Firstly, what material has been digitised, and if material exists elsewhere, will it be needed for this phase of the project
  • Do we need to update new scans with consistency of what has already been completed? How might this work on an ongoing basis?
  • What is the process for digitisation with this scanner (Epson DS50000)
  • What are the goals of these outputs – how much more material in the DRI and/or the Archive?
  • Will this phase produce best practices / guidelines to enable consistently in the future?

Backup

The main goal of backup is to ensure that the Cork LGBT Archive is safe from data loss. In order to achieve this, we looked to best practices in establishing the most reasonable methods that a community archive could align with for cost, quality and consistency. We decided to work out the “3 2 1” rule of data backup for this project (https://www.armstrongarchives.com/3-2-1-rule-data-backup/)

This rule basically means three things – 3 copies, 2 on media, 1 offsite.


For the Cork LGBT Archive, this meant –

  • Where is data currently stored?
  • How is it currently stored?
  • How can we meaningfully back up the new data as it comes in, without creating a whole new system?

To answer the first question, it was determined that there were three locations:
1) On a recently purchased HP laptop
2) On a 4TB external hard-drive
3) In the cloud (Dropbox)

The recently purchased laptop had a capacity of 256GB which was determined to be unfeasible as a permanent backup. There were 2TB storage capacity on cloud storage through Dropbox and 300GB was already in use. Traversing different storage locations was an important aspect of the process of backup. The External Hard Drive contained 4TB, and as each item was saved, it could be sent to Dropbox.

A further external hard-drive could then be bought to store the final permanent backup off-site and backed up every month.

The next question we arrived at was, what type of software did we want to use to monitor changes in backups, and how would this workflow be established. I had already used AOMEI Backup on my own machine, but our team member Jacob suggested to use GitHub.

The Digital Repository of Ireland (DRI)

It was determined that a previous ingest of material from the Cork LGBT Archive was carried out with the DRI, where they already had a repository on their website. Orla had previously worked with Anita, Kathryn and Kevin at the DRI to make this happen. This setup allowed me to become familiar with the way that material has been presented there and the way that the backend looked.#

Day 1

I identified that there are some aspects to the original workflow that could be improved. The first is the scanning of documents. In the case of scanning to jpeg, it was determined that the project workflow could be made more efficient with the use of batch operations to convert TIFF files to JPEG in new destination folders. Carrying out this type of batch operation meant that this would cut the scanning workload in half, as previously both JPEG and TIFF files were created during the digitisation process. It was determined that automating made no difference to the quality of the operation as the file names of TIFFs and JPEG could be the same (and were previously).


Secondly, it was found that booklets that were originally scanned as JPEGs had been scanned to the PDF format, and this made one file out of many. This was determined to be a reasonable compromise, as PDF/A is not currently available without subscription on Adobe software. Also, the benefits of PDF/A are not relevant, as the project does not have a need for portable fonts, or for technical metadata (that is metadata which is stored inside of the PDF file itself). *See later posts for updates on this.

Europeana: DRI data had already been harvested for Europeana. We needed to make sure new material was also ingested. So a key goal was to find out if this was to ask staff at the DRI if Cork LGBT material gets updated with Europeana, and what the protocol is for new items that are added to a collection after the first ingest.

Developing a Catalogue

It was found that Cork Museum is currently carrying out a project to catalogue a number of documents that are from the Cork LGBT Archive, which will include pencil marks itemizing each document that is catalogued. It was determined that a catalogue could be made of these documents for retrieval later on by researchers. It was suggested by Orla that these documents could be used to form a catalogue in such formats as EAD, ISAD(G) or some other suitable archiving standard, and that this catalogue numbering system could then be used in other resources of the Cork LGBT Archive situated either on the DRI website or on the corklgbtarchive.com website within the “Related To” Dublin Core metadata.

Goals for this section became to:

  • Find out how you can start a catalogue. What way will the numbering system work? How do we get those numbers into all digital systems?
  • Meet with Orla and staff from the museum to discuss
  • Research suitable finding aids for this type of material (interns at the museum had started to add identifiers in pencil marks on material)
  • An identification system was needed to let people know what is in the museum and on DRI (maybe an item number added to the “related to” field of the Dublin Core metadata here), so that they could get access to each document

Subject headings

The metadata subject heading options for LGBT are varied, but two main ontologies are available and currently in use: Homosaurus and the Library of Congress authorities which are available at https://authorities.loc.gov

Example 1: A list of Irish organisations is needed – an archives guide, for how to put names of people (eg egan, orla 1966-)
Example 2: there are no “dike bars” in Homosaurus

It was found that some of the subject headings that were previously missing from the Library of Congress can be found under a different heading name (for example “Dike Bars” can be found under “Lesbian Bars”). In this case, there can be confusion around what heading should be used, as it may be political or a preference depending on who is being referred to. It has been made clear from the Cork LGBT Archive team that previous attempts to add information were not carried out, as the information was not made available to the person(s) contributing items to the Archive. It is recommended therefore to print out a series (maybe one page list) of possible subject headings (with brief descriptions) that are already either in Homosaurus or on the Library of Congress Subject Headings resource in order to make prompts available to the person who is scanning / contributing material.

Digital Repository of Ireland

Through investigating what the DRI recommended, it was found that we could scan in high-res JPEG, in TIFF and PDF. The latter is an acceptable format, even though PDF/A is preferred.

A number of smaller issues also arose when discussing the PDF format. Without the Adobe Pro software to rotate PDFS, a previous staff member would get her partner to rotate the PDF instead.

In a previous batch ingest for the DRI, the Cork LGBT Archive team had utilised a file for the operation. Similar to the audit of the state of the Cork LGBT Archive then, the Archive’s presence on the DRI resource needed to be understood. Part of this understanding was to investigate the system that was set up for workflow in previous ingests. For a start, it was determined that this involved DCMES xml files

Another minor issue was to define a method for file naming, as this had not been made clear. It was determined that complying to a standard was best, and this involved some important rules. A helpful guide for best practices is located here:
https://libguides.princeton.edu/c.php?g=102546&p=930626

At this point 397 items were already on the Cork LGBT Archive website


A question to be asked was: how were booklet PDF’s originally made, and what was the process? Could it be that we use this process: Scan > Save as images > Open Acrobat > Add images > Send to DRI > Save in archive? For the new scanner that was acquired, we needed to think about PDF/A and the auto-rotate feature. Our scanner (Epson WorkForce DS-50000) facilitated both PDF/A creation, and auto-rotate. It also came with a “Dual Image Output”. However, there were other problems with this workflow. The process of cropping images manually was used in the older version of this scanner software, but did not seem to be available to us in the newer model. Test scans revealed some issues with the auto-crop, which will be explored in the next blog post.

Conclusion

In all, the first weeks of this project demonstrated a wide variety of different processes at work, and a lot of variables to manage. Within each stage of the process (from box to digital repository), there were specific challenges and a number of “to do” lists, each one having a potential impact.

On the other hand, this archive is not represented by any institution, and so there is a great deal of flexibility on what can happen with material and the workflow that is involved with the project. This has been the interesting part of the Cork LGBT Archive, and gives it an “edge”.

More to come in the next blog post!

Leave a Reply

Your email address will not be published. Required fields are marked *