Collections as Data (2017)

Collections as Data – Hackathon / Collaborative Workshop
November 29 @ 10:00 am – November 30 @ 5:00 pm
NUI Galway, Galway, Ireland
http://mooreinstitute.ie/event/collections-data-hackathon-collaborative-workshop/

Introductions

This event, hosted by the Irish Research Council, DARIAH and the Moore Institute at NUI Galway was a 2-day Collaborative Workshop / Hackathon / Exploration of creativity using humanities research data.  The objective was to collaborate in small groups of researchers and practitioners over two days to explore and create. NUIG’s plan was to explore what people from diverse backgrounds can create when they work together. Each group would consist of three participants: a humanities researcher, a developer / engineer and a designer. The end goal was for participants to walk away with a community of support, and an idea of the possibilities of using collections as data.

This discussion is divided as follows:

I begin with introducing ideas and advice from hackathon speakers, I then summarise highlights from questions from the floor, next is a discussion on our ideas as projects. Once this discussion was expanded and a plan drafted, we began to explore the data from our selected collections. Then at the beginning of day two we encountered new departures with our work as new knowledge was generated. Near the end of day two, we performed our archive which sets up the conclusion.

The morning kicked off with introductions from hackathon organiser David Kelly, who then went on to introduce Prof Seán Ryder and Prof Daniel Carey.

Opening Speakers – Prof Daniel Carey and DH Manager David Kelly

Our opening speakers Prof Daniel Carey and David Kelly highlighted the importance of this event, with references to the state of play in the area of digital humanities in Galway. They explained the vision that they have for digital explorations of collection data, and thanked participants for attending.

Speaker – Prof Sean Ryder

In his introduction to the topic of digital archives, Prof Sean Ryder elaborated on the enormous possibilities for digital ‘exploitation’ (in the good sense), the way that you can re-purpose material and aggregate materials. He mentioned how certain types of analysis are now available which are not possible in the analogue form – and “that’s the exciting part”. However, he also mentioned that on the other hand it’s daunting because it requires a whole range of skills and team work, it’s about aggregating the skill sets of different people, the challenge of bringing people together who don’t ordinarily get a chance to work together.

He explained how it is exciting because you all have ideas about what you could do. On the other hand it is daunting because you need to find a common language. Some of the traits that he outlined which were needed were to be:

1) open
2) willing to listen
3) playful
4) experimental

With different methodologies, processes and goals in mind, participants would need to find ways to come together and to build on these experiences. He concluded by commenting that hopefully we will build on these learning experiences for the future.

Next we were introduced to three speakers from three distinct areas – Justin Tonra (humanities scholar), Niall O’Leary (digital consultant) and Andrea Fitzpatrick (artist/designer) in order to present some perspectives on collaboration between different groups.

Speaker – Dr Justin Tonra

Justin introduced himself and mentioned his lack of slides, which correlated with the “ad-hoc spirit of hackathons”. In a show of hands, it was noted that quite a few had been to a hackathon before. Dr Tonra mentioned the “un-conference” idea and the similarities with hackathons in their unplanned and ad-hoc nature, seeing it as an “unprescribed” event.

He thanked participants for attending and artists in particular. He mentioned the deficit in fine arts programme at NUI Galway, interdisciplinarity being “somewhat stymied” by that absence (aside from GMIT), particularly as Galway is a city which prides itself on the arts.

Dr Tonra’s example projects introduced stimulating ideas for what can be achieved when working with data.

1) His “Personæ: A Character-Visualisation Tool for Dramatic Texts” (http://www.davidkelly.ie/projects/personae/) showed individual characters’ involvement in a Shakespearean play. This produced patterns and correlations whilst looking into the play text. Dr Tonra described how this model could be used for other dramatic texts.

2) The Ossian project, a network analysis project, focused on literature of the romantic period, which looked at Ossian poems which were published from 1760 to the late 1770s. These poems were recently linked to the Irish Ossian tales. Dr Tonra worked with a scholar at University of Coventry and Oxford University who were interested in comparative mythology. Mapping the network structure of these texts enabled the researchers to compare a “fundamental level of ‘unobservables’ to the close reader”. The concept was to see how Ossian related to the Irish tales and to other European narratives.

3) Transcribe Bentham (University College London). The concept was to use crowdsourcing to transcribe 60,000 un-transcribed manuscripts by Jeremy Bentham. Dr Tonra used his background in text encoding and designed the toolbar for transcription. Involved in this project were philosophers, digital humanities scholars, librarians. Different people on this project had different priorities. But the “collaboral benefit” of this project was the flexibility in re-tuning your “dial” or your research question to the other peoples’ priorities. In this way, agreement or synthesis of varied research interests were brought to the fore.

Dr Tonra concluded with a phrase from Jeremy Bentham, “Many hands make light work, many hands together make merry work”, and shared a view that “the benefits of collaboration involve novel outcomes from co-ordinating and integrating a number of different voices.” Concepts to take away were the sense that participants needed to be aware of broader expertise, to combine mutual support and new perspectives on a range of issues. Good and clear personal relationships, careful planning and having clear roles and responsibilities for the group were also advised. An emphasis on focusing on the big picture was also highlighted.

Speaker – Consultant Niall O’Leary

Niall O’Leary introduced three recent projects that he is currently involved in:
1) Corpus Stairiúl na Gaeilge – an Irish historical dictionary
2) eSenchas – Dealing with Irish language manuscripts
3) Southem (Settler and Indigenous Writing in the British-Controlled Southern Hemisphere and Straits Settlements from 1780-1870)

Niall presented his own background and experiences as “chequered” in that he started in English and Philosophy, then film and eventually returned to IT, screen writing and web development. He also worked at the Digital Humanities Observatory.

His delivery was advisory in nature but also practical as he mentioned situations which arose within development of the above projects. He noted that it is useful to have somebody on a team who can speak to both worlds – the digital or IT on one side, and the humanities on the other, plus the jargon. Niall advised that one of the main things was not to use jargon.

A sense that one needed to get outside one’s own practice pervaded the next section as he noted, “Once you get past the detail of one’s own discipline you start to see that we share common goals.” These growing pains were detailed through each project development discussed.

In the Corpus Stairiúl na Gaeilge project, he explained how he focused on face to face meetings, keeping things as general as possible. He described how a “show and tell” attitude and a listening approach allowed him to see where the lexicographers were coming from. Along with project specifications, participants need to be clear about project goals, along with clear description of words that were used to describe the project. Project plan, timeline and dependencies were also noted as highly important.

Southem is his most recent project that had not yet been published, so I decided not to comment fully on this work. However a few things may be mentioned about it. He highlighted the importance of finding out what others wanted from a project in clear detail. One of the aspects that I could relate to here with O’Leary’s work is that the PDF that you are sometimes given from someone outside of the web development world does not contain the exact or required details that are needed as a coder. This is mostly because there is a lack of crossover between the two areas. This problem was used to illustrate that establishing specific sets of deliverables are an extremely important part of the project development process. He then stressed that the team should be aiming for face to face meetings instead of email.

eSenchas – Irish language manuscript project at Cambridge
In this project, there were very little face-to-face meetings. There was also a need for patience when working on the project. Listening was seen as crucial to project success. Accepting the challenge is also of big importance. To gather issues in batches is seen as the best way to move through such a project.

Overall, Niall provided a deep insight into the development process, to make things clear and transparent. He advised us to learn about people, to look at the circumstances within which they are working – manage the expectations, and explain what you will do very clearly. Most of all, he advised to allocate roles and stick by them and not to change the goalposts. As we will see below, this came into conflict with how artists like to operate in a creative space.

Speaker – Director Andrea Fitzpatrick

Lastly, Andrea Fitzpatrick, who is director of the Cimera Art and Science Programme, introduced her projects as an artist. The Enso Art project was presented first. She explained how this project worked by identifying common areas that could be highlighted between collaborators. Andrea noted how the emphasis with this project was on the process and not the product, which resonated greatly with the way that this hackathon event needed to progress.

Andrea noted that “We wanted each person to step outside their own frame of reference in order to find a common ground with which to work on”. She highlighted here that “so much is lost” when you are assigned to and stick to a role. In her approach, she explained that as artists, sticking to roles is quite problematic, and that a freer, more dynamic set of roles is the way forward in order to collaborate on an artistic level.

She was specific in reference to the way that it would be “nice” if each person moved outside of their place instead of the artist being given their role to “illustrate someone else’s ideas”. She expressed how this could be made more exciting if people decided to do something else, by stepping outside of themselves. She reminisced on an older project as a “time of possibilities, people sharing ideas” and how “there’s something very … rare about that”.

What Fitzpatrick liked about some of her projects was that the emphasis was not on the end product, and in that way, the participants were left with “so many possibilities”.

Two other projects were also introduced, “Symbiotica” and “Artisanal Labs, Zurich”. Of note with these spaces is that the artist is given free rein to produce art on its own terms, and more importantly, not as part of commercial enterprise.

Questions from the Floor

The friction relating to the “use” of artists for illustrative work at the end of the development of a project was raised during questions by some of the artists participating. One artist asked, “what is the actual outcome you are hoping for? … in my work it is completely about communication.. can you verbalise what is the artist’s role?”

Another commentator from the humanities claimed that “Working within disciplinary boundaries there are greater strictures on how you can do that” – philosophers for example have to respond to these questions based on their discipline, but artists have only themselves to base this upon, therefore giving artists more freedom. A lull ensued.

Another digital humanist described how their work involved “beautiful” works of art created through natural language processing, which was countered by a comment, “yeah well is art just beautiful then”?

The DHer replied that “im not saying that art is always something that’s beautiful but maybe my conception of art is different than yours and then I think that what is beautiful is art”.

This friction was also highlighted by one of the digital humanities participants who explained how he as a coder did not interact with artists and therefore wasn’t fully aware of the “role” that the artist might play in a hackathon setting. This led to a great sense of unease, which I felt was a positive ice-breaker for the questions of roles in the event.

In conclusion, the last speaker summed up some of the positive aspects of the significance of including artists by explaining that, “art is exceptionally good at handling ambiguity”.

Ideas as Projects

As a digital humanist / coder I immediately set about thinking about ideas for a collaborative project. Some sample datasets that were built by researchers at NUI Galway included:

  • Duanaire – a collection of datasets related to Irish economic history. One example is Customs15, which is made up of quantitative trade data spanning over 100 years, with data on locations and types of goods, along with high-resolution digitised images of the original source manuscripts.
  • Earlier Latin Manuscripts – A collection of data and high-resolution images of Latin manuscripts published before the year 800.
  • Landed Estates – Data, including location and images, on landed estates and historic houses in Ireland (c.1700 – 1914).
  • Tim Robinson Archive – An index describing 567 town-lands in Aran and Connemara. This draws together information on the language of local place-names, folktales, and historical, geological, archaeological and botanical information from each town-land.

There was also an NUI Galway library list of available data which included; Abbey Theatre Minute Books, The Balflour Albums, Brendan Duddy Papers, Cusack Papers and the Ritchie-Picklow papers.

Some external datasets listed were: Europeana, UCD Folklore, Tate, V&A, DRI, Guardian News, New York Times, Gutenberg and Spotify.

In thinking of a conceptual idea here, I decided to concentrate on linking collections through topics which were related but not necessarily correct representations when linked together. For example, the Abbey Theatre Minute Books might have been linked to Youtube or Spotify API in order to perform songs that might have featured in strategic conversation at the Abbey. This might have been linked to pantomime memorabilia of a similar time period from the V&A. Through “playing” with data like this, it was imagined that I might re-create a general sense of the data within these collections without the representation necessarily needing to be a “correct” linking of data.

I therefore came up with the concept of “skewing truths in an archive”. This idea resonated with some artists, and coders were also interested. It meant that intentional “truths” placed upon the structures of data collections would be performed by us in order to highlight how data may be useful or nervously inconsistent as bearers of “truths”.

Exploring the Data

We discussed a proposed outcome for the project, and named it the “Pantomime Archive”, which would be an abstract “performance” of collection data that would highlight most frequently appearing song titles and participants at Abbey Theatre meetings. The idea would be that each song and participant would be weighted through a ratio of appearance in a presentation.

We found that the concept of bringing collections together was great in theory. However, as we began to explore the data in more detail, it quickly became apparent that a large percentage of the minute notes concerned play titles rather than songs. We discussed this instance to think about how we may approach the collections from a different angle.

New Departures

Our collaborator Lucas from the Insight Centre had written some Python code to web scrape all information from the minute books. As this was completed in less than thirty minutes, we were in a position to make some inferences about all seven books. First of all, we carried out some close reading and noticed that W B Yeats was mentioned a number of times. This led us to ask if Lady Gregory had been mentioned, and if not, reasons why.

We found that Yeats had been mentioned over five times more than Lady Gregory. Immediately we thought of gender imbalance, of a concept of data being incorrect, and that the algorithms in general were skewing the truth about the data.

Lucas decided to process the same set of data through a software tool for natural language processing entitled Core NLP Toolkit (Stanford) in order to explore this data from a different angle. As a result, it became apparent that W B Yeats had signed off on minutes through a large number of meetings, whilst Lady Gregory had only signed off four times in total.

This led us to reflect on the data and to think about the people at these meetings. I began to run some concordance analysis on the frequency of surnames, extracting data on the three most mentioned surnames from each of the seven books. This revealed that Yeats did indeed become heavily involved in writing the meeting minutes. However, more importantly, we see that Yeats finishes writing minutes after book five (in 1936) and Ernest de Blythe takes over.

Screenshot above showing concordance data from 1936-37 revealing the three most frequently mentioned surnames in the minute book from that era. Noted absence of W B Yeats.

 

After this discovery, a number of Python scripts were run in order to plot the frequency of Yeat’s involvement with the minute books. Lucas added the following data:

Graph showing the mentions of Yeats within each minute book. Notice the absence of mentions before 1918, and then a significant drop-off after his growing involvement before the 1936-37 era.

As each tool was used to explore the data, we noted where our digital tools failed to reveal “true” representations of the data. We were reflective after each stage of this process with our other collaborator Gianna. Very often when using two different digital tools, it revealed where there were errors in our approach. This forced us to return to the data to re-jig our code / concordance search but more importantly to reflect on the intention of each part of the hacking process.

A number of other interesting discoveries were then encountered through further concordances. Firstly, we noted some key words that we thought would be pertinent to the data. Issues, decisions and other descriptions of actions taken were mined to see where they were most prominent within the data. A knowledge of the role of the minute taker became another topic of discussion.

The following words were selected for concordance: Raise, Proposal, Difficulty, Decision. Others considered included Aim, Should, Repertoire. Within each concordance of the first set, we found variants of each word. This was discovered by selected each keyword within context of the line where it was discovered by the software (KWIC). It was necessary to re-run concordance software with the root words as indexes (for example rais would catch both the words raising and raised). Some other problematic responses that inhibited our “true” search of the data were where Raise did not mean a question was raised, but that a physical floor needed to be raised or that salaries could not be raised.

It was surprising to see how the word “diffic(ulty)” performed. We noticed that in each successive book, the word became more frequent.

Another exploration that we conducted was the issue of frequency of lines within each meeting minute book. We found that through web scraping that we could extract all lines of the minute books, but a problem lay in the way that transcribers had reproduced what was written in the books. In some cases, new lines were entered by transcribers themselves. This introduced another layer to the “intentionality” involved with the collection. Further more, this produced different data which skewed the true amount of lines that we could find had been written by each minute contributor.

From the data in the concordances, the Python code and subsequent excel listing, some further observations were made. The following D3 visualisation was built to show the amount of lines found within the minute books (in blue) vs the amount of times that plays were mentioned within each book (in orange).

At this point, we began to think of our project as a self-reflexive evaluation of us as “explorers” and that the project title might be called “Selected Authors from the Archive of Intentional Neutrality” – we were attempting to be neutral in our digital exploration but it was failing under many errors in the exploration.

Performing our Archive

In order to ask further questions about what we had explored, we decided to represent this data on physical books from the library. With access to the Hardiman resources, we were in a position to gather books that related specifically to the variety of subjects that we had explored. For instance, gender imbalance prompted us to select the text “Gender Injustice”, whilst “The Irish Storyteller” served as a reminder that the type of data we had chosen to look at may be more ambiguous in content than we perceive it to be. Similarly, the book on “Information is Beautiful” seems to sit in eerie contrast to the experiences that we had just encountered with our own errors and achievements with the data we had explored.

This then allowed us to “project” text onto those textbooks, in order to highlight the ways in which “intentionality” is affixed through the structures of our media formats, but how they can also be perceived as flexible, more ephemeral processes of knowledge making.

Collections as Data – Hackathon Part III from patrick egan on Vimeo.

A number of other projects at the final presentation echoed the interesting process of exploration that we had encountered by working together. Some were more straightforward with their ideas for what they were searching for, their methodologies differed significantly from ours. Others had extracted data which was clearly useful to researchers and the public in general (https://samehkamaleldin.github.io/irish-military-females-map/).

Collection as Data workshop: Mapping of addresses of Irish females who participated in Easter rising, the war of independence, and the civil war in Ireland 1926~1923 according to military service pensions archives.

Conclusion

The aim of this event was to explore what people from diverse backgrounds can create when they work together. During the process of this exploration, a number of spontaneous changes in direction occurred from a project that began by attempting to link archives but instead reconfigured metadata from one archive. Through this process a range of inconsistencies arose through the use of digital tools, and critical reflection on the results were therefore essential to our understanding of how we perceive data with three different perspectives. In some cases I became the coder, in others I became the guide between artist and coder.

The end goal for this hackathon was for participants to walk away with a community of support, and an idea of the possibilities of using collections as data. It is clear that a number of pertinent issues exist with regard to communication between engineers and artists. However, it is only through a comprehensive understanding of each other’s perspectives that we can really develop collections data. This conversation must continue. Perceptions of data must be transparent, but more importantly, so too must understandings of the processes involved. As humanists, scientists and artists, we must begin to explore how we want to relate to digital collections.

Our project entitled “Selected Authors from the Archive of Intentional Neutrality”

Patrick Egan
04-12-2017