Liza Harrell-Edge is currently Manager of Digital Initiatives at the New School Archives and Special Collections. She previously worked at NYU’s Fales Library on collections including the Kathleen Hanna Papers, the Erich Remarque Papers and the Sylvester Manor Archive.
Using Beautiful Soup with Python for Webscraping
BY KAREN H.
Topic(s):
- Introduction to the process of webscraping, using Python and Beautiful Soup
Audience:
- People who want to understand the process for extracting data from web pages, especially in situations when direct access to the backend database might not be possible;
- People who program in Python and want to know more about the HTML parser Beautiful Soup;
- Digital humanists, scientists, infographic designers, etc..
Continue reading “Using Beautiful Soup with Python for Webscraping”
Distilling nlp4arc 💦
Notes by Katie M.
This past week I traveled to University of North Carolina for nlp4arc, an intimate symposium marking the start of Bitcurator NLP (this Andrew W. Mellon funded project is aimed at developing a suite of natural language processing tools for archives). The meeting opened with 11 presentations by educators and archivists who shared their experiences building and applying NLP to analyze digital collections. Our second half was scheduled to be more of an ‘unconference,’ with group-selected topics of interest to be discussed in smaller circles. Unfortunately, midway through, the university announced Chapel Hill’s water supply was being shut off immediately due to a county-wide water emergency—forcing us to evacuate while discussing things like the frozen NYPL in The Day After Tomorrow, and “preppers.”
Despite this interruption, we had enough time to review active and closed projects, and walk away with ideas that should be considered or incorporated into future software. Here were my personal takeaways:
Your name is a small part of your identity
Daniel Pitti chose a more theoretical approach to his talk, focusing on the challenge of identity and in the context of NLP tools, the limitations of a ‘name’ entity. He described the makeup of an individual as being part physical person (what we see when we people-watch) and many parts social person (work-self, hobbies-self, friend-self, etc.). None of which are represented by a name.
“To form a “reliable” identity we must triangulate across multiple sources providing mutually corroborating facts and contexts assembling fragments into a constellation that “identifies” that person.”
Are we looking for questions or answers?
This point was expressed by attendee Stephanie Haas, a UNC professor with over 20 years of NLP research and experience. When conversation circled around the responsibility of an archivist versus that of a researcher, she responded by questioning our expectations of natural language processing. Effective platforms may expose new lines of inquiry through dynamic arrangement, but we may not ever find an application use that will allow us to touch a document just once.
Communities sustain projects
—(this practical advice is a point I continue to revisit)
Our final presentation was delivered by Carl Wilson, tech lead of OpenPreservation.org. He mentioned a number of projects that he described as fascinating and complex but ultimately, unsuccessful. Many projects mentioned over the course of the morning contained a common thread of frustration with being unable to sustain the work, citing issues like tech challenges, lack of funding and low use. Yet, Wilson makes the point that when communities care, anything is sustainable. If a user community is too exclusive, it resists the kind of expansion and care that arises through community-formed documentation, bug reports, feature requests, etc.
On that note, I was left considering how Bitcurator NLP is currently at a stage which holds the most potential: the beginning. At the next symposium maybe the conversation will be interdisciplinary, inviting non-archivist/academic voices to discuss their experiences (more diverse as well, ten of eleven nlp4arc speakers were male). This is an opportunity to develop a platform that will be accessible to a community of users, not just select experts.
Tools mentioned:
veraPDF
ArchExtract
Voyant
Stanford NLP
CMU Sphinx
ePADD
Conference Redux: Molly went to INST-INT in New Orleans
Calling all in need of some Monday inspiration! Last week I attended the INST-INT in New Orleans from January 22 – 24. About 300 artists, designers, activists, and engineers (plus a couple librarians, woot woot!) gathered in an intimate jazz market for two days of talks and demonstrations about the Art of Interactivity, interspersed with evening musical programs and design demos at venues around New Orleans. The scale, complexity, and creativity of the work on display was truly mind-blowing, ranging from musical swings in city centers to self-sustaining waterpods to folded paper structures that turn into a planetarium with the help of your smartphone’s flashlight. Some of the projects were serious, others playful, some massive, others tiny, some machine-based, others decidedly non-digital.
One of the major conference takeaways for me was how the essence of interaction is collaborative, and therefore it often takes large teams to pull off any one of these installations. As Rafael Lozano-Hemmer emphasized in his presentation, too often interactive media art is categorized as visual arts, whereas in truth it is closer to film: it is both time-based and event-based. Therefore each project should include a list of credits, like a film does, attributing credit to all of the people whose effort it took to create it.
Since these projects are better shown than told, I’ve included a list of videos about some of my favorites in this blog post for you to explore:
Melissa Mongiat and Mouna Andraos presented on the work they did with their Daily Tous Les Jours studio to built a collective musical instrument using swings in multiple cities:
Dr. Rebecca Fiebrink showed us how to use Wekinator to create musical synethesis:
Mary Mattingly presented about her work on a self-sustaining Waterpod in New York City:
Waterpod Project from Mary Mattingly on Vimeo.
Rafael Lozano-Hemmer gave an inspiring talk about the broader meanings of interactive media art and showed us many of the installations that his prolific studio has worked on over the past two years, including this one, “Call on Water,” which writes words from the poems of Mexican writer Octavio Paz in mid-air with plumes of air from a water basin:
L05, Wesley Taylor, and ill Weaver of the Detroit-based art collective Complex Movements shared their project, Beware of the Dandelions:
Delaney Martin and Jay Pennington shared the work they did to create a Music Box Roving Village in New Orleans and elsewhere:
New Orleans Airlift – Music Box Roving Village: City Park 2015 from New Orleans Airlift on Vimeo.
Refik Anadol showed walked us through his journey to the work he does today, including this public art installation 350 Mission Building in City of San Francisco:
Virtual Depictions: San Francisco / Public Art Project from Refik Anadol on Vimeo.
Kelli Anderson rounded off the amazing three days with a presentation of the creative work she’s done with paper as an interactive medium, including engineering paper into a working camera:
There were also many amazing five-minute “show and tell” presentations. Here are a couple of the projects presented:
Diffusion Choir by Hypersonic, Sosolimited, and Plebian Design:
DRYADS from Dave and Gabe (with digital fabrication help from Gamma NYC):
I hope you enjoy these videos as much as I do. Until next time,
Molly
p.s. I also gave a quick lightening talk about my project at METRO, calling for collaborators and ideas (which actually worked, thanks for everyone at INST-INT who approached me!).
Catch up on the latest episodes of Library Bytegeist! 🎧
You can stay updated with monthly audio stories from the libraries, archives, and museums of New York City by following our podcast on SoundCloud, iTunes, or Stitcher. Here are summaries of our last three episodes, produced and hosted by Molly Schwartz as part of her METRO fellowship project:
Episode: #4 Talking Pop-up Media Migration with the XFR Collective’s Rachel Mattson
In this episode, Molly talks with Dr. Rachel Mattson about her work as a member of the XFR Collective, an all-volunteer group of over 14 members, does the work that it does, partnering with artists, activists, individuals, and groups to preserving at-risk audiovisual media – especially unseen, unheard, or marginalized works, like this gay wedding celebration – by providing low-cost digitization services. Please read below for more information about the XFR Collective and the tools we used to produce this podcast.
Here is a link to a rough transcript of the episode: docs.google.com/document/d/1Yf3iD…/edit?usp=sharing
Related Articles and Links
Continue reading “Catch up on the latest episodes of Library Bytegeist! 🎧”
Email Mis/Management 📂
(NOTES by KATIE M.)
Seemingly reckless email use had a major impact on this year’s presidential campaign—this was framed as both a matter of secrecy and irresponsible record keeping—but a central issue is the sticky nature of the format. As a new technology, email in its founding years was disregarded as an informal communication mode until significant legal cases in the early 2000s raised the status of information transmitted in this form. Advancements in the format have only increased the impact of mismanagement as messages more easily proliferate and storage costs drop. Let’s review some recent instances where email records have made headlines.
2012
The Account of Richard Windsor
EPA Administrator, Lisa Jackson, becomes target of an audit after she is discovered to have a private account under the nom de plume “Richard Windsor,” a combination of her former residence and family dog’s name. In defense, “assigning a secondary email account to the administrator at EPA is not new to this administration. The intent, the agency says, is for the administrator to have a manageable email account in addition to the one that is openly available to the public.” http://politi.co/2gPAY3i
2013
The IRS Gets Personal
Lois Lerner, an IRS official, is asked to turn over six years of official correspondence after she is suspected of having used a personal email address to discuss the targeting of specific political groups interested in tax-exempt status, namely those related to the Tea Party movement. http://bit.ly/2gc9vca
2013
Sekisui Medical America v. Hart
Sekisui fails to provide evidentiary documents during a financial dispute with a client, claiming “during that period, the business unit’s HR director deleted the relevant emails because they were cluttering the company’s servers.” This results in a renewed awareness of the need to redefine rules governing electronic discovery, producing a number of legal and social conclusions.
- Legally, there isn’t a difference between negligent destruction to save space or to hide evidence. As Judge Shira Scheindlin states, “The law does not require a showing of malice to establish intentionality with respect to the spoliation of evidence.” http://bit.ly/2gcKrlm
- Continually, while organizations retain responsibility for their electronic records, legal sanctions regarding electronic discovery need to be updated once more to fit with the evolution and proliferation of email. http://bit.ly/2gN8x7i
- Emails need to remain in their original format. Sekisui’s practice of printing important emails for their archive instead of retaining the electronic form contributed to a significant loss of data. This was supported in court by referencing a tool newly developed by the MIT Media Lab to visualize a person’s professional and personal history using compressed email metadata: https://immersion.media.mit.edu/.
2014-2016
hdr22@clintonemail.com
In 2009, Hillary Rodham Clinton becomes secretary of state and begins using hdr22@clintonemail.com for personal email. After a committee is formed in 2014 to investigate the 2012 Benghazi attacks, the State Department requests all of her emails and private and public reviews of her correspondence ensues. One year later, the investigation is continued by the FBI, impacting HDR’s presidential campaign in 2016. http://nyti.ms/1TNFlMg
2014-2016
A federal investigation into Bridgegate uses NJ Gov. Chris Christie’s personal and government email accounts for evidence. Two and a half years later, messages from a private account between Christie and his wife contribute further information about his involvement in the matter. http://bit.ly/2g0XoRy
These are extraordinary examples of unexceptional issues related to records management: blended personal and professional correspondence, reactions to the cost of space, and unclear legal retention periods. Legal aspects may similarly inform policies within arts organizations but collecting for institutional memory requires separate attention to long-term preservation and access. Over the past few months, I’ve been considering the retention policies surrounding electronic records in place at BAM and SRGM in an effort to understand how they may or may not be compatible with the everyday habits of staff. It has exposed how email, as a digital format, exists on the periphery of electronic records management in the broad category of ‘correspondence.’ In a way, this neglects the complexity of a mode that allows a thick web of overlapping data to form through attachments, forwards, personal talk, and professional updates. I thought it might be helpful to review recommendations provided to government staff during email management training.
“Digital curation is not simply a matter for those charged with care of resources at the end of their active lives, for the term ‘digital curation’ refers to the ongoing management of digital materials for both current AND future use. Curation issues are relevant from day one of the records life-cycle, from creation through to curation and including re-use of the data.”
-Maureen Pennock (2006)
1. Is it a record?
As Geof Huth notes in his recommendations for digital appraisal, in the context of an institution addressing its own records, appraisal is usually an extension of records scheduling. Using the example of an arts institution, a retention schedule may suggest the following record categories for a valued creative department:
- Collection Object Files & Artist Files
- Exhibition Files
- Correspondence: significant, routine, and related to other records (*including email)
- Education Programming
- Research Files
Organizing messages and evaluating the significance of content delivered through email is easier when reviewed in relationship to each record category, rather than as an extension of ‘correspondence.’ For example, receiving an email with a link to download images of artworks may fall under ‘Research’ rather than ‘Correspondence,’ and can be deleted after extracting the files. This approach relies on communication and collaboration between all participating departments to curate content accordingly.
- Collection Object Files & Artist Files: Information on artworks and artists in collection, including documentation on the installation of an artwork. For permanent collection works, includes treatment reports and incident reports related to artworks. Also includes artist interviews.
- Exhibition Files: Information collected and created by conservation related to exhibitions including traveling, foundation and affiliate exhibitions. For loans, includes treatment reports and incident reports related to artworks.
- Correspondence: Correspondence that documents important activities, events, operations, policy changes, etc. Correspondence with artists (email, audio/video recordings).
- Education Programming: Programming: Includes documentation [correspondence, contracts, planning, lectures, surveys] on programming such as symposiums, and conversations with contemporary artists.
Visual Materials: Includes photographs, videos, and/or films of programs, residencies, publicity, etc. as well as photo permissions. - Research Files: Information collected and created by curatorial related to collections,provenance, and other topics.
2. Is it related to your job?
- Personal mail may seem like the most obvious non-work related message type. A simple solution would be to make a list of friends and family who you correspond with regularly and filter those messages into their own folder.
- CCs are often courtesy copies shared to keep different parties up to date on a project, but a copy does not need to be maintained. Unless you go on to contribute to the conversation in a notable way, CCs can be deleted.
- Unsolicited messages, even when work-related, can be deleted. This includes newsletters, office updates, PR announcements, etc.
3. Are you the custodian?
- The person who wrote and sent the email can be considered the custodian of the record copy.
These recommendations are meant to be applied at the creation stage to influence routine maintenance of email correspondence. Additionally, desktop and cell phone applications offer private access to the creator which lead to the perception of email existing as a data source for personal reference only, unlike departmental records which are usually saved to a shared drive. Training can alter this view by making connections between these records and the development of an organization’s history. Tips for curating email have contributed to my review of two email accounts provided by BAM and SRGM, the first containing valuable record of a project’s development, and the other a broader reflection of the institution formed over the course of a director’s career. Over the next posts, I will be exploring what it has been like to use ePADD to appraise records while considering a digital curation (and risk management) framework, and how conversations with staff and other archivists have continued to reveal the complex qualities of this format.
Huth, G. (2016). Appraisal and acquisition strategies. Chicago: Society of American Archivists.
Pennock, M. (2006). Preservation of e-mail messages. DCC Digital Curation Manual, S.Ross, M.Day (eds). Retrieved from http://www.dcc.ac.uk/resource/curation-manual/chapters/curating-e-mails
Texas State Library and Archives Commission (2013). Email management [PowerPoint slides]. Retrieved from https://slrmtraining.tsl.texas.gov/index.php?
Toward a Mindful Media Studio 📻
(hei there this is molly here)
In this post I’ll give you a little overview of my fellowship project designing a media studio at METRO’s new offices on 599 11th Avenue over the next nine months, as well as some background about my inspiration and motivation for this work. In the spirit of the studio, I am producing a podcast with audio stories from across the libraries and archives of New York City. Tune into Library Bytegeist and look out for updates here as studio construction continues:
My Fellowship Project: Designing a Media Studio
The purpose of the studio will be to provide a welcoming and inclusive space that is equipped with different kinds of media, old and new and in-between, analog and digital and virtual. This includes digital forensics equipment, a graphics station, and media for the production of audio, video, and web content. In an open loft space with huge windows overlooking the Hudson River on one side and the bustling streets of Hell’s Kitchen on the other, METRO’s Media Studio will be a physical space where the library, archive, and museum professionals from across METRO’s 250+ member organizations can gather, experiment, and create with tools we use to tell our stories and our histories in an increasingly media-saturated world.
As Erik Boekesteijn from the innovative Doklab at the Delft Public Library in the Netherlands put it, “librarians of today are the media guides of tomorrow.” With the proliferation of multimedia creation becoming ever-more accessible and diverse, librarians are faced with the task of becoming fluent users of and able guides to digital, analog, and virtual media that are constantly changing and evolving. Libraries and archives have always brought of level of institutionalized structure and mindful intention to the collection, organization, preservation, and access to the different media that tell our stories and document our pasts, and the studio will be a space to help them do so in the age of smartphones, virtual reality, and whatever comes next.
Analog ←→ Digital: it’s a spectrum
Although it might not seem obvious from my intention to design a tech-filled studio, my fellowship project was inspired by one reverse pitch in particular, submitted by the Art Resources Transfer. The title of this pitch was “Creating access through print collections: the role of books and print literacy today.” The pitch was calling for new ways to share the Art Resources Transfer’s amazing print collections with a public that seems increasingly occupied with new digital media. Even as excited as I get about the possibilities of new technologies, especially as they could be applied in the cultural sector (virtual reality trips to the museums of the past! open access to scanned books!), I came to library work due to my longtime love of books and print literature. Research in psychology and cognitive science has only come to confirm something that I have known for a long from my own experience: reading print literature does something to the brain that increases focus, attention, absorption, and relaxation in beneficial ways. That is part of why I like it so much. But similar brain states, characterized by psychologist Mihaly Csikszentmihalyi as a state of flow, can also be activated by the participation in a wide range of activities, such as playing chess, going jogging, working on a puzzle, playing a musical instrument, or doing yoga. As Csikszentmihalyi outlines in his book “Flow: The Psychology of Optimal Experience,” this state of flow, when we lose our sense of self and feel deep senses of exhilaration and enjoyment in our human experience, is reached through experiences rather than through the elusive satisfaction of new material possessions.1 Which is why, in the design of our media studio, I would like to focus less on the characteristics of the cool new media that we buy and more on the types of experiences we create when we use the equipment in the studio to create media that tell the stories of the libraries, archives, and museums of New York City and Westchester County.
I was struck by the ways that discourses around media and technology oftentimes characterize the formats that we use to communicate information as being new or old, advanced or obsolete, an inevitable one-way street of technological process in which different types of media are seen as competitors vying for our attention. In such a zero-sum paradigm, the increasing popularity of video production is inherently threatening to print literature and the internet threatens the existence of libraries. While I think that time has proven such predictions to be mostly alarmist and unfounded, they hold powerful sway over how we view media in ways that I think limit the ways that experiment within and among different types of media. Let’s question and blur some of these false dichotomies by turning away from looking at media as pieces of technological equipment, and toward looking at media as human experiences.
Mindful Media + Intermediation
It is interesting to think about what we are actually seeking when we use media: are we discovering knowledge, forging human connections, understanding our past, escaping into an alternative reality? Thinking about media usage as an experience let’s us bring some awareness to how we use in our professional and personal lives, in a similar way that mindfulness practices help us bring awareness to our feelings, thoughts, and bodily sensations. This could open up new possibilities for more intentional, and potentially creative, usages of media. Therefore I had the idea: what if, instead of conceiving of our studio space as a new media lab, we conceive of it as a mindful media studio? Such a distinction gets more at the philosophy behind building a studio than what the space looks like in practice. Both types of spaces would probably have computers, cameras, printers, recorders, etc. But by focusing on the ultimate usage of the media, we can design a space that is more iterative, low-cost, low-stakes, high-impact, and flexible, such as our Executive Director Nate Hill described in his blog post. We want to create a space that allows for the creative reuse and appropriation of materials in a way that works for our member institutions as the technologies inevitably change over time.
To give an example of the type of I would like to turn to something that has been both a popular internet phenomenon and forms the subject of my research as a PhD student: ASMR videos.
ASMR stands for “autonomous sensory meridian response” and is used to describe a type of video that triggers relaxing tingles for the viewer through the use of aural and visual stimuli. ASMR videos have become incredibly popular on YouTube, the most watched ones garnering upwards of 2 million views. Here is an example of what one looks like (I strongly suggest putting on your headphones):
To come back to the question of how we can creatively think about the role of printed books and how we engage with them in an era when libraries are increasingly focused on digital technologies, these types of ASMR videos are extremely interesting. These are examples of print literature being used within videos purely for the relaxing effect of their materiality as human hands tap on the covers, flip the pages, and trace the words. Many ASMR videos feature other forms of media, things like keyboards, smartphones, and computer screens, but ASMR creators use them as purely material objects rather than as information-carrying materials. It is a classic example of remediation, in which a medium uses the format of other medium as its content, like the use of videos, pictures, and words as the content of much of the material found on the web, or the ornate decorated letters in medieval texts, or the icons on your computer shaped like paper files and floppy disks. Media scholars Jay David Bolter and Richard Grusin as describe the way that old and new media are constantly blurring and interacting with each other, especially as digital media continues to disrupt the landscape:
Older electronic and print media are seeking to reaffirm their status within our culture as digital media challenge that status. Both new and old media are invoking the twin logics of immediacy and hypermediacy in their efforts to remake themselves and each other.2
The Mindful Media Studio will ideally be a playground for this kind of thinking and experimentation. Libraries, archives, and museums are unique in that they have rich stores of diverse materials that have been collected, organized, and preserved over decades. The possibilities for what we could do when we rethink, blend, and remix these materials across genres and sensory experiences is extremely exciting and will hopefully lead us to use new technologies as tools for mindfully engaging with our cultural histories.
Thanks for reading, listening, and everything you do,
Molly
- Csikszentmihalyi, Mihaly (1990). Flow: The Psychology of Optimal Experience. New York: Harper and Row.
- Bolter, Jay David and Grusin, Richard (1998). Remediation: Understanding New Media. The MIT Press. 5.
A Look at Institutional Email 🖥
Hello!
In this first post, I’d like to give a bit of background about my proposal, and the logic guiding which tools, software, and research I’ve chosen to discuss in the posts to follow.
SIP AIP DIP Anxiety
My project developed out of pitches submitted by two archives engaged in recent efforts to strengthen (and simplify) their born-digital workflows, Brooklyn Academy of Music (BAM) Hamm and the Solomon R. Guggenheim Museum (SRGM). As a performing arts venue and an art museum, these two institutions operate on similar cycles of exhibitions/performances, during which high-value institutional records are created regularly by programming and curatorial staff. For both, institutional email preservation was highlighted as an area in need of attention. Like many of their contemporaries, only informal or very broad record retention policies exist for email, and internal education about organization hasn’t been consistent. Although both have similar goals for realistic incorporation of email into record management schedules, and options for access, they each offer very different examples of accounts and staff management. This suggested an opportunity to consider a cross-organization framework for email archiving.
Our First Podcast: “Library Bytegeist” is live!
This is an introduction and discussion with the three METRO fellows: Katie Martinez, Karen Hwang, and Molly Schwartz. They talk about why Google searches don’t take you inside library catalogs, how email preservation is becoming more of a priority, and what the opportunities are for libraries and archives to adopt new media for storytelling. Subscribe to our RSS feed on SoundCloud and never miss an episode: https://soundcloud.com/librarybytegeist
Intro and closing music is “Magic” by Otis McDonald.
Podcast: Play in new window | Download
Congratulations, Karen
Karen is on maternity leave until January 2017, she will be extending the end of her fellowship until August 2017. You can look forward to seeing her on the blog after the holidays!