{"id":30883,"date":"2016-10-05T11:00:22","date_gmt":"2016-10-05T15:00:22","guid":{"rendered":"http:\/\/libapps.libraries.uc.edu\/liblog\/?p=30883"},"modified":"2016-10-04T09:12:00","modified_gmt":"2016-10-04T13:12:00","slug":"behind-the-scenes-with-ucs-digital-archivist-finding-the-needle-in-the-haystack","status":"publish","type":"post","link":"https:\/\/libapps.libraries.uc.edu\/liblog\/2016\/10\/behind-the-scenes-with-ucs-digital-archivist-finding-the-needle-in-the-haystack\/","title":{"rendered":"Behind the Scenes with UC\u2019s Digital Archivist: Finding the Needle in the Haystack"},"content":{"rendered":"<p><em>By Eira Tansey, Digital Archivist\/Records Manager<\/em><\/p>\n<p>A constant challenge for digital archivists is identifying potentially sensitive material within born-digital archives. This content may be information that fits a known pattern (for example, a 3-2-4 number that likely indicates the presence of a social security number), or sensitive keywords that indicate the presence of a larger body of sensitive information (for example, the keywords \u201cevaluation\u201d and \u201ccandidate\u201d in close proximity to each other may indicate the presence of an evaluation form for a possible job applicant).<\/p>\n<p>Digital archivists use a number of tools to screen for potentially sensitive information. When this information is found, depending on the type of information, institutional policy, legal restrictions, and ethical issues, archivists may redact the information, destroy it, or limit access to it (either by user, or according to a certain period of time).<!--more--><\/p>\n<p>At the University of Cincinnati, we have a number of information security policies in place, including the Data Protection Policy (<a href=\"http:\/\/www.uc.edu\/content\/dam\/uc\/ucit\/docs\/itpolicies\/9-1-1-B_DataClassificationAndDataTypes.pdf\">http:\/\/www.uc.edu\/content\/dam\/uc\/ucit\/docs\/itpolicies\/9-1-1-B_DataClassificationAndDataTypes.pdf<\/a>). This policy defines the difference between restricted and controlled data. Restricted data includes information such as social security numbers, credit card data, etc.<\/p>\n<p>One tool that a lot of digital archivists use (and is packaged with Archivematica [https:\/\/wiki.archivematica.org\/Release_1.5] and BitCurator [http:\/\/wiki.bitcurator.net\/index.php?title=Software], two very popular suites of open-source tools to process and work with born-digital archives) is Bulk Extractor. Bulk Extractor uses a combination of pattern-recognition filters (for example, it will look for [text]@[text].[text] to flag email addresses). You can also set up a customized list of keywords you want to flag. Bulk Extractor is available for download here: <a href=\"http:\/\/digitalcorpora.org\/downloads\/bulk_extractor\/\">http:\/\/digitalcorpora.org\/downloads\/bulk_extractor\/<\/a><\/p>\n<p>Depending on the types of filters you choose to run, Bulk Extractor spits out a series of reports after you execute it across a set of files. If a report is empty, it means it did not detect anything sensitive.<\/p>\n<p><a href=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-30884\" src=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01.jpg\" alt=\"Files - Bitcurator\" width=\"800\" height=\"446\" srcset=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01.jpg 1200w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01-155x86.jpg 155w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01-300x167.jpg 300w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01-768x428.jpg 768w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_01-341x190.jpg 341w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>On the other hand, if it does pick up a keyword, you can then review it to ensure that it\u2019s not a false-positive. Unfortunately I haven\u2019t yet figured out a way to automate this step. Here\u2019s an example using some keywords \u2013 also these may indicate personnel files, when reviewed in context, this particular example shows the word is in the context of a grant application.<\/p>\n<p><a href=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-30887\" src=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a.jpg\" alt=\"Reviewing Files - Bitcurator\" width=\"816\" height=\"767\" srcset=\"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a.jpg 816w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a-155x146.jpg 155w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a-202x190.jpg 202w, https:\/\/libapps.libraries.uc.edu\/liblog\/wp-content\/uploads\/2016\/09\/liblog_post3_02a-768x722.jpg 768w\" sizes=\"auto, (max-width: 816px) 100vw, 816px\" \/><\/a><\/p>\n<p>Digital archivists are particularly attuned to the challenges of dealing with sensitive information, and this topic remains an important one within the archival community. This is an area where we could use some more tools that can specifically address the needs of the archival community with identification and redaction or removal of sensitive information \u2013 one such example so far is BitCurator Access (<a href=\"https:\/\/github.com\/bitcurator\/bca-redtools\">https:\/\/github.com\/bitcurator\/bca-redtools<\/a>). Hopefully more will be in development as time goes on.<\/p>\n<p>The Archives and Rare Books Library is located on the 8<sup>th<\/sup> floor of Blegen Library.\u00a0 We are open Monday through Friday, 8:00 am-5:00 pm.\u00a0 You can also call us at 513.556.1959, email us at <a href=\"mailto:archives@ucmail.uc.edu\">archives@ucmail.uc.edu<\/a>, visit us on the web at. <a href=\"http:\/\/www.libraries.uc.edu\/arb.html\">http:\/\/www.libraries.uc.edu\/arb.html<\/a>, or have a look at our Facebook page, <a href=\"https:\/\/www.facebook.com\/ArchivesRareBooksLibraryUniversityOfCincinnati\">https:\/\/www.facebook.com\/ArchivesRareBooksLibraryUniversityOfCincinnati<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Eira Tansey, Digital Archivist\/Records Manager A constant challenge for digital archivists is identifying potentially sensitive material within born-digital archives. This content may be information that fits a known pattern (for example, a 3-2-4 number that likely indicates the presence &hellip; <a href=\"https:\/\/libapps.libraries.uc.edu\/liblog\/2016\/10\/behind-the-scenes-with-ucs-digital-archivist-finding-the-needle-in-the-haystack\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,68,13],"tags":[1180,86,66],"class_list":["post-30883","post","type-post","status-publish","format-standard","hentry","category-arb","category-digital-collections","category-uclibraries","tag-digital-archives","tag-records-management","tag-university-archives"],"_links":{"self":[{"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/posts\/30883","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/comments?post=30883"}],"version-history":[{"count":0,"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/posts\/30883\/revisions"}],"wp:attachment":[{"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/media?parent=30883"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/categories?post=30883"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/libapps.libraries.uc.edu\/liblog\/wp-json\/wp\/v2\/tags?post=30883"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}