PPTc - EL Training

advertisement

Data management

DocLing 2016

David Nathan

Two most valuable strategies

 design and use a filename system

 work out (‘model’) your basic units of documentation and the relationships between them

- if you get these right, it will do the “heavy lifting” of your data management strategy

- data and metadata are intertwined, points in a spectrum rather than different things

Three most important qualities

 consistency

 documentation of conventions, structures, methods

 machine readability

“computer programs can act on data in terms of its proper structures and categories” an example

Data management

 understand and model the data (units, relationships)

 use appropriate data structure methods – in both file

contents and organisation

 use appropriate and conventional data encoding methods (e.g. Unicode)

 be explicit and consistent

 plan for flow of data, working with others, across different systems

 document steps, decisions, conventions, structures

 think ahead to archiving

Managing data in your computer

 design a well-organised system of folders so that you can always find your stuff according to what it is, not:

 where the software decided to put it

 what the software decided to call it when/where you last used it what someone else called it

File structures and names

 design folder structure as a logical hierarchy that suits your goals, content and work style

 have documentary materials within one overall directory (e.g. for backup)

 make directories for relevant categories , e.g. sessions, media types, dates

 design it so that you will always be able to find things

 you may need to restructure at different points in your project, e.g. move from datebased to session-based structures

Designing a file/folder structure

 it should relate to reality

 locations should make sense, so you (and others) will know where to look for things

(where do you keep your passport; favourite cup?)

 the best location is “the place that one would naturally look to find it”

3 methods of linking or ‘bunding’ related files

 tree of distinguishing folder names

 one folder with distinguishing filenames

 one folder with numerical filenames

… what else is needed?

On identifiers

 real world objects are uniquely identified because they are physically unique - an unlabelled cassette is

poorly identified

 digital objects have no physical existence - they depend on identifiers that we give them

 three types of identifiers:

 semantic keys relative

On identifiers

semantic, e.g.

 Nelson Mandela

The Sound of Music

SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-

2010.wav *

* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav

On identifiers

keys (disambiguators), e.g.

 1137204 (a student number)

0803 211 6148 (a telephone number) p12893fh23.pdf (some system's reference number)

On identifiers

relative, e.g.

 67 High Street

 the secretary index.html metadata.xls

On identifiers

 your collection may have a mix of these but it is important to be aware of their differences and limitations, for example:

 semantic identifiers: invite name clashes

 keys: a program or process might depend on the identifier to work properly relative identifiers: if you move them, you probably change or destroy their meaning

Digital objects and identities

 a digital object’s identity includes its location

 a file’s full identity = path + filename

 the path is a representation of the volume and the directory (folder) hierarchy

 if the full identity is unambiguous then everything can be fine, compare:

 c:\\dogs\spaniels\ rover.jpg

c:\\cars\british\ rover.jpg

or lectures\syntax\2013-02-12\ notes.doc

Digital objects and identities

 but semantic identifiers are potentially ‘dangerous’, because just adding more chunks to disambiguate them will not work:

 2015\rover.jpg

 2015\white_rover.jpg

 therefore, domains that do not offer semantic uniqueness may need identifiers which are either keys, or relative identifiers

And now to file names

 (having said all that)

 filenames are only filenames, and do not necessarily provide information

 common mistaken assumptions:

 that a filename “dp_verbs_39.wav” means there is an entity “dp_verbs_39” that files are logically linked just by sharing some part of their filenames

- these are only true if your system ensures it (and if

you state it explicitly)

File naming

 filenames that are unsystematic or are non-standard will cause problems, eventually

 unsystematic file naming might be (just) OK if

 you already have many files

 you have a working method that already does everything you need to do your “system” will do everything you need to do in the future

Manage file names from the start

 a new file:

 don’t just accept the default filename or location suggested by the application when you first save the file

 put it where it belongs, immediately. If necessary, create the place (directory/path) where it belongs name it according to your naming system!

if you have an inventory/index of files, add an entry for the new file

Filename rules

 all filenames should have correct extensions

 each filename should have only one ".", before the extension

 use only ASCII characters (US keyboard)

 use only letters, numbers, hyphens (-) and underscores

(_)

 keep filenames short, just long enough to contain the necessary identifier - don't fill them up with lots of information about the content (that is metadata!)

 (advised) use only lower case letters

How about these file names?

1.

ready.audio.wav

2.

ReAlLyhArDtOReAd.txt

3.

éclair.jpg

4.

e'clair.jpg

5.

french-cake.jpeg

6.

french-cake.jaypeg

7.

-2011.psd

8.

lexicon-master

9.

ɘɫ I ɲʰ.eaf

10.

ice cream.doc

11.

Obama.TXT

12.

オバマ .txt

Make filenames sortable

 make filenames usefully sortable:

20100119lecture.doc

20100203lecture.doc

gr_transcription_1.txt

gr_transcription_12.txt

gr_transcription_5.txt gr_transcription_9.txt gr_transcription_001.txt

gr_transcription_005.txt

gr_transcription_009.txt

gr_transcription_012.txt

Associating files

 you can make resources sortable together by giving them the same filename root (the part before the extension), or part of the root: gr_reefs.wav

gr_reefs.eaf

gr_reefs.txt

paaka_photo001.jpg

paaka_photo002.jpg

paaka_txt_conv203.wav

paaka_txt_conv203.eaf

paaka_txt_lex.doc

 document your conventions and system if you do this

Avoid metadata in filenames

 avoid putting metadata into filenames. A filename is an identifier, not a data container

 better to use a simple (semantic) filename or a key (i.e. meaningless) filename, and then create a metadata table to contain all the relevant information

 a table can properly express all the information, contain links etc, and is extensible for further metadata

Avoid metadata in filenames

 e.g. Paaka_Reefs_Dan_BH_3Oct97.wav

 better:

 paaka_063.wav

plus

 paaka_063.txt

paaka_063.txt

language topic

Paakantyi Reefs at

Mutawintyi speaker location date

Dan

Herbert

Broken

Hill

1997-

10-03

A filenaming system

 carefully design a filename system for your data and document the system so that somebody else can understand it

 one documenter’s new system: aaa_bb_cc_yyyy-mm-dd_nnn .wav

A filenaming system

 aaa_bb_cc_yyyy-mm-dd _nnn .wav

aaa = village code bb = (main) speaker code cc = genre/event code yyyy-mm-dd = date (why this order?) nnn = optional number (e.g. 001)

.wav = correct extension for file content type

Documenting the filename system

 describe the system

- how would you describe it?

- where would you put the description?

 document the codes – this is probably part of your metadata

On changing file names

 decide if it’s possible, benefits and side effects (e.g. loss of links in ELAN files)

 design a system first

 don’t change names in situ – copy data set and gradually migrate it to your new system

 document file name changes

 if possible, automate or copy and paste filenames

 if possible, use machine processes, e.g. system filename listings, XLS formulas

Different types of metadata

 there are many types of metadata

 different types of materials may have different metadata

 eg metadata for photos and videos may have technical parameters, lists of people appearing e.g. metadata for transcriptions may have date, version, who transcribed, notes on progress

Meta-documentation

 you should keep an updated description of the methods, conventions, abbreviations you use

 .. so somebody could fully understand (and use) your data and methods in your absence

Your collection catalogue

 first, define your collection/corpus/project as some coherent (logical) set of materials

 your collection catalogue/inventory/index is a type of metadata

 this should list and describe all files in your collection it usually contains the categories of information that are relevant for many files

Your collection catalogue

 you could have one large catalogue that covers every file, or

 you could have a catalogue that is subdivided according to types of files, and/or groups of resources

 there is no “one size fits all” solution!

Making an “active” catalogue

 this is not necessary, but may be useful

 if you use a spreadsheet, you can embed links to actual files to make using your collection easier

 Excel formula

=hyperlink(address, display-text)

 useful methods for getting file listings

“Open command window here” Win 7: SHIFT+rightclick

Karen’s Directory Printer

My cells have multiple values!

 example: speakers in a recording

 speakers are probably not ‘atomic’ – they have other attributes create a separate “speakers” sheet

 give each speaker an ID (number or initials) use the IDs in the original sheet, with delimiter

(implements one to many)

(better) make another sheet to associate recordings with speakers (implements many to many)

Data/file versions

 need to distinguish or keep versions depends on purposes

 by suffixing filename, eg

 fugu1.txt

fugu2.txt

or

 fugu_1.txt

fugu_2.txt

 which of the above methods is better?

Data/file versions

 fugu_14022013.txt

fugu_20130214.txt

14022013_fugu.txt

20130214_fugu.txt

 which of the above would be best?

Managing data/file versions

 do you need to keep every version?

 it may be OK to keep “original” plus current

 if information is regularly updated, corrected, you can keep 1 filename and put dates in the document itself, or record dates in a catalogue/metadata file

 however, a series of files may have inherent value, e.g. your transcriptions/annotations, as your understanding and analysis changes, so

 date and keep files use different tiers in ELAN?

Character encoding

 if your document contains anything other than those on a US keyboard, use UTF character encoding

 how can I tell if characters in my MS Word document are encoded as UTF8?

 save as plain text and check options

 copy into plain text editor such as

Notepad++

Character encoding, useful tools

 Notepad++ http://notepad-plus-plus.org/

 for Mac, use: TextWrangler http://www.barebones.com/products/textwrangler/

 SIL ViewGlyph http://scripts.sil.org/cms/scripts/page.php?item_id=Vi ewGlyph_home

 BabelMap http://www.babelstone.co.uk/software/babelmap.html

 TypeIt (view and write IPA)

 http://ipa.typeit.org/full/

 browsers such as Firefox and Chrome are useful for checking and reporting character encoding

Transferring data

 ensure your computer is not a “walled garden”

 you can use

 drives/devices (but avoid DVDs!!) email

 upload to website (where available) send links

“cloud” e.g. Carbonite, Dropbox, collaboration software

 some of these could be considered backup but not true archiving

Sharing

 can we work in a shared, collaborative space?

 Google Docs

 Dropbox

 blogs, Tumblr, wikis etc can have shared

“authors”, and contributors with particular roles

 aalso there is dedicated collaboration software (usually $$$)

Exercise - now it’s your turn!

 Practical exercise for DocLing 2016

Data management & archiving

Work in pairs

Go to http://www.eltraining.org/courses/docling/2016/exercise/

Download the file, unzip it, and place it in a working folder

• exercise.zip

 This is dummy data - the content is not important for the exercise

 Look through all the files to see what files are present

 Find the metadata file

 Do the following:

 identify the problems and errors with the data set work out strategies for dealing with the problems work out strategies for documenting the changes you make fix the problems and errors (as much as possible) add columns to the metadata for date and location modify the metadata to create links to the audio files

Download