Introduction to Digital Humanities Textbook by imansalehian

Introduction to Digital Humanities Textbook

2A. HTML: STRUCTURED DATA, CONTENT MODELLING, INTERPRETATION, AND DISPLAY All content in digital formats can be characterized as structured or unstructured data. In actuality, all data is structured—even typing on a keyboard “structures” a text as an alphabetic file and links it to an ASCII keyboard and strokes. The distinction of one letter from another or from a number structures the data at the primary level. But the concept of “structured data” is used to refer to another, second, level of organization that allows data to be managed or manipulated through that extra structure. Common ways to structure data are to introduce mark-up using tags, to use comma separated values, or other data structures. The distinction between structured/unstructured data has ramifications for the ways information can be used, analyzed, and displayed. Structured data is given explicit formal properties by means of the secondary levels of organization, or encoding, referred to above. These use extra elements (such as tags, to be discussed below), data structures (tables, spread sheets, data bases), or other means to add an extra level of interpretation or value to the data. The term unstructured data is generally used to refer to texts, images, sound files, or other digitally encoded information that has not had a secondary structure imposed upon it. Sidebar Example: Think about the text of Romeo and Juliet. Every line in the play is structured by virtue of being alphabetic. But the text is also divided into lines spoken by characters, stage directions, and information about the act, scene and so on. If we want to find any instance of “Juliet” a simple string search will locate the name. That is a search operation on unstructured data. But if we want to be able to pull all of the lines by Juliet, we would have to introduce a tag, such as <proper_name> into the text. The degree of granularity introduced by the structure will determine how much control we have over the manipulation and/or analysis. Every line could be marked for attributes such as class, race, gender, but if we then wanted to sort analyze all of the lines with obscene language, this set of tags, or structures, would be of no use. Every act of structuring introduces another level of interpretation, and is itself an act of interpretation, with powerful implications. The most ubiquitous and familiar form of mark-up is HTML (hypertext markup language), which was created to standardize display of files carried over the internet, read by browsers, and displayed on screens. Many scholarly projects make use of other forms of markup language, and the principles that are fundamental to HTML transfer to their use, even if each markup language is different. The original mark-up language, SGML (standardized general markup language) was the first standard designed for the Web, and, technically, should be considered a metalanguage—a language used to describe other languages. Mark-up languages were designed to standardize communication on the Web, and, in essence, to make files display in the same way across different browsers and platforms. Good resources for understanding