This is a backup resource for the rest of the site, intended to provide index-relevant explanations, rather than universal definitions for terms that might be unfamiliar.
Hyperlinked terms in the definitions are themselves glossary entries. The terms are explained in terms of their likely significance for indexers, rather than being strictly defined. The latest additions are indicated by an icon to aid their identification.
- analytical indexing
- The most advanced form of human indexing, in which the indexer analyses text for meaning, augments the author's terminology choices to anticipate the readers' likely preferences, applies tests of significance and uniqueness and assists the index user with cross-references, subentries and often qualifying information. Compared with search boxes, keyword occurrence and word-spotting indexes, it adds the greatest possible value and delivers the widest possible range of retrieval options to the indexed document.
- API (Application Programming Interface)
- Software generally needs two sorts of interface: the user interface facilitates communication with humans; the API, interactions with other software. So, for example, though hidden from most users, APIs are involved in almost all online transactions: if you buy something on eBay or from an online retailer and pay via PayPal or use a credit card subject to bank authorisation, several proprietary systems will have to interact and share data seamlessly, without user input. There are a number of standard APIs for different operating systems and programming languages.
- Though simply an abbreviation for application software, this has the everyday meaning of a downloadable application specifically for use on mobile devices.
- In markup languages, attributes are name=value pairs included within the elements themselves to supply variable data. For example in HTML
<a href="anotherpage.html">click here</a>the anchor element (
<a>) is a link to another page and contains the 'href' attribute tells the browser where to find the linked file. In an XML example
<indexterm sortas="three-dimensional">3-D</indexterm>the attribute 'sortas' tells the rendering software to overrides the default alphabetisation and position the index term '3-D' as though it was spelled out.
- Big Data
- Used to describe very large or complex datasets, often derived by automated data capture, that resist analysis by conventional database management tools. If they can be successfully interrogated, for example by techniques like data mining, they hold out the promise that that correlations may emerge without any underlying hypothesis (so that a machine might, for example, have flagged up the link between cigarette smoking and lung cancer unprompted). There are a number of interesting mathematical approaches to deriving meaningful information from the mass of raw data, although in some cases slow solutions like crowd-sourcing analysis have been employed.
- CFI (Canonical Fragment Identifier)
- EPUB's automatically-generated location marker for a stable text-based document, where each location is related to package-specific subdivisions like chapters and paragraphs.
- chapter-style index
- EPUB use: an index display option with fixed alphabetical presentation. IDPF's 2012 Charter document has: "An index presented in a book's content as a chapter, accessed from the table of contents and from special menus or icons. It can be paged through and browsed as normal content, with hyperlinks back into the book's content, and cross-reference hyperlinks to other areas of the index". Perhaps the key aspect is accessibility from the ToC, since this is a common difficulty with early ebook readers.
- cloud computing
- An internet-based delivery model for IT services where data storage and processing facilities are provided at undefined data centre locations remote from the customer. The users may access cloud services through a web browser but their experience should otherwise be indistinguishable from having the data and software on their own systems. Some concerns have been expressed over contract terms, data protection, security and service levels.
- Listings, now usually computer-generated, of terms in a document with the location of each of their occurrences, regardless of context. Thus, for example, the phrase 'fought like lions' would be listed in a concordance under 'lions' but not in an analytical index, under 'lions'. Normally very frequent and uninformative terms are suppressed. Concordance generation is the basis of numerous keyword-retrieval and semi-automatic indexing systems, and can be considered the antithesis of analytical indexing. Word-spotting indexes bear some similarity to concordances.
- Either the soliciting of ideas from, or the delegation of tiresome or routine tasks to, a large group of interested and computer-literate members of the public. Depending on how broadly one applies the term, it might include competitions and cooperative tasks like so-called citizen science, where seti@home and galaxyzoo.org permit the analysis of large datasets (see also Big Data) not accessible to machine processing. Another might be the editorial control of Wikipedia.
- controlled vocabularies
- Lists of approved terms, and relationships between them, which must be used to describe publications or objects in a specialised subject area or collection consistently. The objective is to ensure that indexers and cataloguers use predictable descriptions so as to collocate all treatments of a topic, and to help those searching by translating their query formulations into the preferred standard terminology. Two important varieties of controlled vocabulary are the thesaurus and the ontology. Controlled vocabularies are sometimes called authority lists.
- CSS (Cascading Style Sheets)
- A stylesheet language controlling the rendering of HTML that avoids the need to style each element individually (inline styling) and assists in maintaining a consistent appearance throughout a website, regardless of content changes.
- A proprietary system devised by Cambridge University Press using XML to facilitate content repurposing. The derived indexing system allows indexers to work on CUP books without meeting any XML or indeed having the XML version of the text leave CUP's offices. Indexers mark text locations in an electronic or paper copy of the book using unique numeric codes chosen by themselves, then supply a standalone index using these codes in place of page numbers. In this form, CUP-XML is not embedded indexing and requires no knowledge of XML on the indexer's part.
- CVS (Concurrent Versions System)
- A software tool to manage changes to a developing project (e.g. a book or website) so as to avoid or reconcile conflicting changes. A CVS is extremely valuable whenever a remote indexer is working on anything short of a definitive version of a text, along with editors, proofreaders, graphic designers etc.
- data mining
- The automated analysis of large datasets according to multiple criteria with the aim of extracting patterns and correlations that were previously neither apparent nor expected. In the analysis of Big Data, it can be seen as an alternative approach to crowd-sourcing. Commercially, the importance of data mining lies in predicting future behaviours, for example of customers by novel analyses of historic records like till receipts.
- An XML-derived semantic markup language for presentation-neutral description of documents, capable of interconversion with a number of publishing formats. DocBook is regulated by the OASIS consortium.
- DTD (Document Type Definition)
- A declaration of document type and markup conventions, normally found early in the code, to control the precise rendering of a markup language document. Where that language is XML, XML Schema is another way of achieving the same thing.
- The Digital Trends Task Force, established by the American Society for Indexing in 2011 to gather information on digital publishing practices as they affect indexes, interface with leading digital publishing companies, eReader hardware and software suppliers to ensure inclusion of usable indexes, and inform ASI members regarding digital trends in a timely manner. More information at
- A form of electronic publication, usually corresponding to a printed book but downloadable for display on a handheld device (a dedicated eReader, smartphone or tablet computer). Normally eBooks are encoded in an HTML-like markup language as a single file and offer variable text sizes so pagination is not fixed. This means that pages are not useable as index locators though, with index-worthy material now more likely to be offered as an eBook, there are signs that more publishers are willing to investigate index provision.
- The individual units of a markup language, on which rendering software like web browsers and stylesheets operates, usually comprising a pair of 'tags' delimited by angle brackets together with their content. For example: <p>Hello World!</p>
- embedded indexing
- A technique in which the actual text of index entries is electronically attached to the text regions which they indicate, so that no separate index file is supplied. Instead, the index is generated on demand to match the current pagination. The constituents of the index are ever present and resistant to changes in pagination or even order of the text. Most present-day word processing and page-layout software provides some inbuilt embedded indexing functionality; most are fairly cumbersome (especially in their treatment of ranges). Conventional indexing is sometimes described as standalone indexing to contrast it with embedding.
- An XHTML-based standard for electronic publishing, used for tablet-based eBook readers but not by the Amazon Kindle.
- A blanket term covering all electronically delivered publications formatted and accessible as individual titles equivalent to hardcopy books. The two main classes are eBooks (essentially versions downloadable to a handheld eReader like the Amazon Kindle or Apple's iPad) and online books which are usually paginated books, accessible via a publisher's website as PDFs. Confusion with the EPUB standard and either class makes the use of the term (and especially its abbreviated forms) undesirable.
- A handheld device, designed specifically to display eBooks, often using reflective eInk technology, usually with variable font sizes which means the page-like appearance is variable and unlikely to correspond to the pagination of any printed version. The Amazon Kindle family dominates the UK market; the Barnes & Noble Nook and Sony Reader are important in the US while the Kobo, with its touchpad approach, is another popular device.
- hidden text
- Describes the formatting, commenting and other additional information not normally visible (unless its display is explicitly selected) in the conventional view of a marked-up document like a word processing file. Hidden text includes control characters like line-feed and paragraph markers, as well as embedded index terms in for example Microsoft Word. The normal view of a section of text might be:
The cat sat on the mat.
while the view showing hidden text could be:
- HTML (HyperText Markup Language)
- The markup language that drives the World Wide Web and enables a hyperlinked multimedia internet experience, by specifying a mix of content-descriptive and presentational information and links to external files. In website indexing and eBooks, hyperlinks can be used in place of locators to take readers from index entries to the relevant text treatments.
- HTML5 is the latest definition of HTML, extending it to include video, audio and scalable graphics, without the need for add-on software such as Adobe Flash. It is still officially in development with its target date for Recommendation being 2014. The latest Kindle 8/Kindle Fire format is based on HTML5.
- IDPF (International Digital Publishing Forum)
- The IDPF 'develops and maintains the EPUB standard format for reflowable digital books and other digital publications that are interoperable between disparate reading devices and applications [and] provides a forum that fosters enhanced communication between all stakeholders in the emerging global digital publishing industry'. The website URL is http://idpf.org.
- index locator search
- An option, available only with electronic text, whereby linked or embedded index terms applied to a particular section of text can be displayed, providing both a documentary context and potential lead terms for a diverging index exploration.
- legacy indexing
- The conversion of a human-compiled analytical index prepared for a printed book to a linked index to the equivalent ebook by semi-automated means.
- linked index
- A form of electronic index where the index term is hyperlinked to the relevant text location by using the anchor element of a markup language (XML, HTML or XHTML). If the href=# attribute in the index and name attribute in the text locations matches, clicking on the index term or an accompanying surrogate locator will open the appropriate section of text on the reader's screen.
- Amazon's eBook reader, which drives its display with a subset of HTML so that website-type indexing is feasible. Nevertheless early downloads have been predominantly of fiction, and other titles not requiring indexes. Because page size is variable under the reader's control, Kindles use a system of 'locations' to specify approximate position, and Amazon states that index use is not recommended, although indexes of varying degrees of usefulness do appear.
- markup languages
- Document description languages, now usually complying with the SGML standard, which allow different types of content to be selected, grouped and subjected to consistent operations like display, editing and sorting. The best known examples are HTML, which encodes website pages, and XML.
- In the wider world, a website aggregating data from disparate sources (often using numerous APIs) to provide an enriched user experience, for example a map plus live video or real-time exchange rates. In specialised indexing usage, mashups are aggregates of the existing indexes to a selection of documents, potentially making all their contents searchable in one pass. Though attractive to publishers keen to derive extra value from their back catalogues, they have obvious risks, because analytical index term choice is affected by numerous document-specific factors like authorial preference, academic level, geographical usages and the conventional suppression of the metatopic. These can give rise either to clashing redirections (like electronic circular references) or else retrieval performance inferior (because of inconsistent choices) to simple word occurrence searching.
Subsets are on-demand selections from larger datasets, most commonly chapters from one or more complete books or ebooks. They suffer even more from the same shortcoming, because the degree of analysis in human indexing is always influenced by documentary context, which is lost when making subsets.
- In EPUB terms, the kind of bibliographic data found in library catalogues, and therefore amenable to the standardisation initiatives like Dublin Core and ONIX, used to identify a specific document and version.
- Essentially the main subject of a document and, as such, conventionally avoided in subject, as opposed to name, indexes. Otherwise virtually the whole index to a book about growing oranges might appear under the letter 'O'. Normally more specific entries are chosen as facilitating more efficient retrieval, though it is permissible to wonder whether all readers understand this convention.
- An extreme form of word-spotting, where index terms are restricted to proper names, rather than based on an analysis of meaning and subject content. Name-spotting is undesirable, even for documents like biographies, where names are important. Otherwise, analytical indexing may feature the deliberate provision of a separate, fully-researched name index, but this would normally complement rather than replace a full subject index.
- namespaces (XML)
- Used to disambiguate potentially clashing element and attribute names, the namespace is usually declared early in an XML document and takes the form of a URI, its format resembling the URL familiar as a website address.
- OASIS (Organization for the Advancement of Structured Information Standards)
- A not-for-profit consortium promoting worldwide standards in various applications including security, web services and electronic publishing. DocBook is an OASIS-supported standard. Their website is at
- online books
- Books accessible from a publisher's website, usually for a fee and normally in paginated format as PDF files but occasionally as markup. They are usually designed to be read on a PC, as opposed to eBooks which are intended to be downloaded to a suitable eReader.
An ontology is a form of knowledge representation, which basically means it is a description or model of the concepts and their relationships in a specific field. It defines a controlled vocabulary for users who need to share information in a specific domain (e.g. medicine) by providing a mechanism to capture information about objects and the relationships between them. An ontology is more complex than a thesaurus in having additional specific relationships (associative relationships) in addition to preferred/non-preferred (equivalence relationships) and broader terms/narrower terms (hierarchical relationships). It will include:
Classes (or concepts)
Subclasses (narrower terms)
Properties or attributes (shared by the class)
Relationships (ways of interacting)
Instances (individuals/individual examples)
In a simple example:
Subclass: Viral infection
Subclass: Cold sores
Relationship: to Antivirals ('has treatment' is the relationship)
Relationship: to Mouth swab ('has diagnostic' is the relationship)
Relationship: to Oral Herpes ('has synonym' is the relationship = equivalence/UF relationship of thesauri)
- parsers (XML)
- Software tools for checking the well-formedness and sometimes the validity of an XML document.
- PDF (Portable Document Format)
- Adobe's proprietary file format, designed to display documents (text plus graphics and images) the same way regardless of hardware platform. PDFs have become the near-universal standard for the transmission of paginated documents.
- A placeholder is a symbol in a logical or mathematical expression that can be replaced by the name of any member of specified set or, in the case of indexing, some form of dummy content, reserving space for information that will become available only later (for example, a yet-to-be-determined page number in an electronic document destined to be paginated).
- The Publishing Technology Group, a working party established by the Society of Indexers in 2011, initially to prepare briefing notes for publishers approaching the Society of Indexers for advice on incorporating indexes in electronic documents or converting indexed hardcopy documents for electronic delivery. Its remit was redefined on 26 October 2011 to include promoting indexing in the digital context and keeping members up to date.
- The presentation of a marked up document in the intended form, e.g. converting HTML code to a human-readable web page by a browser.
- reverse indexing
- see index locator search
- schemas (XML)
- Schemas are a set of constraints on an XML document's structure, content, and the ordering of its elements, of which DTDs are one common and straightforward form. XML Schema is the name of a specific schema language.
- semantic web
- A development of the Word Wide Web providing content labelled according to its meaning rather than just its format, the semantic web should allow computer to deliver a more integrated and enriched user experience. Though progress towards achieving this vision has been slow, some software components are in place, including RDF (Resource Description Framework), OWL (Web Ontology Language) and XML. For an up-to-date overview, see 'The semantic web: an introduction for information professionals', by Matt Moore, The Indexer, 30(1), 38-43 (March 2012).
- semi-automatic indexing
- A number of systems, such as TExtract and IndDoc, will analyse machine-readable text and produce a list of word and phrase occurrences judged significant by inbuilt criteria which, it is claimed, can save indexers a considerable amount of work. In fact, because these systems are occurrence-based, they apply a different model for generating terms from that used by human indexers and the initial result is likely to resemble a concordance or at best a word-spotting index. Semi-automatic indexing systems should not be confused with indexing software or add-ons that automate operations after the indexer has discerned meaning and selected candidate terms.
- SGML (Standard Generalized Markup Language)
- The earliest document markup language to appear, SGML introduced the use of universal codes to identify document components according to their format to facilitate their exchange in machine-readable form. SGML now features in the ISO 8879 standard that formulates requirements for all other markup languages.
- standalone indexes
- Commonly the conventional form of back-of-the-book index, in which a separate index file is compiled, giving access to the significant treatments of topics using some form of locator (usually the page number). The term is used chiefly to distinguish this traditional approach from an embedded index. However, the EPUB Indexes Charter Proposal (http://code.google.com/p/epub-revision/wiki/IndexesCharterProposal) offers a completely different definition: a publication that consists only of one or more indexes to other EPUBs or external targets.
- Software imposing a consistent rule set on one particular rendering of markup, whether HTML as the set of matching pages in a website or XML in virtually any desired format. CSS is the commonest stylesheet language for HTML; XSLT the commonest for XML.
- see mashups
- tagging (publishers)
- Marking the precise location of a treatment or a term occurrence, often in as yet unpaginated text, by means of one or more codes, which appear as the locator in an accompanying standalone index. Though they may subsequently be converted into page numbers or some other locator form like hyperlinks, they can indicate the location and extent of the treatment to a greater degree of precision than would a page number.
- tags (markup languages)
- One component of markup language elements, the other being content, tags are delimited with angle-brackets and fall into three groups; start tags, end tags and empty-element tags. Their formats are
<element/>respectively. The term has a number of different meanings but its use by indexers is especially undesirable, not least because of possible confusion with the use of tagging by publishers.
- Though Roget's is a list of near synonyms, in information retrieval a thesaurus is a form of controlled vocabulary which builds in more structure by linking terms with broader and narrower meanings within a hierarchy, redirecting from non-preferred to preferred terms and indicating near synonyms with see also links. For example:
asteroids BT solar system USE minor planets
asteroid belt objects
SA Kuiper belt
minor planets BT solar system UF asteroids
NT asteroid belt objects
NT near-earth objects
Though invaluable for controlling document collections and periodical runs, thesauri are much less suitable for books, eBooks or website indexing. Even if they are available to guide the searcher (without which they are useless), they don't reflect authorial preferences, can't accommodate new coinages or very specialised terminology, and are usually insufficiently detailed to allow subtle arguments to be specified.
- topic maps
- A model of knowledge representation involving topics, associations and occurrences that began with work on back-of-the-book indexes and predates, but has accommodated to XML (as the XTM language) and represents an alternative to the RDF (Resource Description Framework) approach to the semantic web. The standard ISO/IEC 13250:2003 covers topic maps and the relevance to indexes is usefully discussed in 'The medium is not the message: topic maps and the separation of presentation and content in indexes', by Richard Northedge, The Indexer, 26(2), 60-64 (June 2008).
- URL (Uniform Resource Locator)
- The character string that defines the location of an Internet resource. Strictly speaking only the first part of http://www.indexers.org.uk/index.php is the URL, the last part being the Uniform Resource Name, and the whole, the Uniform Resource Identifier. Intended to make sense to human users, they are automatically translated to numeric IP addresses by software. Especially cumbersome URLs can usefully be replaced using the service at http://tinyurl.com.
- To be valid, a marked up XML document must contains a reference to a schema or DTD that declares its elements and attributes and must follow the grammatical rules specified in that schema.
- The World Wide Web consortium, which maintains standards governing, and documentation defining, most markup languages.
- A required format of an XML document, meaning that it meets the very strict requirement of XML syntax rules with regard to criteria like matching tags, legal characters and correct nesting.
- Radio-based wireless local area networks, allowing internet connections to PC and smartphone users through simple proximity (the range depending on intervening barriers like walls) and increasingly offered by hotels and on public transport. The contactless connection brings obvious security vulnerabilities.
- A pejorative description of a form of indexing in which entries are restricted to the terms used by the author(s), rather than being expanded to include alternatives likely to be sought by the document's intended readers. The latter is analytical indexing. Extreme forms are name-spotting and concordance generation.
- World Wide Web
- The 'Web' is the global, interconnected mass of multi-media documents stored on millions of machines across the world connected by the Internet, made possible by Tim Berners-Lee's work on HTML at CERN in 1990. Web pages always have the prefix
http://, indicating the hypertext transfer protocol which dictates how they are handled by the Internet. Though dominant, 'the web' is not synonymous with the Internet, which of course predates it and still supports other types of exchange, for example the File Transfer Protocol, indicated by
ftp://and used for the rapid movement of static documents and data.
- XHTML (Extensible HyperText Markup Language)
- One of a growing family of XML-derived markup languages, this is essentially a stricter form of HTML, meeting XML standards of well-formedness. XHTML is supported by a W3C Recommendation and is the basis, for example, of the EPUB standard for eBooks.
- XML (Extensible Markup Language)
- A flexible and powerful data description language specification - with a derived family of specialised languages - used worldwide to provide inter-convertibility for documents and web services. The markup comprises elements, whose grammar, relationships and rendering rely on supporting schemas and stylesheets. Since Office 2007, Microsoft Office applications have been XML-based. XML can be used for embedded indexing, see for example 'XML indexing', by Michele Combs, The Indexer, 30(1), 47-52 (March 2012).
- A language used in examining the elements and attributes of XML documents in a systematic way, XPath is used in the production if XSLT stylesheets and therefore contributes to the rendering of XML documents.
- XSLT (Extensible Stylesheet Language Transformations)
- The stylesheet language used to render a document marked up in XML, bearing a similar relation to XML as CSS does to HTML but being much more powerful and flexible. In the case of a printed book, the XSLT needs to generate the page layout and also collect the index terms, sort and format them, suppress duplication and add locators and lay out the resulting index.