11 Implementation of the query system22

Compatibly with other query languages, the EQL defines the user a front-end interface and infers the query’s results from its semantics. However, a query language does not define any data structures or specify how the query engine is to be implemented. As mentioned in Chapter 2, a major user requirement was database portability, simple package installation, tolerable run times over complex queries, and a system that did not rely on external software at runtime. The only available back-end implementation that met those needs and was also available as an R package at the time was (R)SQLite (Hipp and Kennedy (2007), Wickham, James, and Falcon (2014)). As (R)SQLite is a relational database management system, emuR’s query system could not be implemented so as to use directly the primary data sources of an emuDB, that is, the JSON files described in Chapter 5. A syncing mechanism that maps the primary data sources to a relational form for querying purposes had to be implemented. This relational form is referred to as the emuDBcache in the context of an emuDB. The data sources are synchronized while an emuDB is being loaded and when changes are made to the annotation files. To address load time issues, we implemented a file check-sum mechanism which only reloads and synchronizes annotation files that have a changed MD5-sum (Rivest 1992). Figure 11.1 is a schematic representation of how the various emuDB interaction functions interact with either the file representation or the relational cache.

Schematic architecture of `emuDB` interaction functions of the `emuR` package. textcolor{three-color-c2}{Orange} paths show examples of functions interacting with the files of the `emuDB`, while  extcolor{three-color-c1}{green} paths show functions accessing the relational annotation structure. Actions like saving a changed annotation using the `EMU-webApp` first save the `_annot.json` to disk then update the relational annotation structure.

Figure 11.1: Schematic architecture of emuDB interaction functions of the emuR package. textcolor{three-color-c2}{Orange} paths show examples of functions interacting with the files of the emuDB, while extcolor{three-color-c1}{green} paths show functions accessing the relational annotation structure. Actions like saving a changed annotation using the EMU-webApp first save the _annot.json to disk then update the relational annotation structure.

Despite the disadvantages of cache invalidation problems, there are several advantages to having an object relational mapping between the JSON-based annotation structure of an emuDB and a relation table representation. One is that the user still has full access to the files within the directory structure of the emuDB. This means that external tools can be used to script, manipulate or simply interact with these files. This would not be the case if the files were stored in databases in a way that requires (semi-)advanced programming knowledge that might be beyond the capabilities of many users. Moreover, we can provide expert users with the option of using other relational database engines such as PostgreSQL, including all their performance-tweaking abilities, as their relational cache. This is especially valuable for handling very large speech databases.

The relational form of the annotation structure is split into six tables in the relational database to avoid data redundancy. The six tables are:

  1. emu_db: containing emuDB information (columns: uuid, name),
uuid name
0fc618dc-8980-414d-8c7a-144a649ce199 ae
  1. session: containing session information (columns: db_uuid, name),
db_uuid name
0fc618dc-8980-414d-8c7a-144a649ce199 0000
  1. bundle: containing bundle information (columns: db_uuid, session, name, annotates, sample_rate, md5\_annot\_json),
db_uuid session name annotates sample_rate md5_annot_json
0fc618dc-8980-414d-8c7a-144a649ce199 0000 msajc003 msajc003.wav 20000 785c7cdb6d4bd5e8b5cd7c56a5946ddf
  1. items: containing all annotation items of emuDB (columns: db_uuid, session, bundle, item_id, level, type, seq_idx, sample_rate, sample_point, sample_start, sample_dur),
db_uuid session bundle item_id level type seq_idx sample_rate sample_point sample_start sample_dur
0fc618dc-8980-414d-8c7a-144a649ce199 0000 msajc003 147 Phonetic SEGMENT 1 20000 NA 3749 1389
  1. labels: containing all labels belonging to all items (columns: db_uuid, session, bundle, item_id, label_idx, name, label), and
db_uuid session bundle item_id label_idx name label
0fc618dc-8980-414d-8c7a-144a649ce199 0000 msajc003 147 1 Phonetic V
  1. links: containing all links between annotation items of emuDB (columns: db_uuid, session, bundle, from_id, to_id, label).
db_uuid session bundle from_id to_id label
0fc618dc-8980-414d-8c7a-144a649ce199 0000 msajc003 8 7 NA
7 NA

While performing a query the engine uses an aggregate key to address every annotation item and its labels (db_uuid, session, bundle, item_id) and a similar aggregate key to dereference the links (db_uuid, session, bundle, from_id / to_id) which connect items. As the records in relational tables are not intrinsically ordered a further aggregate key is used to address the annotation item via its index and level (uuid, session, bundle, level / seq_idx). This is used, for example, during sequential queries to provide an ordering of the individual annotation items. It is worth noting that a plethora of other tables are created at query time to store various temporary results of a query. However, these tables are created as temporary tables during the query and are deleted on completion which means they are not permanently stored in the emuDBcache.

11.1 Query expression parser

The query engine parses an EQL query expression while simultaneously executing partial query expressions. This ad-hoc string evaluation parsing strategy is different from multiple other query systems which incorporate a query planner stage to pre-parse and optimize the query execution stage (e.g., Hipp and Kennedy (2007), Conway et al. (2016)). Although no pre-optimization can be performed, this strategy simplifies the execution of a query as it follows a constant heuristic evaluation strategy. This section describes this heuristic evaluation and parsing strategy based on the EQL expression [[Syllable == W -> Syllable == W] ^ [Phoneme == @ -> #Phoneme == s]].

The main strategy of the query expression parser is to recursively parse and split an EQL expression into left and right sub-expressions until a so-called Simple Query (SQ term is found and can be executed (see EBNF in Appendix 17 for more information on the elements comprising the EQL). This is done by determining the operator which is the first to be evaluated on the current expression. This operator is determined by the sub-expression grouping provided by the bracketing. Each sub-expression is then considered to be a fully valid EQL expression and once again parsed. Figure 11.2, which is split into seven stages (marked S1-S7), shows the example EQL expression being parsed (S1-S3) and the resulting items being merged to meet the requirements of the individual operator (S4-S6) of the original query. S1 to S3 show the splitting operator character (e.g., textcolor{three_color_c3}{->} in purple) which splits the expression into a textcolor{three_color_c1}{left} (green) and textcolor{three_color_c2}{right} (orange) sub-expression.

Example of how the query expression parser parses and evaluates an EQL expression and merges the result according to the respective EQL operators.

Figure 11.2: Example of how the query expression parser parses and evaluates an EQL expression and merges the result according to the respective EQL operators.

The result modifier symbol (#) is noteworthy for its extra treatment by the query engine as it places an exact copy of the items marked by it into its own intermediary result storage (see #sitems node on S7 in Figure 11.2). After performing the database operations necessary to do the various merging operation which are performed on the intermediary results, this storage is updated by removing items from it that are no longer present due to the merging operation. As a final step, the query engine evaluates if there are items present in the intermediary result storage created by the presence of the result modifier symbol. If so, these items are used to create an emuRsegs object by deriving the time information and extracting the necessary information from the intermediate result storage. If no items are present in the result modifier storage, the query engine uses the items provided by the final merging procedure in S3 instead (which is not the case in the example used in Figure 11.2).

A detailed description of how this query expression parser functions is presented in a pseudo code representation in Algorithms 11.3 and 11.423. For simplicity, this representation ignores the treatment of the result modifier symbol (#) and focuses on the parsing and evaluation strategy of the query expression parser. As stated previously, the presence of the result modifier before an SQ triggers the query engine to place a copy of the result of that SQ into an additional result table, which is then updated throughout the rest of the query. The starting point for every query is the query() function (see Algorithm 11.4). This function places the filtered items, links and labels entries that are relevant for the current query into temporary tables. Depending on which query terms and operators are found, the EQL query engine uses the various sub-routines displayed in Algorithms 11.3 and 11.4 to parse and evaluate the EQL expression.

Pseudo Code for Query Engine Algorithm - Part 1

Figure 11.3: Pseudo Code for Query Engine Algorithm - Part 1

Pseudo Code for Query Engine Algorithm - Part 2

Figure 11.4: Pseudo Code for Query Engine Algorithm - Part 2


  1. Sections of this chapter have been published in Winkelmann, Harrington, and Jänsch (2017).↩︎

  2. The R code that implements this pseudo code can be found here: https://github.com/IPS-LMU/emuR/blob/master/R/emuR-query.database.R.↩︎