Appendix 1: Glossary accessory genome The collection of genes that are not present in all genomes in a given set of genomes (see also core genome, pan-genome) alignment A method of comparing two or more sequences, usually displayed as rows in a matrix or array; may be either local or global assembly A construct formed by joining overlapping sequence reads into contiguous strings of sequence (contigs) that acts as a representation of the organism genome; may be either partial (draft) assembly or complete (closed) assembly bioinformatics The processes and algorithms involved in managing and analysing sequencing data BLAST Basic Local Alignment Search Tool – an algorithm that uses local alignment to match sequences against a database for identification (see also local alignment) consensus The base called by the majority of overlapping sequences or reads at a given locus; can also be expressed as a percentage eg. 90% consensus = 9 out of every 10 reads called the same base at that particular locus contig A contiguous sequence formed by the consensus sequence from overlapping reads core genome The collection of genes that are present in all genomes in a given set of genomes (see also accessory genome, pan-genome) coverage The amount of an entire genome sequence covered by mapped reads; sometimes also used to refer to “depth of coverage” (see depth) depth The number of overlapping reads that map to a given locus eg. 100x depth = 100 reads mapped to that particular locus exome The collective protein-coding regions of a genome genome The total genetic content of an organism genomics The study and analysis of genomes global alignment The process of aligning each locus in two or more sequences of similar length for comparison, resulting in an alignment of sequences of equal length (see also local alignment) GUI Graphical User Interface – an interactive program interface that utilises graphical elements such as icons and “point-and-click” options in a window to facilitate ease of use GWAS Genome-Wide Association Study – a study that attempts to match genetic alleles with a specified phenotype within a population haploid A single chromosome per genome (see also polyploid) homopolymer A sequence string of identical bases eg. GGGGGGGG or AAAA horizontal gene transfer Any lateral transfer of genetic material including through conjugation, transduction, transformation, plasmid transfer indel Mutation that involves either insertion and/or deletion of bases insertion sequence A short sequence of DNA that can move from one position on a chromosome to another; also known as an IS element local alignment The process of aligning each locus in a shorter sequence against a corresponding section of a longer sequence or database for comparison (see also global alignment, BLAST) locus A position in a genome sequence occupied by a single nucleotide metagenomics The study and analysis of all the genomic content in a sample (frequently involves multiple organisms) models of DNA evolution Models that aim to predict the rate and direction of nucleotide substitution during evolution eg. Jukes & Cantor, Generalised Time-Reversible, Hasegawa, Kishino & Yano; usually incorporated into maximum likelihood and Bayesian models to estimate a phylogeny molecular clock The natural background rate of evolutionary change in a genome that is assumed to exist N50 A genome assembly metric – the N50 is the contig length where the sum of the contigs of that length or longer comprise ≥50% of the total sum of contig lengths, and the sum of the contig of that length or shorter comprise ≥50% of the total sum of contig lengths neighbour-joining A rapid clustering method to draw a phylogenetic tree; other more probabilistic methods include maximum likelihood and Bayesian analysis open source Software that is publicly available and free pan-genome The total collection of genes present in a given set of genomes. Pan-genome = core genome + accessory genome Phred-score A per-base score to indicate the quality and accuracy of sequencing; Q10 = 1 error per 10 bases, Q20 = 1 error per 100 bases, Q30 = 1 error per 1000 bases phylogeny The history of the genetic evolution of a group of organisms; usually represented by a phylogenetic tree, where branches represent the genetic differences between organisms polyploid Multiple chromosomes per genome (see also haploid) read clipping A post-sequencing bioinformatics quality-control process where raw sequence reads are filtered and trimmed to remove indexing adaptors attached to the DNA sequence during library preparation, and to remove low quality sequence regions read mapping The process of aligning/matching sequence reads against corresponding regions of a reference genome reads The raw, usually overlapping, unanalysed sequence output from a genome sequencer, that can be either mapped to a reference genome, or assembled de novo into contigs; length varies depending on the sequencer and intended study recombination Any process involving the horizontal exchange of genes between two organisms reference genome A representative (usually complete / closed) genome scaffold A genome framework, comprising several contigs, separated by expected gaps of a known size SNP Single nucleotide polymorphism – a single base variation transcriptome The sum of all transcribed RNA produced by a set of organisms