Abstract
Emergent software systems are composed of elementary building blocks, where many of those blocks have variations available which are better or worse in different deployment contexts. Genetic Improvement (GI) for source code has been proposed for creating and curating collections of such blocks, but the combination of new code synthesis with genetic mutation and crossover results in large, complex search spaces. A range of methods to aid such a search have been proposed, with the particular notion of species having appeared in the context of Genetic Algorithms (GAs) to identify individuals with similar genotypes for controlling competition, encouraging the exploration of distant local optima, maintaining diversity and avoiding premature convergence. In this paper we examine a species definition for GI for source code, a domain which has specific features: genotype similarity is largely irrelevant; distance between individuals is undefined; and the fitness landscape is extremely rugged. We propose a phenotypic species definition that captures an algorithm’s functional phenotypic characteristics, while excluding its nonfunctional phenotypic characteristics (and its particular representation in source code). We introduce our proposal in a GI for a hash table scenario, where species are characterised by divergence in probability distributions.