This Go project uses the genetic solver from my previous post to evolve a regular expression that matches all items in a set of wanted strings without matching any items in the set of unwanted strings. For example:
wanted := []string{"00", "01", "10"}
unwanted := []string{"11"}
We desire a regular expression that finds strings from wanted without finding those from unwanted. We’ll start by determining the unique set of characters in the wanted strings:
func getUniqueCharacters(wanted []string) string {
uniqueCharacters := make(map[string]bool)
characters := ""
for _, item := range wanted {
for i := 0; i < len(item); i++ {
token := item[i : i+1]
if !uniqueCharacters[token] {
characters += token
uniqueCharacters[token] = true
}
}
}
return characters
}
And we’ll combine those with a set of regex special characters to get our gene set:
const regexSpecials = "[]()|?*+" geneSet := getUniqueCharacters(wanted) + regexSpecials
Next, when we receive a candidate regex from the solver we’ll verify that it is a valid regex by trying to compile it.
func isValidRegex(candidate string) bool {
if strings.Contains(candidate, "()") || strings.Contains(candidate, "??") {
return false
}
_, err := regexp.Compile(candidate)
return err == nil
}
We also need a function to calculate the fitness of a given candidate based on the number of wanted and unwanted strings it matches.
func calculate(wanted, unwanted []string, geneSet string, candidate string) int {
if !isValidRegex(candidate) {
return math.MaxInt32
}
regex := regexp.MustCompile("^(" + candidate + ")$")
fitness := 0
for _, item := range wanted {
if !regex.MatchString(item) {
fitness++
}
}
for _, item := range unwanted {
if regex.MatchString(item) {
fitness += 10
}
}
return fitness
}
Next we’ll need a way to visualize what’s happening:
disp := func(candidate string) {
fmt.Println(candidate,
"\t",
calc(candidate),
"\t",
time.Since(start))
}
Finally we glue it all together as follows:
package main
import (
"github.com/handcraftsman/GeneticGo"
"fmt"
"math"
"regexp"
"strings"
"time"
)
const regexSpecials = "[]()|?*+"
func main() {
wanted := []string{"AL", "AK", "AS", "AZ", "AR"}
unwanted := []string{"AA"}
geneSet := getUniqueCharacters(wanted) + regexSpecials
calc := func(candidate string) int {
return calculate(wanted, unwanted, geneSet, candidate)
}
start := time.Now()
disp := func(candidate string) {
fmt.Println(candidate,
"\t",
calc(candidate),
"\t",
time.Since(start))
}
var solver = new(genetic.Solver)
solver.MaxSecondsToRunWithoutImprovement = 3
solver.LowerFitnessesAreBetter = true
solver.MaxRoundsWithoutImprovement = 10
var best = solver.GetBestUsingHillClimbing(calc, disp, geneSet, 10, 1, 0)
fitness := calc(best)
if fitness == 0 {
println("\nsolved with: " + best)
} else {
println("\nfailed to find a solution")
}
fmt.Print("Total time: ")
fmt.Println(time.Since(start))
}
Note: depending on the speed of your processor you may be able to reduce MaxSecondsToRunWithoutImprovement. For this problem I’ve set it to 1/10 second.
solver.MaxSecondsToRunWithoutImprovement = .1
Output:
$ go run samples/regex.go | 3 0 10 2 1.0001ms 1*0+ 1 131.0075ms 0*1?0* 0 335.0192ms solved with: 0*1?0* Total time: 335.0192ms
Cool!
Now, here’s a somewhat more interesting problem – repetition:
wanted := []string{"01", "0101", "010101"}
unwanted := []string{"0011"}
solver.MaxSecondsToRunWithoutImprovement = .1
Output:
$ go run samples/regex.go 0 3 1ms 01 2 1ms (01)+ 0 215.0123ms solved with: (01)+ Total time: 216.0123ms
Nailed it! And one more based on U.S. state mail codes:
wanted := []string{"AL","AK","AS","AZ","AR"}
unwanted := []string{"AA"}
solver.MaxSecondsToRunWithoutImprovement = .5
Output:
$ go run samples/regex.go R 5 0 AS 4 1.0001ms AK?L? 3 1.0560604s AS*L?K* 2 2.1301218s A[SKLZ] 1 2.62215s A[LRSZK] 0 3.1391795s solved with: A[LRSZK] Total time: 3.6392081s
The source code for this project is available from my GitHub repository.
Go on to part 5.
Very cool, I’ve spent the last 20 minutes playing with it; quite fun.
I was trying to figure out why I wasn’t getting repeatable results, and noticed a couple things.
1. Solver.quit is read/written in a way that isn’t guaranteed to be recognized across goroutines. You can use sync/atomic or a Mutex, but it’s likely that a WaitGroup, a channel, or maybe a Tomb (http://gopkgdoc.appspot.com/pkg/launchpad.net/tomb) along with some reorganization would be better.
2. You use println. Probably shouldn’t. fmt.Println is almost always a better choice. Or, in a library, an optionally null log.Logger might be better, as some users may not want things to go to stdout/stderr.
3. It’d be better to have the Solver maintain it’s own *rand.Rand instead of seeding the global one. In fact, some of the generator goroutines could probably have one to themselves and you could avoid some contention/locking.
4. incrementStrategyUseCount and getNextStrategyIndex seem to be using the strategySuccessSum and strategy counts in a racy way.
5. Incrementally building strings with bytes.Buffer tends to be much more efficient than loops with +=.
Thank you for your constructive comments. I’ll definitely take a look at all the things you suggested. I’ve opened them as issues: https://github.com/handcraftsman/GeneticGo/issues