# Edit distance implementation

Posted on

Problem

This problem is taken from the book Introduction to Algorithms, Third Edition By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein:

In order to transform one source string of text x[1..m] to a target
string y[1..n], we can perform various transformation operations.
Our goal is, given x and y, to produce a series of transformations
that change x to y. We use an array ́z—assumed to be large enough to
hold all the characters it will need—to hold the intermediate results.
Initially, z is empty, and at termination, we should have z[j] = y[j]
for j = 1, 2,…, n. We maintain current indices i into x and j into
z, and the operations are allowed to alter z and these indices.
Initially, i = j = 1. We are required to examine every character in x
during the transformation, which means that at the end of the sequence
of transformation operations, we must have i = m + 1.

We may choose from among six transformation operations:

Copy a character from x to z by setting ́z[j] = x[i] and then incrementing both i and j. This operation examines x[i].

Replace a character from x by another character c, by setting z[j] = c, and then incrementing both i and j . This operation examines x[i].

Delete a character from x by incrementing i but leaving j alone. This operation examines x[i].

Insert the character c into z by setting z[j] = c and then incrementing j , but leaving i alone. This operation examines no
characters of x.

Twiddle (i.e., exchange) the next two characters by copying them from x to z but in the opposite order; we do so by setting z[j] = x[i
+ 1] and ́z[j + 1] = x[i] and then setting i = i + 2 and j = j + 2. This operation examines x[i] and x[i + 1].

Kill the remainder of x by setting i = m + 1. This operation examines all characters in x that have not yet been examined. This
operation, if performed, must be the final operation.

a. Given two sequences x[1..m] and y[1..n] and set of transformation
operation costs, the edit distance from x to y is the cost of the
least expensive operation sequence that transforms x to y. Describe a
dynamic-programming algorithm that finds the edit distance from
x[1..m] to y[1..n] and prints an optimal operation sequence. Analyze
the running time and space requirements of your algorithm

My solution is O(nm+m+n)$O\left(nm+m+n\right)$$O(nm + m+n)$.

## Code:

extern crate rand;

use std::collections::HashMap;
use std::collections::VecDeque;

#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
enum EditOp {
Copy,
Replace,
Delete,
Insert,
Twiddle,
Kill,
}

impl EditOp {
fn valid(&self,
i       : usize,
j       : usize,
iprev   : char,
jprev   : char,
icurrent: char,
jcurrent: char) -> bool {

let threshold = |p| i != p || j == p;
match *self {
EditOp::Copy    => threshold(1)
&& icurrent == jcurrent,

EditOp::Replace => threshold(1),
EditOp::Delete  => i > 1,

EditOp::Twiddle => i > 1 && j > 1
&& threshold(2)
&& icurrent == jprev
&& iprev == jcurrent,
_ => true,
}
}

fn displacement(&self) -> (usize, usize) {
match *self {
EditOp::Copy    => (1, 1),
EditOp::Replace => (1, 1),
EditOp::Delete  => (1, 0),
EditOp::Insert  => (0, 1),
EditOp::Twiddle => (2, 2),
EditOp::Kill    => (0, 0),
}
}

fn apply(&self,
from_itr: &mut std::str::Chars,
to_itr  : &mut std::str::Chars,
result  : &mut String) {
match *self {
EditOp::Copy    => {
result.push(from_itr.next().unwrap());
to_itr.next();

},
EditOp::Replace => {
result.push(to_itr.next().unwrap());
from_itr.next();
},
EditOp::Delete  => {
from_itr.next();
},
EditOp::Insert  => {
result.push(to_itr.next().unwrap());
},
EditOp::Twiddle => {
let second = from_itr.next().unwrap();
let first = from_itr.next().unwrap();
result.push(first);
result.push(second);
to_itr.next();
to_itr.next();
},
EditOp::Kill => {},
}
}
}

fn create_tables(from_len: usize,
to_len  : usize,
costs   : &HashMap<EditOp, usize>)
-> (Vec<Vec<EditOp>>, Vec<Vec<usize>>) {

let mut op_table   = vec![ vec![EditOp::Kill; to_len]; from_len ];
let mut cost_table = vec![ vec![0usize; to_len]; from_len ];
let delete_cost    = costs.get(&EditOp::Delete).map(|c| *c).unwrap_or(0);

for (i, c) in (1.. from_len).map(|i|(i, i * delete_cost)) {
cost_table[i][0] = c;
op_table[i][0]   = EditOp::Delete;
}
(op_table , cost_table)
}

fn edit_distance(from : &str,
to   : &str,
costs: &HashMap<EditOp, usize>)
-> (usize, VecDeque<EditOp>){

if costs.is_empty() {
panic!("map of costs is empty");
}

let from_len = from.chars().count();
let to_len   = to.chars().count();

let (mut op_table, mut cost_table) = create_tables(from_len + 1,
to_len + 1,
costs);
let (mut lasti, mut lastj)         = (' ', ' ');

for (i, ichar) in from.chars().enumerate() {
for (j, jchar) in to.chars().enumerate() {
let (i, j) = (i + 1, j + 1);
let (op, cost) = costs.iter()
.filter(|&(x, _)| !x.eq(&EditOp::Kill))
.filter(|&(x, _)| x.valid(i, j, lasti, lastj,
ichar, jchar))
.map(|(x, c)| {
let (di, dj) = x.displacement();
(x, cost_table[i - di][j - dj] + c)
})
.min_by_key(|&(_, c)| c)
.unwrap();

cost_table[i][j] = cost;
op_table[i][j] = *op;
lastj = jchar;
}
lasti = ichar;
}

let no_kill_cost    = cost_table[from_len][to_len];
let (i, cost, kill) = costs.get(&EditOp::Kill)
.map(|kc| min_with_kill(&cost_table,
from_len,
to_len, *kc))
.unwrap_or((from_len, no_kill_cost, None));

(cost, build_operations(kill, &op_table, i, to_len))
}

fn edit(from: &str, to: &str, ops: &VecDeque<EditOp>) -> String {
let mut result   = String::new();
let mut from_itr = from.chars();
let mut to_itr   = to.chars();

for op in ops {
op.apply(&mut from_itr, &mut to_itr, &mut result);
}
result
}

fn min_with_kill(cost_table: &Vec<Vec<usize>>,
from_size : usize,
to_size   : usize,
kill_cost : usize ) -> (usize, usize, Option<EditOp>) {

let no_kill_cost  = cost_table[from_size][to_size];
(1..from_size).map(|i|(i, cost_table[i][to_size] + kill_cost) )
.map(|(i, c)|(i, c, Some(EditOp::Kill)))
.chain(Some((from_size, no_kill_cost, None)).into_iter())
.min_by_key(|&(_, cost, _)| cost).unwrap()
}

fn build_operations(kill    : Option<EditOp>,
op_table: &Vec<Vec<EditOp>>,
i       : usize,
j       : usize) -> VecDeque<EditOp> {

let itr = std::iter::repeat(()).scan((i, j), |s,_| {
let op       = op_table[s.0][s.1];
let (id, jd) = op.displacement();
*s = (s.0 - id, s.1 - jd);
Some(op)
}).take_while(|op| !op.eq(&EditOp::Kill));

let mut stack = VecDeque::new();
for op in kill.into_iter().chain(itr) {
stack.push_front(op);
}
stack
}

fn main() {

let mut map = HashMap::new();
map.insert(EditOp::Copy, 2);
map.insert(EditOp::Replace, 3);
map.insert(EditOp::Delete, 2);
map.insert(EditOp::Insert, 4);
map.insert(EditOp::Twiddle, 1);
map.insert(EditOp::Kill, 1);

let from = "▓▓▓▓n
▒▒▒▓▓n
▒▒▒▒▒▓n
▒▒▒▒▒▒▓n
▒▒▒▒▒▒▓n
▒▒▒▒▒▒▒▓n
▒▒▒▒▒▒▒▓▓▓n
▒▓▓▓▓▓▓░░░▓n
▒▓░░░░▓░░░░▓n
▓░░░░░░▓░▓░▓n
▓░░░░░░▓░░░▓n
▓░░▓░░░▓▓▓▓n
▒▓░░░░▓▒▒▒▒▓n
▒▒▓▓▓▓▒▒▒▒▒▓n
▒▒▒▒▒▒▒▒▓▓▓▓n
▒▒▒▒▒▓▓▓▒▒▒▒▓n
▒▒▒▒▓▒▒▒▒▒▒▒▒▓n
▒▒▒▓▒▒▒▒▒▒▒▒▒▓n
▒▒▓▒▒▒▒▒▒▒▒▒▒▒▓n
▒▓▒▓▒▒▒▒▒▒▒▒▒▓n
▒▓▒▓▓▓▓▓▓▓▓▓▓n
▒▓▒▒▒▒▒▒▒▓n
▒▒▓▒▒▒▒▒▓";

let to = "│▒│ /▒/n
│▒│/▒/n
│▒/▒/─┬─┐n
│▒│▒|▒│▒│n
┌┴─┴─┐-┘─┘n
│▒┌──┘▒▒▒│n
└┐▒▒▒▒▒▒  ";

let (c, s) = edit_distance(&from, &to, &map);
let result = edit(&from, &to, &s);

println!("cost = {}", c);
println!("{:?}", s);
println!("{}", result);
}


## Test:

use rand::Rng;

fn random_string(size: usize) -> String {
(0..size).map(|_|rand::random::<char>()).collect()
}
#[test]
fn test_for_accurate_edition(){

let mut map = HashMap::new();
map.insert(EditOp::Copy, 2);
map.insert(EditOp::Replace, 3);
map.insert(EditOp::Delete, 2);
map.insert(EditOp::Insert, 4);
map.insert(EditOp::Twiddle, 1);
map.insert(EditOp::Kill, 1);

let size_generator = || rand::thread_rng().gen_range(100, 500);
for _ in 0..100 {
let from = random_string(size_generator());
let to = random_string(size_generator());
let (_, s) = edit_distance(&from, &to, &map);
let result = edit(&from, &to, &s);
assert_eq!(to, result);
}
}


Solution

Here’s a lightweight review…

1. You can combine imports with {}: use std::collections::{HashMap, VecDeque};
2. Consider importing enum variants in methods that exhaustively match on a lot of them. This helps avoid duplication and rightward pushing of code.
3. Accept a generic type that implements an iterator, instead of the specific std::str::Chars.
4. Introduce more and unique types, especially when you have repeated parallel names (i, iprev, icurrent). This also allows you to move functionality into methods.
5. Introduce type aliases. This is a tiny step towards official types, but gives an opportunity to introduce names and shorten them.
6. Instead of VecDeque, use a normal Vec, then you can use Iterator::collect and reverse it.
7. Use a == &b instead of a.eq(&b).
8. I always recommend using expect instead of unwrap. When the panic happens, you will be much happier having some context of where the failure is.
9. Maybe introduce a dedicated struct for the cost mapping. Then you can implement Default, and there’s no chance of missing a key.
10. Use &[T] instead of &Vec<T>. Faster and more usable.
11. [Clippy] Combine identical match arms with Foo | Bar.
12. [Clippy] Use map_or instead of map().unwrap_or.
13. [Clippy] Can use cloned instead of |x| *x.
14. [Clippy] Calls to edit and edit_distance have extra references.
15. Check out rustfmt, which, let’s say disagrees, with some of your style choices.

Unfortunately, I found the code very dense, which made reading over it fairly difficult. It’s possible some more verbose names could help. Here’s a cherry-picked sample of code that takes a second to mentally process :

.map(|(x, c)| {
let (di, dj) = x.displacement();
(x, cost_table[i - di][j - dj] + c)
})


extern crate rand;

use std::collections::HashMap;

#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
enum EditOp {
Copy,
Replace,
Delete,
Insert,
Twiddle,
Kill,
}

#[derive(Debug, Copy, Clone)]
struct FindBetterName {
idx: usize,
prev: char,
current: char,
}

impl FindBetterName {
fn valid_twiddle(&self, other: &FindBetterName) -> bool {
self.idx > 1 && other.idx > 1 && self.current == other.prev && self.prev == other.current
}
}

impl EditOp {
fn valid(&self, i: FindBetterName, j: FindBetterName) -> bool {
use EditOp::*;

let threshold = |p| i.idx != p || j.idx == p;

match *self {
Copy => threshold(1) && i.current == j.current,
Replace => threshold(1),
Delete => i.idx > 1,
Twiddle => threshold(2) && i.valid_twiddle(&j),
_ => true,
}
}

fn displacement(&self) -> (usize, usize) {
use EditOp::*;

match *self {
Copy | Replace => (1, 1),
Delete => (1, 0),
Insert => (0, 1),
Twiddle => (2, 2),
Kill => (0, 0),
}
}

fn apply<F, T>(&self, mut from_itr: F, mut to_itr: T, result: &mut String)
where F: Iterator<Item = char>,
T: Iterator<Item = char>
{
use EditOp::*;

match *self {
Copy => {
result.push(from_itr.next().unwrap());
to_itr.next();
}
Replace => {
result.push(to_itr.next().unwrap());
from_itr.next();
}
Delete => {
from_itr.next();
}
Insert => {
result.push(to_itr.next().unwrap());
}
Twiddle => {
let second = from_itr.next().unwrap();
let first = from_itr.next().unwrap();
result.push(first);
result.push(second);
to_itr.next();
to_itr.next();
}
Kill => {}
}
}
}

type CostMap = HashMap<EditOp, usize>;

fn create_tables(from_len: usize,
to_len: usize,
costs: &CostMap)
-> (Vec<Vec<EditOp>>, Vec<Vec<usize>>) {
let mut op_table = vec![ vec![EditOp::Kill; to_len]; from_len ];
let mut cost_table = vec![ vec![0usize; to_len]; from_len ];
let delete_cost = costs.get(&EditOp::Delete).cloned().unwrap_or(0);

for (i, c) in (1..from_len).map(|i| (i, i * delete_cost)) {
cost_table[i][0] = c;
op_table[i][0] = EditOp::Delete;
}

(op_table, cost_table)
}

fn edit_distance(from: &str, to: &str, costs: &CostMap) -> (usize, Operations) {
if costs.is_empty() {
panic!("map of costs is empty");
}

let from_len = from.chars().count();
let to_len = to.chars().count();

let (mut op_table, mut cost_table) = create_tables(from_len + 1, to_len + 1, costs);
let (mut lasti, mut lastj) = (' ', ' ');

for (i, ichar) in from.chars().enumerate() {
for (j, jchar) in to.chars().enumerate() {
let (i, j) = (i + 1, j + 1);
let ii = FindBetterName {
idx: i,
prev: lasti,
current: ichar,
};
let jj = FindBetterName {
idx: j,
prev: lastj,
current: jchar,
};

let (op, cost) = costs.iter()
.filter(|&(x, _)| x != &EditOp::Kill)
.filter(|&(x, _)| x.valid(ii, jj))
.map(|(x, c)| {
let (di, dj) = x.displacement();
(x, cost_table[i - di][j - dj] + c)
})
.min_by_key(|&(_, c)| c)
.unwrap();

cost_table[i][j] = cost;
op_table[i][j] = *op;
lastj = jchar;
}
lasti = ichar;
}

let no_kill_cost = cost_table[from_len][to_len];
let (i, cost, kill) = costs.get(&EditOp::Kill)
.map_or((from_len, no_kill_cost, None), |kc| {
min_with_kill(&cost_table, from_len, to_len, *kc)
});

(cost, build_operations(kill, &op_table, i, to_len))
}

type Operations = Vec<EditOp>;
type OperationsSlice<'a> = &'a [EditOp];

fn edit(from: &str, to: &str, ops: OperationsSlice) -> String {
let mut result = String::new();
let mut from_itr = from.chars();
let mut to_itr = to.chars();

for op in ops {
op.apply(&mut from_itr, &mut to_itr, &mut result);
}
result
}

fn min_with_kill(cost_table: &[Vec<usize>],
from_size: usize,
to_size: usize,
kill_cost: usize)
-> (usize, usize, Option<EditOp>) {

let no_kill_cost = cost_table[from_size][to_size];
(1..from_size)
.map(|i| (i, cost_table[i][to_size] + kill_cost))
.map(|(i, c)| (i, c, Some(EditOp::Kill)))
.chain(Some((from_size, no_kill_cost, None)).into_iter())
.min_by_key(|&(_, cost, _)| cost)
.unwrap()
}

fn build_operations(kill: Option<EditOp>,
op_table: &[Vec<EditOp>],
i: usize,
j: usize)
-> Operations {

let itr = std::iter::repeat(())
.scan((i, j), |s, _| {
let op = op_table[s.0][s.1];
let (id, jd) = op.displacement();
*s = (s.0 - id, s.1 - jd);
Some(op)
})
.take_while(|op| op != &EditOp::Kill);

let mut stack: Vec<_> = kill.into_iter().chain(itr).collect();
stack.reverse();
stack
}

fn main() {
let mut cost_map = HashMap::new();
cost_map.insert(EditOp::Copy, 2);
cost_map.insert(EditOp::Replace, 3);
cost_map.insert(EditOp::Delete, 2);
cost_map.insert(EditOp::Insert, 4);
cost_map.insert(EditOp::Twiddle, 1);
cost_map.insert(EditOp::Kill, 1);

let from = "▓▓▓▓n
▒▒▒▓▓n
▒▒▒▒▒▓n
▒▒▒▒▒▒▓n
▒▒▒▒▒▒▓n
▒▒▒▒▒▒▒▓n
▒▒▒▒▒▒▒▓▓▓n
▒▓▓▓▓▓▓░░░▓n
▒▓░░░░▓░░░░▓n
▓░░░░░░▓░▓░▓n
▓░░░░░░▓░░░▓n
▓░░▓░░░▓▓▓▓n
▒▓░░░░▓▒▒▒▒▓n
▒▒▓▓▓▓▒▒▒▒▒▓n
▒▒▒▒▒▒▒▒▓▓▓▓n
▒▒▒▒▒▓▓▓▒▒▒▒▓n
▒▒▒▒▓▒▒▒▒▒▒▒▒▓n
▒▒▒▓▒▒▒▒▒▒▒▒▒▓n
▒▒▓▒▒▒▒▒▒▒▒▒▒▒▓n
▒▓▒▓▒▒▒▒▒▒▒▒▒▓n
▒▓▒▓▓▓▓▓▓▓▓▓▓n
▒▓▒▒▒▒▒▒▒▓n
▒▒▓▒▒▒▒▒▓";

let to = "│▒│ /▒/n
│▒│/▒/n
│▒/▒/─┬─┐n
│▒│▒|▒│▒│n
┌┴─┴─┐-┘─┘n
│▒┌──┘▒▒▒│n
└┐▒▒▒▒▒▒  ";

let (c, s) = edit_distance(from, to, &cost_map);
let result = edit(from, to, &s);

println!("cost = {}", c);
println!("{:?}", s);
println!("{}", result);
}


It seems my solution is incorrect for some inputs.

The main problem is in the valid and create_tables functions.

For some reason I added a several conditions in valid so that i will reach zero always after j and this is wrong. For example "a" -> "ha" with insert and copy should be valid. But for this to work properly all j so that i == 0 most have insert in the table (The only way to edit an empty string is with insert).

So valid and create_table will end up like this:

fn create_tables(from_len: usize,
to_len  : usize,
costs   : &HashMap<EditOp, usize>)
-> (Vec<Vec<EditOp>>, Vec<Vec<usize>>) {

let mut op_table   = vec![ vec![EditOp::Kill; to_len]; from_len ];
let mut cost_table = vec![ vec![0usize; to_len]; from_len ];
let delete_cost    = *costs.get(&EditOp::Delete).expect("delete most be in costs");
let insert_cost    = *costs.get(&EditOp::Insert).expect("insert most be in costs");

for (i, c) in (1.. from_len).map(|i|(i, i * delete_cost)) {
cost_table[i][0] = c;
op_table[i][0]   = EditOp::Delete;
}
for (j, c) in (1.. to_len).map(|j|(j, j * insert_cost)) {
cost_table[0][j] = c;
op_table[0][j]   = EditOp::Insert;
}
(op_table , cost_table)
}

fn valid(&self,
i       : usize,
j       : usize,
iprev   : char,
jprev   : char,
icurrent: char,
jcurrent: char) -> bool {

match *self {
EditOp::Copy    => icurrent == jcurrent,
EditOp::Twiddle => i > 1 && j > 1
&& icurrent == jprev
&& iprev == jcurrent,
_ => true,
}
}