Boundary
We say a boundary is some condition that splits an identifier into a list of words.
Some boundaries discard some substring at the location of an identified boundary, and others do not.
A boundary consists of three parts:
- the condition: a function that takes a slice of the identifier starting at location
i
to the end of the identifier, and returns it true if the boundary condition is met - the index of the start of boundary
start
relative to the slice of the identifier - the length of the boundary
len
To identify the locations of an identifier that contain boundaries, we iterate through substrings of the identifier. For i
from 0 to the length of identifier I
, we create a slice S = I[i:]
. If this slice meets the condition, we have identified a split in the identifier. The end of the previous word is at index i+start
exclusive and the start of the next word is at index i+start+len
inclusive.
Splitting
We say splitting is the process of converting an identifier into a list of words.
In pseudocode, you could implement splitting by a boundary b
as follows.
function split(b: Boundary) {
words = []
last_word_start = 0
for i in 0..n {
S = I[i:]
if b.condition(S) {
last_word_end = i + b.start
words.append( I[last_word_start:last_word_end] )
last_word_start = i + b.start + b.len
}
}
words.append(I[last_word_start:])
return words
}
Delimiter Example
Suppose we wanted to split a snake case identifier into words. The boundary can be defined as follows:
condition
: is the first character in the slice equal to an underscore?S[0] == "_"
start
: 0len
: 1
Examine the identifier I = last_byte_count
. This boundary condition would be true at index 4. This means the start of the boundary would be at 4 + start, or 4. The length of the boundary is 1, so the end of the boundary is 4 + 1, or 5. Using these values to split the word, we would get I[:4] = last
and I[5:] = byte_count
. We would continue until we reach the end of the string.
Camel Case Example
Camel case is delimited by an empty string. We can define the boundary as follows:
condition
: is the first character lowercase and the second character uppercase?S[0].is_upper() and S[1].is_lower()
start
: 1len
: 0
Because our condition checks the character before and after the split, we set a start of 1. And we don’t want to remove any characters while splitting, so the length is 0.