Search Pattern Improvement
Vysci opened this issue · 1 comments
Currently search patterns is per word basis, enhance it to support phrases (including spaces)
I've been thinking about how to do this and have started really looking at the code and how the message parsing works. Ive got a rough solution coded out but still needs alot of work.
Here's a couple of notes on the issue:
- The locale specific "tagStrings"(what i call them) are already space separated, so we'd need another special/magic char to indicate "space"
- my preference is to use
+
since it makes the most sense. - i.e; the message should match "this" plus "that", or "other =>
tagString = "this+that other"
- my preference is to use
- We could use a word version of a Suffix Trie, for "efficient" pattern matching where we want to find sequences of words.
- This might only be viable for roman and cyrillic alphabets (maybe korean but thats not supported by addon atm).
- I think an actual suffix trie might work for chinese since each "character" is usually a word.
- The Current algorithm for categorizing looks something like:
- pre-process tags and create a
[tag] => CategoryKey
map for everything inTags.lua
- New chat messages passed to
GBB.GetMessageWordList
(formerly named SplitToNoNb)- creates differently processed versions (9) of the message. ie numbers removed, punctuation removed, spaces over punctuations, etc
- splits every processed copy @ word level and collects all unique words as a list
- list of words from the message are iterated and checked against the
[tag] => CategoryKey
map to see what categories matches the message. - limitations:
- the map only supports single word keys, in order to support multiple words there would need to be another copied string that splits the message at every other "space". ie
"this is a test" => "this is", "is a", "a test"
- the map only supports single word keys, in order to support multiple words there would need to be another copied string that splits the message at every other "space". ie
- pre-process tags and create a
- The trie based algorithm works similarly.
- process
Tags.lua
and create Trie data structure for all phrases. A phrase would be 1 or more words connected by+
(or any punctuation) - in the case where there a no multi word sequences the trie pretty much behaves like the previous map.
[tag] => CategoryKey
- however when a build the trie if a tag with sequence is seen (say
"molten+core mc rag"
): - Incoming message is processed in similarly to before, instead of making an entire copy of the string, we only copy the bits a string and append to the original.
- eg: before
"lf1m zul'aman" -> ["lf1m zul'aman", "lf m zul'aman", "lfm zul'aman", "lf1m zul aman", "lf1m zulaman]
(separate strings to parse through) - after
'"lf1m zul'aman" -> "lf1m lf m lfm zul'aman zul aman zulaman"
1 string - limitations:
- not as accurate since its processing alot less string variations (can be improved with better regex pattern [WIP])
- process
- There needs to be some guidelines for how tags should be written:
- ex1: incoming message have all their "apostrophes" normalized so
"zul'aman zul`aman zul´aman
are redundant tags. - ex2: We should make note that any punctuation characters on incoming messages (barring chinese punctuation) are replaced with a single space.
- Therefore:
"zul'aman, zul`aman" => zul aman
- so tags should just use the sequence of
zul+aman
as that would match any form ofzul{%p}aman
where%p
is an appropriate punctuation character.
- Therefore:
- For the time being we can do some backward compatibility when pre-processing tags to build the Trie by treating [
´
] as the+
character which we will be using for joining sequences/chains of words as single patterns for categorization. - ex3: Incoming message are also "transliterated" which removes any accents from latin characters. For this reason, there makes no sense in having tags with accents in them
- ex1: incoming message have all their "apostrophes" normalized so
performance profiling
Trie (with no sequences) vs Regular Map. Note similar performance scaling with slight overhead. The scaling remains linear with sequences added since the incomming message is parsed incrementally.
GBB.GetMessageWordList optimization (slightly less accurate but better performant)