string searching algorithm

artem · Sep 3, 2003

Hi guys

I have got the task where single string has to be matched and it is identifier must be returned . Implementation via binary tree is not acceptable as it takes memory a lot . Hashing can be done but the problem is that set of string is undertermined and can be changed during live program execution , so I have to maintain flexible collision list for hashing algorithm .
Could someone recommend hash function with predetermined number of collisions for undetermined number input elements (strings) where string length can vary in range 2-32 characters . There could be up to 300 strings to be matched and algorithm must be quick enough to handle matching in real time . String length can vary between 2- 32 characters .
Special literature for hashing will also be appreciated .

Current implementation uses hybrid n-sized tree with linked element list for each tree node where sequential search is done to find match for current n-th position character . This algorith has deterministic time for execution buit it is quite long and inefficient .

Thank you in advance

luben111 · Sep 3, 2003

Hello,

I made some time ago a project , where I had to detect 60 different strings with different lenght and quickly to reply. Actually the time for responses was so limited that I really was in trouble.

The solution was more then simple - CRC16 - I make CRC 16 of the comming string of data. When I get CR/LF I compare the CRC with the etalons in ROM... works great.

If you need more reliable working (actually CRC16 is more then enought) you can use CRC32.

Using table approach makes calculation of CRC for single byte 10-12 instructions!

Hope this helps
Luben

artem · Sep 4, 2003

Hi Luben

Thank you for reply . It is interresting , but littled bit different from what I am dealing . May be I did not explain this correctly , input string could be one which is not in the hash table , so hashing function should recognize that it is not in the table , even if key produced by hash function will match string already inserted into the hash table.

For fixed number of input string there is simple algorithm provided by GNU , generates C source code needed for predefined set of strings - gpref utility , but it is not suitable for my case .

luben111 · Sep 4, 2003

Maybe I didn't explain in details:

You compare not strings - char, by char, but you compare whole strings (finished with CR/LF). You compare in fact two 16 bit digits - very simple and fast comparision.

There is no difference how you compare - char by char, or you get CRC16 of the comming string and you compare it with the etalons in ROM (for every string 2 bytes.. that's much less then to keep whole strings).

You don't need to knwo exactly the input string - you just will know that it matches the strings (their CRC) in table or not. You can add new strings too...

I agree that the approach sounds crazy, but really works reliable in my project for GSM module control.

Regards
Luben

artem · Sep 4, 2003

Hi Luben

I know that I will firstly produce CRC16 of input string to be matched , then I can index it via table (but as it is 2 bytes table will need 65535 entries ) or compare one by one all CRC codes precalculated for strings to be matched and stored as constants in ROM .

I ment that the CRC code computed for one string (let say "+CMTI" unsolicited response in your appl) is not unique and there could be other string which gives exactly the same CRC code . If this string
(which has the same CRC has code as the
registered in hash table "+CMTI" string)
is not in the hash table , output from CRC function will point to the +CMTI string .
When input set of strings apriori is not deterministic, there is probability that CRC code got from hash function could
point to registered one - collision case .
Another question ,does CRC16 algorithm perfectly distribute all GSM Modem AT response strings into unique key's and if there are collisions ,how many collisions you have for all GSM Modem response strings?

luben111 · Sep 4, 2003

Hello,

Seems that you misundersand me....

>>>I know that I will firstly produce CRC16 of input string to be matched , then I can index it via table (but as it is 2 bytes table will need 65535 entries ) or compare one by one all CRC codes precalculated for strings to be matched and stored as constants in ROM .

No... I didn't mean that. I have CRC16 calculator, so I can easy calculate (by hand or with program) of all desired strings the CRC (CRC16 calculator is somewhere posted in electroda like file). So, if I have to search within 300 strings, I need 300 16 bit words = 600 bytes!!!! No one of the known algorythms can produce such compact and quick compare.. no one!

>>>I ment that the CRC code computed for one string (let say "+CMTI" unsolicited response in your appl) is not unique and there could be other string which gives exactly the same CRC code . If this string
(which has the same CRC has code as the
registered in hash table "+CMTI" string)
is not in the hash table , output from CRC function will point to the +CMTI string .

I know what you meant... but the possibility 2 strings to have same CRC16 is exactly 1/65000. If you use CRC32 practically you'll never have problems.

>>>Another question ,does CRC16 algorithm perfectly distribute all GSM Modem AT response strings into unique key's and if there are collisions ,how many collisions you have for all GSM Modem response strings?

No one... I never had problems. I used AT89C2041 to track over 60 responses with very different lenght. In addition this was on 12 MHz (1 MIPS) - no one algorythm was able to do this in such compact volume. I tried everything in vain - the processor was too slowly (and we had not to change it).

As I told you, it's very crazy approach - but just try it. For me it works great - I never had problems... I don't compare the strings, I don't gather the string in memory (no buffer required), I don't keep the etalons in memory.... what you wish more?... maybe one beer

Regards
Luben

luben111 · Sep 4, 2003

I forgot something...

This approach requires special delimiter string, some character that will tell you where is the end of string. In GSM modules this is no probllem - CR/LF

Until you get the delimiter you just put the comming data to CRC16 register - no comparisions - very fast!

When you get the delimiter, only then you compare the CRC16 value in the register with the etalons in ROM. If you need 300 commands, you need 600 bytes. You compare 2 16 bit digits. Actually the strings could be very long, but for every string you need only 2 bytes. If you're scared - use CRC32 - I swear that you'll never get equal CRC32 for two strings..... just try by hand to do this.

As I told you the possibility 2 strings to have same CRC is VERY, VERY, VERY small. Just forget about this! I never had such luck to see 2 strings with the same CRC... If this happens, you can easy add one additional character in register (to start not from zero in beginning) and the problem will disappear. Trust me, it's amazing approach and after I discovered it I never use other one.

artem · Sep 5, 2003

Example

I have tried another approach just to test and did not use has table . The source code is in attachment . The lookup tale is built by another fucntions which takes strings and identifiers as argument . This file contains just string matching and lookup table contant . You can compile and test the program, may be perfomr some benchmarking towards your algorithm .

luben111 · Sep 5, 2003

I drop a look on your tables....

Actually to calculate CRC16 quickly (table approach) you need 512 bytes of ROM for fast calculating of CRC16. So, for limited number of strings my approach could yield bigger code.

For the speed there is no comparision. I don't compare the strings until I get the delimiter (this saves a lot of time).

When delimiter found I compare two 16 bit digits in table with clear structure - word array. The strings with different lenght create or "ugly" array, or array with too many zeroes. You can calculate alone that to compare one 16 bit word with array of 16 bit word sis very fast procedure... and it's made once per incomming string. Until there is no delimiter, the string comparision almost doesn't consume uP resources.

As I remember the string approach in my system was just impossible to run on speeds over 1200 (on 89C4051 / 20MHz). After moving to CRC16 I was able to reach speeds of 56000K! I'm talking about reliable operation - when I flood the module with random non-stopping data stream - if the module survived - it's OK, if even only one missed byte appeared - the speed is higher then it can take.

But if you're not sure about my approach, feel free to use the classic one. I can tell only that my approach in big arrays could be several times faster then the string approach. And I agree that some very small possibility exists 2 strings to yield teh same CRC16 (1/65000).

If you don't care for the speed you can calculate the CRC with program - then you have almost equal to string approach speed (but still faster) and amazing compression of the ROM memory.

If you need more information - email me and I can send you sources of my project. It's GSM answering machine - when somebody calls teh module it hangs and starts conversation. The user can cancel the call aor break the conversation. Maybe ~70 such modules are working permanently (almost 24 hours per day) right now and I never had any problems with them.... This was my last project without RTOS, after this project I switched completely to RTOS. The algorythm described with words sounds very simple, but after you make it on blocks you'll see that linear programming can support many parts of the algorythm... For sure you'll need RTOS as well.

Best regards
Luben Hristov
luel@mbox.cit.bg

artem · Sep 5, 2003

Hi

No I do not mean that your approach is good or bad - it is just to share idea .
The lookup table in my program is not just couple of strings , it is sorted linked list/ tree based data structure . I have just changed the program to run it standalone and made some tests for most of the used AT solic and unsolic responses.
What I have found is that max number of iterations in strtoid() function is about 30 . One iteration includes just one byte comparision and pointer adjustment (which is sum operation) into the lookup table .
The number of iterations (comparisions ) is dependent on content
equality between keywords , because of sorting principle used when building lookup table . For worst case where let say 50 letters used to encode 16 character length string , number of iterations will be 50*16 =
800 byte comparision , but it is far away from real cases .

It is also interresting to see your implementation . Could you please post it ?

Also , the function of course is invoked after delimiter will be found , so
input to strtoid is null terminated character string .

See the source file it could be interresting .I have used bc45 or bcc55 .

artem · Sep 5, 2003

Sorry problem with file

luben111 · Sep 5, 2003

Here is the source code of the file.... CRC16 , GSM command decoding and smart response.

Actually this is not exactly the last version - here you'll see decoding of 30-40 commands and the speed is still 9600 , but it's tested in real object and even currently works in some devices.

Hope this will help you...

Best regards
Luben

Welcome to EDAboard.com

string searching algorithm

artem

Advanced Member level 4

luben111

Advanced Member level 1

artem

Advanced Member level 4

luben111

Advanced Member level 1

artem

Advanced Member level 4

luben111

Advanced Member level 1

luben111

Advanced Member level 1

artem

Advanced Member level 4

luben111

Advanced Member level 1

artem

Advanced Member level 4

artem

Advanced Member level 4

luben111

Advanced Member level 1

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics