I'm new to programming and would like help on going through the process of thinking through a problem. My teacher gives us problems that involve things like creating a program that converts numbers into words, and to do that was hard and took awhile. I would like help with thinking through how to solve a problem.
This is an example of a program that I'd need help thinking through. I wanted to create a spell checker, so i would need to take in what the user types, break it into words, convert it to either upper or lower case, search a large index and then print out the result. How could i break the sentence up into words? When the user is entering a word, the text is being entered into a temporary variable and when they hit the space bar take the word and perform the search? I know that there are some methods available that do this for you but I'd like to create my own methods that do this.
Then if i wanted to create database search, how would i do it? I would have to have the database of words in order, but i can't start from the beginning and go through the list of words in order, because it'd take to long. So would the best way be to start in the middle of the database and compare the word to the word in the database by it and if it's less then split the group on the left side and search the middle and repeat again and again until i found my search, kind of like divide and conquer? I have heard of hash numbers what is it exactly and why do it?
The full answer to this question could take an hour or more - it's not a simple question and more than I can fully answer in a lot of detail. However, I've implemented a spellchecker on a hand-held computer and will give you some advice.
There are several phases to spell checking. Having words to input to it is one. Most simple spell checkers do their work one word at time. Looking up the word is next, and as you mention, this needs to be as fast as possible or a 1000-word file with only 10mS per word would take 10 seconds to check. If you get fancy, you want to add "Skip words" and "Add words" to skip recurring misses and to save words not in the dictionary. Getting super fancy you can try to suggest close words to what was entered. This is difficult to do, but really quite fun so I hope you have the time to work on this part.
Breaking up words: Normally you take sequential letters only. As soon as you have 2 in a row, you have a word candidate. If you see whitespace (spaces, tabs, returns), punctuation, or numbers you have word (unless a number was hit and you'd skip to the next letter follow any of these and start over). If you're providing options, often being able to skip all capital words is one you want for input parsing. Usually you find words by brute force. You can copy to a buffer holding the word. I used a pointer to the start of the word and kept the word length and used that. Copying is expensive time-wise, even on a PC.
Looking up words: The trick here is the dictionary. Really compressed dictionaries can hold 100,000-200,000 words in a few hundred kilobytes. For your case, let's not be so fancy at first. I would read in the dictionary (whatever word list you can find) and store these in a binary tree, with one tree per starting letter. For each letter, continue down the tree one letter at a time. apple would be in the a tree with nodes to -> p -> p -> l -> e -> end. Your leaves should hold more than one leaf (since many words being with p, s, etc. For each letter in the input word, you see if you can get to an ending leaf. You need an end marker to define sub-words. apple and applesauce should match, so you need a marker on e's node pointer that indicates a valid word end is there for apple. Building the trees is the hard part. Traversing the tree is trivial and where you get the speed. You mention divide and conquer and a binary tree does that. If what I mention here is too fancy for you, you can store every word in a list, sorted, and then using a binary search is also fast. If you do this, use qsort and bsearch (usually fast implementations in a compiler) to sort the list and to search the list. If the list is on disk. Presort it and simply load it using an array of pointers to the loaded words. But this uses a lot of storage. My tree idea uses much more memory for a small list but a huge list will end up being 20-50 times smaller.
Giving suggestions: This is usually done by changing letters in the input word and repetitively trying to find matches. It gets more involved to insert and delete characters. The spellchecker in Microsoft word is sensational when it comes to suggestions. My guess is they have hundreds of man months invested in that piece of code. But this is advanced and all I'll say about it.
Good luck with this - I hope I was helpful.