Tokenization

From Mario Fan Games Galaxy Wiki
Revision as of 21:41, 5 January 2009 by Xgoff (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Tokenization is the process of splitting one string into smaller substrings. This is usually carried out with regular expressions by searching for certain characters called delimiters, which mark the split points. A familiar delimiter encountered in everyday life is the space, as it separates words in sentences. Sentences themselves are delimited by various punctuation points, and paragraphs are delimited by line breaks.

Once a string is split, a parser decides what to do with the substrings. Sometimes nothing special is done if all that is needed is to get the substrings, but sometimes the substrings themselves may mean something to the program (such as tags and attributes in HTML or BBCode).