Tokenization

From Mario Fan Games Galaxy Wiki

Tokenization is the process of working with strings (text) by dividing them into several substrings.

Dividing a string

To divide a string, special characters known as delimiters must be used. Common delimiters may be spaces, which divide words, or line breaks, which divide paragraphs.

These three words

May be divided into three substrings with the space as the delimiter:

These
three
words

Delimiters may also be invisible. There are special delimiters available in regex that represent the area between a space and the first character of the word just after it, among other things.

Markup

A string may be injected beforehand with markup (special characters used to add effects to a string). Markup is found in such languages as HTML (Hypertext Markup language) and BBCode.

This markup can be found using delimiters and replaced using regular expressions.

For instance, in MediaWiki, if the before string is ...

''Italic text''

... then MediaWiki will parse the string, placing italics between the '' characters and removing said characters:

Italic text