Lua: Patterns

From Mario Fan Games Galaxy Wiki
Lua
Lua.gif
Basics
Intermediate
Advanced
XLua
Add to this template
 Standardwikimessagebox.png This article assumes the use of Lua 5.1.

Information may not be accurate or may need revision if you are using a different version.

 Stub.png This article or section is in need of expansion.

Please add further information.

Lua has a basic regex implementation referred to as pattern matching. It isn't quite as full-featured as most other languages' regex implementations, but it's generally good enough for most tasks. Pattern matching is used with some of the string functions to allow somewhat advanced string parsing.

Lua patterns are just simple strings, but their content is specially formatted in such a way that it is almost a mini-language itself. There are some reserved symbols, and special combinations of characters called a class. Classes represent a group of characters, such as letters, numbers, or symbols. They are also locale-dependent, meaning %a in one language may stand for an entirely different set of characters than %a in another language. The built-in classes include:

  • .: a single dot represents any character; you'll want to be careful when using this because it can be difficult to "control"
  • %a: represents all letters, uppercase and lowercase
  • %l: represents lowercase letters only
  • %u: represents uppercase letters only
  • %d: represents digits
  • %x: represents hex numbers and letters
  • %w: represents alphanumerical characters (combination of %a and %d)
  • %c: represents control characters
  • %s: represents whitespace characters (space and tab, etc)
  • %p: represents punctuation characters
  • %z: represents the null character; this must be used instead of '\0' or '\000' when that character is needed

Any character not matching one of the sequences above simply represents itself. In addition, regular backslash escapes may also be used in patterns. % can be used to escape characters that aren't numbers or letters (ie %% stands for '%' itself). Using an uppercase letter for the above classes represents the opposite of that class, so %D represents non-digits.

A set is a collection of characters that isn't one of the above classes, and is specified by matching square brackets. A hyphen, when used between characters, represents a range, so [0-9] would match any digit. Sets are an easy way to define custom classes, as any classes or characters within the set are combined; ie, [%a%d] means the same thing as %w. Be careful, however, of locales—[a-z] may not necessarily be the same thing as %a! Like classes, you can get the opposite of a set by using ^ right after the opening left square bracket.

An undocumented class is %f, the so-called 'frontier pattern'. It is used directly before a set, and detects when that set actually becomes active. For example, %f[%a] detects the beginning of a word, and %f[%A] detects the end. Being undocumented, its behavior is not guaranteed to be correct in all cases, and it may even be removed entirely at some point.

Designing an effective pattern can be a complex ordeal of trial and error; two different patterns that may look like they perform the same task may in fact have subtle differences in their behavior, such as how they handle conflicting classes.