Regular expressions

From Mario Fan Games Galaxy Wiki
(Redirected from Regex)

Regular expressions or regex is a type of advanced string tokenizing operation. Regex makes the process of complicated string manipulation and extraction much easier, as there is no need to write large chunks of code to search the string for patterns. Regex is almost a programming language itself, and some expressions (patterns) can be quite complex.

Regex can take some practice to use, as patterns usually find unamusing ways to behave in unexpected ways. Sometimes a pattern may be more complex than was first assumed.

Implementations

Lua

Lua has a limited form of regex called pattern matching, as a full implementation would be larger than the language core and standard libraries combined. Still, it is capable of most regex functionality, and separate regex libraries are available.

function csv_to_table(csv_string) -- converts comma-separated values to a table
    local t = { }
    for value in csv_string:gmatch("(%w+),?") do
        table.insert(t, value)
    end
    return t
end

-- usage
test = csv_to_table("1,2, 3,   4, 5,6") -- just a whitespace-handling test
print(table.concat(test, " ")) -- prints "1 2 3 4 5 6"

The string "(%w+),?%s*" is the actual pattern. It may look like random gibberish, but everything in that string has a specific meaning: %w stands for alphanumeric characters; upper/lowercase letters and numbers. The comma stands for itself. Now for the symbols: '+' matches 1 or more of the previous character or class, until it reaches a character that falls under another set. '?' matches 0 or 1 occurrence of the previous character or class. '(' and ')' constitute a capture: whatever is between them is what will be returned (or in this case, assigned to 'value'); everything outside a given set of parentheses (there can be more than one) won't be returned.

Basically, this pattern does the following: ignores all whitespace surrounding a group of 1 or more alphanumeric characters (as whitespace is not part of the %w set), then searches for a comma (since we're using that to separate values). If those conditions are satisfied, the capture is returned.

If you haven't figured it out already, this pattern is somewhat simple and could use more expansion in order to handle different types of values more effectively; for instance, it doesn't recognize decimal points). You may notice the commas are actually optional because we used '?'. This is because the last value, though valid, wouldn't be returned if it didn't have a comma after it, had we not used '?'. This would be an inconvenience to the user, since trailing commas aren't the most common thing ever. This could be worked around by having the function check the input string for a trailing comma and add one if there is none, then apply the pattern without the '?'. Or you could try coming up with a better pattern; in any case, this pattern works fine in most cases, and is fine for program-generated output.