

SIMPLEXpress
Regular expressions simplified.
Regular expressions are an incredibly powerful tool, but let's face it: they're not the easiest to write or read, and sometimes they're too powerful for what we need.
Simple.
SIMPLEXpress uses a visible and predictable syntax to make writing expressions easier. The syntax design trades some brevity for huge gains in readability. Nearly everything is literal, except within a "unit".
Units always begin with one of two characters, '^' for a match-only unit or '~' for a snag unit, and end with a '/'. Only three symbols are always reserved: '^' and '~' to start units, and '%' to escape them. This ability to easily "switch" between literal and symbolic modes means less escaping, less memorizing reserved symbols, and less referencing online documentation and tutorials.
Expanded.
In regular expressions, you have to write your own complicated test cases for character sets, and Unicode support is only available through third-party libraries.
On the other hand, SIMPLEXpress has dozens of character classes, called "specifiers". Most also have sub-classes for more fine-grained matching. All of these specifiers automatically kick in within a unit. For example, you could use 'o' to match all common math operators, 'lu' to match all uppercase letters, and 'al' to match any alphanumeric character that isn't uppercase.
You can even match a range of Unicode characters with the
specifier "u123-456
"!
Express.
SIMPLEXpress was designed specifically for lexing and parsing, a task that regular expressions are infamously ill-suited for. Because of this goal, SIMPLEXpress is fast and efficient.
That second reserved symbol, '~', is used to "snag" units and arbitrary literal segments, which can then be returned on demand. This, paired with efficiency, makes SIMPLEXpress ideal for language parsing.
Syntax
Below is the basic syntax for SIMPLEXpress, according to the most recent specification draft. (Subject to change.)
Operators
Only the '^' and '~' symbols are hard-reserved. All the rest only work as described within a unit ('^.../'), excepting the '%', which works outside units when preceding a hard-reserved character.
Symbol | Usage | Example |
---|---|---|
^ | Start unit | ^.../ |
~ | Snag, aka capture group | ~.../ |
[ ] | Set: Match any one of the unit values within. Space delimited. |
^[(abc) (123))]/ matches 'abc' or
'123'.
|
< > | Literal Set: Any literal character within. |
^<abc>/ matches 'a', 'b',
or 'c'.
|
( ) | Group: Allows for literal characters, strings, and further nested units within a unit. |
^(a)(bc)?/ matches either 'a'
or 'abc'.
|
% | Escape following character (literal). Also works outside of units when preceding '^' or '~'. |
^<12>/^%*?/ matches, '1', '2', '1*',
or '2*'.
|
{ } | Exclusion: Anything within is checked but is not returned as part of the result. Parallels regex "lookahead" and "lookbehind". |
~{(abc)}/123 matches 'abc123', but
returns only '123'.
|
. | Matches any character. |
^./23 matches 'z23' and anything
else with a single character followed by '23'.
|
+ | Multiple. |
^(abc)+/ matches 'abc', 'abcabc',
and so forth.
|
? | Optional. |
abc^(123)?/ matches 'abc' and
'abc123'.
|
* | Optional multiple. |
abc^(123)*/ matches 'abc', 'abc123',
'abc123123', and so forth.
|
#1, #2-3 (etc) | Exact number or range of matches. |
abc^(123)#2-3/ only matches
'abc123123', or 'abc123123123'.
|
! | NOT operator. |
^!<abc>/ only matches a single
character that is NOT 'a', 'b', or 'c'.
|
$ | Line boundary (line beginning/end). |
^$/abc^$/ only matches 'abc' if it is
the entirety of the line.
|
Operators
All specifiers start with a single letter, and only function within a unit. Lowercase is a match, uppercase inverts the logic (A = NOT alphanumeric).
Specifier | Usage |
---|---|
a | alphanumeric |
c | classification (Reserved for later expanded character classes, such as 'c_hangal' for Hangal characters). |
d | digit |
e | extended Latin |
g | Greek/Coptic |
i | IPA (International Phonetic Alphabet) |
l | Latin letter |
n | newline ('\n') |
o | math operator |
p | punctuation |
r | carriage return ('\r') |
s | literal space |
t | tab |
u# | unicode (accepts 'u78' or 'u57-78') |
w | whitespace |
Most specifiers can also include 'u' or 'l' after the first character to indicate uppercase or lowercase. For example, '^au/' indicates alphanumeric uppercase, while '^gl/' indicates Greek/Coptic lowercase. This will be ignored if case doesn't apply (no error.)
FAQs: SIMPLEXpress
(Click a question to view the answer.)