Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-40

Unmatched national characters in regular expressions

    XMLWordPrintable

    Details

    • Type: Explanation
    • Status: Published
    • Affects Version/s: EXASolution 4.2.4, EXASOL 6.0.0, Exasol 6.1.0, EXASolution 5.0.0
    • Fix Version/s: None
    • Component/s: EXASolution
    • Labels:
    • Symptoms:
      Hide

      Regular expressions do not recognize non-ascii national characters in class groups.

      Example:

      select regexp_substr('dejá vu', '[[:lower:]]+');
      --> "dej"
      
      Show
      Regular expressions do not recognize non-ascii national characters in class groups. Example: select regexp_substr('dejá vu', '[[:lower:]]+'); --> "dej"
    • Explanation:
      Hide

      Character groups using [[:group:]] notation are unicode-unaware as well as highly locale-dependant. Basically, they only work for the ASCII part of unicode (characters 0 through 127).

      Show
      Character groups using [ [:group:] ] notation are unicode-unaware as well as highly locale-dependant. Basically, they only work for the ASCII part of unicode (characters 0 through 127).
    • Solution:
      Hide

      As EXASolution implements the PCRE syntax for regular expressions, you should use unicode character classes instead, using \p notation:

      select regexp_substr('dejá vu', '[\p{Ll}]+');
      --> "dejá"
      

      Here, "Ll" denotes the property (L)etter, (l)owercase.

      A list of classes (or character properties) defined by the unicode standard can be found in Table 12 of document http://www.unicode.org/reports/tr44/#Property_Values

      Show
      As EXASolution implements the PCRE syntax for regular expressions, you should use unicode character classes instead, using \p notation: select regexp_substr('dejá vu', '[\p{Ll}]+'); --> "dejá" Here, "Ll" denotes the property (L)etter, (l)owercase. A list of classes (or character properties) defined by the unicode standard can be found in Table 12 of document http://www.unicode.org/reports/tr44/#Property_Values
    • Category 1:
      SQL

      Attachments

        Activity

          People

          • Assignee:
            CaptainEXA Captain EXASOL
            Reporter:
            CaptainEXA Captain EXASOL
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: