Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-40

Unmatched national characters in regular expressions

    Details

    • Type: Explanation
    • Status: Published
    • Affects Version/s: EXASolution 4.2.4, EXASOL 6.0.0, Exasol 6.1.0, EXASolution 5.0.0, Exasol 6.2.x
    • Fix Version/s: None
    • Component/s: EXASolution
    • Labels:
    • Symptoms:
    • Explanation:
      Hide

      Background

      Regular expressions do not recognize non-ascii national characters in class groups.

      Example:

      select regexp_substr('dejá vu', '[[:lower:]]+');
      --> "dej"

      Explanation

      Character groups using [[:group:]] notation are Unicode-unaware as well as highly locale-dependant. Basically, they only work for the ASCII part of Unicode (characters 0 through 127).

      As EXASolution implements the PCRE syntax for regular expressions, you should use Unicode character classes instead, using \p notation:

      select regexp_substr('dejá vu', '[\p{Ll}]+');
      --> "dejá"
      

      Here, "Ll" denotes the property (L)etter, (l)owercase.

      Additional References

      A list of classes (or character properties) defined by the Unicode standard can be found in Table 12 of document http://www.unicode.org/reports/tr44/#Property_Values

       

      Show
      Background Regular expressions do not recognize non-ascii national characters in class groups. Example: select regexp_substr('dejá vu', '[[:lower:]]+'); --> "dej" Explanation Character groups using [ [:group:] ] notation are Unicode-unaware as well as highly locale-dependant. Basically, they only work for the ASCII part of Unicode (characters 0 through 127). As EXASolution implements the PCRE syntax for regular expressions, you should use Unicode character classes instead, using \p notation: select regexp_substr('dejá vu', '[\p{Ll}]+'); --> "dejá" Here, "Ll" denotes the property (L)etter, (l)owercase. Additional References A list of classes (or character properties) defined by the Unicode standard can be found in Table 12 of document http://www.unicode.org/reports/tr44/#Property_Values  
    • Category 1:
      SQL

      Attachments

        Activity

          People

          • Assignee:
            CaptainEXA Captain EXASOL
            Reporter:
            CaptainEXA Captain EXASOL
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: