Package: Util.Strings

Dependencies

with Ada.Characters.Handling;
with Ada.Strings.Fixed;
with Ada.Strings.Maps;

Description

Copyright © 2001, 2002 by Thomas Wolf.
This piece of software is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This software is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License with this distribution, see file "GPL.txt". If not, write to the Free Software Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
As a special exception from the GPL, if other files instantiate generics from this unit, or you link this unit with other files to produce an executable, this unit does not by itself cause the resulting executable to be covered by the GPL. This exception does not however invalidate any other reasons why the executable file might be covered by the GPL.

Version: 1.1

Author:
Thomas Wolf (TW) <twolf AT acm DOT org>
Purpose:
Various string utilities not provided in the standard library. Some of these also are repeated here, so that one can get all one needs with a single "with".
Tasking Semantics
Neither task- nor abortion-safe.
Storage Semantics
No dynamic storage allocation.

Header

package Util.Strings is
 
pragma Elaborate_Body;

Known child units

Util.Strings.DOS_Match(function instantiation)
Util.Strings.Unix_Match(function instantiation)

Exceptions

Illegal_Pattern renames Ada.Strings.Pattern_Error
Raised by Wildcard_Match if a pattern is malformed.

Constants and Named Numbers

Backward : constant Ada.Strings.Direction := Ada.Strings.Backward;
Blanks : constant Ada.Strings.Maps.Character_Set;
Anything that can be considered white space: not just a blank, but also tabs, non-breaking spaces, carriage returns, and so on.
Both : constant Ada.Strings.Trim_End := Ada.Strings.Both;
Forward : constant Ada.Strings.Direction := Ada.Strings.Forward;
Left : constant Ada.Strings.Trim_End := Ada.Strings.Left;
Letters : constant Ada.Strings.Maps.Character_Set;
7-bit ASCII letters, i.e. A-Z and a-z.
No_Escape : constant Character := Character'Val (0);
This constant is used to indicate to the string parsing operations Get_String and In_String that string delimiters cannot be escaped.
No_Set_Inverter : constant Character := Character'Val (0);
Null_Set : constant Ada.Strings.Maps.Character_Set :=
  Ada.Strings.Maps.Null_Set;
Right : constant Ada.Strings.Trim_End := Ada.Strings.Right;
Shell_Quotes : constant Ada.Strings.Maps.Character_Set;
Quotes typically recognized by command shells: double, single, and back quote.
String_Quotes : constant Ada.Strings.Maps.Character_Set;
Typical string quotes: double and single quotes.

Other Items:

function To_Lower (Ch : in Character) return Character
  renames Ada.Characters.Handling.To_Lower;

function To_Upper (Ch : in Character) return Character
  renames Ada.Characters.Handling.To_Upper;

function To_Lower (S : in String) return String
  renames Ada.Characters.Handling.To_Lower;

function To_Upper (S : in String) return String
  renames Ada.Characters.Handling.To_Upper;

function To_Mixed (S : in String) return String;
Maps all character immediately following an underscore ('_') or a period ('.') or a white space as defined by Blanks below to upper case, all others to lower case.

function Count
  (Src : in String;
   Ch  : in Character)
  return Natural;
Returns the number of occurrences of Ch in the string Src.

function Count
  (Source  : in String;
   Pattern : in String)
  return Natural;
As Ada.Strings.Fixed.Count, but without mapping and therefore way faster.

function Index
  (Src : in String;
   Ch  : in Character;
   Dir : in Ada.Strings.Direction := Forward)
  return Natural;
Returns the index of the first (or last, if Dir is Backward) occurrence of Ch in the string Src, or zero if no occurrence of this character can be found.

function First_Index
  (Src : in String;
   Ch  : in Character)
  return Natural;
As Index, but hard-wired to searching forward.

function Last_Index
  (Src : in String;
   Ch  : in Character)
  return Natural;
As Index, but hard-wired to searching backward.

function First_Index
  (Source  : in String;
   Pattern : in String)
  return Natural;
As Index, but hard-wired to searching forward. Way faster than Ada.Strings.Fixed.Index, also because no mapping is applied.

function Last_Index
  (Source   : in String;
   Pattern  : in String)
  return Natural;
As Index, but hard-wired to searching backward. Way faster than Ada.Strings.Fixed.Index, also because no mapping is applied.

function Index
  (Source  : in String;
   Pattern : in String;
   Dir     : in Ada.Strings.Direction := Forward)
  return Natural;
As Ada.Strings.Fixed.Index, but hard-wired to not using a mapping.

function Is_Prefix
  (Source : in String;
   Prefix : in String)
  return Boolean;
Returns True if Source starts with Prefix, False otherwise.

function Is_Suffix
  (Source : in String;
   Suffix : in String)
  return Boolean;
Returns True if Source ends with Suffix, False otherwise.

function Is_Blank
  (Ch : in Character)
  return Boolean;
Returns Ada.Strings.Maps.Is_In (Ch, Blanks).

function Is_In
  (Set : in Ada.Strings.Maps.Character_Set;
   Ch  : in Character)
  return Boolean;
Returns Ada.Strings.Maps.Is_In (Ch, Set). Provided mainly because I very often mix up the order of the arguments.

function Trim
  (S    : in String;
   Side : in Ada.Strings.Trim_End := Both)
  return String;
Removes all characters in Blanks declared above from the specified string end.

function Trim
  (S     : in String;
   Left  : in Ada.Strings.Maps.Character_Set;
   Right : in Ada.Strings.Maps.Character_Set := Null_Set)
  return String
  renames Ada.Strings.Fixed.Trim;
Removes the specified character sets. The point of this renaming is the default parameter.

procedure Get_String
  (S        : in     String;
   From, To :    out Natural;
   Delim    : in     Character := '"';
   Escape   : in     Character := No_Escape);
Returns in From and To the indices of the beginning or end of the next string in S.

A string is defined as a sequence of characters enclosed by Delim; any occurrences of Delim after the first Delim that are immediately preceeded by Escape do not yet terminate the string but are part of the string's content.
Escape
/= Delim Delimiters that are part of the string must follow an Escape immediately. Two Escapes in a row are considered one literal Escape. For instance, with Delim = '"' and Escape = '\', the operation recognizes C strings.
= Delim Delimiters that are part of the string must be doubled, an in Ada strings.
= No_Escape Strings cannot contain instances of the delimiter. The second occurrence of a delimiter in S is the string end.

If no string is found, both From and To are zero.

If an unterminated string is found, From is the index of the opening occurrence of Delim, and To is zero.

Otherwise, a string was found, and From and To are the indices of the opening and closing occurrences of Delim, respectively.


function In_String
  (S      : in String;
   Delim  : in Character := '"';
   Escape : in Character := No_Escape)
  return Boolean;
Returns True if the end of S is within an unterminated "string" (as described above), and False otherwise. (If S ends with an unterminated string, returns True, otherwise False.)

function Skip_String
  (S      : in String;
   Delim  : in Character := '"';
   Escape : in Character := No_Escape)
  return Natural;
Returns the index of the closing occurrence of Delim of the string in S. S (S'First) should be the opening occurrence of Delim. The semantics of Delim and Escape are as for Get_String.

Returns zero if co closing occurrence of Delim can be found in S.


function Quote
  (S      : in String;
   Delim  : in Character;
   Escape : in Character)
  return String;
Quote a string. S is supposed to contain the string's contents (without the delimiters). Any embedded delimiter is quoted as follows:
  • If Escape = No_Escape, S is returned.
  • If Escape = Delim, all occurrences of Delim in S are replaced by two Delims.
  • Otherwise, an Escape is inserted before any occurrence of Delim or Escape in S.

function Unquote
  (S      : in String;
   Delim  : in Character;
   Escape : in Character)
  return String;
Unquotes embedded delimiters in a string. S is supposed to contain the string's contents without the bounding delimiters.
  • If Escape = No_Escape, S is returned.
  • If Escape = Delim, all non-overlapping occurrences of two consecutive Delims in S are replaced by a single Delim.
  • Otherwise, any non-overlapping occurrence of two Escapes in S is replaced by a single Escape, and any occurrence of an Escape immediately followed by a Delim is replaced by a single Delim.

In all cases, the following is true:

    Unquote (Quote (S, Delim, Escape), Delim, Escape) = S
  

function Unquote_All
  (S      : in String;
   Quotes : in Ada.Strings.Maps.Character_Set;
   Escape : in Character := No_Escape)
  return String;
Unquotes all non-overlapping occurrences of strings within S delimited by any character in Quotes. If Escape = No_Escape, the Ada convention (embedded delimiters must be doubled) is assumed, otherwise, embedded delimiters must be escaped by Escape.

function Identifier
  (S : in String)
  return Natural;
If S starts with an identifier, returns the index of the identifier's last character. Otherwise, returns zero. For the purpose of this function, an identifier has the following syntax:
     Identifier = Letter {Letter | Digit | '_'}.
     Letter     = 'A' .. 'Z' | 'a' ..'z'.
     Digit      = '0' .. '9'.
  

Note that this is the Ada 95 syntax, except that multiple underscores in a row are allowed.


function Next_Non_Blank
  (S : in String)
  return Natural;
Returns the index of the first character in S such that Is_Blank (S (I)) = False, or zero if no such character exists in S.

function Next_Blank
  (S : in String)
  return Natural;
Returns the index of the first character in S for which Is_Blank (S (I)) = True, or zero if there is no such character in S.

function Replace
  (Source : in String;
   What   : in String;
   By     : in String)
  return String;
Replaces all non-overlapping occurrences of What in Source by By. Occurrences of What in By are not replaced recursively, as this would lead to an infinite recursion anyway.

generic
   Any_One      : in Character := '?';
   Zero_Or_More : in Character := '*';
   Set_Inverter : in Character := '!';
   Has_Char_Set : in Boolean   := True;
   Has_Escape   : in Boolean   := True;
   Zero_Or_One  : in Boolean   := False;
function Wildcard_Match
  (Pattern : in String;
   Text    : in String)
  return Boolean;
Returns True if the wildcard string Pattern matches the text Text, and False otherwise. Raises Illegal_Pattern if the pattern is malformed.

Wildcard patterns are a simple form of regular expressions. Their syntax is as follows: (This description assumes the default values for all generic parameters.)
? Matches any one character.
* Matches any sequence of characters (zero or more).
[...] The characters between the square brackets define a character set. Matches any one character of the given set.
[!...] Defines an inverted set. Matches any one character not listed.

Character sets are given either by specifying a range ("a-z"), single characters ("xyz") or any combination of the two ("a-zA-Z0123"). If the first character in the set is '!', the set is inverted, i.e. it contains all characters not listed.

Any character that is not one of the meta characters '?', '*', '[', ']', and '\' matches literally. To do a literal match against any meta character, escape it with a backslash, or use a one-character character set.

\? or [?] matches a ?
\* or [*] matches a *
\[ or [[] matches a [
\] or []] matches a ]
\\ or [\] matches a \

In a character set, characters must not and need not be escaped. To include the character '!' in a character set, make sure it is not the character immediately following the '['. To include ']' in a character set, make sure it follows the opening '[' (or the opening "[!" in the case of an inverted set) immediately. To include '-' in a character set, make it either the first or last character of the set, or the lower or upper bound of a range, e.g. "[-a-z]", or "[abc-]", or "@[ab --9]@", or "[!-./]".

(Note that in "@[ab --9]@", the set is 'a' or 'b' or (' ' to '-') or '9', not 'a' or 'b' or ' ' or ('-' to '9'), i.e. the earliest interpretation of a range is taken. Also note that the set "[abc--9]" is illegal because in the range "c--", 'c' > '-'. Specify this set as "[--9abc]" instead.)

The '!' used for set inversion matches literally when used outside a character set. It is a meta character only when immediately following the opening '[' of a character set.

Note that by default '?' matches any one character, not zero or one!

Matches always are case sensitive. To do a case insensitive match, map upper-case letter to lower-case letters in both the text and the pattern before calling this routine.

Note: if character sets are not allowed, they match literally. E.g. the pattern "[abc]" would then match the text "[abc]", but not "a".

Generic Parameters:
Any_One The character used to match any one arbitrary text character. If Zero_Or_One (see below) is True, this character matches zero or one arbitrary characters.
Zero_Or_More The character used to match zero or more arbitrary characters.
Set_Inverter The character used for inverting a character set. If it is No_Set_Inverter, but Has_Char_Set (see below) is True, character sets cannot be inverted. If Has_Char_Set is False, Set_Inverter is ignored.
Has_Char_Set If True, character sets are supported, otherwise, they're not allowed and the set meta characters '[' and ']' always match literally. (Note that the set inverter (by default '!') always matches literally if used outside a character set.)
Has_Escape If True, backslash-escaping of meta characters is supported. If False, it is not, and one-character character sets must be used for literal matches against meta characters.
Zero_Or_One If True, the Any_One character matches zero or one text characters. If False, Any_One must match a text character.

The three characters used for Any_One, Zero_Or_More and Set_Inverter should of course be distinct, and not coincide with any of the other meta characters either!

Note that character sets always must match a character; a null match is never allowed. (If null matches were allowed, a pattern like "[!a]*" would also match texts starting with "a"!)


function Match
  (Pattern : in String;
   Text    : in String)
  return Boolean;
A default instantiation of the above Wildcard_Match.

private

   --  Implementation-defined ...
end Util.Strings;