Created by: gwideman, Jul 18, 2012 8:13 pm
Revised by: gwideman, Jul 21, 2012 1:37 pm (21 revisions)

Overview

One of the beauties of Delphi is that, like C and C++, it is a speedy compiled language which provides pointers for fine-grain manipulation of memory, but for the most part the complex or hazardous aspects of pointers don't need to be confronted in everyday programming. Delphi's implementation of strings is a key example.
However, that benefit is somewhat eroded when it's necessary to interface Delphi strings to other APIs, like Windows APIs or COM. The features are there to do it smoothly, but it can be a little confusing as to which of several Delphi string types or character pointer types to use. The confusion has multiplied with the change in Delphi 2009/10 and following, which changed what the default String/Char/PChar types refer to.
That transition was mostly transparent for new projects built completely within Delphi, but rather disrupts the non-expert's understanding of which string types to use with various Windows or COM APIs, and why. Advice found online, which one might have followed cookbook-style, may apply only to particular versions of Delphi, but not make it clear which ones.
This page captures my notes on this topic.

String types and their layouts (Win32)

The memory layouts for string and related types provide the ground truth as to which type is compatible with which, particularly which pointer types are sensible to use with which string types. So herewith I gather notes on all the string and char types, and their memory layouts.
Notes:
  • The type names "String", "Char" and "PChar" are names for default types, the specific meanings of which have changed over the years, notably in Delphi 2009. Other type names refer to structures that have remained constant.

String types

All example strings contain the 26 letters of the English alphabet: "AB...YZ", and where it is relevant, character encoding code page 1252 ($4E4) is shown.
Type
Default
'String' type
Element
type
Null-term,
pointer
Structure
ShortString
TP, BP, D1?
AnsiChar
No [1]
delphi-layout-shortstring.png [1]
AnsiString
D2-D2007
AnsiChar
Yes,
PAnsiChar
delphi-layout-ansistring.png
UnicodeString
D2009-on
[2]
WideChar
Yes,
PWideChar
[3]
delphi-layout-unicodestring.png
WideString
(=BStr
= OleStr)
[4]
n/a
WideChar
Yes,
PWideChar
delphi-layout-widestring.png

  • Purple outline: The variable's base slot (eg: on the stack)
  • Blue outline: Space that the variable's base slot points to (heap)
  • Red text: Terminating nulls that need special attention
  • Orange outline: For WideString (aka BStr), space that the variable points to allocated by Windows.
  • Grey outline: A PAnsiChar or PWideChar can point to a sequence of bytes anywhere.

Char types

Type
Default 'Char'
Element size
Pointer
Comments
AnsiChar
D1-D2007
unsigned byte
PAnsiChar

WideChar
D2009-on
2 unsigned bytes
PWideChar
UTF-16 "element" (usually element = character).

Pointers to char (PAnsiChar, PWideChar)

In this table, the pointed-to (grey box) items show the structures that a PXxxxChar should point to in order to meet the expectations of a function that tries to read data from it. (Conversely, this is the structure that results when a function writes via a PXxxChar.) The main requirement is the terminating null character, as this is the only indicator of the length of the string.
Delphi does not allocate these structures automatically. Instead:
  • a PXxxxChar can be pointed to an existing string,
  • specific storage can be allocated to the pointer by declaring and initializing a large enough string or array of chars (and assigning it to the PXxxxChar variable),
  • or use GetMem (...and later FreeMem).
Note the congruence between the structures here, and those of the string and array types.
Type
Default 'PChar'
Pointer
Structure
PAnsiChar
D1-D2007

delphi-layout-pansichar.png
PWideChar
D2009-on

delphi-layout-pwidechar.png

Example array-of-char types

For comparison, and possibly as an alternate way to allocate storage for PAnsiChar or PWideChar. Perhaps obviously, when this data is written to as an array, the terminating null character has to be added explicitly. (Not shown: corresponding static and dynamic arrays of AnsiChar.)
Static array

array[1..n] of WideChar
delphi-layout-statarray-wc.png
Dynamic array

array of WideChar...
SetLength
delphi-layout-dynarray-wc.png
Notes
[1] To allow code to successfully read through a PAnsiChar pointer that points to a shortstring, application code could stuff a 0 byte into the character position after the last character of the actual string. (This additional null character must be allowed for in the length declared for the shortstring.)
[2] Actually it looks like in XE2 at least, UnicodeString is an alias for String, not the other way around. So basically the actual definition of String was what changed in D2009 and after).
[3] See Clarification of UnicodeString and AnsiString elemSize field section below.
[4] WideString relatioinship to BStr: For Win32, _NewWideString actually calls Windows OLE function SysAllocStringLen(), so that rather definitively tells us that a WideString works like a BStr and can be used in conjunction with OLE/COM/ActiveX interfaces that call for such an argument..

Clarification of UnicodeString and AnsiString elemSize field.

The presence of the elemSize field suggests that instances of UnicodeString and AnsiString could accommodate elements of various sizes. Apparently, as of Delphi XE, this is not the case. AnsiString has elemSize always 1, and UnicodeString always 2. So what is the point of this field? This forum discussion describes the historical origin. From Remy Lebeau's comment:
In D2009, when UnicodeString was first introduced, it could hold both [AnsiChars or WideChars], actually. This was for compatibility with existing AnsiString code while people ported their apps to Unicode. That is why the StrRec record that is located in front of a string's character payload contains an elemSize field. The RTL in D2009 and D2010 would use that field to distinquish between Ansi payloads and UTF-16 payloads, regardless of what string type is actually holding the payload. If the RTL detected a mismatch, it would silently "correct" the mismatch by converting the payload to the correct type. This behavior was controlled by a new STRINGCHECKS compiler directive, but was dangerous and exhibited side effects under some situations. While most code was not affected by this, mismatches that did occur typically only occurred at library boundaries, particularly in event handlers where strings had to pass between C++ code and Delphi code.
By the time XE was released, [...] the entire STRINGCHECKS feature was removed from the compiler and RTL, and the IDE was updated to detect and error on mismatches in C++ event handlers. While the StrRec.elemSize field still exists (and the RTL still contains a StringElementSize() function) for backwards compatibility, it is not actually used for anything anymore. An AnsiString is always elemSize=1, and an UnicodeString is always elemSize=2, and all RTL code assumes that. If a mismatch happens to occur, you have a bug in your code, the data is corrupted, and the RTL will not try to fix it for you anymore.

Pointer-to-character types

Reading via PChar/PAnsiChar/PWideChar

The type PAnsiChar and PWideChar (=PChar in D2009? on) are simply pointers. To work properly in functions that expect to read string data from such a pointer, a Pxxxx needs to point to any sequence of bytes that represent AnsiChars or WideChars, terminated by a null character (two bytes in the case of PWideChar).
Therefore, a Pxxx could be set to point to the beginning of any of the Delphi string types (expect ShortString), or indeed to some intermediate position. Or a Pxxx could point to the beginning or intermediate location of an array of bytes or chars, provided there is a terminating null.

Writing via PChar/PAnsiChar/PWideChar

1. Before handing over a PChar to some other code (say an API) to receive data, usually a buffer must be allocated to the PChar into which the API function can write. One way:
Var
  PBufStr: PAnsiChar;
  BufStr, RetStr: AnsiString;
Begin
  BufStr := StringOfChar('x',10);
  PBufStr := PAnsiChar(BufStr);
  SomeAPI(PBufStr, 10); // SomeAPI(char * OutStr, int Len);
  RetStr := PBufStr; // make AnsiString copy because SomeAPI didn't update BufStr's Length field.
Notes:
  • The type of string and Pxxx declared in Delphi must match the type of string (eg: char *) in SomeAPI.
  • Make sure that BufStr's lifetime is at least as long as PBufStr needs to be around.

Which types to use when

From the info gathered on this page, the following are my conclusions as to which types to use when. (I've not yet actually tested these completely.)
API
API wants
pxx
PLxxx
API will
Use delphi type
Comment
Win32
. * char
psz
LPCSTR
read
PAnsiChar
If the Delphi variable is not already a PAnsiChar, then cast it.
Win32
. * char (plus length arg)
psz
LPSTR
(not const)
write
PAnsiChar
Same as preceding, but in this case caller must allocate sufficient space to the PAnsiChar variable.
Win32
. * char
psz
LPSTR
return
PAnsiChar
Variable should be a PAnsiChar, unallocated. [1]
Win32
. * wchar
pwsz
LPCWSTR
read
PWideChar
If the Delphi variable is not already a PWideChar, then cast it.
Win32
. * wchar (plus length arg)
pwsz
LPWSTR
(not const)
write
PWideChar
Same as preceding, but in this case caller must allocate sufficient space to the PWideChar variable.
Win32
. * wchar
pwsz
LPWSTR
return
PWideChar
Variable should be a PWideChar, unallocated. [1]
COM
BSTR


read, write or return
WideString
See Lifecycle section below







Notes

[1] API function will replace whatever value is in the variable, and any previously allocated memory would be leaked.

Casting

Example:
MyPWideChar = PWideChar(MyUnicodeString);
In some cases, casting simply tells the compiler to treat the base value of the variable as a pointer to a type different from the one originally declared. In this example, the UnicodeString's pointer can legitimately be reinterpreted as a pointer to a WideChar string.
In other cases, the contents of the string would not be compatible:
MyPWideChar = PWideChar(MyAnsiString);
In this case I'm not clear on what happens. Does the compiler create a temporary variable of some kind? Does the compiler just let MyPWideChar point to a string of AnsiChars whose bytes won't make sense as a WideChar string? Needs more investigation. There are a bunch of conversion functions in system.pas, but it's not clear which are called under what circumstances.

C type mnemonics

  • LP = long pointer (basically, a long (32 bit) integer containing a pointer. (Though the LP I think came from Win16 and its choice between single-word and two-word pointers).
  • C = Const
  • TSTR and TCHAR: can refer to either one-byte or two-byte chars depending in whether UNICODE is defined.
  • Example: LPCTSTR = pointer to const string using TCHARs, the size of which is defined elsewhere.

Lifecycle of BSTR (WideString)

This MSDN article describes management of allocation and deallocation of the space for a BSTR:
From examining system.pas, it's my belief that for WideString variables, the SysAllocString and SysFreeString situations are automatically handled by Delphi. In particular, Delphi invokes Free when the WideString variable goes out of scope, and also does the right thing when a new value is copied into a WideString.

References

Delphi docs on Embarcadero docwiki

Regarding BSTR

Others

A Brief History of Delphi Strings