From: Greg Bailey [mailto:greg@minerva.com]
Sent: Tuesday, June 01, 1999 7:41 PM
To: 'ANSForth real mailgroup'
Cc: 'Localisation and Internationalisation'; 'ark-gvb-i'
Subject: Octet String Prospectus


Problem Statement:
------------------

Most standards defining interoperable data structures, such as for
example those used in networking and cryptography, do so in terms of
sequences of octets.  Even in embedded applications, these standards are
increasingly relevant and are indeed supporting them is often a critical
application requirement.

The most commonly encountered computer architectures today address
their memories in units of 8 bit bytes, and Standard Forth appli-
cations have no difficulty in manipulating octet sequences directly when
running on typical systems, with eight bit character sets, for such
machines.

However, such applications are environmentally dependent upon this
common combination in which addresses are in units of bytes or octets,
*and* in which characters are eight bits wide; or upon machines whose
addresses are in units such as 4-bit nibbles which divide 8, and whose
characters are also eight bits wide.  On these families of architectures
portable software may manipulate octet sequences by treating them as
characters.

If, however, either character size or address units are larger
than eight bits, we do not document standard ways of allocating,
manipulating, or performing I/O using sequences of octets.

This proposal provides mechanism that may be used by standard
programs to manipulate sequences of octets on any standard system
which supports it.

(Actual packaging TBD.  Should probably be an extension, but if
so it will depend upon presence of the DOUBLE extension; and it
will include additions to the FILE extension if both are present.)


Discussion of common practice and architectural tradeoffs:
----------------------------------------------------------

Many systems and applications have been written for "cell addressed"
machines with 16 bit and larger address units.  Many strategies have
been used for addressing characters, which were generally equivalent to
octets, on such machines.  In general the hardware does not directly
support linear addressing of bytes, characters, or octets, so this type
of arithmetically usable address has generally been simulated in
software.  The most commonly used strategy has been to multiply the
physical, cell address by the number of octets held within a cell, and
add to this product the relative position of the octet within the cell,
in order to form a linear octet address.  Coding strategies for
employing this additional, synthetic address data type depend on the
nature of the underlying CPU.  Since there is usually a substantial
performance penalty for using these synthetic addresses, it has been
common practice to use the octet address data type only in conjunction
with octet operators, and to use native cell addresses for all other
purposes.

Since the dynamic range required of this synthetic data type is
one or more bits larger than for native address units, it follows
that if the machine supports full cell width cell addresses, then
an address capable of identifying any stored character or octet
within the memory must be greater than one cell in width.

A number of practical systems have used cell width octet addresses
with varying degrees of success.  For example, a number of the 16-
bit minicomputers have been restricted architecturally to 15 bit
cell addressing; in fact, in some cases, the 16th bit has been used to
mark indirect addresses.  On such systems, it has been possible to
address all of memory with a 16 bit octet address, with no negative side
effects.

Less successful have been efforts to use 16 bit synthetic octet
addresses on machines that support full 16 bit cell addressing.
One strategy is to limit octet addressing to the low half of
memory.  Another is to "float" octet addressing upon each task's
private memory.  Yet another subdivides octet addressable space
into a static, common region and another which is "floated".  Each
of these strategies has inflicted pain upon programmers who have
had to live with them.

A slightly less obvious form of this pain has been experienced
when maintaining a single source base that runs on both cell and
octet addresed machines.  In a typical synthetic addressing scheme
for such 16 bit machines, it is possible to convert a cell address
into the synthetic address of its first octet by simply doubling
the cell address.  The advantage of this transformation was that
all the system had to do was specify which operators took octet
addresses as opposed to cell addresses, and expect the programmer
to use the conversion operator when needed.  This avoided the need
for special allocation and declaration functions for octet space.
The disadvantage is that, when running on an octet addressed machine,
the conversion operators were no-ops.  The consequence of failing to use
a conversion operator, or of using the wrong address type with a given
function, were nil.  As a result, a programmer could change such a
program inattentively, test it on an octet addressed machine, and never
discover the bugs thus introduced until the program was later run on a
cell addressed machine.  Practical experience has shown that this error
is easy to make, hard to detect, and is a direct consequence of having
an octet address that is of the same size and the same value as is the
regular memory address on octet addressed machines.  As a result, it
appears that from the perspective of human factors this is an
architecture to be avoided.

Based on this experience, it is proposed that explicit octet add-
ressing be done using an ordered pair.  This practice has actually
been used in a number of systems, and is also the method often
used in hardware and software support for octet sequences on
large cell addressed mainframes.


Synopsis of proposed architecture:
----------------------------------

The ordered pair of an Octet Address consists of a Base Address
and an Octet Index.  The base Address is the standard Address of
the beginning of a memory allocation declared for an Octet Sequence. All
Octet Addresses within that allocation share the same Base Address, and
there is no portable method for transforming an Octet Address with a
given Base Address to use a different Base Address. The Octet Index is a
zero relative positive integer denoting the position of an octet within
the sequence which starts at the Base Address.

On the stack, the Base Address is on top.  Arithmetic on Octet
Addresses is meaningful only when subtracting the address of
one octet from that of another within the same sequence, or
when adding or subtracting a scalar to or from the address of
an octet.  This structure and these rules allow the application
to use double operators such as M+ and D- for the valid arithmetic
if those operators are assumed present; otherwise, since such valid
arithmetic never involves carries or borrows between the Index and Base
parts of the Octet Address, they are amenable to simple arithmetic
operations using standard CORE operators and similarly for machine code.
 For example, the difference between two Octet Addresses that may be
validly compared may be computed

   ROT 2DROP -   ( in lieu of  D- )

and an Octet Address may be decremented using

   SWAP 1- SWAP   ( in lieu of  -1 M+ )

Incrementation is of course done by the dedicated operator below.

Finally, this arrangement leads to syntax which is analogous to
that which is commonly used with arrays in Forth.  If PACKET has
been declared as an octet sequence, the phrase:

   5 PACKET

places on the stack the formal Octet Address of the sixth octet
in that sequence since PACKET simply provides the Base Address
for that sequence.  In a loop,

    I PACKET

or  4 + DUP PACKET

occurs naturally as it does with arrays, helping out with stack
bloat that would occur if "indexing" were not available and
arithmetic on the double form was the only way to navigate.


I believe, based on considerable experience, that this is the
cleanest way to deal with this issue.  In fact, it is precisely
the solution that ATHENA uses for data structures defined as
sequences of *bits*, where it has served well, led to readable
code, and produced no glaring inconsistencies.  Based on this,
the minimum set of things we might need is:

   OCTETS  ( n1 - n2)            Clone defn from CHARS
   OCTET+  ( 8-addr1 - 8-addr2)  Clone defn from CHAR+
   8@      ( 8-addr - u)         Clone defn from C@
   8!      ( u 8-addr)           Clone defn from C!
   8MOVE   ( 8-addr1 8-addr2 u)  Clone defn from CMOVE

It is strictly coincidental that "8" looks very much like "B"
at first glance ;-)


Storage for octet sequences is allocated using the present
conventions for allocating and identifying *aligned* addresses.
For example,

   CREATE PACKET  536 OCTETS ALLOT
   ... , ... ALIGN HERE  64 OCTETS ALLOT  ...

For the purpose of complying with standards, the first form is
more likely to be used.  The requirement for ALIGNing Base
Addresses facilitates efficient implementations on the universe
of equipment.


Addition of octet sequence support to the FILE extension must be
done in such a way that it is independent of character size, which
may be larger than an octet.  However, as written all FILE operators
function in terms of lengths and positions whose units are charcters.
Because more than one octet position may map onto the same character
position, dealing with the same file ID in terms of both octets and
characters would be problematic.

Instead, the following is proposed:

    OCT  ( fam1 - fam2)

       Modify the implementation-defined file access method fam1
       to additionally select an octet oriented, as opposed to character
       or file oriented, access method.  When a file ID has been opened
       with the OCT access method, all file positions and sizes used in
       association with that file are in units of octets instead of
       characters.  In addition, it is an amgiguous condition to use
       READ-FILE, READ-LINE, WRITE-FILE, WRITE-LINE, or INCLUDE-FILE
       with such a file ID.  INCLUDED is not mentioned in this list
       because it does not consume a file ID.

    READ-OCTET  ( 8-addr u1 fileid - u2 ior)   Clone from READ-FILE

       Note ambiguous condition if used with a fileid not opened as OCT

    WRITE-OCTET ( 8-addr u fileid - ior)       Clone from WRITE-FILE

       Note ambiguous condition if used with a fileid not opened as OCT


This appears to be the minimum necessary change.  READ-FILE and
WRITE-FILE are not overloaded because experience indicates that
having different arguments for the same function depending on a
flag leads to maintenance problems.

If written, this proposal will of course have to include a number
of details in sections 2, 3, and 4 as well as 11 and whatever is
assigned for this extension.


ALTERNATIVE STRUCTURE 1:
------------------------

If the TC strongly feels that this is too much solution for the
problem, there is a simpler alternative that is logically self
consistent:

1.  An octet is guaranteed to fit inside the storage allocation
    for a character.

2.  Therefore, omit all of this except the FILE wordset part.

3.  In the FILE wordset, include  OCT  but simply note that in
    this access method octets are read from and written to the
    device, sizes and positions are in octets, and the data
    are read into and written from character storage such that
    octets are right justified and zero filled into characters
    on READ-FILE, only the low order eight bits of each character
    are written by WRITE-FILE, and that READ-LINE, WRITE-LINE,
    and INCLUDE-FILE are ambiguous with an  OCT  file handle.

The disadvantage of this is that while it would allow everyone
with the AU=byte=char=octet dependency to congratulate themselves
as having complied without doing any work, it would not address
the physical storage structures commonly used by hardware and
operating systems for cell addressed equipment, and would be
inefficient on byte addressed machines with large characters.


ALTERNATIVE STRUCTURE 2:
------------------------

It might be more useful to use the initial structure above but to
de-ambiguify READ-FILE and WRITE-FILE by incorporating the con-
ventions in item 3. of alternative 1 above.  What this would buy
is that an existing AU=byte=octet=char application that had to
be converted in a hurry to use say 16 bit characters could adapt
to such a system by using OCT as file access method with no other
changes (assuming it was coded with CHARS and CHAR+ as needed)
and still operate upon its octet sequence structures with reduced
efficiency.  For that matter, it could run on cell addressed hard-
ware with similarly reduced efficiency.  In either case, at leisure and
if necessary the application could be upgraded to actually use the Octet
Addressing functions, but in the meanwhile there would be a fast and
dirty way to solve the problem with minimal effort.


At present I think that Alternative 2 would be the wisest of these
three.  Perhaps the part of Alternative 2 taken from Alternative 1
could be the OCTET extension, and the rest of it could be called
OCTET EXT.

Or, if one felt more strongly about it, OCT could be added to the
base FILE wordset along with the change in behavior of that wordset per
Alternative 1, and the rest of 2 implemented as simply the OCTET wordset
with no OCTET EXT (as yet).  For those maintaining typical systems, that
could require as little as adding OCT as a no-op.



Obviously it would be nice to have a first draft that might pass,
so these packaging issues should be more or less resolved first.
In that regard the central question is, to me, how essential and
therefore how non-optional each of these layers should be.


-----------------------------------------------------------------


    Greg Bailey     |  ATHENA Programming, Inc  |  503-295-7703  |
  ----------------  |  310 SW 4th Ave  Ste 530  |  fax 295-6935  |
  greg@minerva.com  |  Portland, OR  97204  US  |