[AscToHTM] Documentation for AscToHTM conversion utility
----------------------------------------------------------------------------

5 HTML markup produced

5.1 Indentation

     AscToHTM performs statistical analysis on the document to determine at
     what character positions indentations occur. This information is used
     on the output pass to determine the indentation level for each source
     line.

     AscToHTM attempts to indent the HTML code to match the output
     indentation level, to make it easier to read. The indentations
     themselves will be marked up using <UL> ... </UL> tags.

     Note:
          This is argueably not strictly correct HTML as we omit the <LI>
          tag that would give a bullet character. However, this does produce
          simpler HTML than using <TABLE> ... </TABLE> markups which are not
          supported by earlier browsers, and which give (apparently, if not
          actually) slower-loading HTML.

5.2 Header Lines

     AscToHTM recognises various types of headers. Where headers are found,
     and deemed to be consistent with the prevailing document policy
     (correct indentation, right type, in numerical sequence etc), AscToHTM
     will use the standard <Hn> ... </Hn> markup.

     In addition to this, AscToHTM will insert a named Anchor tag (<A> ...
     </A>) to allow hyperlink jumps to this point. These anchors are used
     for example in the contents list and cross-reference hyperlinks that
     AscToHTM generates.

5.2.1 Numbered headers

     This is the preferred heading type and the type that AscToHTM has most
     success with. Sections of type N.N.N can be checked for consistency,
     and references to them can be spotted and converted into hyperlinks.

     At present more exotic numbering schemes using roman numerals and
     letters of the alphabet are not fully supported. This is planned to be
     implemented soon, possibly via user policy files.

5.2.2 Capitalised headers

     AscToHTM can treat wholly capitalised lines as headers. It also allows
     for such headers to be spread over more than one line.

5.2.3 Underlined headers

     AscToHTM can recognise underlined text, and optionally promote the
     preceding line to be a section header.

5.2.4 Numbered paragraphs

     Some types of documents use what look like section numbers to number
     paragraphs (e.g. legal documents, or sets of rules).

     AscToHTM can recognise this, and mark up such lines by placing the
     number in bold, and not using <Hn> ... </Hn> markup on the whole line.

5.3 Hyperlinks

5.3.1 Contents List lines

     Contents list lines are marked up in bold, and turned into a hyperlink
     pointing at the section referenced. The text is sized according to
     heading type in the range +/- 1 font size from normal (3).

5.3.2 Cross-references

     AscToHTM can convert cross-references to other sections into hyperlinks
     to those sections. Unfortunately this is currently only possible for
     second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n
     etc)

     This is because the error rate becomes too high on single
     numbers/letters or roman numerals. This may be refined in later
     releases.

5.3.3 URLs

     AscToHTM can convert any URLs in the document to hyperlinks. This
     includes http and ftp URLs and any web addresses beginning with www.

5.3.4 Usenet Newsgroups

     AscToHTM can convert any newsgroup names is spots into hyperlinks to
     those newsgroups. Currently only third level newsgroups such as
     comp.os.vms are converted to reduce the error rate. This may be
     inproved
     in later releases.

5.3.5 E-mail addresses

     AscToHTML can convert any email addresses into hypertext mailto: links.

5.4 Hanging paragraph indents

     Some documents, especially ones dumped from Word, have hanging
     paragraph indents. That is, each paragraph starts at an offset to the
     rest of the paragraph.

     AscToHTM stuggles heroically with this, and tries not to treat this as
     text at two indent levels, but it does occasionally get confused.

     If writing a text file from scratch with AscToHTM in mind, then it is
     best to avoid this practice.

5.5 Bullets

     AscToHTML detects and supports several types of bullets.

5.5.1 Bullet chars

     Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines

     That is, a single character followed by the bullet line. AscToHTM can
     determine via statistical analysis which character, if any, is being
     used in this way. Special attention is paid to the '-' and 'o'
     characters.

     Bullets of this type are given a <UL> ... <LI> ... </UL> markup.

5.5.2 Numbered bullets

     AscToHTM can spot numbered bullets. These can sometimes be confused
     with section headings in some documents. This is one area where the use
     of a document policy really pays dividends in sorting the sheep from
     the goats.

     Numbered bullets are given a <OL TYPE=1 START=N> ... <LI> ... < /OL>
     markup.

     Note:
          Not all browsers support this type of markup. In such cases, it's
          possible that the numbering of bullets will get reset to 1 every
          so often. However, this isn't a problem with either Netscape or
          Internet Explorer.

5.5.3 Alphabetic bullets

     AscToHTM detects upper and lower case alphabetic bullets. These are
     marked up like numbered bullets, with TYPE=a.

5.5.4 Roman Numeral bullets

     AscToHTM detects upper and lower case roman numeral bullets. These are
     marked up like numbered bullets, with TYPE=a.

5.6 Definitions

5.6.1 Definition lines

     A definition line is a single line that appears to be defining
     something. Usually this is a line with either a colon (:) or an equals
     sign (=) in it. For example

        IMHO = In my humble opinion

        Address : Somewhere over the rainbow.

     AscToHTM attempts to determine what definition characters are used and
     whether they are strong (only ever used in a definition) or weak (only
     sometimes used in a definition).

     AscToHTM marks up definition lines by placing a <BR> on the end of the
     line to preseve the original line structure. Where this decision is
     made incorrectly unexpected breaks can appear in text.

     AscToHTM offers the option of marking up the definition term in bold.
     This is not the default behaviour however.

5.6.2 Definition paragraphs

     AscToHTM also recognises the use of definition paragraphs such as :-

      Note:     This is a definition paragraph whereby the whole
                paragraph is defining the term shown on the first line.
                Unfortunately AscToHTM currently only copes with single
                paragraphs (i.e. not with continuation paragraphs), and
                only with single word definitions.

     This gets marked up in a <DL> <DT>...</DT> <DD>...< /DD> </DL> sequence

     Note:
          This is a "definition" paragraph, i.e. the whole paragraph defines
          the term shown on the first line. Unfortunately AscToHTM currently
          only copes with single paragraphs (i.e. not with continuation
          paragraphs), and only with single word definitions.

5.7 Quoted lines

     AscToHTM recognises that, especially in Internet files, it is
     increasingly common to quote from other text sources such as e-mail.
     The convention used in such cases is to insert a quote character such
     as > at the start of each line.

     Consequently, AscToHTM adds a <BR> tag at the end of such lines to
     preserve the layout in the original.

5.8 Pre-formatted text

5.8.1 Lines and form feeds

     Lines are interpreted in context. If they appear to be underlining
     text, or part of some pre-formatted structure such as a table, then
     they are treated as such.

     Otherwise they become horizontal rules (<HR>). Form feeds or page beaks
     also become <HR> markups.

5.8.2 User defined pre-formatted text

     AscToHTM normally ignores any HTML markup in the original text. The
     sole exceptions are the < PRE> ... < /PRE> tags which a user may insert
     into their text document.

     For example :-

      The use of < PRE> and < /PRE> in the text documents tells AscToHTM
      that this portion of the document has been formatted and should
      be left unchanged.

     Note:
          Because of this, care has to be taken when referring to < PRE> and
          < /PRE> HTML tags in source document. A single space after the
          opening < is sufficient. All other HTML tags are ignored and
          converted into "safe" text.

5.8.3 Automatically detected pre-formatted text

     AscToHTM attempts to spot chunks of preformatted text. This can vary
     from a single line (e.g. a line with a page number on the right-hand
     margin) to a complete table of data.

     Where such text is detected AscToHTM marks these sections up in < PRE>
     ... < /PRE> tags.

     Eventually it is hoped to add full <TABLE> ... </TABLE> generation for
     such sections.

5.9 Centred text

     AscToHTM can attempt to spot chunks of centred text. However, because
     this can easily go wrong this option is normally switched off.

     Centering is only switched on for single isolated lines, or any group
     of at least two lines. <CENTER> ... </CENTER> markup is used.

----------------------------------------------------------------------------

Prev | Next | Contents

----------------------------------------------------------------------------
© 1997 John A. Fotheringham