>2) 5 : HTML markup produced  A 8
AscToHTM 

Documentation for the OAscToHTM conversion utility



9

This documentation can be downloaded as part of the  documentation set in .zip format (370k)




V Previous page $@ Back to Contents List#R Next page 


 *

5 HTML markup produced

 

5.1 Text layout

 

5.1.1 Indentation

H

AscToHTM performs statistical analysis on the document to determineF at what character positions indentations occur. This information isE used on the output pass to determine the indentation level for each source line.

B

AscToHTM attempts to indent the HTML code to match the output3 indentation level, to make it easier to read.

4

The indentations themselves are marked up using6 <BLOCKQUOTE> ... </BLOCKQUOTE> tags.

B

Future versions of AscToHTM may offer you the option of using <DIV> tags.


 (

5.1.2 Hanging paragraph indents

M

Some documents, especially ones dumped from Word, have hanging paragraphJ indents. That is, each paragraph starts at an offset to the rest of the paragraph.

L

AscToHTM struggles heroically with this, and tries not to treat this asG text at two indent levels, but it does occasionally get confused.

O

If writing a text file from scratch with AscToHTM in mind, then it is best to avoid this practice.


 

5.1.3 Bullets



9 AscToHTM detects and supports several types of bullets.


 

5.1.3.1 Bullet chars



$ Bullet chars are lines of the type

*
        - this is a bullet line$        - this is a bullet paragraph&          because it carries over onto          more lines
1

G

That is, a single character followed by the bullet line. AscToHTMJ can determine via statistical analysis which character, if any, is beingA used in this way. Special attention is paid to the '-' and 'o' characters.

P

Bullets of this type are given a <UL> ... <LI> ... </UL> markup.


 !

5.1.3.2 Numbered bullets

I

AscToHTM can spot numbered bullets. These can sometimes be confusedB with section headings in some documents. This is one area whereC the use of a document policy really pays dividends in sorting the sheep from the goats.

Z

Numbered bullets are given a <OL TYPE=1 START=N> ... <LI> ... </OL> markup.


K
Note:
Not all browsers support this type of markup. In such; cases, it's possible that the numbering of bullets will9 get reset to 1 every so often. However, this isn't a6 problem with either Netscape or Internet Explorer.

 #

5.1.3.3 Alphabetic bullets

I

AscToHTM detects upper and lower case alphabetic bullets. These are3 marked up like numbered bullets, with TYPE=a.


 &

5.1.3.4 Roman Numeral bullets

L

AscToHTM detects upper and lower case roman numeral bullets. These are3 marked up like numbered bullets, with TYPE=i.


 

5.1.4 Centred text

M

AscToHTM can attempt to spot sections of centred text. However, becauseD this can easily go wrong this option is normally switched off.

J

Centering is only switched on for single isolated lines, or any groupL of at least two lines. <CENTER> ... </CENTER> markup is used.



h See the policy "Allow automatic centring"




 


 

5.1.5 Definitions

 !

5.1.5.1 Definition lines



@ A definition line is a single line that appears to be definingB something. Usually this is a line with either a colon (:) or an% equals sign (=) in it. For example

*
#        IMHO = In my humble opinion
1


 or

*
-        Address : Somewhere over the rainbow.
1






D AscToHTM attempts to determine what definition characters are usedA and whether they are strong (only ever used in a definition) or- weak (only sometimes used in a definition).



N AscToHTM marks up definition lines by placing a <BR> on the end of theG line to preserve the original line structure. Where this decision is8 made incorrectly unexpected breaks can appear in text.



A AscToHTM offers the option of marking up the definition term in3 bold. This is not the default behaviour however.


 &

5.1.5.2 Definition paragraphs



@ AscToHTM also recognises the use of definition paragraphs such as :-

*
@      Note:     This is a "definition" paragraph, i.e. the wholeC                paragraph defines the term shown on the first line.G                Unfortunately AscToHTM currently only copes with singleG                paragraphs (i.e. not with continuation paragraphs), and2                only with single word definitions.
1

[

This gets marked up in a <DL> <DT>...</DT> <DD>...</DD>  </DL> sequence


F
Note:
This is a "definition" paragraph, i.e. the whole7 paragraph defines the term shown on the first line.; Unfortunately AscToHTM currently only copes with single; paragraphs (i.e. not with continuation paragraphs), and& only with single word definitions.

 

5.2 Text formatting

 

5.2.1 Quoted lines

B

AscToHTM recognises that, especially in Internet files, it isF increasingly common to quote from other text sources such as e-mail.B The convention used in such cases is to insert a quote character- such as > at the start of each line.

M

Consequently, AscToHTM adds a <BR> tag at the end of such lines toA preserve the line structure of the original, and marks it up inC <EM>..</EM> tags to differentiate the quoted text


 

5.2.2 Emphasis

J

AscToHTM can look for text emphasised by placing asterisks (*) eitherJ side of it, or underscores (_). AscToHTM will convert the enclosed textU to bold and italic respectively using <STRONG> and  <EM> tags respectively.

J

AscToHTM will also look for combinations of asterisks and underscoresZ which is placed in bold italic. The asterisks and underscores should be properly nested.

L

The emphasised word or phrase should span no more than a few lines, andX in particular should not span a blank line. If the phrase is longer,G or if AsctoHTM fails to match opening and closing emphasis marks, the& characters are left unconverted.

K

To a limited extent the software can detect and handle nested emphasisI of different types. For example a phrase that is mostly in italicsW may contain a few bold words within it. Nested emphasis of the! same type is not supported.



Tests are made to


 

5.2.3 Fonts

F

The program allows you to select the default font for your files.@ Fonts are implemented Cascading Style Sheets (CSS) by default,J although this can be change to the <FONT> tag should you wish.

G

The <FONT> tag is allowed in "HTML 3.2", but is "deprecated"@ in favour of CSS in "HTML 4.0 Transitional". It is completely) disallowed under "HTML 4.0 Strict".

L

Some older features of the software still use <FONT> tags. TheseB will be changed to also use CSS in later releases as support for! "HTML 4.0 Strict" is added.

W

See Font Policies for more details.


 !

5.2.4 Special characters

C

The program will detect special characters and symbols such asM á and © and will replace these by the correct HTML entity codesI &aacute; and &copy;. By using the correct HTML codes, the HTMLC produced is guaranteed to display correctly on all computers.

E

Some symbols use different character codes on different machines (e.g. Mac and PC).


 

5.3 Added hyperlinks

 "

5.3.1 Contents List lines

K

Contents list lines are marked up in bold, and turned into a hyperlinkE pointing at the section referenced. The text is sized according to@ heading type in the range +/- 1 font size from normal (3).


 

5.3.2 Cross-references

L

AscToHTM can convert cross-references to other sections into hyperlinksG to those sections. Unfortunately this is currently only possible forO second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

N

This is because the error rate becomes too high on single numbers/lettersO or roman numerals. This may be refined in future releases, although+ it's hard to see how that would work.


 

5.3.3 URLs

G

AscToHTM can convert any URLs in the document to hyperlinks. ThisA includes http and ftp URLs and any web addresses beginning with www.

@

The domain name part of the URL will be checked against theB known domain name structures and country codes to check it fallsF within an allowed group. So www.somewhere.thing won't be allowed as/ ".thing" isn't a proper top level domain.

J

URLs that use IP addresses or some more obscure methods of specifyingD domain names will also be recognised, but the link will be changed? wherever to either a domain name or an IP address. This willA de-obfuscate any obscure references so beloved by spammers.




5.3.4 Usenet Newsgroups

I

AscToHTM can convert any newsgroup names it spots into hyperlinks toG those newsgroups. Because this is prone to error, AscToHTM currently> only converts newsgroups in known USENET hierarchies such as< rec.gardens by default.

#

This can be overcome either by



    
  1. : placing "news:" in front of the newsgroup name (e.g.T news:this.is.a.newsgroup.honest)

  2. 
  3. C relaxing this condition via a document policy (see the policyX "Only use known groups")

  4. 
  5. C specifying the newsgroup hierarchy as recognised via a policy^ "Recognised USENET groups".

 

5.3.5 E-mail addresses

K

AscToHTM can convert any email addresses into hypertext mailto: links.h As with URLs (see 5.3.3), the domain name is checked to see it falls into a recognised group.


 &

5.3.6 User-specified keywords

L

AscToHTM can convert use-specified keywords into hyperlinks. The wordsC or phrase to be converted must lie on a single line in the sourceE document. Care should be taken to ensure keywords are unambiguous.H Normally I mark my keywords in [] brackets if authoring for conversion by AscToHTM

¡

See the discussions on "link dictionaries" in 4.3.2.2 and 4.4.2.

e
E E&

5.3.7 Other sections and URLs

H

AscToHTM offers a number of pre-processor commands and in-line tagsD that allow you to add hyperlinks commands to your source text.

n

For example the "GOTO" tag allows hyperlinks to be created to named- sections in the same document. So that

b
+

[[GOTO Other sections and URLs]]

;
;

becomes


?

Other sections and URLs

C
x

In the same way the "HYPERLINK" tag allows hyperlinks to be inserted


t S

5.4 Section headings

i
N

AscToHTM recognises various types of headings. Where headings are found,J and deemed to be consistent with the prevailing document policy (correctH indentation, right type, in numerical sequence etc), AscToHTM will use5 the standard <Hn> ... </Hn> markup.

nP

In addition to this, AscToHTM will insert a named Anchor tag (<A> ...M </A>) to allow hyperlink jumps to this point. These anchors are used F for example in the contents list and cross-reference hyperlinks that AscToHTM generates.

T
o

5.4.1 Numbered headings

K

This is the preferred heading type and the type that AscToHTM has mostvG success with. Sections of type N.N.N can be checked for consistency,sJ and references to them can be spotted and converted into hyperlinks.

F

At present more exotic numbering schemes using roman numerals andC letters of the alphabet are not fully supported. This is planned>= to be implemented soon, possibly via user policy files.

a
o #

5.4.2 Capitalised headings

T
F

AscToHTM can treat wholly capitalised lines as headings. It alsoD allows for such headings to be spread over more than one line.


D E"

5.4.3 Underlined headings

G

AscToHTM can recognise underlined text, and optionally promote theoG preceding line to be a section header. The "underlining" line shouldh have no gaps in it.

C
D

5.4.4 Embedded headings

*

New in version 4

I

The program can look for headings "embedded" in the first paragraph. C Such headings are expected to be a complete sentence or phrase intE UPPER CASE at the start of a paragraph. Where detected the heading>L will be marked up in bold, rather than <Hn> markup, although it willF still be added to, and accessible from any hyperlinked contents list$ you generate for the document.

B

At present such headings are not auto-detected... you need tos switch on the "Expect embedded headings" policy.


 "

5.4.5 Key phrase headings

*

New in version 4

H

The program can now look for lines that start with particular wordsC or phrases (such as "Chapter", "Part", Title") of your choice andnB treat these lines as headings. Previously this only worked in aS limited way if the heading line was also numbered ("Chapter 1") n etc.

z

To use this feature, set the policy "Heading key phrases"


o P"

5.4.6 Numbered paragraphs

I

Some types of documents use what look like section numbers to numberB: paragraphs (e.g. legal documents, or sets of rules).

C

AscToHTM can recognise this, and mark up such lines by placing R the number in bold, and not using <Hn> ... </Hn> markup on the whole line.

>
 t&

5.4.7 Mail and USENET headers

J

Some documents, especially those that were originally email or USENETC posts, come with header lines, usually in the form of a number oftC lines with a keyword followed by a colon and then some value.

eI

AscToHTM can recognise these (to a limited extent). Where these arecJ detected the program will parse the header lines to extract the Subject,H Author and Date of the article concerned. A 2-line heading containingF this information will then be generated to replace all the unsightly header lines.

r
& !

5.5 Pre-formatted text

E <#

5.5.1 Lines and form feeds

4
H

Lines are interpreted in context. If they appear to be underliningE text, or part of some pre-formatted structure such as a table, theneU they are treated as such. Otherwise they become horizontal rules (<HR>).

eI

An attempt is made to interpret half-lines etc as such, although the.! effect is only approximate.

kB

Form feeds or page breaks also become <HR> markups.

*

New in version 4
f You can use the HTML fragment HORIZONTAL_RULE to customize how lines= are displayed, e.g. you can substitute your own image file.


C 3.

5.5.2 User defined pre-formatted text



AF AscToHTM normally ignores any HTML markup in the original text. TheC sole exceptions are any preprocessor tags which a user may insertQ_ into their text document (see Using the preprocessor). 



e For example :-

l*
,3      The use of BEGIN_PRE and END_PRE preprocessor=        commands (see 7.1) in.          the text documents            tells AscToHTM thato!              this portion of thew            document          has been formatted        by the user andf      should be left unchanged.s
1


T 8

5.5.3 Automatically detected pre-formatted text

L

AscToHTM attempts to spot sections of preformatted text. This can varyF from a single line (e.g. a line with a page number on the right-hand* margin) to a complete table of data.

K

Where such text is detected AscToHTM analyses the section to determine 9 what type of pre-formatted text it is. Options includeb


 E
=2

A number of policies allow you to control

 se

See Pre-formatted text policies for full details.

P
e R

5.5.3.1 Tables

M

Tables are marked out by their use of white space, and a regular patterneF of gaps or vertical bars being spotted on each lines. AscToHTM willJ attempt to spot the table, its columns, its headings, its cell alignment5 and entries that span multiple columns or rows.

J

Should AscToHTM wrongly detect the extent of a table, you can mark upc a section of text by using the "BEGIN/END_TABLE">G pre-processor commands. Alternatively you can try adding blank lines K before and after, as the analysis uses white space to delimit tables.

K

You can alter the characteristics of all tables via the table policiestR (see Table generation policies).

J

You can alter the characteristics of all or individual tables via theZ table pre-processor commands (see 7.1.4).

;

Or you can suppress the whole thing altogether via the&d "Attempt TABLE generation" policy


g p

5.5.3.2 Code samples

J

AscToHTM attempts to recognise code fragments in technical documents.D The code is assumed to be "C++" or "Java"-like, and key indicatorsK are, for example, the presence of ";" characters on the end of lines.

5J

Should AscToHTM wrongly detect the extent of a code fragment, you canw mark up a section of text by using the "BEGIN/END_CODE" pre-processorg commands.

eE

You can choose what type of markup is used for the code fragmenta (see the policy "Use <CODE>..</CODE> markup").

tB

Of you can suppress the whole thing altogether via the policyT "Expect code samples".


> n'

5.5.3.3 Ascii art and diagrams

H

AscToHTM attempts to recognise Ascii art and diagrams in documents.E Key indicators include large numbers of non-alphanumeric characters ! and the use of white space.

dC

However, some diagrams use the same mix of line and alphabetict> characters as tables, so the two sometimes get confused.

D

Should AscToHTM wrongly detect the extent or type of a diagram,0 you can mark up a section of text by using thed "BEGIN/END_DIAGRAM" pre-processor commands.


 /

5.5.3.4 Text blocks

*

New in version 4

G

If AscToHTM detects a block of text at a large indent, it will now/G place that text in a table in such a way as to preserve as faithfullye& as possible the original indent.

I

Of course, sometimes this text is, in fact, centred on the page. IndŽ that case you could consider switching on the policy "Allow automatic centring", See 5.1.4


t e%

5.5.3.5 Other formatted text

G

If AscToHTM detects formatted text, but decides that is is neither>G table, code or art (and it knows what it likes), then the text may beAO put out "as normal", but with <BR> added to each line. In such regionsmF other markup (such as bullets) may not be processed such as it would be elsewhere.

l

h e!

5.6 Added value markup

/ 

5.6.1 Document Title

s
I

AscToHTM can calculate - or be told - the title of a document. ThislS is placed in <TITLE>...</TITLE> markup in the <HEAD> section ofu each HTML page produced.

G

The Title is calculated as in the order shown below. If the first.A algorithm returns a value, the subsequent ones are ignored.

n
    
  1. p If a $_$_TITLE pre-processor command (see 7.1.2) is placed in the# source text, that value is useda

  2. 
  3. r If the "Use first header as title" policy is setE then the first heading (if any) encountered is used as the title.l
    S
    Note:
    Depending on your document structure, this is prone to givee@ bland tiles like "Introduction" , "Overview" and "Summary"
    
  4. n
  5. n If the "Use first line as title" policy is set9 then the first line in the file is used as the title.o

  6. t
  7. l If the "Document title" policy is set then this value is used.
    P
    Note:
    If this is the value you want, ensure the other policies" outlined above are disabled.
    
  8. s
  9. < Finally, if none of the above result in a title the text. "Converted from <filename>" is used.

  10. o
h
e .

5.6.2 Contents lists

i
H

AscToHTM can detect the presence of a contents list in the originalH document, or it can generate a contents list for you from the headingsI that it observes. There are a number of policies that give you controlT6 over how and where a contents list is generated (seeO Contents generation policies).

kH

There are four different situations in which contents lists may, or% may not be generated. These are :-w


 e
s
l c6

5.6.2.1 Contents lists in default conversions

L

By default AscToHTM will not generate a contents list for a file unless it already has one.

H

If it should detect a contents list in the document, then that listD is changed into hyperlinks to the named sections. This only works1 currently for files with numbered headings.

sG

Where an existing list is detected, headings shown in the contents>A list are converted into links, and the link text is that in the G original contents list, and not the text in the actual heading (oftenn they are different).


S
Note:
AsctoHTM currently only detected numbered contents lists, anda= is occasionally prone to error when they are present. Ifl< you experience problems, either delete the contents list< and get AscToHTM to generate one for you, or mark up the; existing list using the contents pre-processor commandse: (see 7.1.1)

 eD

5.6.2.2 Contents lists in conversions to a single HTML file

g

As described in 5.6.2.1, AscToHTM will not generate a contents listo+ by default unless it already has one.

v5

Requesting a contents list
> You can request that a contents list is always generated, bym using the "Add contents list" policy. In this case a  contents list is eithero


    
  1. , made from the existing contents list, or

  2. 2
  3. D generated from the observed headings. in this case the contents= list will only be as good as the detection of headings ino* the rest of the document permits.
  4. 
n<

Forcing a generated contents list
< You can force a generated list to be used by disabling theq "Use any existing contents list" policy.

tD

If an existing contents list is present, it is deleted from theG output. Normally it's best to either use the existing contents list,H or to delete it from the source text and request a generated list.

3

Contents lists placement
AA By default the contents list is placed at the top of the outputrE file. In earlier versions of AscToHTM the contents list was always4 placed in a separate file.

I

You can cause contents lists to be placed wherever you want by usingWk the "CONTENTS_LIST" preprocessor command. If you do this,tS then contents lists is placed only where you place CONTENTS_LISTH markers.

H

Generating a contents list in a separate file
€ If you select the "Generate external contents list" policyA the contents list is placed in a separate file, and a hyperlinktF to that file called "Contents List" is placed at the top of the HTML' page generated from the document.

>;

You can choose the name of the external file using theT… "External contents list filename" policy. If omitted, the fileoN is called "Contents_<filename>", where <filename> is the name of# the document being converted.

t
 oE

5.6.2.3 Contents lists in conversions to multiple HTML files

/
J

AscToHTM can be made to split the output into many files. At presentE this is only possible at detected section headings. Each generatedID page usually has a navigation bar, which includes a hyperlink back4 to the following section in any contents list.

]

The behaviour is identical to that in 5.6.2.2 expect thatn


    h
  1. / the output is now split into several files.s

  2. a
  3. C the options to generate an external contents list in a separated! file are no longer available.<

  4. o
  5. D if the contents list is being generated, it is now placed at theB foot of the first document, rather than at the top (unless the4 CONTENTS_LIST preprocessor command is used)
  6. 
h
V

This is usually before the first heading (which now starts theK second document), and after any document preamble.

t

S
Note:
Where the original contents list is used when splitting files_= it is possible that not every file is directly accessiblesC from the contents list, and that the back links to the contents>= list may not function as expected. In such cases you can<> go from the contents list to a major section, and then use= the navigation bars to page through to the minor section.(

r 8

5.6.2.4 Contents lists in conversions to frames

*

New in version 4

M

Contents list generation for the main document will proceed as describedF in the previous sections.

yB

When making a set of frames, you can elect to have a contents? frame generated (the default behaviour), and this will have aBB generated list placed in a frame on the left. This can mean youB have a contents list in the contents frame on the left, and alsoI at the top of the first page in the main document. For this reason thel< main frame often starts by displaying the second page.

L

The number of levels shown in the contents frame list can be controlledE by policy. Alternatively you can replace the whole contents of theT` contents frame by defining a CONTENTS_FRAME HTML fragment.


 e

5.6.3 Directory page

c
L

When converting several files at once, AscToHTM can be made to generateG a "Directory Page". This is an HTML index of all the files convertedx and their contents.

rE

The policies available for controlling generation of a directoryia page are explained in "Directory Page policies".

LI

The directory page will consist of an entry for each file converted,I in the order that files are converted (usually alphabetic). Each entryK" will (optionally) contain :-

 n
< .

5.6.4 Headers, footers and JavaScript

K

AscToHTM can be made to add standard header, footers and JavaScript toLG each page generated. It does this by allowing you to specify includegF files to be copied into the generated HTML. These include files can& contain any valid HTML commands.

:

The program supports three types of such files :-


    g
  1. F Header files. These contain any HTML you want placing immediately- after the output's <BODY> tag.
  2. o

G

A good example might be a standard header, with a logo and linksx back to the home page.


-
    E
  1. F Footer files. These contain any HTML you want placing immediately) before the closing </BODY> tag.l

  2. 
  3. A Script files, These contain any HTML you want placing inside P the <HEAD> ... </HEAD> portion of the generated file. Such tags! are not usually visible.
  4. B

D

You should place in here any JavaScript you want, although it> is difficult to make this apply to the converted text.


cN

You can specify include files for the converted files, as well as for anyg directory page (see 5.6.3) that you create. If you don't specify values J for the directory page, then it will use the same files as the generated files.

c

HTML headers and footers can also be defined as HTML fragmentsm/ (see 5.6.5).

I
r L

5.6.5 HTML fragments

m
*

New in version 4

A

The "HTML fragments" feature allows you to define a block of C HTML that you can elect to have used in place of the default HTMLd& that the program would generate.

C

Some reserved names are recognised by the software and used todH override the default HTML that is generated. For example the fragmentJ HORIZONTAL_RULE can be used to define a <IMG> tag that displays anG image and which will be used in a number of situations where a simple/4 <HR> tag would have been used otherwise.

i

They are described fully in the Tag Manual (see HTML fragments).

r

i 2


eV Previous page $@ Back to Contents List# R Next page 


B& 
Valid HTML 4.0!Converted from 6 a single text file by A AscToHTM
/J© 1997-2001 John A Fotheringham
P