[AscToHTM] Documentation for AscToHTM conversion utility ---------------------------------------------------------------------------- 6 Using File Policies File policies have two main uses : a. To correct any failure of analysis that AscToHTM makes. Hopefully this won't be needed too much as the core analysis engine improves. Examples include page width, whether or not underlined section headings are expected etc. b. To tell the program how to produce better HTML end product in ways that couldn't possibly be inferred from the original text. Examples include adding colour and titles to the page, as well as requesting a large document is split into several pages, and a contents list created. 6.1 An example conversion This documentation has itself been converted using AscToHTM. The files used were o A2HDOCO.TXT. This is the text version of the documentation. The text version is kept as the master copy and updated as required. It's then converted to HTML. o IA2HDOCO.POL. This is the policy file used to create the HTML version of this document. Only those policies that differ from the defaults have been added. This policy file "includes" the link dictionary A2HLINKS.TXT. o A2HDOCOH.TXT. This is the header HTML added as the bottom of each generated HTML page. o A2HDOCOS.TXT. This is the JavaScript HTML added into the ... portion of the generated HTML page. This particular example toggles the logo when the mouse is over it (only if you use Netscape V3.0 or above though). o A2HLINKS.TXT. This is the link dictionary used for this document and is used to add hyperlinks to the main text file. o A2HDOCOF.TXT. This is the footer HTML added as the bottom of each generated HTML page. These files are included in the distribution kit as an example set of documentation. You can, of course, use AscToHTM to convert this doco into whatever format, colour etc that you wish. 6.2 Layout policies 6.2.1 Indentation AscToHTM has the following indentation policies that will normally be correctly calculated on the analysis pass :- Indentation ----------- Indent position(s) : 0 6 8 16 Hanging paragraph position(s) : (none) 6.2.1.1 "Indent position(s)" These are the positions of the major indent levels in the document. 6.2.1.1 "Hanging paragraph position(s)" These refer to the indentation levels at which definition paragraphs are expected (see 5.6.2). 6.2.2 Page layout AscToHTM has the following general layout policies that will normally be correctly calculated on the analysis pass :- Page layout ----------- Page width : 78 Expect Contents List : FALSE Text Justification : Left 6.2.2.1 "Page width" This value can be used to influence short line and centred text detection. It also helps to determine if the definition characters ':' and '-' (see 5.6.1) are to be regarded as "strong" or "weak". 6.2.2.2 "Expects contents list" See the discussion in 4.3.3. 6.2.2.3 "Text justification" This policy is important in detecting pre-formatted text. The possible values are "left", "center" (i.e. left and right), "right" and "none". If text is centered then padding spaces may be added. This has to be ignored when detecting pre-formatted text. 6.2.3 Paragraphs AscToHTM has the following paragraph policies that will normally be correctly calculated on the analysis pass :- Paragraphs ---------- Expect blank lines between paras : TRUE New Paragraph Offset : (none) 6.2.3.1 "Expect blank lines between paras" Paragraphs are normally expected to have blank lines before them. Where this isn't true (e.g. on a text file dumped from Word) different algorithms can be applied more rigoursously. 6.2.3.2 "New paragraph offset" This policy refers to any hanging indent. Again, this is a Word for Windows favourite. 6.2.4 Styling AscToHTM has the following "styling" policies that will normally be correctly calculated on the analysis pass :- Style ----- Highlight Definition Text : FALSE Allow automatic centring : FALSE Allow automatic 1-line < PRE> : TRUE 6.2.4.1 "Highlight definition text" This policy specifies whether or not the definition term (the part marked up in
...
) should be placed in bold for greater emphasis (see 5.6). 6.2.4.2 "Allow automatic centring" This policy allows automatic detection of centred text to be performed. This is normally left switched off, as it is prone to give errors. This algorithm may be refined in later versions. 6.2.4.3 "Allow automatic 1-line < PRE>" This policy allows individual lines to places in their own < PRE> ... < /PRE> sections. This is sometimes desirable, so is switched on by default. For example if you have lines with page numbers at the top of each page in your document. Of course... this makes no sense in an HTML document. 6.3 Section headers AscToHTM has the following section heading policies that will normally be correctly calculated on the analysis pass :- Section Types ------------- First Section Number : 1 Expect Numbered Sections : TRUE Expect Underlined Headings : FALSE Expect Capitalised Headings : FALSE Expect Second Word Headings : FALSE We have 3 recognised headings Heading level 0 is of form : "" N "" at indents : 0 /-1 Heading level 1 is of form : "" N.N "" at indents : 0 /-1 Heading level 2 is of form : "" N.N.N "" at indents : 0 /-1 Section headers are far and away the most complex things the analysis pass has to detect, and the most likely area for errors to occur. 6.3.1 "First Section Number" This policy indicates what the first section number is. Normally this starts at 1, but if it starts higher, then AscToHTM may reject headers as being out of sequence, and fail to detect to presence or absence of contents lists correctly. 6.3.2 "Expect Numbered Sections" This indicates whether or not numbered sections are to be expected. 6.3.3 "Expect Underlined Headings" This indicates whether or not underlined headers are to be expected. AscToHTM normally promotes any underlined lines to section headers. This policy can be used to switch that behaviour off. 6.3.4 "Expect Capitalised Headings" This indicates whether or not a line that is wholly capitalised should be regarded as a section heading. 6.3.5 "Expect Second Word Headings" *** not fully supported in this version *** 6.3.6 "Heading level ..." *** not fully supported in this version *** 6.4 Bullets AscToHTM has the following bullet point policies that will normally be correctly calculated on the analysis pass :- Bullet Types ------------- Expect Numbered bullets : FALSE Expect alphabetic bullets : TRUE Expect Roman Numeral bullets : FALSE Bullet characters ----------------- Bullet Char : '-' AscToHTM tries hard not to get confused by the "1", "a" and "I" that happen to end up at the start of lines by random. These could get mistaken for bullet points. 6.4.1 "Expect Numbered bullets" This indicates that numerical bullets are expected (but you probably guessed that). 6.4.2 "Expect alphabetic bullets" This does likewise for alphabetic bullet points. AscToHTM recognises (and distinguises between) upper and lower case variants. 6.4.3 "Expect Roman Numeral bullets" This does likewise for roman numerals. Again upper and lower case variants are recognised. 6.4.4 "Bullet Char" These policy lines indicate character(s) that can occur at the start of a line to represent a bullet point. Special attention is paid to '-' and 'o' characters, but any character will do. Use one line per bullet char. 6.5 Definitions AscToHTM has the following "definitions" policies that will normally be correctly calculated on the analysis pass :- Definitions ----------- Definition Char : '-' (weak) Definition Char : ':' (strong) See the discussion in 5.6.1 6.6 Hyperlink policies AscToHTM has the following hyperlink policies set as defaults :- Hyperlinks ---------- Create hyperlinks : TRUE Create mailto links : TRUE Create NEWS links : TRUE Cross-refs at level : 2 6.6.1 "Create hyperlinks" This policy really means that all http, www and ftp URLs will get converted to hyperlinks. 6.6.2 "Create mailto links" This indicates that probable email addresses such as jaf@yrl.co.uk are to be converted into mailto hyperlinks. 6.6.3 "Create NEWS links" This indicates that probable newsgroup references such as alt.games.mornington.cresent (sic) are to be converted. 6.6.4 "Cross-refs at level" This indicates the section level at which and above which all cross-references are to be converted to hyperlinks. For example a value of 2 means all n.n, n.n.n etc references are converted. A value of "1" might seem desirable, but is liable to give many false references (see 5.3.2). This behaviour may be improved in later versions. 6.7 Extra HTML details AscToHTM has the following HTML policies that will only ever take effect if supplied in a user policy file :- HTML details ------------ Document Title : ASC2HTML user documentation Background Colour : DDDDCC Background Image : (none) HTML Script file : (none) HTML header : a2hdocoh.txt HTML footer : a2hdocof.txt These "polices" allow you to start "adding value" to the HTML generated. That is, they allow to specify things that cannot be inferred from the original text. 6.7.1 "Document Title" This identifies the text to be placed in the ... markup in the document header. We did consider defaulting to the first line of text, but that rarely works. 6.7.2 "Background Colour" This identifies the colour to be placed in the BGCOLOR attribute of the tag. The value is normally expressed as a 6-digit hexadecimal value in the range 000000 (black) to FFFFFF (white). 6.7.3 "Background Image" This identifies the URL of any image to be placed in the BACKGROUND attribute of the tag. 6.7.4 "HTML Script file" This identifies the name of a text include file to be transcribed into the ... portion of the generated HTML page. This allows you to place JavaScript in your pages (though you'll be a little limited as to what it can act on). In the near future HTML will support style sheets, so again, this provides a hook to refine the appearance of your HTML pages. 6.7.5 "HTML header" This identifies the name of a text include file to be transcribed into the HTML file at the top of the ... portion of the generated HTML page. This can be used to add standard headers, logos, contact addresses to your HTML pages, and is especially useful to give a consistent "look and feel" when breaking your document up into a number of smaller HTML files. 6.7.6 "HTML footer" This identifies the name of a text include file to be transcribed into the HTML file at the bottom of the ... portion of the generated HTML page. This can be used to add "return to home page" links, and contact addresses to your HTML pages. Again, this helps to give a consistent "look and feel" when breaking your document up into a number of smaller HTML files. 6.8 File organisation policies AscToHTM has the following HTML policies that will only ever take effect if supplied in a user policy file :- File split details ------------------ Split level : 1 Min HTML File size : (none) Add contents list : TRUE Add navigation bar : TRUE Use DOS filenames : TRUE DOS filename root : a2h These policies how your document is divided into one or more HTML files, and how those files are to be named and linked together with hyperlinks. 6.8.1 "Split level" This identifies the heading level at which the generated HTML should be split into smaller files. A value of "none" will put all the HTML into one file. A value of "1" will create a new HTML file for each new major section. A value of "2" will create a new HTML file for each new n.n section, whilst "3" creates a new document for each n.n.n section, and so on. The first file created normally has a name that matches the source file. Subsequent files append the section number, separated by underscores. This a file called MYDOC.TXT, will generate MYDOC.HTML, MYDOC_1.HTML, MYDOC_1_1.HTML etc... 6.8.2 "Min HTML File size" This policy is only relevant when splitting the document into small output files, i.e. a "split level" is specified (see 6.8.1). This policy specifies a minimum output HTML size in lines (although this is only approximate). This can be useful for documents that have chapters where all the content is in the sub-sections. In such documents you'd end up with a virtually empty chapter heading file if this policy is not used. 6.8.3 "Add contents list" This policy specifies that AscToHTML should generate a contents list to match all the section heading that it marks up. This contents list will consist of hyperlinks to take you to the corresponding section and HTML file. The placement of the contents list depends on how you have decided to split up your output HTML (see 6.8.1). If you decide to convert MYDOC.TXT to a single HTML file MYDOC.HTML, AscToHTM will create a separate file called CONTENTS_MYDOC.HTML and add a link to this file at the top of MYDOC.HTML. You can, if you wish, simply cut and paste this file into MYDOC.HTML. If you decide to convert MYDOC.TXT into several files, then the contents list is placed at the bottom of MYDOC.HTML, and points to all the newly created files. Any text before the first section in your document will be placed before the contents list in MYDOC.HTML. Whenever you elect to have a contents list generated, and lines perceived by AscToHTM as being part of a contents list in the original document will be discarded. 6.8.4 "Add navigation bar" This policy is only relevant if you have elected to split your document into a number of smaller HTML files (see 6.8.1). In such cases this policy allows you have a navigation bar inserted at the foot of each HTML page, before any standard footer is added. The navigation bar consists of o A "Previous" link, to take to the previous HTML page. o A "Next" link, to take to the next HTML page. o A "Contents" link, to take to the start of the next section in the contents list. 6.8.5 "Use DOS filenames" This policy allows you to specify that the HTML file names must be DOS compatible. If selected the filenames will all have a ".HTM" extension, and be given upper case names. Any file name whose root exceeds 8 character will be shortened by keeping the first 3 characters, and adding a unique 5-digit number derived from the longer name. See the discussion in 4.2.4. 6.8.6 "DOS filename root" Where DOS filenames are used this allows you to specify an up-to-5 character root to which any section numbers will be appended (see 6.8.1). If splitting a document at 2 levels we normally recommend a 3-character filename root. Thus MYDOC.TXT given a root of MYD would produce MYD.HTM, MTD_1.HTM MYD_1_1.HTM etc... which are all less than 8 characters and thus maintain some readability. If no root were specified, MYDOC_1_1.HTM would be renamed to MYDnnnnn.HTM where "nnnnn" would be a generated 5-digit code. See the discussion in 4.2.4. 6.9 Link definitions Link definitions appear as follows :- Link Dictionary --------------- Link definition : "A2HDOCO.TXT" = "Source text" + "/~jaf/A2HDOCO.TXT" That is, the text to be matched, the text to be used in its placed as the highlighted text, and the URL this link is to point to (in this case a relative URL). See the discussion in 4.2.2. 6.10 Analysis policies AscToHTM has the following policies that can be used to influenece its analysis :- Analysis --------- Min chapter size : 8 6.10.1 "Min chapter size" This policy allows you to specify the minimum chapter size expected in the document (in numbers of lines). AscToHTM will ignore any apparant Chapter headings that appear too close together. ---------------------------------------------------------------------------- Prev | Next | Contents ---------------------------------------------------------------------------- © 1997 John A. Fotheringham