Skip to Page Content | Navigation for Module


Navigation for Module 6: Word
Page 6 of 9

  1. * Convert Word to HTML

Converting Word to Accessible HTML

Because the HTML code that Word generates is not automatically accessible to all users, there are two options for making Word documents accessible.

At first, both of these options may seem daunting, but it is not as difficult as you might think. For a basic Word document without much formatting, you can create a new HTML document based on an accessible template. For documents containing more complex features and formatting, it may be easier to edit the code Word has generated. This second approach is the one you will take with the sample document.

Editing the HTML code generated by Word

Note: The following explanation covers Word 2000 and Word XP. There are some differences for Word 2007. To view these differences, please click Here

The next section will look at the unaccessible aspects of the HTML code generated by Word and how to modify these by hand to be accessible.

To make changes to the source code for your web page, you have three options in which to open the html file:

Notepad will be used for this example. All Windows computers should have Notepad installed.

To open Notepad:



  1. Go to Start.
  2. Select Programs.
  3. Select Accessories.
  4. Select Notepad.
  5. Go to the File menu.
  6. Choose Open.


    Note: When you browse to the directory where you saved your html file, you may not see the file at first. If this is the case, select the drop-down arrow next to "Files of Type:" and choose "All Files" (see Image 9).
  7. Open the html file you created. Notepad displays the html file as HTML code (see Image 10).

The first line of your file should be <html> which tells the web browser what language this code is written in. The first thing a document must contain in order to be accessible and compliant with standards is a DOCTYPE statement. The DOCTYPE statement tells the browser exactly which version of HTML it is dealing with. Copy the following line into the code, just above the <html> line.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

This tells the browser that this document should comply with the W3C standards for HTML version 4.01. The "Transitional" clause simply means that there may be some older elements and tags used since this is a transitional period and not all browsers support the "Strict" implementation of this document type. The web address tells the browser where it can find the details of this specification if it does not recognize the DOCTYPE. Make sure this line is the first thing in your html file (see Image 11).



Now the beginning of your document should look like the following:

Next, you should tell the browser what language the text in this file is written. For a user who can see, it is easy to know that the page is in English. For someone using a screen reader, it is not so easy. The screen reader needs to know the language in order to know how to read the document. This information goes in the <html> tag.
<html lang="EN-US">

The <head> tag tells the browser that all the information in this section is for the browser's use and should not be displayed.

The next thing to address is the <title>...</title> line.

If you used Word XP to generate the HTML, you should have given your document a title already. If you used Word 2000, you may see the file name or something like "Untitled." Make sure not to erase either of the <title> tags or the brackets around them, but change the title to something meaningful if necessary. When a user views your web page, the title will appear in the browser's blue title bar (at the top of the browser window). The title is also what shows up by default in the Favorites list when a user bookmarks your page. For this example, make the title even more descriptive by changing it to "ITSK 1701 Syllabus."

The next block of code begins with <style> and tells the browser how to display certain elements of the page. Word generally includes extra style information so you should delete this now.  The W3C guidelines and other accessibility organizations recommend the use of external styles, so you should delete the style that Word has applied within the document; it is best to leave the style outside of the document.

Select everything between <style> and </style> and delete this material.

In addition, you will want to remove the "meta" tags that Word has inserted.  These can cause problems in displaying "smart quotations" and other special characters.

Once you have removed the above-mentioned information, you should now have something that looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3c.org/TR/html4/loose.dtd">
<HTML lang=EN-US>


<HEAD>
<title>ITSK 1701 Syllabus</title>
</HEAD>

The next section is the body, denoted by <body>. This tells the browser that this is the main part of the page and the things in this section should be displayed to the user.

First, since you have deleted all of Word's style definitions, use Find and Replace to get rid of the remaining style information. HTML ignores extra white-space in a document.

To delete remaining style information (see Image 12):



  1. Go to the Edit menu.
  2. Choose "Replace."
  3. Search for "class=MsoNormal."
  4. Replace with a blank space.
  5. Select "Replace All."

Word sometimes includes <span> tags in strange places. These can also be deleted. Make sure to remove the opening and closing tags (<span> and </span>).

The next thing you should change is in the table that displays the instructor's contact information. Find the first <table> tag (see Image 13).



Most users can look at a table and quickly see the relationships represented. A user who cannot see may have a harder time. For this reason, you should modify the table in a couple of ways:

  1. Add the "Summary" attribute to the table so that a screen reader can give the user an overview of what the table contains. In this example, "Instructor contact information" is a sufficient summary. In between the words "table" and "class" place the following: summary="Instructor contact information"
  2. Identify which cells in the table contain "header" information. HTML has two classifications for cells. The <td> tag represents a "normal" data cell and the <th> tag represents a header cell. In this example, the left column contains our headers so we will change those tags. Find the <td> tags for each of the left-hand cells and replace "td" with "th." Since header cells are generally displayed as bold and centered, you will also add the attribute "align=left." You should also replace the appropriate closing tags (</td>) with </th> closing tags.
  3. Delete the style information in the table tags since Word has duplicated these attributes.  Remove all of the style information within the table tags.

Image 14 offers a sample of the HTML code with changes highlighted and deletions crossed out. The final document with all changes made should look like Image 15.







Repeat this process for each cell and table in the document.

The table containing the textbook information is an exception. That table is essentially for layout so it does not need a summary or headers.  The W3C recommends that you do not use tables for layout. The best thing do to is convert the table to a non-table format, going left-to-right and top-to-bottom, removing the table that is used for layout; this "linearizes" the table.

To create a non-table format, move the content into plain HTML without the table tag or any TH or TD tags around the content. In this example, the display order of the content should be:

  1. First textbook image.
  2. First textbook information (author, ISBN, etc.).
  3. Second textbook image.
  4. Second textbook information.

The resultant webpage will look like Image 16.










The HTML code for the bulleted and numbered lists should be the last elements to check for accessibility. When two lists are in close proximity to each other, Word assumes that the second list is a continuation of the first. For this reason, there is an extra ordered list (<ol>) tag that you should delete (see Image 17).

The <ol> tag tells the browser that this will be a numbered "ordered list." When the browser sees that tag, it looks for a list item (<li>) tag to start a line in the list. In this example, there is not a <li>, just an unordered "bulleted" list represented by the <ul> tag. In order to be completely compliant with HTML guidelines, you should correct Word's mistake.

To correct the problem, remove the starting and ending </ol> tags around the "class expectations" list.

Your resulting HTML code should appear similar to Image 18.

The data table that the Word document contained is not fully accessible.  When data tables are used in an accessible HTML document, they must contain CAPTION, THEAD and TBODY tags and use TH and TD tags appropriately.  Here is the basic template that data tables should follow:
<table summary="SUMMARY HERE">
<caption> CAPTION TO ASSOCIATE WITH THE DATA TABLE </caption>
<thead> HEADER INFORMATION HERE ABOUT TABLE COLUMNS </thead>
<tbody> DATA CONTAINED HERE WITH TR AND TD TAGS </tbody>
</table>

In this example syllabus there are two data tables, one for the percentage of the overall grade to which each assignment contributes and one for grade cutoffs (what's an A, what's a B, etc.).

The HTML that Word generated for these two tables is quite close to accessible HTML, but you should add the summary, caption, and header information for complete compliance.  For this example, an appropriate caption for the first table would be "How much each assignment counts towards your overall grade" and "Letter grades for overall numeric grades" for the second.  Summaries for each would be similar (and are displayed below).  Adding the CAPTION, THEAD, and TBODY tags is easy to do.

As you can see from the following HTML example (from the first of the two tables), these tables are relatively easy to correct and make fully accessible.

The accessible "grade values" table:
<table border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse; border:none;' summary="Grade values for each assignment">
<caption> How much each assignment counts towards your overall grade </caption>
<thead>
<th scope="col">Assignment</th>
<th scope="col">Percentage of grade</th>
</thead>
<tbody> THIS IS THE MATERIAL THAT WORD GENERATED - YOU DON'T NEED TO CHANGE IT </tbody>
</table>  

The accessible "grade distribution" table:
<table border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse; border:none;' summary="Letter grade distribution">
<caption> Letter grades for overall numeric grades </caption>
<thead>
<th scope="col">Letter grade earned</th>
<th scope="col">Numeric grade range</th>
</thead>
<tbody> THIS IS THE MATERIAL THAT WORD GENERATED - YOU DON'T NEED TO CHANGE IT </tbody>
</table>  

When you have finished, choose "Save" from the File menu. Open your web page in a browser and you will see that it looks almost identical - now that it is accessible to everyone.

Updating the webpage code generated by Word 2007

If you view the webpage code generated by Word 2007's "Save As Webpage" feature, you will immediately notice that it seems like nonsense. Depending on the length of the content of your webpage, the first roughly 50-75% of the generated code will be "XML" markup, which you (the editor) don't even need to change.

An example "XML" Tag: <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true" DefSemiHidden="true" DefQFormat="false" DefPriority="99" LatentStyleCount="267">. Tags of a similar fashion, and CSS Stylesheets, will be contained all throughout the <head> section.

In order to actually begin editing the content, you must find where the <body> tag begins. In order to find this in the file, you can either search for "body", or the end of the "head" section, which will look like this: </head>.

A noticable difference between how Word 2007 generates webpage code and how Word 2000/XP does is the location of the "lang" declaration. In Word 2007, this occurs in the body tag, leaving your body tag looking potentially something like this: <body lang=EN-US style='tab-interval:.5in'>.



Images in a Word 2007 generated webpage

As you are browsing through the body section, you will notice that preceding every <img> (image) tag, there is more XML markup. In order to avoid having to edit this, you should skip past it, and continue editing the remainder of the code. If you do need to change an image, however, the best method of doing this is editing it in the original word document and repeating the process. It's tedious, so you should make sure you are happy with your images before you continue the process.

Look at the selected text in Image N01 if you are having trouble discerning which text is related to an image.

Other elements of a Word 2007 generated webpage

Tables, lists, and other elements generated by Word 2007's "Save as Webpage" feature remain the same as for previous versions, or, just standard HTML. Therefor, you can view the above section for information on making these elements accessible.


The next page will discuss publishing Word documents to the course management system, Blackboard.

Top of Page arrow up
       Page 6


 
-- END OF PAGE ;