This guide gives you a short overview on how to use FileZilla. By default you don’t have to configure FileZilla, so you can start directly working with the program.
Connecting to an FTP server
To connect to an FTP server, enter the address of the server into the ‘Address:’ field of the Quickconnect bar. If a username and password are required, enter it in the corresponding ‘User:’ and ‘Password:’ fields. Click on Quickconnect or press Enter to connect to the server. On the setup information provided by Host4Africa.com, the ‘FTP Server:’ value should be entered into the ‘Address:’ field. The ‘FTP Username:’ is the required value for the ‘User:’ field and the ‘FTP Password:’ value is the value required for the ‘Password:’ field.
Navigating on the server
After a successful connection attempt, a list of files and folders appears on the right side of the main window. The current folder is listed in the edit field on the top. You can change the current folder by double-clicking a folder or by entering the folder name into the edit field and pressing enter. You may also right-click the file and folder list and select Open from the context menu to change the current folder. You will notice a folder called “..” displayed in all directories. This folder allows you to go up to the parent directory of the current folder. Double click on the httpdocs/ directory. All the files for your website should be uploaded to this directory.
Navigating on your machine
Navigating on your machine works almost like navigating on the server. There’s only one addition: The folders on your machine are arranged in a tree for faster navigation. A tree is also availabe for the remote side, but it is hidden by default. The remote tree can be shown anytime by clicking on the remote tree icon in the toolbar.
To change the current folder either on your machine or on the server, just select a tree item in the appropriate tree.
You can upload or download a file by double-clicking on it. It will be added to the transfer queue and the transfer starts automatically. To transfer folders and/or multiple files, select them and right-click the selection. Then you can click on Upload/Download in the popup menu.
You can also drag the files from one side and drop them on the other side. To add files to the queue so that they will be transferred later, select them and click Add to Queue from the popup menu. You may also drag the files directly into the queue. Click on the button on the toolbar to start the transfer.
Bandwidth is the total amount of data that can be sent in a given time between two computer devices. The more bandwidth that is available, the faster the access to the server will be.
Any webmaster expecting a decent amount of traffic to their site will require a web hosting package that includes a large amount of bandwidth. This is so important, especially with the growth of your online business and, ultimately your sites. If you don’t have enough bandwidth, surfers will not be able to access your site as quickly, and that may turn them off which in turn, will mean they will leave your site.
To calculate your bandwidth needs, you must know how large each page on your site is, including the graphics and any script usage you may have. Then, you multiply that by the number of views you expect the site to get every month.
For example, say you have three 5k images on your page and a 2k HTML file – you would have 17k of data on that page. Multiply that by your expected page views (let’s say in this case it is 1,000 per month), and you get 17 MB of data to be transferred that month for that page alone. Now calculate this for each page, and you will know approximately how much bandwidth your entire site requires.
The best way of using bandwidth efficiently is to keep the size of your html pages low. In other words, ensure your HTML programming is adequate enough to use minimal amounts of coding for the purpose you want and, that your photos and graphics are small. To reduce photo file sizes you should always use JPEG format, which can reduce files up to 5% of its original size. Make sure to use the GIF format for graphics, as opposed to TIFFs or BMPs, which are generally much larger in file size.
If you attach URLs to your IMG SRC that link to another person’s images without their permission, this is known throughout the online industry as hotlinking, which costs the person you are linking the image of money and not yourself. For this reason, hotlinking, is looked upon dimly.
The unfortunate truth is that hotlinking is rife in the online industry. However, lets not let that deter you, by utilizing a file called .htaccess on your server, you can prevent other dishonest webmasters hotlinking your images.
A Web cache sits between Web servers (or origin servers) and a client or many clients, and watches requests for HTML pages, images and files (collectively known as objects) come by, saving a copy for itself. Then, if there is another request for the same object, it will use the copy that it has, instead of asking the origin server for it again.
There are two main reasons that Web caches are used:
To reduce latency – Because the request is satisfied from the cache (which is closer to the client) instead of the origin server, it takes less time for the client to get the object and display it. This makes Web sites seem more responsive.
To reduce traffic – Because each object is only gotten from the server once, it reduces the amount of bandwidth used by a client. This saves money if the client is paying by traffic, and keeps their bandwidth requirements lower and more manageable.
Kinds of Web Caches Browser Caches
If you examine the preferences dialog of any modern browser (like Internet Explorer or Netscape), you’ll probably notice a ‘cache’ setting. This lets you set aside a section of your computer’s hard disk to store objects that you’ve seen, just for you. The browser cache works according to fairly simple rules. It will check to make sure that the objects are fresh, usually once a session (that is, the once in the current invocation of the browser).
This cache is useful when a client hits the ‘back’ button to go to a page they’ve already seen. Also, if you use the same navigation images throughout your site, they’ll be served from the browser cache almost instantaneously.
Web proxy caches work on the same principle, but a much larger scale. Proxies serve hundreds or thousands of users in the same way; large corporations and ISP’s often set them up on their firewalls.
Because proxy caches usually have a large number of users behind them, they are very good at reducing latency and traffic. That’s because popular objects are requested only once, and served to a large number of clients.
Most proxy caches are deployed by large companies or ISPs that want to reduce the amount of Internet bandwidth that they use. Because the cache is shared by a large number of users, there are a large number of shared hits (objects that are requested by a number of clients). Hit rates of 50% efficiency or greater are not uncommon. Proxy caches are a type of shared cache.
Aren’t Web Caches bad for me? Why should I help them?
Web caching is one of the most misunderstood technologies on the Internet. Webmasters in particular fear losing control of their site, because a cache can ‘hide’ their users from them, making it difficult to see who’s using the site.
Unfortunately for them, even if no Web caches were used, there are too many variables on the Internet to assure that they’ll be able to get an accurate picture of how users see their site. If this is a big concern for you, this document will teach you how to get the statistics you need without making your site cache-unfriendly.
Another concern is that caches can serve content that is out of date, or stale. However, this document can show you how to configure your server to control this, while making it more cacheable.
On the other hand, if you plan your site well, caches can help your Web site load faster, and save load on your server and Internet link. The difference can be dramatic; a site that is difficult to cache may take several seconds to load, while one that takes advantage of caching can seem instantaneous in comparison. Users will appreciate a fast-loading site, and will visit more often.
Think of it this way; many large Internet companies are spending millions of dollars setting up farms of servers around the world to replicate their content, in order to make it as fast to access as possible for their users. Caches do the same for you, and they’re even closer to the end user. Best of all, you don’t have to pay for them.
The fact is that caches will be used whether you like it or not. If you don’t configure your site to be cached correctly, it will be cached using whatever defaults the cache’s administrator decides upon.
How Web Caches Work
All caches have a set of rules that they use to determine when to serve an object from the cache, if its available. Some of these rules are set in the protocols (HTTP 1.0 and 1.1), and some are set by the administrator of the cache (either the user of the browser cache, or the proxy administrator).
Generally speaking, these are the most common rules that are followed for a particular request (don’t worry if you don’t understand the details, it will be explained below):
If the object’s headers tell the cache not to keep the object, it won’t. Also, if no validator is present, most caches will mark the object as uncacheable.
If the object is authenticated or secure, it won’t be cached.
A cached object is considered fresh (that is, able to be sent to a client without checking with the origin server) if:
It has an expiry time or other age-controlling directive set, and is still within the fresh period..
If a browser cache has already seen the object, and has been set to check once a session.
If a proxy cache has seen the object recently, and it was modified relatively long ago. Fresh documents are served directly from the cache, without checking with the origin server.
If an object is stale, the origin server will be asked to validate the object, or tell the cache whether the copy that it has is still good.
Together, freshness and validation are the most important ways that a cache works with content. A fresh object will be available instantly from the cache, while a validated object will avoid sending the entire object over again if it hasn’t changed.
There are several tools that Web designers and Webmasters can use to fine-tune how caches will treat their sites. It may require getting your hands a little dirty with the server configuration, but the results are worth it. For details on how to use these tools with your server, see the Implementation sections below.
HTML Meta Tags vs. HTTP Headers
HTML authors can put tags in a document’s section that describe its attributes. These Meta tags are often used in the belief that they can mark a document as uncacheable, or expire it at a certain time.
Meta tags are easy to use, but aren’t very effective. That’s because they’re usually only honored by browser caches (which actually read the HTML), not proxy caches (which almost never read the HTML in the document). While it may be tempting to slap a Pragma: no-cache meta tag on a home page, it won’t necessarily cause it to be kept fresh, if it goes through a shared cache.
On the other hand, true HTTP headers give you a lot of control over how both browser caches and proxies handle your objects. They can’t be seen in the HTML, and are usually automatically generated by the Web server. However, you can control them to some degree, depending on the server you use. In the following sections, you’ll see what HTTP headers are interesting, and how to apply them to your site.
If your site is hosted at an ISP or hosting farm and they don’t give you the ability to set arbitrary HTTP headers (like Expires and Cache-Control), complain loudly; these are tools necessary for doing your job.
HTTP headers are sent by the server before the HTML, and only seen by the browser and any intermediate caches. Typical HTTP 1.1 response headers might look like this:
The HTML document would follow these headers, separated by a blank line.
Pragma HTTP Headers (and why they don’t work)
Many people believe that assigning a Pragma: no-cache HTTP header to an object will make it uncacheable. This is not necessarily true; the HTTP specification does not set any guidelines for Pragma response headers; instead, Pragma request headers (the headers that a browser sends to a server) are discussed. Although a few caches may honor this header, the majority won’t, and it won’t have any effect. Use the headers below instead.
Controlling Freshness with the Expires HTTP Header
The Expires HTTP header is the basic means of controlling caches; it tells all caches how long the object is fresh for; after that time, caches will always check back with the origin server to see if a document is changed. Expires headers are supported by practically every client.
Most Web servers allow you to set Expires response headers in a number of ways. Commonly, they will allow setting an absolute time to expire, a time based on the last time that the client saw the object (last access time), or a time based on the last time the document changed on your server (last modification time).
Expires headers are especially good for making static images (like navigation bars and buttons) cacheable. Because they don’t change much, you can set extremely long expiry time on them, making your site appear much more responsive to your users. They’re also useful for controlling caching of a page that is regularly changed. For instance, if you update a news page once a day at 6am, you can set the object to expire at that time, so caches will know when to get a fresh copy, without users having to hit ‘reload’.
The only value valid in an Expires header is a HTTP date; anything else will most likely be interpreted as ‘in the past’, so that the object is uncacheable. Also, remember that the time in a HTTP date is Greenwich Mean Time (GMT), not local time.
Cache-Control HTTP Headers
Although the Expires header is useful, it is still somewhat limited; there are many situations where content is cacheable, but the HTTP 1.0 protocol lacks methods of telling caches what it is, or how to work with it.
HTTP 1.1 introduces a new class of headers, the Cache-Control response headers, which allow Web publishers to define how pages should be handled by caches. They include directives to declare what should be cacheable, what may be stored by caches, modifications of the expiration mechanism, and revalidation and reload controls.
max-age=[seconds] – specifies the maximum amount of that an object will be considered fresh.Similar to Expires, this directive allows more flexibility. [seconds] is the number of seconds from the time of the request you wish the object to be fresh for.
s-maxage=[seconds] – similar to max-age, except that it only applies to proxy (shared) caches.
public – marks the response as cacheable, even if it would normally be uncacheable. For instance, if your pages are authenticated, the public directive makes them cacheable.
no-cache – forces caches (both proxy and browser) to submit the request to the origin server for validation before releasing a cached copy, every time. This is useful for to assure that authentication is respected (in combination with public), or to maintain rigid object freshness, without sacrificing all of the benefits of caching.
must-revalidate – tells caches that they must obey any freshness information you give them about an object. The HTTP allows caches to take liberties with the freshness of objects; by specifying this header, you’re telling the cache that you want it to strictly follow your rules.
proxy-revalidate – similar to must-revalidate, except that it only applies to proxy caches.
If you plan to use the Cache-Control headers, you should have a look at the excellent documentation in the HTTP 1.1 draft; see References and Further Information.
Validators and Validation
In How Web Caches Work, we said that validation is used by servers and caches to communicate when an object has changed. By using it, caches avoid having to download the entire object when they already have a copy locally, but they’re not sure if it’s still fresh.
Validators are very important; if one isn’t present, and there isn’t any freshness information (Expires or Cache-Control) available, most caches will not store an object at all.
The most common validator is the time that the document last changed, the Last-Modified time. When a cache has an object stored that includes a Last-Modified header, it can use it to ask the server if the object has changed since the last time it was seen, with an If-Modified-Since request.
HTTP 1.1 introduced a new kind of validator called the ETag. Etags are unique identifiers that are generated by the server and changed every time the object does. Because the server controls how the ETag is generated, caches can be surer that if the ETag matches when they make a If-None-Match request, the object really is the same.
Almost all caches use Last-Modified times in determining if an object is fresh; as more HTTP/1.1 caches come online, Etag headers will also be used.
Most modern Web servers will generate both ETag and Last-Modified validators for static content automatically; you won’t have to do anything. However, they don’t know enough about dynamic content (like CGI, ASP or database sites) to generate them; see Writing Cache-Aware Scripts.
Tips for Building a Cache-Aware Site
Besides using freshness information and validation, there are a number of other things you can do to make your site more cache-friendly.
Refer to objects consistently – this is the golden rule of caching. If you serve the same content on different pages, to different users, or from different sites, it should use the same URL. This is the easiest and most effective may to make your site cache-friendly. For example, if you use /index.html in your HTML as a reference once, always use it that way.
Use a common library of images and other elements and refer back to them from different places.
Make caches store images and pages that don’t change often by specifying either a far-away Expires header.
Make caches recognize regularly updated pages by specifying an appropriate expiration time.
If a resource (especially a downloadable file) changes, change its name. That way, you can make it expire far in the future, and still guarantee that the correct version is served; the page that links to it is the only one that will need a short expiry time.
Don’t change files unnecessarily. If you do, everything will have a falsely young Last-Modified date. For instance, when updating your site, don’t copy over the entire site; just move the files that you’ve changed.
Minimize use of SSL – because encrypted pages are not stored by shared caches, use them only when you have to, and use images on SSL pages sparingly.
Use the Cacheability Engine – it can help you apply many of the concepts in this tutorial.
Writing Cache-Aware Scripts
By default, most scripts won’t return a validator (e.g., a Last-Modified or Etag HTTP header) or freshness information (Expires or Cache-Control). While some scripts really are dynamic (meaning that they return a different response for every request), many (like search engines and database-driven sites) can benefit from being cache-friendly.
Generally speaking, if a script produces output that is reproducible with the same request at a later time (whether it be minutes or days later), it should be cacheable. If the content of the script changes only depending on what’s in the URL, it is cacheable; if the output depends on a cookie, authentication information or other external criteria, it probably isn’t.
The best way to make a script cache-friendly (as well as perform better) is to dump its content to a plain file whenever it changes. The Web server can then treat it like any other Web page, generating and using validators, which makes your life easier. Remember to only write files that have changed, so the Last-Modified times are preserved.
Another way to make a script cacheable in a limited fashion is to set an age-related header for as far in the future as practical. Although this can be done with Expires, it’s probably easiest to do so with Cache-Control: max-age, which will make the request fresh for an amount of time after the request.
If you can’t do that, you’ll need to make the script generate a validator, and then respond to If-Modified-Since and/or If-None-Match requests. This can be done by parsing the HTTP headers, and then responding with 304 Not Modified when appropriate. Unfortunately, this is not a trivial task.
Some other tips:
If you have to use scripting, don’t POST unless it’s appropriate. The POST method is (practically) impossible to cache; if you send information in the path or query (via GET), caches can store that information for the future. POST, on the other hand, is good for sending large amount of information to the server (which is why it won’t be cached; it’s very unlikely that the same exact POST will be made twice).
Don’t embed user-specific information in the URL unless the content generated is completely unique to that user.
Don’t count on all requests from a user coming from the same host, because caches often work together.
Generate Content-Length response headers. It’s easy to do, and it will allow the response of your script to be used in a persistent connection. This allows a client (whether a proxy or a browser) to request multiple objects on one TCP/IP connection, instead of setting up a connection for every request. It makes your site seem much faster.
See the Implementation Notes for more specific information.
Frequently Asked Questions
What are the most important things to make cacheable?
A good strategy is to identify the most popular, largest objects (especially images) and work with them first.
How can I make my pages as fast as possible with caches?
The most cacheable object is one with a long freshness time set. Validation does help reduce the time that it takes to see an object, but the cache still has to contact the origin server to see if it’s fresh. If the cache already knows it’s fresh, it will be served directly.
I understand that caching is good, but I need to keep statistics on how many people visit my page!
If you must know every time a page is accessed, select ONE small object on a page (or the page itself), and make it uncacheable, by giving it a suitable headers. For example, you could refer to a 1×1 transparent uncacheable image from each page. The Referer header will contain information about what page called it.
Be aware that even this will not give truly accurate statistics about your users, and is unfriendly to the Internet and your users; it generates unnecessary traffic, and forces people to wait for that uncached item to be downloaded. For more information about this, see On Interpreting Access Statistics in the references.
I’ve got a page that is updated often. How do I keep caches from giving my users a stale copy?
The Expires header is the best way to do this. By setting the server to expire the document based on its modification time, you can automatically have caches mark it as stale a set amount of time after it is changed.
For example, if your site’s home page changes every day at 8am, set the Expires header for 23 hours after the last modification time. This way, your users will always get a fresh copy of the page.
See also the Cache-Control: max-age header.
How can I see which HTTP headers are set for an object?
To see what the Expires and Last-Modified headers are, open the page with Netscape and select ‘page info’ from the View menu. This will give you a menu of the page an any objects (like images) associated with it, along with their details.
To see the full headers of an object, you’ll need to manually connect to the Web server using a Telnet client. Depending on what program you use, you may need to type the port into a separate field, or you may need to connect to www.myhost.com:80 or www.myhost.com 80 (note the space). Consult your Telnet client’s documentation.
Once you’ve opened a connection to the site, type a request for the object. For instance, if you want to see the headers for http://www.myhost.com/foo.html, connect to www.myhost.com, port 80, and type:
Press the Return key every time you see [return]; make sure to press it twice at the end. This will print the headers, and then the full object. To see the headers only, substitute HEAD for GET.
My pages are password-protected; how do proxy caches deal with them?
By default, pages protected with HTTP authentication are marked private; they will not be cached by shared caches. However, you can mark authenticated pages public with a Cache-Control header; HTTP 1.1-compliant caches will then allow them to be cached.
If you’d like the pages to be cacheable, but still authenticated for every user, combine the Cache-Control: public and no-cache headers. This tells the cache that it must submit the new client’s authentication information to the origin server before releasing the object from the cache.
Whether or not this is done, it’s best to minimize use of authentication; for instance, if your images are not sensitive, put them in a separate directory and configure your server not to force authentication for it. That way, those images will be naturally cacheable.
Should I worry about security if my users access my site through a cache?
SSL pages are not cached (or unencrypted) by proxy caches, so you don’t have to worry about that. However, because caches store non-SSL requests and URLs fetched through them, you should be conscious of security on unsecured sites; an unscrupulous administrator could conceivably gather information about their users.
In fact, any administrator on the network between your server and your clients could gather this type of information. One particular problem is when CGI scripts put usernames and passwords in the URL itself; this makes it trivial for others to find and user their login.
If you’re aware of the issues surrounding Web security in general, you shouldn’t have any surprises from proxy caches.
I’m looking for an integrated Web publishing solution. Which ones are cache-aware?
It varies. Generally speaking, the more complex a solution is, the more difficult it is to cache. The worst are ones which dynamically generate all content and don’t provide validators; they may not be cacheable at all. Speak with your vendor’s technical staff for more information, and see the Implementation notes below.
My images expire a month from now, but I need to change them in the caches now!
The Expires header can’t be circumvented; unless the cache (either browser or proxy) runs out of room and has to delete the objects, the cached copy will be used until then.
The most effective solution is to rename the files; that way, they will be completely new objects, and loaded fresh from the origin server. Remember that the page that refers to an object will be cached as well. Because of this, it’s best to make static images and similar objects very cacheable, while keeping the HTML pages that refer to them on a tight leash.
If you want to reload an object from a specific cache, you can either force a reload (in Netscape, holding down shift while pressing ‘reload’ will do this by issuing a Pragma: no-cache request header) while using the cache. Or, you can have the cache administrator delete the object through their interface.
I run a Web Hosting service. How can I let my users publish cache-friendly pages? If you’re using Apache, consider allowing them to use .htaccess files, and provide appropriate documentation.
Otherwise, you can establish predetermined areas for various caching attributes in each virtual server. For instance, you could specify a directory /cache-1m that will be cached for one month after access, and a /no-cache area that will be served with headers instructing caches not to store objects from it.
Whatever you are able to do, it is best to work with your largest customers first on caching. Most of the savings (in bandwidth and in load on your servers) will be realized from high-volume sites
.htaccess Explained Comprehensive guide to .htaccess
Tutorial written and contributed by Feyd, moderator of the JK Forum. Please see tutorial footnote for additional/bio info on author.
I am sure that most of you have heard of htaccess, if just vaguely, and that you may think you have a fair idea of what can be done with an htaccess file. You are more than likely mistaken about that, however. Regardless, even if you have never heard of htaccess and what it can do for you, the intention of this tutorial is to get you two moving along nicely together.
If you have heard of htaccess, chances are that it has been in relation to implementing custom error pages or password protected directories. But there is much more available to you through the marvelously simple .htaccess file.
A Few General Ideas
An htaccess file is a simple ASCII file, such as you would create through a text editor like NotePad or SimpleText. Many people seem to have some confusion over the naming convention for the file, so let me get that out of the way.
.htaccess is the file extension. It is not file.htaccess or somepage.htaccess, it is simply named .htaccess
In order to create the file, open up a text editor and save an empty page as .htaccess (or type in one character, as some editors will not let you save an empty page). Chances are that your editor will append its default file extension to the name (ex: for Notepad it would call the file .htaccess.txt). You need to remove the .txt (or other) file extension in order to get yourself htaccessing–yes, I know that isn’t a word, but it sounds keen, don’t it? You can do this by right clicking on the file and renaming it by removing anything that doesn’t say .htaccess. You can also rename it via telnet or your ftp program, and you should be familiar enough with one of those so as not to need explaining.
htaccess files must be uploaded as ASCII mode, not BINARY. You may need to CHMOD the htaccess file to 644 or (RW-R–R–). This makes the file usable by the server, but prevents it from being read by a browser, which can seriously compromise your security. (For example, if you have password protected directories, if a browser can read the htaccess file, then they can get the location of the authentication file and then reverse engineer the list to get full access to any portion that you previously had protected. There are different ways to prevent this, one being to place all your authentication files above the root directory so that they are not www accessible, and the other is through an htaccess series of commands that prevents itself from being accessed by a browser, more on that later)
Most commands in htaccess are meant to be placed on one line only, so if you use a text editor that uses word-wrap, make sure it is disabled or it might throw in a few characters that annoy Apache to no end, although Apache is typically very forgiving of malformed content in an htaccess file.
htaccess is an Apache thing, not an NT thing. There are similar capabilities for NT servers, though in my professional experience and personal opinion, NT’s ability in these areas is severely handicapped. But that’s not what we’re here for.
htaccess files affect the directory they are placed in and all sub-directories, that is an htaccess file located in your root directory (yoursite.com) would affect yoursite.com/content, yoursite.com/content/contents, etc. It is important to note that this can be prevented (if, for example, you did not want certain htaccess commands to affect a specific directory) by placing a new htaccess file within the directory you don’t want affected with certain changes, and removing the specific command(s) from the new htaccess file that you do not want affecting this directory. In short, the nearest htaccess file to the current directory is treated as the htaccess file. If the nearest htaccess file is your global htaccess located in your root, then it affects every single directory in your entire site.
Before you go off and plant htaccess everywhere, read through this and make sure you don’t do anything redundant, since it is possible to cause an infinite loop of redirects or errors if you place something weird in the htaccess.
Also… some sites do not allow use of htaccess files, since depending on what they are doing, they can slow down a server overloaded with domains if they are all using htaccess files. I can’t stress this enough: You need to make sure you are allowed to use htaccess before you actually use it. Some things that htaccess can do can compromise a server configuration that has been specifically setup by the admin, so don’t get in trouble.
Successful Client Requests
Client Request Redirected
Client Request Errors
Payment Required (not used yet)
Method Not Allowed
Not Acceptable (encoding)
Proxy Authentication Required
Request Timed Out
Content Length Required
Request Entity Too Long
Request URI Too Long
Unsupported Media Type
Internal Server Error
HTTP Version Not Supported
In order to specify your own ErrorDocuments, you need to be slightly familiar with the server returned error codes. (List to the right). You do not need to specify error pages for all of these, in fact you shouldn’t. An ErrorDocument for code 200 would cause an infinite loop, whenever a page was found…this would not be good.
You will probably want to create an error document for codes 404 and 500, at the least 404 since this would give you a chance to handle requests for pages not found. 500 would help you out with internal server errors in any scripts you have running. You may also want to consider ErrorDocuments for 401 – Authorization Required (as in when somebody tries to enter a protected area of your site without the proper credentials), 403 – Forbidden (as in when a file with permissions not allowing it to be accessed by the user is requested) and 400 – Bad Request, which is one of those generic kind of errors that people get to by doing some weird stuff with your URL or scripts.
In order to specify your own customized error documents, you simply need to add the following command, on one line, within your htaccess file:
ErrorDocument code /directory/filename.ext
ErrorDocument 404 /errors/notfound.html
This would cause any error code resulting in 404 to be forward to yoursite.com/errors/notfound.html
ErrorDocument 500 /errors/internalerror.html
You can name the pages anything you want (I’d recommend something that would prevent you from forgetting what the page is being used for), and you can place the error pages anywhere you want within your site, so long as they are web-accessible (through a URL). The initial slash in the directory location represents the root directory of your site, that being where your default page for your first-level domain is located. I typically prefer to keep them in a separate directory for maintenance purposes and in order to better control spiders indexing them through a ROBOTS.TXT file, but it is entirely up to you.
If you were to use an error document handler for each of the error codes I mentioned, the htaccess file would look like the following (note each command is on its own line):
ErrorDocument 400 /errors/badrequest.html
ErrorDocument 401 /errors/authreqd.html
ErrorDocument 403 /errors/forbid.html
ErrorDocument 404 /errors/notfound.html
ErrorDocument 500 /errors/serverr.html
You can specify a full URL rather than a virtual URL in the ErrorDocument string (http://yoursite.com/errors/notfound.html vs. /errors/notfound.html). But this is not the preferred method by the server’s happiness standards.
You can also specify HTML, believe it or not!
ErrorDocument 401 “<body bgcolor=#ffffff><h1>You have to actually <b>BE</b> a <a href=”#”>member</A> to view this page, Colonel!
The only time I use that HTML option is if I am feeling particularly saucy, since you can have so much more control over the error pages when used in conjunction with xSSI or CGI or both. Also note that the ErrorDocument starts with a ” just before the HTML starts, but does not end with one…it shouldn’t end with one and if you do use that option, keep it that way. And again, that should all be on one line, no naughty word wrapping!
Now you should have yourself some brand-spanking new error documents…go off and destroy your site to see some of those beautiful ErrorDocuments get pulled up.
(Note: that last part is optional)
Next, we are moving on to password protection, that last frontier before I dunk you into the true capabilities of htaccess. If you are familiar with setting up your own password protected directories via htaccess, you may feel like skipping ahead.
Ever wanted a specific directory in your site to be available only to people who you want it to be available to? Ever got frustrated with the seeming holes in client-side options for this that allowed virtually anyone with enough skill to mess around in your source to get in? htaccess is the answer!
The first thing you will need to do is create a file called .htpasswd. I know, you might have problems with the naming convention, but it is the same idea behind naming the htaccess file itself, and you should be able to do that by this point. In the htpasswd file, you place the username and password (which is encrypted) for those whom you want to have access.
For example, a username and password of wsabstract (and I do not recommend having the username being the same as the password), the htpasswd file would look like this:
Notice that it is UserName first, followed by the Password. There is a handy-dandy tool available for you to easily encrypt the password into the proper encoding for use in the httpasswd file.
For security, you should not upload the htpasswd file to a directory that is web accessible (yoursite.com/.htpasswd), it should be placed above your www root directory. You’ll be specifying the location to it later on, so be sure you know where you put it. Also, this file, as with htaccess, should be uploaded as ASCII and not BINARY.
Create a new htaccess file and place the following code in it:
require user wsabstract
The first line is the full server path to your htpasswd file. If you have installed scripts on your server, you should be familiar with this. Please note that this is not a URL, this is a server path. Also note that if you place this htaccess file in your root directory, it will password protect your entire site, which probably isn’t your exact goal.
The second to last line require user is where you enter the username of those who you want to have access to that portion of your site. Note that using this will allow only that specific user to be able to access that directory. This applies if you had an htpasswd file that had multiple users setup in it and you wanted each one to have access to an individual directory. If you wanted the entire list of users to have access to that directory, you would replace Require user xxx with require valid-user.
The AuthName is the name of the area you want to access. It could anything, such as “EnterPassword”. You can change the name of this ‘realm’ to whatever you want, within reason.
We are using AuthType Basic because we are using basic HTTP authentication.
Enabling SSI Via htaccess
Many people want to use SSI, but don’t seem to have the ability to do so with their current web host. You can change that with htaccess. A note of caution first…definitely ask permission from your host before you do this, it can be considered ‘hacking’ or violation of your host’s TOS, so be safe rather than sorry:
AddType text/html .shtml
AddHandler server-parsed .shtml
Options Indexes FollowSymLinks Includes
The first line tells the server that pages with a .shtml extension (for Server parsed HTML) are valid. The second line adds a handler, the actual SSI bit, in all files named .shtml. This tells the server that any file named .shtml should be parsed for server side commands. The last line is just techno-junk that you should throw in there.
And that’s it, you should have SSI enabled. But wait…don’t feel like renaming all of your pages to .shtml in order to take advantage of this neat little toy? Me either! Just add this line to the fragment above, between the first and second lines:
AddHandler server-parsed .html
A note of caution on that one too, however. This will force the server to parse every page named .html for SSI commands, even if they have no SSI commands within them. If you are using SSI sparingly on your site, this is going to give you more server drain than you can justify. SSI does slow down a server because it does extra stuff before serving up a page, although in human terms of speed, it is virtually transparent. Some people also prefer to allow SSI in html pages so as to avoid letting anyone who looks at the page extension to know that they are using SSI in order to prevent the server being compromised through SSI hacks, which is possible. Either way, you now have the knowledge to use it either way.
If, however, you are going to keep SSI pages with the extension of .shtml, and you want to use SSI on your Index pages, you need to add the following line to your htaccess:
DirectoryIndex index.shtml index.html
This allows a page named index.shtml to be your default page, and if that is not found, index.html is loaded. More on DirectoryIndex later.
Deny users by IP
Is there a pesky person perpetrating pain upon you? Stalking your site from the vastness of the electron void? Blockem!
In your htaccess file, add the following code–changing the IPs to suit your needs–each command on one line each:
deny from 188.8.131.52
deny from 012.34.5.
allow from all
You can deny access based upon IP address or an IP block. The above blocks access to the site from 184.108.40.206, and from any sub domain under the IP block 012.34.5. (012.34.5.1, 012.34.5.2, 012.34.5.3, etc.) I have yet to find a useful application of this, maybe if there is a site scraping your content you can block them, who knows.
Change your default directory page
Some of you may be wondering, just what in the world is a DirectoryIndex? Well, grasshopper, this is a command which allows you to specify a file that is to be loaded as your default page whenever a directory or url request comes in, that does not specify a specific page. Tired of having yoursite.com/index.html come up when you go to yoursite.com? Want to change it to be yoursite.com/ILikePizzaSteve.html that comes up instead? No problem!
This would cause filename.html to be treated as your default page, or default directory page. You can also append other filenames to it. You may want to have certain directories use a script as a default page. That’s no problem too!
Placing the above command in your htaccess file will cause this to happen: When a user types in yoursite.com, your site will look for filename.html in your root directory (or any directory if you specify this in the global htaccess), and if it finds it, it will load that page as the default page. If it does not find filename.html, it will then look for index.cgi; if it finds that one, it will load it, if not, it will look for index.pl and the whole process repeats until it finds a file it can use. Basically, the list of files is read from left to right.
Every once in a while, I use this method for the following needs: Say I keep all my include files in a directory called include, and that I keep all my image files in a directory called images, I don’t want people to be able to directory browse through them (even though we can prevent that through another htaccess trick, more later) I would specify a DirectoryIndex entry, in a specific htaccess file for those two directories, for /redirect/index.pl that is a redirect page that redirects a request for those directories to be sent to the homepage. Or I could just specify a directory index of index.pl and upload an index.pl file to each of those directories. Or I could just stick in an htaccess redirect page, which is our next subject!
htaccess uses redirect to look for any request for a specific page (or a non-specific location, though this can cause infinite loops) and if it finds that request, it forwards it to a new page you have specified:
Note that there are 3 parts to that, which should all be on one line : the Redirect command, the location of the file/directory you want redirected relative to the root of your site (/olddirectory/oldfile.html = yoursite.com/olddirectory/oldfile.html) and the full URL of the location you want that request sent to. Each of the 3 is separated by a single space, but all on one line. You can also redirect an entire directory by simple using Redirect /olddirectory http://yoursite.com/newdirectory/
Using this method, you can redirect any number of pages no matter what you do to your directory structure. It is the fastest method that is a global affect.
Prevent viewing of .htaccess file
If you use htaccess for password protection, then the location containing all of your password information is plainly available through the htaccess file. If you have set incorrect permissions or if your server is not as secure as it could be, a browser has the potential to view an htaccess file through a standard web interface and thus compromise your site/server. This, of course, would be a bad thing. However, it is possible to prevent an htaccess file from being viewed in this manner:
deny from all
The first line specifies that the file named .htaccess is having this rule applied to it. You could use this for other purposes as well if you get creative enough.
If you use this in your htaccess file, a person trying to see that file would get returned (under most server configurations) a 403 error code. You can also set permissions for your htaccess file via CHMOD, which would also prevent this from happening, as an added measure of security: 644 or RW-R–R–
Adding MIME Types
What if your server wasn’t set up to deliver certain file types properly? A common occurrence with MP3 or even SWF files. Simple enough to fix:
AddType application/x-shockwave-flash swf
AddType is specifying that you are adding a MIME type. The application string is the actual parameter of the MIME you are adding, and the final little bit is the default extension for the MIME type you just added, in our example this is swf for ShockWave File.
Preventing hot linking of images
In the webmaster community, “hot linking” is a curse phrase. Also known as “bandwidth stealing” by the angry site owner, it refers to linking directly to non-html objects not on one own’s server, such as images, .js files etc. The victim’s server in this case is robbed of bandwidth (and in turn money) as the violator enjoys showing content without having to pay for its deliverance. The most common practice of hot linking pertains to another site’s images.
Using .htaccess, you can disallow hot linking on your server, so those attempting to link to an image on your site, for example, is shown either the door (a broken image), or the lion’s mouth (another image of your choice, such as a “Barbara Streisand” picture- no emails please). There is just one small catch- unlike the rest of the .htaccess functionalities we saw earlier, disabling hot linking also requires that your server supports mod_rewrite. Inquire your web host regarding this.
With all the pieces in place, here’s how to disable hot linking of images on your site. Simply add the below code to your .htaccess file, and upload the file either to your root directory, or a particular subdirectory to localize the effect to just one section of your site:
Same deal- replace mydomain.com with your own, plus nasty.gif.
Time to pour a bucket of cold water on hot linking!
Preventing Directory Listing
Do you have a directory full of images or zips that you do not want people to be able to browse through? Typically a server is setup to prevent directory listing, but sometimes they are not. If not, become self-sufficient and fix it yourself:
The * is a wildcard that matches all files, so if you stick that line into an htaccess file in your images directory, nothing in that directory will be allowed to be listed.
On the other hand, what if you did want the directory contents to be listed, but only if they were HTML pages and not images? Simple says I:
IndexIgnore *.gif *.jpg
This would return a list of all files not ending in .jpg or .gif, but would still list .txt, .html, etc.
And conversely, if your server is setup to prevent directory listing, but you want to list the directories by default, you could simply throw this into an htaccess file the directory you want displayed:
Options + Indexes
If you do use this option, be very careful that you do not put any unintentional or compromising files in this directory. And if you guessed it by the plus sign before Indexes, you can throw in a minus sign (Options -Indexes) to prevent directory listing entirely–this is typical of most server setups and is usually configured elsewhere in the apache server, but can be overridden through htaccess.
If you really want to be tricky, using the +Indexes option, you can include a default description for the directory listing that is displayed when you use it by placing a file called HEADER in the same directory. The contents of this file will be printed out before the list of directory contents is listed. You can also specify a footer, though it is called README, by placing it in the same directory as the HEADER. The README file is printed out after the directory listing is printed.
Conclusion & More Information
Of course, I can’t list every possible use of htaccess here, just the more notable and useful ones (read: for fun and profit). There is a list of Apache Directives you can use for your htaccess files, though not all of them are designed to be used by htaccess. Consult the documentation for the directive you are looking to use and make sure that you can actually use it as an htaccess string.
You should also go through the Apache User’s Guide for more detailed information if you are really serious about making your life easier as a webmaster. You don’t need to update all 4,000 of the pages on your site individually, by hand, in order to change one file reference…honestly!
In any event, I hope you got a better idea of the power available to you through this relatively simple little Clark Kent-ish file. You really do have the ability to save yourself a lot of time and grief by using htaccess, especially when you add to that the power of SSI and xSSI.
The Definition of Spam
The word “Spam” as applied to Email means Unsolicited Bulk Email (“UBE”).
Unsolicited means that the Recipient has not granted verifiable permission for the message to be sent. Bulk means that the message is sent as part of a larger collection of messages, all having substantively identical content.
A message is Spam only if it is both Unsolicited and Bulk.
Unsolicited Email is normal email (examples include first contact enquiries, job enquiries, sales enquiries, etc.)
Bulk Email is normal email (examples include subscriber newsletters, discussion lists, information lists, etc.).
This distinction is important because the Direct Marketing Association, the pro-junk group who lobby on behalf of the junk email industry, try to dupe politicians into thinking anti-spam organizations want “Unsolicited Email” banned, in order to dupe policitians into voting against anti-spam laws.
Technical Definition of “Spam”
An electronic message is “spam” IF: (1) the recipient’s personal identity and context are irrelevant because the message is equally applicable to many other potential recipients; AND (2) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent; AND (3) the transmission and reception of the message appears to the recipient to give a disproportionate benefit to the sender.
Solicited Bulk Email is an important mechanism for keeping consenting customers informed of products or service news. When Bulk Email is Solicited it is valuable to the recipient and therefore also to the sender. When it’s Unsolicited it’s purely Spam, an unwanted nuisance to the recipient, and, because it forces the recipient to assume the cost of receiving, storing and dealing with the unwanted advert it is also a theft of the unwilling recipient’s time and resources.
The difference between senders of legitimate bulk email and spammers couldn’t be clearer, the legitimate bulk email sender has verifiable permission from the recipients before sending, the spammer does not.
All bulk email sent to recipients who have not expressly registered permission for their addresses to be placed on the mailing list, and which requires recipients to opt-out to stop further unsolicited bulk mailings, is by definition Unsolicited Bulk Email. The sending of Unsolicited Bulk Email is illegal in most of Europe and is against all ISP Terms of Service worldwide.
The Recipient has, according to the Bulk Email Sender, unverifiably initiated a request for the address to be subscribed to the Bulk Email Sender’s mailing list. The Bulk Email Sender has subscribed the address to the mailing list without verifying if the address owner has in fact granted permission or not. In most cases the Bulk Email Sender has simply purchased the address from another spammer.
As there is no verification, all spammers claim to practice ‘Opt-in’ which is why the vast majority of spam claims you “opted-in”.
Unconfirmed Opt-in means that anyone can subscribe anyone, therefore if the address submitted by an unverified user was “President@Whitehouse.gov”, the President has ‘opted-in’ and will receive bulk mailings whether he likes it or not until he opts-out.
In case of dispute
The Bulk Email Sender has no verifiable proof and is therefore liable for sending Spam, the sending of which is against all ISP contracts, against European laws, and against Spamhaus SBL policy.
Legitimate Bulk Email Closed-Loop Opt-In
Also known as “Confirmed Opt-in” or “Verified Opt-in”. The Recipient has verifiably confirmed permission for the address to be included on the specific mailing list, by confirming (responding to) the list subscription request verification. This is the standard practice for all Internet mailing lists, it ensures users are properly subscribed from a working address and with the address owner’s consent.
In case of dispute
The Bulk Email Sender is fully and legally protected because the reply to the Subscription Confirmation Request received back from the recipient proves that the recipient did in fact opt-in and grant verifiable consent for the mailings.
Spammers and some Direct Marketing Associations fronting for Spammers try to further confuse the Bulk Email issue by using variations on the above terms, which have very different meanings from what consumers expect, these tricks include:
Wording used usually by spammers to mean the recipient “has not opted-out, therefore they are opt-in”. Usually means any address the spammer can get hold of.
Wording used usually by spammers to imply the recipient has “opted-in twice”. The first time, says the spammer, was when the address was obtained and “opted-in” by the spammer without the recipient’s consent, the second time was when the recipient failed to opt-out after receiving spam.
Verbatim from Florida spam outfit Briceco Inc., who describe themselves as a “Leading provider of Marketing Solutions”: “A triple opt-in email is a person who subscribes and fills out name, address, and interest.” Presumably, a “quadruple-opt-in” is someone who fills out their age as well…
This is the path to your cgi-bin which holds the formmail.pl script. FormMail offers many new ways to code your form to tailor the resulting HTML page and the way the script performs. Below is a list of form fields you can use and how to implement them.
Mandatory Form Fields
Mandatory Form Fields
There is only one form field that you must have in your form, for FormMail to work correctly. This is the recipient field.
This form field allows you to specify to whom you wish for your form results to be mailed. The field must contain a valid email address for the form to work. Most likely you will want to configure this option as a hidden form field with a value equal to that of your email address. If you would like the submitted form results copied to more than one recipient then you can separate multiple email addresses with commas.
NOTE: The recipient email addresses must all be hosted by Host4Africa.com in order for this form to work.
The subject field will allow you to specify the subject that you wish to appear in the e-mail that is sent to you after this form has been filled out. If you do not have this option turned on, then the script will default to a message subject: WWW Form Submission
This form field will allow the user to specify their return e-mail address. If you want to be able to return e-mail to your user, we strongly suggest that you include this form field and allow them to fill it in. This will be put into the From: field of the message you receive. If you want to require an email address with valid syntax, add this field name to the required field.
<input type=text name=”email”>
realname (use recommended)
The realname form field will allow the user to input their real name. This field is useful for identification purposes and will also be put into the From: line of your message header.
<input type=text name=”realname”>
redirect (use recommended)
If you wish to redirect the user to a different URL, rather than having them see the default response to the fill-out form, you can use this hidden variable to send them to a pre-made HTML page.
You can now require for certain fields in your form to be filled in before the user can successfully submit the form. Simply place all field names that you want to be mandatory into this field. If the required fields are not filled in, the user will be notified of what they need to fill in, and a link back to the form they just submitted will be provided.
To use a customized error page, see missing_fields_redirect
If you want to require that they fill in the email and phone fields in your form, so that you can reach them once you have received the mail, use a syntax like:
Allows you to have Environment variables included in the e-mail message you receive after a user has filled out your form. Useful if you wish to know what browser they were using, what domain they were coming from or any other attributes associated with environment variables. The following is a short list of valid environment variables that might be useful:
REMOTE_HOST – Sends the hostname making the request.
REMOTE_ADDR – Sends the IP address of the remote host making the request.
REMOTE_USER – If server supports authentication and script is protected, this is the username they have authenticated as. *This is not usually set.*
HTTP_USER_AGENT – The browser the client is using to send the request.
If you wanted to find the remote host and browser sending the request, you would put the following into your form:
This field allows you to choose the order in which you wish for your variables to appear in the e-mail that FormMail generates. You can choose to have the field sorted alphabetically or specify a set order in which you want the fields to appear in your mail message. By leaving this field out, the order will simply default to the order in which the browsers sends the information to the script (which is usually the exact same order as they appeared in the form.) When sorting by a set order of fields, you should include the phrase “order:” as the first part of your value for the sort field, and then follow that with the field names you want to be listed in the e-mail message, separated by commas.
print_config allows you to specify which of the config variables you would like to have printed in your e-mail message. By default, no config fields are printed to your e-mail. This is because the important form fields, like email, subject, etc. are included in the header of the message. However some users have asked for this option so they can have these fields printed in the body of the message. The config fields that you wish to have printed should be in the value attribute of your input tag separated by commas.
If you want to print the email and subject fields in the body of your message, you would place the following form tag:
print_blank_fields allows you to request that all form fields are printed in the return HTML, regardless of whether or not they were filled in. FormMail defaults to turning this off, so that unused form fields aren’t e-mailed.
This form field allows you to specify the title and header that will appear on the resulting page if you do not specify a redirect URL.
If you wanted a title of ‘Feedback Form Results’:
<input type=hidden name=”title” value=”Feedback Form Results”>
This field allows you to specify a URL that will appear, as return_link_title, on the following report page. This field will not be used if you have the redirect field set, but it is useful if you allow the user to receive the report on the following page, but want to offer them a way to get back to your main page.
This is the title that will be used to link the user back to the page you specify with return_link_url. The two fields will be shown on the resulting form page as:
<input type=hidden name=”return_link_title” value=”Back to Main Page”>
This form field allows you to specify a URL that users will be redirected to if there are fields listed in the required form field that are not filled in. This is so you can customize an error page instead of displaying the default.
Any other form fields that appear in your script will be mailed back to you and displayed on the resulting page if you do not have the redirect field set. There is no limit as to how many other form fields you can use with this form, except the limits imposed by browsers and your server.
Example Formmail Form
Below is some example HTML code that creates a very basic form. The example below includes all of the most
important parameters for the form to function. The example below, will submit the completed form results to
firstname.lastname@example.org, with a subject in the email of “Website Contact Page”.
Protect You and Your Customers from Phishing Scams
Phishing is one of the fastest growing online frauds today. It uses spam email to defraud victims. It is becoming increasingly common – and increasingly dangerous.
Phishers send out emails falsely claiming to be an established and legitimate company in an attempt to scam users into surrendering private information that will subsequently be used for identity theft. The email directs the user to a website where they are asked to update personal information. This can include passwords and credit cards, social security, bank account numbers and other sensitive information that the legitimate organization already has on file.
This website is bogus even though it looks identical to the legitimate site. Once a customer has updated their data, the phishers steal the identity and run up bills in your name or uses it to commit other crimes.
A common phishing technique involves creating the impression that there is an immediate need for personal information, luring unsuspecting users to quickly click on a link to these bogus sites. By spamming large groups of people, phishers can convince up to five percent of email users to reveal sensitive and personal information.
One of the easiest ways to protect yourself from phishers is to take simple precautions.
Do not respond to unsolicited emails that ask for any personal information regardless of how urgent the request appears. Legitimate companies do not ask for personal or sensitive information in this format. If you are concerned about your account – contact the company directly using an email address or phone number that you know is legitimate.
Do not email any personal or financial information. If you initiate a purchase online, look for indicators that the site is secure. E.g. a lock icon, a url that begins with “https:” (the “s” stands for secure)
Review your credit card and bank statements as you receive them to ensure that all transactions are legitimate.
Get spam and anti-virus protection. Using the same methods to detect spam, phishing emails can be identified and filtered out of inbound email to stop you from receiving them.
Report anything suspicious. Contact the legitimate company in the suspect email using an email address or phone number that you know is correct.
If you believe that you responded to a phishing email and provided sensitive and personal information to a bogus website:
Contact the legitimate company in the suspect email using an email address or phone number that you know is correct.
Contact your credit card company and request that a fraud alert be placed on your card(s).
Take preventative action for the future through awareness and by investing in an anti-spam and anti-virus service.
Protect yourself and your customers from phishing. Be aware of what it is and take preventative measures.
Select the Tools menu and then select E-mail Accounts.
Select Add a new e-mail account.
Select the Next button.
On the Server Type page select POP3.
Select the Next button.
You can fill in your details on the Internet E-mail settings(POP3) window.
In the Your Name box type in the name you want attached to your email.
In the Your Email box type in your email address that you created in the control panel.
Type in the Incoming Mail Server details that were supplied to you in the ‘Setup Information’ email in the Incoming Mail Server(POP3) box.
Type in the Outgoing Mail Server details that were supplied to you in the ‘Setup Information’ email in the Outgoing Mail Server(SMTP) box.
The User Name box should contain your complete email address.
The Password box should contain the password that you assigned to your email address in the control panel.
Please check that the User Name and password that you type in are not in the wrong capitalisation because user names and passwords are case sensitive. The password that you type in shows up as a bunch of stars to keep it private.
Select the Remember Password box if you want Outlook 2003 to NOT prompt you for your password every time you download your email.
Click on More Settings and select the General tab.
The Reply E-mail box should contain your Reply-to email address. This is usually your full email address.
Select the Advanced tab. Check that the port number for Incoming Server(POP3) is 110 and Outgoing Server(SMTP) is 25.
You can change the server timeouts on this page as well, in most cases the default setting is Short and 1 minute. We suggest that you increase the time out to 2 minutes.
Click OK to return to the E-mail Accounts page and click Next to complete the add new mail account wizard.
The Webalizer produces several reports (html) and
graphics for each month processed. In addition, a
summary page is generated for the current and previous
months (up to 12)
yearly (index) report shows statistics for a 12 month
period, and links to each month. The monthly report
has detailed statistics for that month with additional
links to any URL’s and referrers found. The various
totals shown are explained below.
Any request made to the server which is logged, is considered
a ‘hit’. The requests can be for anything… html
pages, graphic images, audio files, cgi scripts, etc…
Each valid line in the server log is counted as a
hit. This number represents the total number of requests
that were made to the server during the specified
Some requests made to the server, require that the server then send something back to the requesting client, such as a html page or graphic image. When this happens, it is considered a ‘file’ and the files total is incremented. The relationship between ‘hits’ and ‘files’ can be thought of as ‘incoming requests’ and ‘outgoing responses’.
Pages are, well, pages! Generally, any HTML document, or anything that generates an HTML document, would be considered a page. This does not include the other stuff that goes into a document, such as graphic images, audio clips, etc… This number represents the number of ‘pages’ requested only, and does not include the other ‘stuff’ that is in the page. What actually constitutes a ‘page’ can vary from server to server. The default action is to treat anything with the extension ‘.htm’, ‘.html’ or ‘.cgi’ as a page. A lot of sites will probably define other extensions, such as ‘.phtml’, ‘.php3’ and ‘.pl’ as pages as well. Some people consider this number as the number of ‘pure’ hits… I’m not sure if I totaly agree with that viewpoint. Some other programs (and people 🙂 refer to this as ‘Pageviews’.
Each request made to the server comes from a unique ‘site’, which can be referenced by a name or ultimately, an IP address. The ‘sites’ number shows how many unique IP addresses made requests to the server during the reporting time period. This DOES NOT mean the number of unique individual users (real people) that visited, which is impossible to determine using just logs and the HTTP protocol (however, this number might be about as close as you will get).
Whenever a request is made to the server from a given IP address (site), the amount of time since a previous request by the address is calculated (if any). If the time difference is greater than a preconfigured ‘visit timeout’ value (or has never made a request before), it is considered a ‘new visit’, and this total is incremented (both for the site, and the IP address). The default timeout value is 30 minutes (can be changed), so if a user visits your site at 1:00 in the afternoon, and then returns at 3:00, two visits would be registered. Note: in the ‘Top Sites’ table, the visits total should be discounted on ‘Grouped’ records, and thought of as the “Minimium number of visits” that came from that grouping instead. Note: Visits only occur on PageType requests, that is, for any request whose URL is one of the ‘page’ types defined with the PageType option. Due to the limitation of the HTTP protocol, log rotations and other factors, this number should not be taken as absolutely accurate, rather, it should be considered a pretty close “guess”.
The KBytes (kilobytes) value shows the amount of data, in KB, that was sent out by the server during the specified reporting period. This value is generated directly from the log file, so it is up to the webserver to produce accurate numbers in the logs (some web servers do stupid things when it comes to reporting the number of bytes). In general, this should be a fairly accurate representation of the amount of outgoing traffic the server had, regardless of the web servers reporting quirks.
Note: A kilobyte is 1024 bytes, not 1000
Top Entry and Exit Pages
The Top Entry and Exit Pages give a rough estimate of what URL’s are used to enter your site, and what the last pages viewed are. Because of limitations in the HTTP protocol, log rotations, etc… this number should be considered a good “rough guess” of the actual numbers, however will give a good indication of the overall trend in where users come into, and exit, your site.