1000/1000
Hot
Most Recent
URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized URL so it is possible to determine if two syntactically different URLs may be equivalent. Search engines employ URL normalization in order to assign importance to web pages[clarify] and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.
There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.
The following normalizations are described in RFC 3986 [1] to result in equivalent URLs:
HTTP://www.Example.com/
→ http://www.example.com/
http://www.example.com/a%c2%b1b
→ http://www.example.com/a%C2%B1b
%41
–%5A
and %61
–%7A
), DIGIT (%30
–%39
), hyphen (%2D
), period (%2E
), underscore (%5F
), or tilde (%7E
) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.[2] Example:http://www.example.com/%7Eusername/
→ http://www.example.com/~username/
http://www.example.com:80/bar.html
→ http://www.example.com/bar.html
For http and https URLs, the following normalizations listed in RFC 3986 may result in equivalent URLs, but are not guaranteed to by the standards:
http://www.example.com/alice
→ http://www.example.com/alice/
http://www.example.com/../a/b/../c/./d.html
→ http://www.example.com/a/c/d.html
..
" component, e.g. "b/..
", is a symlink to a directory with a different parent, eliding "b/..
" will result in a different path and URL.[3] In rare cases depending on the web server, this may even be true for the root directory (e.g. "//www.example.com/..
" may not be equivalent to "//www.example.com/
".Applying the following normalizations result in a semantically different URL although it may refer to the same resource:
http://www.example.com/default.asp
→ http://www.example.com/
http://www.example.com/a/index.html
→ http://www.example.com/a/
http://www.example.com/bar.html#section1
→ http://www.example.com/bar.html
http://208.77.188.166/
→ http://www.example.com/
https://www.example.com/
→ http://www.example.com/
http://www.example.com/foo//bar.html
→ http://www.example.com/foo/bar.html
http://example.com/
and http://www.example.com/
may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URLs redirects to the other and normalize all URLs appropriately. Example:http://www.example.com/
→ http://example.com/
http://www.example.com/display?lang=en&article=fred
→ http://www.example.com/display?article=fred&lang=en
http://www.example.com/display?id=123&fakefoo=fakebar
→ http://www.example.com/display?id=123
http://www.example.com/display?id=&sort=ascending
→ http://www.example.com/display
http://www.example.com/display?
→ http://www.example.com/display
Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL
http://example.com/story?id=xyz
appears in a crawl log several times along with
http://example.com/story_xyz
we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.