| Age | Commit message (Collapse) | Author |
|
Jarink is a program to help web administrator to maintains their website.
Currently its provides a command to scan for brokenlinks.
|
|
|
|
The error message can help user to debug the problems with links.
|
|
Using JSON as output can be parsed by other tools.
|
|
After all of the result from scan has been checked for seen or not,
check for link that waiting for status in the second loop.
|
|
For link that is not from the same domain being scanned, use the HTTP
method HEAD to minimize resources being transported.
|
|
|
|
When using goroutine to process a link, the result than passed to main
goroutine through channel.
The main goroutine then process the result one by one, check if its has
been seen, error, or need to be scanned.
In that way, we don't need mutex to guard if link has been seen or not.
|
|
Using HEAD does not return the content of image, which consume less
resources on both end.
|
|
Printing date and time during testing makes the log lines too long.
|
|
The CLI contains one command: scan
Its accept single argument: an URL to be scanned,
and one option "-verbose".
|
|
The fragment part on URL, for example "/page#fragment" should be
removed, otherwise it will indexed as different URL.
|
|
Using struct allow to extends the parameter later without changing
the signature.
|
|
After we check the code and test for [html.Parse] there are no case
actual cases where HTML content will return an error.
The only possible error is when reading from body (io.Reader), and
that is also almost impossible.
[html.Parse]: https://go.googlesource.com/net/+/refs/tags/v0.40.0/html/parse.go#2347
|
|
|
|
Any HTML link that is from domain other than the scanned domain should
net get parsed.
It only check if the link is valid or not.
|
|
For link to image we can skip parsing it.
|
|
The test should not require internet connection to be passed.
|
|
Turn out broken HTML still get parsed by "net/html" package.
|
|
Scanning invalid URL like "127.0.0.1:14594", without HTTP scheme,
and "http://127.0.0.1:14594" (server not available) should return
an error.
Scanning on subpage like "http://127.0.0.1:11836/page2" should return
the same result as scanning from the base URL
"http://127.0.0.1:11836/page2".
|
|
The current implementation at least cover 84% of the cases.
Todo,
* CLI for scan
* add more test case for 100% coverage, including scan on invalid
base URL, scan on invalid HTML page, scan on invalid href or
src image
|
|
|