| Age | Commit message (Collapse) | Author |
|
**🌱 brokenlinks: add option to ignore list HTTP status code**.
When link known to have an issues, one can ignore the status code during
scanning broken links using "-ignore-status" option.
**🌱 brokenlinks: add option "insecure"**.
The "-insecure" option does not report an error on server with invalid
certificates.
**🌱 brokenlinks: implement caching for external URLs**.
Any successful fetch on external URLs will be recorded into jarink
cache file, located in user's cache directory.
For example, in Linux it would be `$HOME/.cache/jarink/cache.json`.
This help improve the future rescanning on the same or different target
URL, minimizing network requests.
**🌼 brokenlinks: reduce the number of goroutine on scan**.
Previously, each scan run on one goroutine and the result is
pushed using one goroutine.
This makes one scan of link consume two goroutine.
This changes the scan function to return the result and push it
in the same goroutine.
|
|
The version command print the version of the program.
|
|
|
|
Previously, each scan run on one goroutine and the result is
pushed using pushResult also in one goroutine.
This makes one link consume two goroutines.
This changes the scan function to return the result and push it
in the same goroutine.
|
|
Any succesful fetch on external URLs, will be recorded into jarink
cache file, located in user's home cache directory.
For example, in Linux it would be `$HOME/.cache/jarink/cache.json`.
This help improve the future rescanning on the same or different target
URL, minimizing network requests.
|
|
The test run a server that contains three six pages that contains
various [time.Sleep] duration before returning the response.
This allow us to see how the main scan loop works, waiting
for resultq and listWaitStatus.
|
|
There are two test cases, one for invalid status code like "abc",
and one for unknown status code like "50".
|
|
|
|
|
|
Before the Options passed to worker, it should be valid, including the
URL to be scanned.
|
|
This is for git.sr.ht to be able to render the README.
|
|
The insecure option will allow and not report as error on server with
invalid certificates.
|
|
When link known to have an issues, one can ignore the status
code during scanning broken links using "-ignore-status" option.
|
|
The first release of jarink provides the command "brokenlinks",
to scan for broken links.
The output of this command is list of page with its broken links
in JSON format.
This command accept the following options,
`-verbose`::
Print the page that being scanned to standard error.
`-past-result=<path to JSON file>`::
Scan only the pages reported by result from past scan based
on the content in JSON file.
This minimize the time to re-scan the pages once we have fixed the URLs.
|
|
|
|
|
|
When two or more struct has the same prefix that means it is time to
move it to group it.
Also, we will group one command to one package in the future.
|
|
Naming it page_links does not make sense if the result is from brokenlinks
command.
|
|
Using HTTP HEAD on certain page may return
* 404, not found, for example on
https://support.google.com/accounts/answer/1066447
* 405, method not allowed, for example on
https://aur.archlinux.org/packages/rescached-git
For 405 response code we can check and retry with GET, but for 404 its
impossible to check if the URL is really exist or not, since 404 means
page not found.
|
|
When the call to HTTP HEAD or GET return an error and the error is
*net.DNSError with Timeout, retry the call until no error or Timeout
again for 5 times.
|
|
Previously, we only encode the BrokenlinksResult.PageLinks.
The struct may changes in the future, so its better to encode the whole
struct now rather than changing the output later.
|
|
The brokenlinks command now have option "-past-result" that accept
path to JSON file from the past result.
If its set, the program will only scan the pages with broken links
inside that report.
|
|
|
|
|
|
Previously, if we pass the URL with path to brokenlinks, for example
"web.tld/path" it will scan all of the pages in the website "web.tld".
Now, it only scan the "/path" and its sub paths.
|
|
|
|
The worker use log with date and time, while the main program is not.
|
|
The README contains the content from the usage function in the
"cmd/jarink".
|
|
Jarink is a program to help web administrator to maintains their website.
Currently its provides a command to scan for brokenlinks.
|
|
|
|
The error message can help user to debug the problems with links.
|
|
Using JSON as output can be parsed by other tools.
|
|
After all of the result from scan has been checked for seen or not,
check for link that waiting for status in the second loop.
|
|
For link that is not from the same domain being scanned, use the HTTP
method HEAD to minimize resources being transported.
|
|
|
|
When using goroutine to process a link, the result than passed to main
goroutine through channel.
The main goroutine then process the result one by one, check if its has
been seen, error, or need to be scanned.
In that way, we don't need mutex to guard if link has been seen or not.
|
|
Using HEAD does not return the content of image, which consume less
resources on both end.
|
|
Printing date and time during testing makes the log lines too long.
|
|
The CLI contains one command: scan
Its accept single argument: an URL to be scanned,
and one option "-verbose".
|
|
The fragment part on URL, for example "/page#fragment" should be
removed, otherwise it will indexed as different URL.
|
|
Using struct allow to extends the parameter later without changing
the signature.
|
|
After we check the code and test for [html.Parse] there are no case
actual cases where HTML content will return an error.
The only possible error is when reading from body (io.Reader), and
that is also almost impossible.
[html.Parse]: https://go.googlesource.com/net/+/refs/tags/v0.40.0/html/parse.go#2347
|
|
|
|
Any HTML link that is from domain other than the scanned domain should
net get parsed.
It only check if the link is valid or not.
|
|
For link to image we can skip parsing it.
|
|
The test should not require internet connection to be passed.
|
|
Turn out broken HTML still get parsed by "net/html" package.
|
|
Scanning invalid URL like "127.0.0.1:14594", without HTTP scheme,
and "http://127.0.0.1:14594" (server not available) should return
an error.
Scanning on subpage like "http://127.0.0.1:11836/page2" should return
the same result as scanning from the base URL
"http://127.0.0.1:11836/page2".
|
|
The current implementation at least cover 84% of the cases.
Todo,
* CLI for scan
* add more test case for 100% coverage, including scan on invalid
base URL, scan on invalid HTML page, scan on invalid href or
src image
|
|
|