har-heedless

Scriptable batch downloading of webpages to generate HTTP Archive (HAR) files, using PhantomJS. See har-dulcify for aggregate HAR analysis. You might want to use har-portent, which runs both downloads multiple dataset variations using har-heedless and then analyzes them with har-dulcify in a single step.

⚠️ This project has been archived

No future updates are planned. Feel free to continue using it, but expect no support.

Downloads the front web page of all domains in a dataset.
- Input is a text file with one domain name per line.
- Downloads n domains in parallel.
  - Tested with over 100 parallel requests on a single of moderate speed and memory. YMMV.
  - Machine load heavily depends on the complexity and response rate of the average domain in the dataset.
- Shows progress as well as expected time to finish downloads.
- Download domains with different prefixes as separate dataset variations.
  - Default prefixes:
    - http://
    - https://
    - http://www.
    - https://www.
- Retries failed domains twice to reduce effect of any intermittent problems.
  - Increases domain timeouts for failed domains.
- Saves screenshots of all webpages.

Usage

# Downloads domain front pages in parallel.
# domains | ./src/domain/parallel.sh <prefix> <parallelism> --screenshot <true|false>
<domains.txt ./src/domain/parallel.sh 'https://www.' 10 --screenshot true

# More advanced usage, with pipe-viewer (pv) for speed estimates.
size=$(wc -l domains.txt | awk '{ print $1 }')
pv --line-mode --size "$size" -cN "input" domains.txt | ./src/domain/parallel.sh 'https://www.' 10 --screenshot true | pv --line-mode --size "$size" -cN "output" >> "domains.log"

Other options:

# Download domain front pages in serial. This can be very slow.
# domains | ./src/domain/serial.sh <prefix> --screenshot <true|false>
<domains.txt ./src/domain/serial.sh 'https://www.' --screenshot true

# Download custom URLs in parallel. Note that almost no testing of non-front-page donwloading has been done.
# urls | ./src/url/parallel.sh --screenshot <true|false>
<urls.txt ./src/url/parallel.sh --screenshot true

# Download custom URLs in serial. This can be very slow. Note that almost no testing of non-front-page donwloading has been done.
# urls | ./src/url/serial.sh --screenshot <true|false>
<urls.txt ./src/url/serial.sh --screenshot true

# Download a single URL. Note that almost no testing of non-front-page donwloading has been done.
# ./src/url/single.sh <URL> --screenshot <true|false>
./src/url/single.sh 'https://joelpurra.com/' --screenshot true

# Download fetch a single HAR, optionally with an embedded screenshot. Note that almost no testing of non-front-page donwloading has been done.
# ./src/get/har.sh <URL> --screenshot <true|false>
./src/get/har.sh 'https://joelpurra.com/' --screenshot true

Original purpose

Built as a component in Joel Purra’s master’s thesis research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.

Citations

If you use, like, reference, or base work on the thesis report Swedes Online: You Are More Tracked Than You Think, the IEEE LCN 2016 paper Third-party Tracking on the Web: A Swedish Perspective, open source code, or open data, please add at least on of the following two citations with a link to the project website: https://joelpurra.com/projects/masters-thesis/

Master’s thesis citation:

Joel Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master’s thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

IEEE LCN 2016 paper citation:

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/

Thanks

netsniff.js is based on the example with the same name in PhantomJS, created by Ariya Hidayat, release under the BSD 3-Clause “New” or “Revised” License (BSD-3-Clause).
URLUtils.js by Yaffle. Released into the public domain.