har-portent

Using har-heedless to download and har-dulcify to analyze web pages in aggregate.

⚠️ This project has been archived

No future updates are planned. Feel free to continue using it, but expect no support.

Downloads the front web page of all domains in a dataset.
- Input is a text file with one domain name per line.
- Downloads n domains in parallel.
  - Tested with over 100 parallel requests on a single of moderate speed and memory. YMMV.
  - Machine load heavily depends on the complexity and response rate of the average domain in the dataset.
- Shows progress as well as expected time to finish downloads.
- Download domains with different prefixes as separate dataset variations.
  - Default prefixes:
    - http://
    - https://
    - http://www.
    - https://www.
- Retries failed domains twice to reduce effect of any intermittent problems.
  - Increases domain timeouts for failed domains.
- Saves screenshots of all webpages.
Runs an analysis on each dataset variation.
- Outputs JSON files for analysis.
- Prepared for aggregate dataset analysis to output tables (TSV/CSV), which in turn are prepared for graph creation.

Directory structure

# $PWD/$(date -u +%F)/$(basename "$domainlist")-$prefix/hars

Usage

# Create directory structure and download all domains in domains.txt with a single prefix/variation.
# ./src/domains/download-and-analyze-https-www-combos.sh <prefix> <parallelism> <domainlists>
./src/domains/download-and-analyze-https-www-combos.sh 'https://www.' 10 many-domains.txt more-domains.txt 100k-se-domains.txt

# Create directory structure and download all domains in domains.txt with all four variations.
# ./src/domains/download-and-analyze-https-www-combos.sh <parallelism> <domainlists>
./src/domains/download-and-analyze-https-www-combos.sh 10 many-domains.txt more-domains.txt 100k-se-domains.txt

Other usage:

# Re-run question step in each dataset.
~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh

# Re-run aggregate and question step in each dataset.
~/path/to/har-dulcify/src/util/dataset-foreach.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' ~/path/to/har-dulcify/src/one-shot/aggregate.sh '&&' ~/path/to/har-dulcify/src/one-shot/questions.sh

# Copy selected files from each dataset.
OUTPUT="$HOME/path/to/output/analysis/$(date -u +%F)" ~/path/to/har-dulcify/src/util/dataset-query.sh $(find . -mindepth 2 -maxdepth 2 -type d) -- echo "--- Entering {} ---" '&&' 'T="$OUTPUT/$(basename "{}")"' '&&' echo '$T' '&&' mkdir -p '$T/' '&&' cp aggregate.disconnect.categories.organizations.json aggregates.analysis.json '*.log' 'failed*' google-gtm-ga-dc.aggregate.json origin-redirects.aggregate.json ratio-buckets.aggregate.json prepared.disconnect.services.analysis.json '$T/'

Example output

Downloading two domains, one of which doesn't have HTTPS.

$ ~/path/to/har-portent/src/domains/download-and-analyze-https-www-combos.sh 10 only-two-domains.txt
2015-01-31T113812Z start http://
       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
    in #1:    2  0:00:00 [26.7k/s] [=================================================================>] 100%
   out #1:    2  0:00:12 [ 157m/s] [=================================================================>] 100%
Downloading https://services.disconnect.me/disconnect-plaintext.json
Downloading https://publicsuffix.org/list/effective_tld_names.dat
2015-01-31T113855Z done http://
2015-01-31T113855Z start http://www.
       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
    in #1:    2  0:00:00 [23.3k/s] [=================================================================>] 100%
   out #1:    2  0:00:12 [ 164m/s] [=================================================================>] 100%
Downloading https://services.disconnect.me/disconnect-plaintext.json
Downloading https://publicsuffix.org/list/effective_tld_names.dat
2015-01-31T113937Z done http://www.
2015-01-31T113937Z start https://
       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
    in #1:    2  0:00:00 [26.3k/s] [=================================================================>] 100%
   out #1:    2  0:00:12 [ 157m/s] [=================================================================>] 100%
Downloading 1 domains, up to 30 at a time
    in #2:    1  0:00:00 [21.1k/s] [=================================================================>] 100%
   out #2:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
Downloading 1 domains, up to 50 at a time
    in #3:    1  0:00:00 [29.9k/s] [=================================================================>] 100%
   out #3:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
Downloading https://services.disconnect.me/disconnect-plaintext.json
Downloading https://publicsuffix.org/list/effective_tld_names.dat
2015-01-31T114019Z done https://
2015-01-31T114019Z start https://www.
       2 /Users/joelpurra/analyze/the/web/only-two-domains.txt
    in #1:    2  0:00:00 [30.3k/s] [=================================================================>] 100%
   out #1:    2  0:00:12 [ 163m/s] [=================================================================>] 100%
Downloading 1 domains, up to 30 at a time
    in #2:    1  0:00:00 [32.3k/s] [=================================================================>] 100%
   out #2:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
Downloading 1 domains, up to 50 at a time
    in #3:    1  0:00:00 [31.7k/s] [=================================================================>] 100%
   out #3:    1  0:00:12 [ 163m/s] [=================================================================>] 100%
Downloading https://services.disconnect.me/disconnect-plaintext.json
Downloading https://publicsuffix.org/list/effective_tld_names.dat
2015-01-31T114101Z done https://www.

Original purpose

Built as a component in Joel Purra's master's thesis research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.

Citations

If you use, like, reference, or base work on the thesis report Swedes Online: You Are More Tracked Than You Think, the IEEE LCN 2016 paper Third-party Tracking on the Web: A Swedish Perspective, open source code, or open data, please add at least on of the following two citations with a link to the project website: https://joelpurra.com/projects/masters-thesis/

Master's thesis citation:

Joel Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master's thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

IEEE LCN 2016 paper citation:

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/