Using the API to archive web pages

30 July 2021

I use the excellent bookmarking service to keep track of web pages of interest.

Pinboard has the following benefits, which is why I use it:

The aim here is to set up a data pipeline of sorts which can be triggered via cron will hold the saved links from the API in Postgres, and also the raw text of the web page to allow for searching at some future date. This raw text stage needs to be done on the desktop where we have the benefit of a browser that we can control with Selenium WebDriver.

Pinboard does offer a slightly more expensive subscription tier which essentially does a better job of what we are doing here which is trying to protect our bookmarks from link-rot etc. So this is provided in the spirit of rolling one’s own, so to speak.

We will do the following:

  1. Create some tables
  2. Pull data from the API and populate the tables
  3. Grab text version of website with browser

1. Create some tables

First of all let’s set some tables in Posgtres.

-- This will hold the bookmarks data direct from the API
create table raw_pinboard
	href text not null
		constraint pk_raw_pinboard
			primary key,
	description text,
	extended text,
	meta text,
	hash text,
	time timestamp,
	shared boolean,
	toread boolean,
	tags text,
	id serial not null,
	insert_ts timestamp with time zone default now()

-- This will hold the bookmarks data direct form the API
create table text
	href text not null
		constraint pk_text
			primary key
		constraint fk_text
			references raw_pinboard,
	text text,
	id serial not null,
	insert_ts timestamp with time zone default now()

-- This will hold the status code and certain headers returned from a http HEAD request
create table head
	href text not null
		constraint pk_head
			primary key
		constraint fk_head
			references raw_pinboard,
	headers text,
	server text,
	content text,
	ts timestamp,
	status bigint,
	html boolean,
	id serial not null,
	insert_ts timestamp with time zone default now()

-- This will hold any URLs which are completely unreachable
create table unreachable
	href text not null
		constraint pk_unreachable
			primary key
		constraint fk_unreachable
			references raw_pinboard,
	insert_ts timestamp with time zone default now(),
	id serial not null

2. Pull data from the API and populate the tables

import requests
import json
import pandas as pd
import urllib.parse
from datetime import datetime
from sqlalchemy import create_engine
from selenium import webdriver
from import Options
from pathlib import Path

from time import sleep

We keep secrets in a separate file in order to keep them out of the repo—more on this approach here.

key_val_pairs = Path("secrets.env").read_text().replace("\n", ",\n")
exec(f"creds = dict({key_val_pairs})")
engine = create_engine(f"postgresql://{creds['username']}:{creds['password']}@{creds['server']}:5432/{creds['database']}")

We ignore certain URLs where it does not make sense to get the text from the page e.g. for etc authentication is required to get to the article.

ignore_urls = ['pdf',
ignore_urls = sorted(ignore_urls)
print('Ignoring any urls which contain', ', '.join(ignore_urls))

response.json() is the JSON encoded data returned by the API, which looks like this:

[{'description': 'The Man Who Broke Ticketmaster',
  'extended': '',
  'hash': 'a0ad94eb078e09f4fc37805ebc9fb925',
  'href': '',
  'meta': 'c9ba8a683a53f9b7fb3d30bf1e624568',
  'shared': 'no',
  'tags': '',
  'time': '2021-07-03T15:28:49Z',
  'toread': 'no'},
 {'description': 'DevOps; a decade of confusion and frustration',
  'extended': '',
  'hash': '1ab6d37916689cafd0c8afd4c0ca775d',
  'href': '',
  'meta': 'da3e90d11226ef6a6a1a1f356b3a6b9d',
  'shared': 'no',
  'tags': '',
  'time': '2021-07-01T22:54:39Z',
  'toread': 'yes'},
 {'description': "Is Facebook's Prophet the Time-Series Messiah, or Just a "
                 'Very Naughty Boy?',
  'extended': '',
  'hash': '7f868bbf8586f350261756eab4d25f07',
  'href': '',
  'meta': '236f226e667b7fff948be680458d8c6a',
  'shared': 'no',
  'tags': '',
  'time': '2021-07-01T21:33:24Z',
  'toread': 'no'}]

Which we can put directly into pd.DataFrame to get a DataFrame (df) that looks like this:

| | href | description | extended | meta | hash | time | shared | toread | tags | | —: | :------------------------------------------------------------ | :------------------------------------------------------------------------- | :------- | :------------------------------- | :------------------------------- | :------------------ | :----- | :----- | :--- | --- | | 0 | | The Man Who Broke Ticketmaster | | c9ba8a683a53f9b7fb3d30bf1e624568 | a0ad94eb078e09f4fc37805ebc9fb925 | 2021-07-03 15:28:49 | False | False | | | | 1 | | DevOps; a decade of confusion and frustration | | da3e90d11226ef6a6a1a1f356b3a6b9d | 1ab6d37916689cafd0c8afd4c0ca775d | 2021-07-01 22:54:39 | False | True | | | 2 | | Is Facebook’s Prophet the Time-Series Messiah, or Just a Very Naughty Boy? | | 236f226e667b7fff948be680458d8c6a | 7f868bbf8586f350261756eab4d25f07 | 2021-07-01 21:33:24 | False | False | |

We then squirrel away any rows in this df that we don’t already have into our raw_pinboard table.

# Get the set of urls we already have if they are already in the table ...

with engine.connect() as conn:
    already_got = set(pd.read_sql('select href from raw_pinboard', conn)['href'])
print(f'{len(already_got)} existing URLs in db')

api_url = f"https://{creds['pb_username']}:{creds['pb_password']}"
# requests .get function will send HTTPS GET request and returned data and certain metadata will be in the response object
response = requests.get(api_url + "/all?format=json")
# Create a dataframe from the returned JSON data
df = pd.DataFrame(response.json())
print(f'{len(df)} from pinboard API')
# The below fixes up the datatypes
df.time = pd.to_datetime(df.time, infer_datetime_format=True)
df.shared = df.shared.apply(lambda x: False if x == 'no' else True)
df.toread = df.toread.apply(lambda x: False if x == 'no' else True)

# Exclude anything we already have
df = df.loc[~df.href.isin(already_got), :]
n = len(df)

if n > 0:
    with engine.connect() as conn:
            .where(pd.notnull(df), None) # Fixes up NULLs
                'raw_pinboard', # Table name
                if_exists='append', # What to do with data if table already exists
                index=False # Exclude index
    print(f'Inserted {n} new entries from API')
    print('No new info from API since last time')

While we’re at it, we also remove any bookmarks in pinboard that are over a year old, in a bid to keep things tidy.

# Find bookmarks over a year old which we want to delete from pinboard using the API
td =
date_1yr_ago = datetime(td.year-1, td.month,
old_urls_to_remove = list(df.loc[df.time < date_1yr_ago, 'href'])
if len(old_urls_to_remove) > 0:
    old_urls_to_remove_encoded = [urllib.parse.quote(url) for url in old_urls_to_remove]
    print(f'{len(old_urls_to_remove)} are over a year old and will be removed')
    for url, enc_url in zip(old_urls_to_remove, old_urls_to_remove_encoded):
        response = requests.get(api_url + f"/delete?format=json&url={enc_url}")
        if response.json()['result_code'] != 'done':
            print('Got response code', response.json()['result_code'], 'with URL', url)
            print(f'{url} removed')

Now we have done that we are ready to re-query to get the set of URLs where we have not checked their status code nor saved them as unreachable—all being well this will approximate the new URLs added to pinboard since we last ran the script.

# We need to re-read the whole of the raw_pinboard table as we previously deleted anything we had previously
# we had may as well ignore anything we have already had a look at at this juncture
with engine.connect() as conn:
    query = """
    SELECT href FROM raw_pinboard
    WHERE NOT EXISTS (SELECT 1 FROM head WHERE raw_pinboard.href = head.href)
    AND NOT EXISTS (SELECT 1 FROM unreachable WHERE raw_pinboard.href = unreachable.href)
    df = pd.read_sql(query, conn)

# We remove all urls that contain a token from ignore_urls
mask = ~df['href'].isnull()
n = mask.sum()
for ignore_url in ignore_urls:
    mask &= ~df['href'].str.contains(ignore_url)
df = df.loc[mask, :]
print(f'{n-mask.sum()} removed as include something in URL ignore list leaving {mask.sum()}')

We want to raise an exception if there’s nothing to do and assert is a lazy way of doing this which works both when run in the notebook and in a script.

# Assert that we do actually have something to do
assert len(df) > 0
# Want to do a maximum of 250 only
df = df.sample(min(250, len(df)))

Here we set up our headers so it looks like a browser request and create a couple of helper functions for HEAD and GET requests, before then trying HEAD requests for each of these URLs.

HEAD requests are more efficient than GET requests as they retrieve less data, but they don’t always work; it depends on configuration.

responses = dict()
failed = set()

# Here we use headers from a real browser as that makes it more likely that we recieve a sensible response from the server.
headers = {
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-User": "?1",
    "Sec-Fetch-Dest": "document",
    "Referer": "",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9"

def head_wrapper(url):
    r = requests.Session()
    r.headers = headers
    return r.head(url)

def get_wrapper(url):
    r = requests.Session()
    r.headers = headers
    return r.get(url)

print(f'Getting {len(df)} HEAD responses, this may take a while so please be patient')

# This may take a few minutes
for url in df['href']:
    if url not in responses.keys():
            responses[url] = head_wrapper(url)

if len(failed) > 0:
    print(', '.join(failed), 'failed')

Once we have done a single pass of HEAD requests we review the status codes, and re-request on a re-directed URL if applicable.

For http status codes can see

unknown_codez = set()
iteration = 0

msgs_d = {200: 'looks good', 301: 'redirect trying new url',
          302: 'redirect trying new url', 303: 'see other', 307: 'temporary redirect trying new url', 308: 'perminent redirect trying new url',
          400: 'likely missing', 404: 'not found',
          403: 'forbidden', 405: 'method not allowed', 406: 'not acceptable i.e. not going to work with headers',
          500: 'internal server error', 502: 'bad gateway', 503: 'not available', 552: 'timeout'}

def proc_responses(responses):
    """Takes dict of {url (string): response (requests response object)}"""
    all_codez = set()

    for url, response in responses.items():
        if response.status_code == 301 or response.status_code == 302 or response.status_code == 307 or response.status_code == 308:
            new_url = response.headers['Location']
            print(f"{url} {response.status_code} {msgs_d[response.status_code]} {new_url}")
            # Is the new_url a relative path?
            if new_url[0] == '/':
                new_url = url.split('//')[0] + '//' + url.split('//')[1].split('/')[0] + new_url
                responses[url] = head_wrapper(new_url)
        elif response.status_code != 200:
            if response.status_code not in msgs_d:
                print(f'{url} {response.status_code} unknown codes')

    return all_codez

while iteration < 10: # Sanity check
    iteration += 1
    print(f'Following 30x status codes iteration {iteration}')

    ret = proc_responses(responses)
    if 301 not in ret and 302 not in ret and 308 not in ret:

if len(unknown_codez) > 0:
    print('Unknown http status codes:', unknown_codez)

If we still don’t achieve success that way it suggests that HEAD is not allowed by the server and we will have to GET after all …

filtered_responses = {k:v for k,v in responses.items() if v.status_code != 200}

if len(filtered_responses) > 0:
    print(f'Non-200 responses, trying GET instead of HEAD')

    for url, response in filtered_responses.items():
        if response.status_code in msgs_d:
            print(f'{url} {response.status_code} {msgs_d[response.status_code]}', end=' ')
            new_response = get_wrapper(url)
            if new_response.status_code == 200:
                print('... success with GET')
                responses[url] = new_response
                print('... no joy with GET')
            print(f'{url} {response.status_code} (unknown)')

This will prep a df which looks like

href server content ts status html
0 LiteSpeed text/html;charset=UTF-8 2021-07-13 10:45:16 200 True
1 cloudflare text/html; charset=utf-8 2021-07-13 10:45:16 200 True
2 Apache text/html 2021-07-13 10:45:16 200 True
3 nan text/html; charset=utf-8 2021-07-13 10:45:19 200 True
4 Apache text/html; charset=UTF-8 2021-07-13 10:49:39 200 True

Plus another column ‘headers’ which contains the full header info in JSON format.

Once we have that we insert these data and the unreachable URLs into the database.

# URL, raw headers
df_url_headers = pd.DataFrame(pd.Series({k:json.dumps(dict(v.headers)) for k,v in responses.items()}, name='headers'))

# Pull out list of dicts which represent the header data
header_data = [dict(r.headers) for r in responses.values()]
# Make header keys all lower-case
header_data = [{k.lower():v for k,v in header.items()} for header in header_data]
# Common parts of headers
df_common_headers = pd.DataFrame(header_data, index=responses.keys()).loc[:, ['server', 'content-type', 'date']]

# Put them together
df = pd.concat([
    pd.Series({k:v.status_code for k,v in responses.items()}, name='status-code')
], axis=1)

# Fix up col names
df.rename(columns={c:c.split('-')[0] for c in df.columns}, inplace=True)
df.rename(columns={'date':'ts'}, inplace=True)

df['html'] = df['content'].str.lower().str.contains('html')
df.ts = pd.to_datetime(df.ts, infer_datetime_format=True) = 'href'

if len(df) > 0:
    with engine.connect() as conn:
            .where(pd.notnull(df), None) # Fixes up NULLs
                'head', # Table name
                if_exists='append', # What to do with data if table already exists
                index=False # Exclude index
    print(f'Put in {len(df)} new status codes')

if len(failed) > 0:
    with engine.connect() as conn:
            pd.DataFrame(pd.Series(sorted(list(failed)), name='href'))
                .to_sql('unreachable', conn, if_exists='append', index=False)
    print(f'Put in {len(failed)} unreachable URLs')

We then re-read what is reachable because we just have updated our status code data in the database.

# Reread what is reachable
# note that the penultimate where needs extending for the various status codes in
# select distinct (status) from head order by status;
query = f"""
SELECT href FROM head
WHERE NOT EXISTS (SELECT 1 FROM text WHERE text.href = head.href)
AND (head.status = 200 OR head.status = 405)
AND head.html = TRUE
with engine.connect() as conn:
    urls = list(pd.read_sql(query, conn)['href'])

text = dict()

def urls_filter(urls_list):
    for exclude in ignore_urls:
        urls_list = [url for url in urls_list if exclude not in url]
    return urls_list

urls = urls_filter(urls)
print(f'{len(urls)} to grab')

3. Grab text version of website with browser

Everything above can be run with a script and a scheduler; the below needs to be run on a machine with a desktop so it can grab the web page with a browser.

The webdriver drivers can be found here. Note that it is very important to get the right version of the webdriver, or it will not work.

Although I could use Brave which is the browser I usually use, I prefer to use Chrome with no user data associated. The use of sleep to wait for 5 seconds on the page before getting the text allows for confirmation of cookie pop-ups etc. Chrome seems to get stuck sometimes—in this case I simply re-run the cell.

driver = webdriver.Chrome(executable_path='./chromedriver-91')

for url in urls:
    if url not in text or text[url] is None:
        text[url] = '\n\n'.join([s.strip() for s in driver.find_element_by_xpath('//body').text.split('\n')])


Once we have the text from the web pages we create a dataframe which looks like the below, and insert it into the database

href text
0 Modin\n\nlatest\n\nINSTALLATION\n\nInstallation\n\nGETTING
1 Home\n\nCloud\n\nProducts\n\nSolutions\n\nLearn\n\nCustomers
2 Sign in\n\nGet started\n\nABOUT\n\nTECHNOLOGY\n\nRESOURCES
3 Ten thousand meters\n\nDiving deep, flying high to s
df = pd.DataFrame(pd.Series(text, name='text')) = 'href'

if len(df) > 0:
    with engine.connect() as conn:
            .where(pd.notnull(df), None) # Fixes up NULLs
                'text', # Table name
                if_exists='append', # What to do with data if table already exists
                index=False # Exclude index
    print(f'Text for {len(df)} new URLs inserted')