<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="rss.xsl"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>sangaline.com Blog</title>
        <link>https://sangaline.com/blog</link>
        <description>sangaline.com Blog</description>
        <lastBuildDate>Wed, 05 Apr 2017 10:49:36 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[Internet Archaeology: Scraping time series data from Archive.org]]></title>
            <link>https://sangaline.com/blog/wayback-machine-scraper</link>
            <guid>https://sangaline.com/blog/wayback-machine-scraper</guid>
            <pubDate>Wed, 05 Apr 2017 10:49:36 GMT</pubDate>
            <description><![CDATA[A guide to scraping historical snapshots of webpages from the Archive.org Wayback Machine.]]></description>
            <content:encoded><![CDATA[<p><em>Skip to <a href="https://github.com/sangaline/wayback-machine-scraper" target="_blank" rel="noopener noreferrer" class="">the Wayback Machine Scraper GitHub repo</a> if you're just looking for the completed command-line utility or the <a href="https://github.com/sangaline/scrapy-wayback-machine" target="_blank" rel="noopener noreferrer" class="">Scrapy middleware</a>.</em>
<em>The article focuses on how the middleware was developed and an interesting use case: looking at time series data from Reddit.</em></p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="introduction">Introduction<a href="https://sangaline.com/blog/wayback-machine-scraper#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>The <a href="https://archive.org/web" target="_blank" rel="noopener noreferrer" class="">Archive.org Wayback Machine</a> is pretty awe inspiring.
It's been archiving web pages since 1996 and has amassed <em>284 billion</em> page captures and over 15 petabytes of raw data.
Many of these are sites that are no longer online and their content would have been otherwise lost to time.
For sites that are still around, it can be absolutely fascinating to watch how they've evolved over the years.</p>
<p>Take <a href="http://reddit.com/" target="_blank" rel="noopener noreferrer" class="">Reddit</a> for instance.
You can go back in time and watch it grow from this...</p>
<p><img decoding="async" loading="lazy" alt="Reddit.com in 2002" src="https://sangaline.com/assets/images/reddit-2002-c254c350ed327cb70a80511d5bc4a63f.png" width="800" height="400" class="img_ev3q"></p>
<p>to this...</p>
<p><img decoding="async" loading="lazy" alt="Reddit.com in 2005" src="https://sangaline.com/assets/images/reddit-2005-c48a461a30eb60c61dcb1657f5650dc2.png" width="800" height="400" class="img_ev3q"></p>
<p>to this...</p>
<p><img decoding="async" loading="lazy" alt="Reddit.com in 2016" src="https://sangaline.com/assets/images/reddit-2016-fa9af31a52fc03f384df304277e7de5e.png" width="800" height="400" class="img_ev3q"></p>
<p>with about <em>192 thousand</em> other stops along the way.
That's just absolutely incredible to me (if you agree then <a href="https://archive.org/donate/" target="_blank" rel="noopener noreferrer" class="">please consider donating to help them keep doing what they do</a>).</p>
<p>That's all well-and-good but the thing about 192 thousand web captures, let alone 284 billion, is that that's just way too much for one person to sift through by hand.
There's a lot of interesting data there, but if you want to actually do something with it then you'll need some sort of scraper to collect the data from the Wayback Machine.</p>
<p>What might one call such a thing?
Hmmm, I dunno.
Maybe a...</p>
<p><a href="https://github.com/sangaline/wayback-machine-scraper" target="_blank" rel="noopener noreferrer" class=""><img decoding="async" loading="lazy" alt="Wayback Machine Scraper Logo" src="https://sangaline.com/assets/images/logo-ecfc68b295396b3c51b60b7da6790aa1.png" width="3120" height="612" class="img_ev3q"></a></p>
<p>That's what I called my own <a href="https://github.com/sangaline/wayback-machine-scraper" target="_blank" rel="noopener noreferrer" class="">reusable middleware and command-line utility</a> at least (original, right?).</p>
<p>In this article, I'll walk you through the process of writing it using python and <a href="https://scrapy.org/" target="_blank" rel="noopener noreferrer" class="">Scrapy</a>.
I should mention that scraping archived pages from the <a href="https://archive.org/web" target="_blank" rel="noopener noreferrer" class="">Wayback Machine</a> isn't exactly a new idea- the official <a href="https://doc.scrapy.org/en/latest/topics/practices.html" target="_blank" rel="noopener noreferrer" class="">Scrapy docs</a> list scraping cached copies of pages under "Common Practices"- but I'm going to try to put a little bit of a twist on it.</p>
<p>If all you wanted to do was fetch a current or historical snapshot then you could just write something like</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">waybackify_url</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token punctuation">,</span><span class="token plain"> closest_timestamp</span><span class="token operator" style="color:#a89984">=</span><span class="token string" style="color:#89b482">'2017'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#89b482">f'https://web.archive.org/web/</span><span class="token string-interpolation interpolation punctuation">{</span><span class="token string-interpolation interpolation">closest_timestamp</span><span class="token string-interpolation interpolation punctuation">}</span><span class="token string-interpolation string" style="color:#89b482">/</span><span class="token string-interpolation interpolation punctuation">{</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation">}</span><span class="token string-interpolation string" style="color:#89b482">'</span><br></span></code></pre></div></div>
<p>and "waybackify" your URLs before crawling them.
That's great if you just want to avoid rate limits and bans but it sure doesn't make for much of a web scraping tutorial!
Aside from that, it doesn't really help much if you're interested in how a page changes over time.
For example, I wanted to look at all of the available <a href="http://news.ycombinator.com/" target="_blank" rel="noopener noreferrer" class="">Hacker News</a> snapshots when I was writing the <a class="" href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm">Reverse Engineering the Hacker News Ranking Algorithm</a>.</p>
<p>The most obvious solution in cases like this is to write a spider that crawls the archive index pages</p>
<p><img decoding="async" loading="lazy" alt="The archive.org index for reddit.com" src="https://sangaline.com/assets/images/reddit-archive-index-d267652c0a290218d75f3d13495883c3.png" width="1109" height="896" class="img_ev3q"></p>
<p>and extracts the timestamps from the URLs.
That's definitely doable but it gets slightly more complicated than it seems at first and if you're putting this logic in your spider then chances are that reusing the code won't be trivial.
Scraping time series data is a fairly general problem and so it would be nice if the code be reused with minimal modifications to spider code.</p>
<p>A more natural way to approach the problem is to write middleware that does the dirty work and integrates easily with existing code.
I had basically already done this myself but hadn't quite cleaned it up enough that I would feel comfortable open sourcing it.
After recently writing <a class="" href="https://sangaline.com/blog/advanced-web-scraping-tutorial">Advanced Web Scraping</a> and receiving a lot of really positive feedback, I realized that others might find it informative if I did a little walkthrough of how I developed it and made the code available.</p>
<p>Just talking about middleware can get a little bit boring so I'll frame the discussion within the context of trying to analyze time series data from Reddit.
The analysis there won't be particularly deep but there was something pretty interesting that popped out of the data.
It should be more than enough of a starting point to do some internet archaeology of your own!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-simple-reddit-spider">A Simple Reddit Spider<a href="https://sangaline.com/blog/wayback-machine-scraper#a-simple-reddit-spider" class="hash-link" aria-label="Direct link to A Simple Reddit Spider" title="Direct link to A Simple Reddit Spider" translate="no">​</a></h2>
<p>Let's begin by throwing together a very simple spider that just grabs the titles and scores of the stories on the front page of <a href="http://reddit.com/" target="_blank" rel="noopener noreferrer" class="">Reddit</a>.
We'll do that right after we get the boilerplate out of the way by setting up a <a href="https://virtualenv.pypa.io/en/stable/" target="_blank" rel="noopener noreferrer" class="">virtualenv</a>, installing Scrapy, and scaffolding out a default Scrapy project.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token function" style="color:#d8a657">mkdir</span><span class="token plain"> ~/scrapers/reddit</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token builtin class-name" style="color:#d8a657">cd</span><span class="token plain"> ~/scrapers/reddit</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">virtualenv </span><span class="token function" style="color:#d8a657">env</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token builtin class-name" style="color:#d8a657">.</span><span class="token plain"> env/bin/activate</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">pip </span><span class="token function" style="color:#d8a657">install</span><span class="token plain"> scrapy</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">scrapy startproject reddit_scraper</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token builtin class-name" style="color:#d8a657">cd</span><span class="token plain"> reddit_scraper</span><br></span></code></pre></div></div>
<p>If any of that stuff doesn't make sense to you then you might want to go check out <a href="https://doc.scrapy.org/en/latest/intro/tutorial.html" target="_blank" rel="noopener noreferrer" class="">The Scrapy Tutorial</a> or <a class="" href="https://sangaline.com/blog/advanced-web-scraping-tutorial">The Advanced Web Scraping Tutorial</a> (though you can probably follow along fine just knowing that that sets up a project scaffold).</p>
<p>Now we can add a basic spider to <code>reddit_scraper/spiders/reddit_spider.py</code></p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> dt</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> scrapy</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">RedditSpider</span><span class="token punctuation">(</span><span class="token plain">scrapy</span><span class="token punctuation">.</span><span class="token plain">Spider</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    name </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'reddit'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">start_requests</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">yield</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">Request</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'http://reddit.com'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">parse</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> response</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        items </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> div </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'div.sitetable div.thing'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">try</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                title </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> div</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'p.title a::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                votes_div </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> div</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'div.score.unvoted'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                votes </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> votes_div</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::attr(title)'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                votes </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> votes </span><span class="token keyword" style="color:#ea6962">or</span><span class="token plain"> votes_div</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                items</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token string" style="color:#89b482">'title'</span><span class="token punctuation">:</span><span class="token plain"> title</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">votes</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">except</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">pass</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">items</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            timestamp </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'wayback_machine_time'</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">timestamp</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">:</span><span class="token plain"> timestamp</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'items'</span><span class="token punctuation">:</span><span class="token plain"> items</span><span class="token punctuation">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>This spider is about as simple as they come: it starts at <code>http://reddit.com</code> and doesn't crawl anywhere else from there.
It uses a few CSS selectors to pull out the title and votes for each story on the front page and then attaches a timestamp to them.
If we run the spider with</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">scrapy crawl reddit </span><span class="token parameter variable" style="color:#ea6962">-o</span><span class="token plain"> snapshots.jl</span><br></span></code></pre></div></div>
<p>then it will produce an uglified version of something like</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token property" style="color:#ea6962">"timestamp"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1491171571.881031</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token property" style="color:#ea6962">"items"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"title"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"Evidence that WSJ used FAKE screenshots"</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"votes"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">32459</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"title"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"Expand the canvas? [MOD APPROVED]"</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"votes"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">16305</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"title"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"My sister's cat spazzing out in his new cat tree!"</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token property" style="color:#ea6962">"votes"</span><span class="token operator" style="color:#a89984">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">15815</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token plain"> etc.</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>in <code>snapshots.jl</code>.</p>
<p>To track changes over time, we could now set up a cron job to run our scraper at regular intervals.
That's a great way to collect frequently changing data but- much like in the movie <a href="https://www.imdb.com/title/tt0390384/" target="_blank" rel="noopener noreferrer" class="">Primer</a>- you can't go back further than when you first turned it on.
That's where the <a href="https://web.archive.org/" target="_blank" rel="noopener noreferrer" class="">Wayback Machine</a> comes in.</p>
<p>The <a href="https://web.archive.org/" target="_blank" rel="noopener noreferrer" class="">Wayback Machine</a> is basically a much more complicated spider that is saving the entire HTML content of each snapshot.
If we can feed the historical HTML snapshots into our spider and attach the correct timestamps then it will effectively be as though we were running our scraper at those points in time.</p>
<p>The point that I'm trying to make here isn't the obvious one that we can use the HTML snapshots to extract the historical data.
It's that if we connect the snapshots to our spider in the right way then the spider should be none the wiser and things should just work.
This is in contrast to the other approach we discussed where our spider would need to be aware of the archive index pages, urls, etc.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="developing-the-middleware">Developing the Middleware<a href="https://sangaline.com/blog/wayback-machine-scraper#developing-the-middleware" class="hash-link" aria-label="Direct link to Developing the Middleware" title="Direct link to Developing the Middleware" translate="no">​</a></h2>
<p>Writing a <a href="https://doc.scrapy.org/en/latest/topics/downloader-middleware.html" target="_blank" rel="noopener noreferrer" class="">Scrapy Downloader Middleware</a> is generally where you'll end up whenever you need to intercept requests and responses to modify or replace them.
Downloader middleware classes implement <code>process_request(request, spider)</code> and <code>process_response(request, response, spider)</code> methods that have a lot of freedom in what they can do.
Let's start piecing together the middleware in <code>reddit_scraper/middlewares.py</code> and it should hopefully become clear exactly how much you can accomplish with this freedom.</p>
<p>We'll first add the basic initialization that loads our <code>WAYBACK_MACHINE_TIME_RANGE</code> setting and saves the crawler.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> json</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> dt</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> scrapy </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> Request</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">http </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> Response</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">exceptions </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> IgnoreRequest</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">UnhandledIgnoreRequest</span><span class="token punctuation">(</span><span class="token plain">IgnoreRequest</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">pass</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">WaybackMachine</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    cdx_url_template </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'http://web.archive.org/cdx/search/cdx?url={url}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                    </span><span class="token string" style="color:#89b482">'&amp;output=json&amp;fl=timestamp,original,statuscode,digest'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    snapshot_url_template </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'http://web.archive.org/web/{timestamp}id_/{original}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">__init__</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> crawler</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">crawler </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> crawler</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># read the settings</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">time_range </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> crawler</span><span class="token punctuation">.</span><span class="token plain">settings</span><span class="token punctuation">.</span><span class="token plain">get</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'WAYBACK_MACHINE_TIME_RANGE'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token decorator annotation punctuation">@classmethod</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">from_crawler</span><span class="token punctuation">(</span><span class="token plain">cls</span><span class="token punctuation">,</span><span class="token plain"> crawler</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> cls</span><span class="token punctuation">(</span><span class="token plain">crawler</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>There are a few other things in here but they aren't doing anything yet and we'll get to them shortly (so just ignore them for now).
To actually turn this on, we'll have to also add a couple of settings to <code>reddit_scraper/settings.py</code>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># enable the middleware</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">DOWNLOADER_MIDDLEWARES </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'reddit_scraper.middlewares.WaybackMachine'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">50</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># only consider snapshots during the year of 2016</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">WAYBACK_MACHINE_TIME_RANGE </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">20160101000000</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">20170101000000</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># be bad but not too bad</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ROBOTSTXT_OBEY </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token boolean" style="color:#ea6962">False</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">DOWNLOAD_DELAY </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">5</span><br></span></code></pre></div></div>
<p>Our middleware is now enabled but we need to implement request and response processing to actually make it useful.
The request processing is the simpler of the two: we'll let any <a href="https://web.archive.org/" target="_blank" rel="noopener noreferrer" class="">web.archive.org</a> requests through without modification and for everything else we'll construct a request to the Wayback Machine's <a href="https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md" target="_blank" rel="noopener noreferrer" class="">public CDX Server API</a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">process_request</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> request</span><span class="token punctuation">,</span><span class="token plain"> spider</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># let any web.archive.org requests pass through</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> request</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">.</span><span class="token plain">find</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'http://web.archive.org/'</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">==</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># otherwise request a CDX listing of available snapshots</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">build_cdx_request</span><span class="token punctuation">(</span><span class="token plain">request</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">build_cdx_request</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> request</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        cdx_url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">cdx_url_template</span><span class="token punctuation">.</span><span class="token builtin" style="color:#d8a657">format</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">request</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        cdx_request </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> Request</span><span class="token punctuation">(</span><span class="token plain">cdx_url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        cdx_request</span><span class="token punctuation">.</span><span class="token plain">meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'original_request'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> request</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        cdx_request</span><span class="token punctuation">.</span><span class="token plain">meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'wayback_machine_cdx_request'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token boolean" style="color:#ea6962">True</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> cdx_request</span><br></span></code></pre></div></div>
<p>Returning a new request aborts the current request processing and sends the new request into the downloader pipeline.
The new request passes through our <code>WaybackMachine</code> middleware unscathed this time because the URL starts with <code>http://web.archive.org/</code>.
It (hopefully) makes it all the way CDX server which will provide what is basically the computer-friendly version of the archive index pages.
The CDX server will specifically return a JSON file including the timestamp, URL, and statuscode of each snapshot request as well as a hash of the snapshot content.
That will look something like</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">"timestamp"</span><span class="token punctuation">,</span><span class="token string" style="color:#89b482">"original"</span><span class="token punctuation">,</span><span class="token string" style="color:#89b482">"statuscode"</span><span class="token punctuation">,</span><span class="token string" style="color:#89b482">"digest"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">"20020718215101"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"http://reddit.com:80/"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"200"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"VNG6YBPFVMBWJPRETPQX45QEHDHXFOFD"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">"20020802023739"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"http://reddit.com:80/"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"200"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"VNG6YBPFVMBWJPRETPQX45QEHDHXFOFD"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">"20020923101504"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"http://reddit.com:80/"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"200"</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"VNG6YBPFVMBWJPRETPQX45QEHDHXFOFD"</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">etc.</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">]</span><br></span></code></pre></div></div>
<p>and, like all responses, will make its way back through the downloader middleware.
Our spider wouldn't know what to do with it though so we need to intercept the response and prevent it from making it that far.
We can do this by implementing <code>process_response(request, response, spider)</code>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">process_response</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> request</span><span class="token punctuation">,</span><span class="token plain"> response</span><span class="token punctuation">,</span><span class="token plain"> spider</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        meta </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> request</span><span class="token punctuation">.</span><span class="token plain">meta</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># parse CDX requests and schedule future snapshot requests</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> meta</span><span class="token punctuation">.</span><span class="token plain">get</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'wayback_machine_cdx_request'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            snapshot_requests </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">build_snapshot_requests</span><span class="token punctuation">(</span><span class="token plain">response</span><span class="token punctuation">,</span><span class="token plain"> meta</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token comment" style="color:#a89984"># schedule all of the snapshots</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> snapshot_request </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> snapshot_requests</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                self</span><span class="token punctuation">.</span><span class="token plain">crawler</span><span class="token punctuation">.</span><span class="token plain">engine</span><span class="token punctuation">.</span><span class="token plain">schedule</span><span class="token punctuation">(</span><span class="token plain">snapshot_request</span><span class="token punctuation">,</span><span class="token plain"> spider</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token comment" style="color:#a89984"># abort this request</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">raise</span><span class="token plain"> UnhandledIgnoreRequest</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># clean up snapshot responses</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> meta</span><span class="token punctuation">.</span><span class="token plain">get</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'original_request'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">replace</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'original_request'</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> response</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>We start out by grabbing the meta information on the request (you may have already noticed that we had attached some to our CDX request in <code>build_cdx_request(request)</code>).
This meta information is then used to determine whether this is a response to a CDX request; if it is then we parse it to construct requests for the individual snapshots, schedule these with the Scrapy engine, and then abort the request by throwing an unhandled error that we defined earlier.
The last little bit of code there is to make the response URL match that of the original request for the snapshot requests so that the spider doesn't have to know about where the snapshot responses actually came from.</p>
<p>The final piece of the puzzle is to implement the code for actually parsing the CDX responses and building the snapshot requests.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">build_snapshot_requests</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> response</span><span class="token punctuation">,</span><span class="token plain"> meta</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># parse the CDX snapshot data</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        data </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> json</span><span class="token punctuation">.</span><span class="token plain">loads</span><span class="token punctuation">(</span><span class="token plain">response</span><span class="token punctuation">.</span><span class="token plain">text</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        keys</span><span class="token punctuation">,</span><span class="token plain"> rows </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> data</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"> data</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">build_dict</span><span class="token punctuation">(</span><span class="token plain">row</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            new_dict </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token punctuation">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i</span><span class="token punctuation">,</span><span class="token plain"> key </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">enumerate</span><span class="token punctuation">(</span><span class="token plain">keys</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                new_dict</span><span class="token punctuation">[</span><span class="token plain">key</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> row</span><span class="token punctuation">[</span><span class="token plain">i</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> new_dict</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        snapshots </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">list</span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">map</span><span class="token punctuation">(</span><span class="token plain">build_dict</span><span class="token punctuation">,</span><span class="token plain"> rows</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># construct the requests</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        snapshot_requests </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> snapshot </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> snapshots</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token comment" style="color:#a89984"># ignore snapshots outside of the time range</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">not</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">time_range</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&lt;</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">snapshot</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&lt;</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">time_range</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">continue</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token comment" style="color:#a89984"># update the url to point to the snapshot</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">snapshot_url_template</span><span class="token punctuation">.</span><span class="token builtin" style="color:#d8a657">format</span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">**</span><span class="token plain">snapshot</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            original_request </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'original_request'</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            snapshot_request </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> original_request</span><span class="token punctuation">.</span><span class="token plain">replace</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token comment" style="color:#a89984"># attach extension specify metadata to the request</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            snapshot_request</span><span class="token punctuation">.</span><span class="token plain">meta</span><span class="token punctuation">.</span><span class="token plain">update</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'original_request'</span><span class="token punctuation">:</span><span class="token plain"> original_request</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'wayback_machine_url'</span><span class="token punctuation">:</span><span class="token plain"> snapshot_request</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'wayback_machine_time'</span><span class="token punctuation">:</span><span class="token plain"> dt</span><span class="token punctuation">.</span><span class="token plain">strptime</span><span class="token punctuation">(</span><span class="token plain">snapshot</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'%Y%m%d%H%M%S'</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            snapshot_requests</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token plain">snapshot_request</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> snapshot_requests</span><br></span></code></pre></div></div>
<p>You can see here that our snapshot requests are built using <code>original_request</code> as a base.
This means that they still have the original callbacks attached and are otherwise identical except for the extra meta data that we attach (and the temporarily different <code>url</code> property).
Additionally, our <code>snapshot_url_template</code> uses a lesser known feature of the Wayback Machine API that allows us to get the original raw page content instead of the one with modified links and added content.
After we switch <code>response.url</code> back to <code>original_request.url</code> in <code>process_response(request, response, spider)</code>, the response will only be distinguishable from one coming from the original server in that it has the additional <code>wayback_machine_url</code> and <code>wayback_machine_datetime</code> meta data attached.
This will all come in handy when it comes time to integrate the middleware with our spider as we'll do momentarily.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="putting-it-all-together">Putting It All Together<a href="https://sangaline.com/blog/wayback-machine-scraper#putting-it-all-together" class="hash-link" aria-label="Direct link to Putting It All Together" title="Direct link to Putting It All Together" translate="no">​</a></h2>
<p>We designed our middleware in such a way that pretty much any existing spider should "just work" and now we get to reap the benefits of that.
No modifications of any generated requests are required and the only evidence that the responses were fetched from <a href="https://archive.org/" target="_blank" rel="noopener noreferrer" class="">archive.org</a> should be the additional meta data and some minor header differences.
Indeed, we could run our scraper again now and it would successfully parse all of the available snapshots.
The only problem is that the timestamps would be wrong because we're populating them with <code>datetime.datetime.now()</code>.</p>
<p>To remedy this, we simply replace</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">:</span><span class="token plain"> dt</span><span class="token punctuation">.</span><span class="token plain">now</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">timestamp</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token string" style="color:#89b482">'items'</span><span class="token punctuation">:</span><span class="token plain"> items</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>with</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">:</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">meta</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'wayback_machine_time'</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">timestamp</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token string" style="color:#89b482">'items'</span><span class="token punctuation">:</span><span class="token plain"> items</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>in <code>reddit_scraper/spiders/reddit_spider.py</code>.
Running the crawler with <code>scrapy crawl reddit_scraper -o snapshots.jl</code> should now yield items for every front page snapshot from 2016!</p>
<p>We now have the data in a nice structured format and can finally get to the fun part.
Let's start with loading the JSON Lines file back into python</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> dt</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> json</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> numpy </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> np</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># load in the data</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">times</span><span class="token punctuation">,</span><span class="token plain"> median_scores </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">with</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">open</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'snapshots.jl'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'r'</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> f</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> line </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> f</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        row </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> json</span><span class="token punctuation">.</span><span class="token plain">loads</span><span class="token punctuation">(</span><span class="token plain">line</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        scores</span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token plain">item</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> row</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'items'</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        time </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> dt</span><span class="token punctuation">.</span><span class="token plain">utcfromtimestamp</span><span class="token punctuation">(</span><span class="token plain">row</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        times</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token plain">time</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        median_votes</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token plain">np</span><span class="token punctuation">.</span><span class="token plain">median</span><span class="token punctuation">(</span><span class="token plain">scores</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># plot the data</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">fig </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> plt</span><span class="token punctuation">.</span><span class="token plain">figure</span><span class="token punctuation">(</span><span class="token plain">figsize</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">single_figsize</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> fig</span><span class="token punctuation">.</span><span class="token plain">add_subplot</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">plot</span><span class="token punctuation">(</span><span class="token plain">times</span><span class="token punctuation">,</span><span class="token plain"> median_scores</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'o'</span><span class="token punctuation">,</span><span class="token plain"> ms</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>We're computing the median scores here instead of the average scores in order to suppress the noise of unusually popular stories.
The median score should be fairly representative of some convolution of Reddit site traffic and their scoring algorithm.
Now let's take a look at how the median scores change over time by plotting them.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> matplotlib</span><span class="token punctuation">.</span><span class="token plain">pyplot </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> plt</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># label the months</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">xticks </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token plain">dt</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">2016</span><span class="token punctuation">,</span><span class="token plain"> i </span><span class="token operator" style="color:#a89984">+</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">18</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">range</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">12</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">xticklabels </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token plain">date</span><span class="token punctuation">.</span><span class="token plain">strftime</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'%b'</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> date </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> xticks</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">set_xticklabels</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">''</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">set_xticks</span><span class="token punctuation">(</span><span class="token plain">xticks</span><span class="token punctuation">,</span><span class="token plain"> minor</span><span class="token operator" style="color:#a89984">=</span><span class="token boolean" style="color:#ea6962">True</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">set_xticklabels</span><span class="token punctuation">(</span><span class="token plain">xticklabels</span><span class="token punctuation">,</span><span class="token plain"> minor</span><span class="token operator" style="color:#a89984">=</span><span class="token boolean" style="color:#ea6962">True</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># title and axis labels</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">set_title</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'Reddit Front Page Stories - 2016'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">ax</span><span class="token punctuation">.</span><span class="token plain">set_ylabel</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'Median Score'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># format and save it</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">fig</span><span class="token punctuation">.</span><span class="token plain">tight_layout</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">fig</span><span class="token punctuation">.</span><span class="token plain">savefig</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'reddit-front-page-stories-2016.png'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Reddit Front Page Stories for 2016" src="https://sangaline.com/assets/images/reddit-front-page-stories-2016-a64b66d4ebeeb096f574038dcd013146.png" width="600" height="400" class="img_ev3q"></p>
<p>Things are fairly stable with a small upwards trend until we hit December and the median scores go crazy!
Let's zoom in a little bit so we can see more clearly when this happened.</p>
<p><img decoding="async" loading="lazy" alt="Reddit Front Page Stories for December 2016" src="https://sangaline.com/assets/images/reddit-front-page-stories-december-2016-4d597386733e3d8b8cbdb31ee23da251.png" width="600" height="400" class="img_ev3q"></p>
<p>It looks like the median scores abruptly doubled on December 7th and the variations within each day also became more pronounced.
It's pretty clear that there was some algorithm change at Reddit that took place on that date.
This was also a significant enough change that it would likely be noticed by regular users.</p>
<p>Let's Google 'Reddit vote scores "December 7th, 20016"' and see who else noticed it.
One of the first results is <a href="http://www.entirenewslink.com/reddit-overhauls-upvote-algorithm-to-thwart-cheaters-and-show-the-sites-true-scale/" target="_blank" rel="noopener noreferrer" class="">Reddit overhauls upvote algorithm to thwart cheaters and show the site's true scale</a> which links to <a href="https://www.reddit.com/r/announcements/comments/5gvd6b/scores_on_posts_are_about_to_start_going_up/" target="_blank" rel="noopener noreferrer" class="">a Reddit self-post by an admin named KeyserSosa</a> from December 6th, 2016.
Here is an abridged excerpt from that post.</p>
<blockquote>
<p>In the 11 years that Reddit has been around, we've accumulated <a href="https://i.redd.it/oek2mm1io01y.png" target="_blank" rel="noopener noreferrer" class="">a lot of rules</a> in our vote tallying as a way to mitigate cheating and brigading on posts and comments
<a href="https://i.redd.it/dastklohq01y.gif" target="_blank" rel="noopener noreferrer" class="">Here's a rough schematic of what the code looks like without revealing any trade secrets or compromising the integrity of the algorithm</a>.
Many of these rules are still quite useful, but there are a few whose primary impact has been <a href="https://www.reddit.com/r/TheoryOfReddit/comments/z1sfn/one_minute_obamas_ama_karma_score_was_16000_a/" target="_blank" rel="noopener noreferrer" class="">to sometimes artificially deflate scores on the site</a>.</p>
<p>...Very soon (think hours, not days), we're going to cut the scores over to be reflective of these new and updated tallies...</p>
<p><strong>TL;DR</strong> voting is confusing, we cleaned up some outdated rules on voting, and we're updating the vote scores to be reflective of what they actually are.
Scores are increasing by a lot.</p>
</blockquote>
<p>And that's exactly what we saw in the data.
The data lets us even go a bit further and see that the new rules were relatively constant from the 7th up until the 16th or 17th when the median scores seem to fluctuate between the 5k-20k range and the 30-40k range.
It's hard to say exactly why this was happening from just the plots we've generated so far; maybe they were trying out new rule variations or maybe specific thresholds were being reached.
If we were to dig in a little bit deeper and track individual story trajectories then we could probably make some more specific guesses (this is sadly outside the scope of this article however).</p>
<p>In addition to the scores, we also scraped the story titles.
Let's cycle through all of the stories again and pick out the most highly rated ones.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">TopStories</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">__init__</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> N</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">stories </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">N </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> N</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">add_story</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> new_story</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># update any existing story with the higher score</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> story </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'title'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">==</span><span class="token plain"> new_story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'title'</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">max</span><span class="token punctuation">(</span><span class="token plain">story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"> new_story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># insert a story in it's proper position</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">range</span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token operator" style="color:#a89984">-</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> new_story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">[</span><span class="token plain">i</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> i </span><span class="token operator" style="color:#a89984">==</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">or</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">[</span><span class="token plain">i </span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> new_story</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'votes'</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                    self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">.</span><span class="token plain">insert</span><span class="token punctuation">(</span><span class="token plain">i</span><span class="token punctuation">,</span><span class="token plain"> new_story</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                    </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">N</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                        self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">.</span><span class="token plain">pop</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                    </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">else</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">break</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># otherwise add it to the end of necessary</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&lt;</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">N</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                self</span><span class="token punctuation">.</span><span class="token plain">stories</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token plain">new_story</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># load in the data</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">top_stories </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> TopStories</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">with</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">open</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'snapshots.jl'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'r'</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">as</span><span class="token plain"> f</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> line </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> f</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        row </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> json</span><span class="token punctuation">.</span><span class="token plain">loads</span><span class="token punctuation">(</span><span class="token plain">line</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        time </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> dt</span><span class="token punctuation">.</span><span class="token plain">utcfromtimestamp</span><span class="token punctuation">(</span><span class="token plain">row</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'timestamp'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> item </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> row</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'items'</span><span class="token punctuation">]</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            top_stories</span><span class="token punctuation">.</span><span class="token plain">add_story</span><span class="token punctuation">(</span><span class="token plain">item</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div style="overflow-x:auto"><table><thead><tr><th>Score</th><th>Title</th></tr></thead><tbody><tr><td>153759</td><td>1 dad reflex 2 children</td></tr><tr><td>109035</td><td>Hey Reddit, we need your help. We are small time youtubers who have recently discovered someone with 300x as many subscribers has made a near shot by shot rip off of one of our videos. The video has nearly 3x as many views as ours. Here is a side by side comparison. We don't know what to do.</td></tr><tr><td>100304</td><td>TIL Carrie Fisher told her fans: "No matter how I go, I want it reported that I drowned in moonlight, strangled by my own bra."</td></tr><tr><td>95621</td><td>Carrie Fisher Dies at 60</td></tr><tr><td>94915</td><td>Grindelwald, Switzerland</td></tr><tr><td>92277</td><td>Carrie Fisher dead at age 60</td></tr><tr><td>92192</td><td>Dog before and after being called a good boy</td></tr><tr><td>90257</td><td>Thanks Reddit. You saved me from potential credit card theft. Always wiggle the card reader.</td></tr><tr><td>88027</td><td>if you draw hands on the small McDonald's hot cup it looks like a butt. If you poke a hole in it...</td></tr><tr><td>87462</td><td>Australian man waits 416 days to see what happens after his ipod timer passes 9999 hours 59 minutes and 59 seconds.</td></tr></tbody></table></div>
<p>Well, now I kind of wish that we had scraped the URLs too so I could include the links.
Here's <a href="https://www.reddit.com/r/gifs/comments/5jrlw1/1_dad_reflex_2_children/" target="_blank" rel="noopener noreferrer" class="">1 dad reflex 2 children</a> at least, it's pretty impressive.</p>
<p>There's a ton more that we could do here if we extracted a bit more data, but hopefully this is enough to give you a test for how easy it is to mess around with the data once we have it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrap-up">Wrap Up<a href="https://sangaline.com/blog/wayback-machine-scraper#wrap-up" class="hash-link" aria-label="Direct link to Wrap Up" title="Direct link to Wrap Up" translate="no">​</a></h2>
<p>Well, I hope that you enjoyed our little foray into internet archaeology here.
There are countless possibilities for what you can do with time series data that you scrape from the Wayback Machine and what we've done here barely scrapes the surface.
You could use the data that we scraped here to apply the techniques developed in <a class="" href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm">Reverse Engineering the Hacker News Ranking Algorithm</a> to Reddit or you could scrape other sites to find historical product prices, review scores, or anything else you're curious about or need to run your business.</p>
<p>I would love to here about whatever analysis you might undertake so feel free to reach out at <a href="mailto:evan@intoli" target="_blank" rel="noopener noreferrer" class="">evan@intoli.com</a> (that's doubly true if you're looking for someone to help your business solve their data needs!).
If you do plan to actually use the middleware that we developed then please check out the full code on <a href="https://github.com/sangaline/scrapy-wayback-machine" target="_blank" rel="noopener noreferrer" class="">the Scrapy Wayback Machine GitHub repo</a>.
I skipped over some important error handling and edge cases to simplify the code a bit during this tutorial and you'll really want those in production.
The repo contains those additions as well as a really useful command-line utility for scraping pages without custom parsing (which may be useful if you want to parse them in a language other than python).</p>
<p>Finally, please don't forget to <a href="https://archive.org/donate/" target="_blank" rel="noopener noreferrer" class="">donate to archive.org if you're scraping data from their servers</a>.
They provide an awesome public server and scraping consumes a lot of their resouces.
Throwing them a few bucks goes a long way in helping them provide the services that they do!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Advanced Web Scraping: Bypassing "403 Forbidden," captchas, and more]]></title>
            <link>https://sangaline.com/blog/advanced-web-scraping-tutorial</link>
            <guid>https://sangaline.com/blog/advanced-web-scraping-tutorial</guid>
            <pubDate>Thu, 16 Mar 2017 11:58:37 GMT</pubDate>
            <description><![CDATA[A tutorial on advanced web scraping techniques including bypassing 403 errors, solving captchas with OCR, and handling anti-scraping measures using Scrapy.]]></description>
            <content:encoded><![CDATA[<p><em>The full code for the completed scraper can be found in the <a href="https://github.com/sangaline/advanced-web-scraping-tutorial" target="_blank" rel="noopener noreferrer" class="">companion repository on github</a>.</em></p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="introduction">Introduction<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>I wouldn't really consider web scraping one of my hobbies or anything but I guess I sort of do a lot of it.
It just seems like many of the things that I work on require me to get my hands on data that isn't available any other way.
I need to do static analysis of games for <a href="http://intoli.com/" target="_blank" rel="noopener noreferrer" class="">Intoli</a> and so I scrape the Google Play Store to find new ones and download the apks.
The <a href="http://pointyball.com/" target="_blank" rel="noopener noreferrer" class="">Pointy Ball</a> extension requires aggregating fantasy football projections from various sites and the easiest way was to write a scraper.
When I think about it, I've probably written about 40-50 scrapers.
I'm not quite at the point where I'm lying to my family about how many terabytes of data I'm hoarding away... but I'm close.</p>
<p>I've tried out <a href="https://github.com/lapwinglabs/x-ray" target="_blank" rel="noopener noreferrer" class="">x-ray</a>/<a href="https://github.com/cheeriojs/cheerio" target="_blank" rel="noopener noreferrer" class="">cheerio</a>, <a href="https://github.com/sparklemotion/nokogiri" target="_blank" rel="noopener noreferrer" class="">nokogiri</a>, and a few others but I always come back to my personal favorite: <a href="https://github.com/scrapy/scrapy" target="_blank" rel="noopener noreferrer" class="">scrapy</a>.
In my opinion, scrapy is an <strong>excellent</strong> piece of software.
I don't throw such unequivocal praise around lightly but it feels incredibly intuitive and has a great learning curve.</p>
<p>You can read <a href="https://doc.scrapy.org/en/latest/intro/tutorial.html" target="_blank" rel="noopener noreferrer" class="">The Scrapy Tutorial</a> and have your first scraper running within minutes.
Then, when you need to do something more complicated, you'll most likely find that there's a built in and well documented way to do it.
There's <em>a lot of power</em> built in but the framework is structured so that it stays out of your way until you need it.
When you finally do need something that isn't there by default, say a Bloom filter for deduplication because you're visiting too many URLs to store in memory, then it's usually as simple as subclassing one of the components and making a few small changes.
Everything just feels so <em>easy</em> and that's really a hallmark of good software design in my book.</p>
<p>I've toyed with the idea of writing an advanced scrapy tutorial for a while now.
Something that would give me a chance to show off some of its extensibility while also addressing realistic challenges that come up in practice.
As much as I've wanted to do this, I just wasn't able to get past the fact that it seemed like a decidely dick move to publish something that could conceivably result in someone's servers getting hammered with bot traffic.</p>
<p>I can sleep pretty well at night scraping sites that actively try to prevent scraping as long as I follow a few basic rules.
Namely, I keep my request rate comparable to what it would be if I were browsing by hand and I don't do anything distasteful with the data.
That makes running a scraper basically indistinguishable from collecting data manually in any ways that matter.
Even if I were to personally follow these rules, it would still feel like a step too far to do a how-to guide for a specific site that people might actually want to scrape.</p>
<p>And so it remained just a vague idea in my head until I encountered a torrent site called Zipru.
It has multiple mechanisms in place that require advanced scraping techniques but its <code>robots.txt</code> file allows scraping.
Furthermore, there is <em>no reason to scrape it</em>.
It has a public API that can be used to get all of the same data.
If you're interested in getting torrent data then just use the API; it's great for that.</p>
<p>In the rest of this article, I'll walk you through writing a scraper that can handle captchas and various other challenges that we'll encounter on the Zipru site.
The code won't work exactly as written because Zipru isn't a real site but the techniques employed are broadly applicable to real-world scraping and the code is otherwise complete.
I'm going to assume that you have basic familiarity with python but I'll try to keep this accessible to someone with little to no knowledge of scrapy.
If things are going too fast at first then take a few minutes to read <a href="https://doc.scrapy.org/en/latest/intro/tutorial.html" target="_blank" rel="noopener noreferrer" class="">The Scrapy Tutorial</a> which covers the introductory stuff in much more depth.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="setting-up-the-project">Setting Up the Project<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#setting-up-the-project" class="hash-link" aria-label="Direct link to Setting Up the Project" title="Direct link to Setting Up the Project" translate="no">​</a></h2>
<p>We'll work within a <a href="https://virtualenv.pypa.io/en/stable/" target="_blank" rel="noopener noreferrer" class="">virtualenv</a> which lets us encapsulate our dependencies a bit.
Let's start by setting up a virtualenv in <code>~/scrapers/zipru</code> and installing scrapy.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token function" style="color:#d8a657">mkdir</span><span class="token plain"> ~/scrapers/zipru</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token builtin class-name" style="color:#d8a657">cd</span><span class="token plain"> ~/scrapers/zipru</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">virtualenv </span><span class="token function" style="color:#d8a657">env</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token builtin class-name" style="color:#d8a657">.</span><span class="token plain"> env/bin/activate</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">pip </span><span class="token function" style="color:#d8a657">install</span><span class="token plain"> scrapy</span><br></span></code></pre></div></div>
<p>The terminal that you ran those in will now be configured to use the local virtualenv.
If you open another terminal then you'll need to run <code>. ~/scrapers/zipru/env/bin/active</code> again (otherwise you may get errors about commands or modules not being found).</p>
<p>You can now create a new project scaffold by running</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">scrapy startproject zipru_scraper</span><br></span></code></pre></div></div>
<p>which will create the following directory structure.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">└── zipru_scraper</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    ├── zipru_scraper</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   ├── __init__.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   ├── items.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   ├── middlewares.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   ├── pipelines.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   ├── settings.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │   └── spiders</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    │       └── __init__.py</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    └── scrapy.cfg</span><br></span></code></pre></div></div>
<p>Most of these files aren't actually used at all by default, they just suggest a sane way to structure our code.
From now on, you should think of <code>~/scrapers/zipru/zipru_scraper</code> as the top-level directory of the project.
That's where any scrapy commands should be run and is also the root of any relative paths.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="adding-a-basic-spider">Adding a Basic Spider<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#adding-a-basic-spider" class="hash-link" aria-label="Direct link to Adding a Basic Spider" title="Direct link to Adding a Basic Spider" translate="no">​</a></h2>
<p>We'll now need to add a spider in order to make our scraper actually do anything.
A spider is the part of a scrapy scraper that handles parsing documents to find new URLs to scrape and data to extract.
I'm going to lean pretty heavily on the <a href="https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy-spider" target="_blank" rel="noopener noreferrer" class="">default Spider</a> implementation to minimize the amount of code that we'll have to write.
Things might seem a little automagical here but much less so if you check out the documentation.</p>
<p>First, create a file named <code>zipru_scraper/spiders/zipru_spider.py</code> with the following contents.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> scrapy</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">ZipruSpider</span><span class="token punctuation">(</span><span class="token plain">scrapy</span><span class="token punctuation">.</span><span class="token plain">Spider</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    name </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'zipru'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    start_urls </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'http://zipru.to/torrents.php?category=TV'</span><span class="token punctuation">]</span><br></span></code></pre></div></div>
<p>Our spider inherits from <code>scrapy.Spider</code> which provides a <code>start_requests()</code> method that will go through <code>start_urls</code> and use them to begin our search.
We've provided a single URL in <code>start_urls</code> that points to the TV listings.
They look something like this.</p>
<p><img decoding="async" loading="lazy" alt="The tv show listings on zipru" src="https://sangaline.com/assets/images/tv-shows-page-1-2bd0b94ca16ec4f7ab4b38803afbffe6.png" width="866" height="277" class="img_ev3q"></p>
<p>At the top there, you can see that there are links to other pages.
We'll want our scraper to follow those links and parse them as well.
To do that, we'll first need to identify the links and find out where they point.</p>
<p>The DOM inspector can be a huge help at this stage.
If you were to right click on one of these page links and look at it in the inspector then you would see that the links to other listing pages look like this</p>
<div class="language-html codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-html codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token tag punctuation" style="color:#e78a4e">&lt;</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">href</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">/torrents.php?...page=2</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">title</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">page 2</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><span class="token plain">2</span><span class="token tag punctuation" style="color:#e78a4e">&lt;/</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token tag punctuation" style="color:#e78a4e">&lt;</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">href</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">/torrents.php?...page=3</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">title</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">page 3</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><span class="token plain">3</span><span class="token tag punctuation" style="color:#e78a4e">&lt;/</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token tag punctuation" style="color:#e78a4e">&lt;</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">href</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">/torrents.php?...page=4</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag" style="color:#e78a4e"> </span><span class="token tag attr-name" style="color:#a9b665">title</span><span class="token tag attr-value punctuation attr-equals" style="color:#89b482">=</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag attr-value" style="color:#89b482">page 4</span><span class="token tag attr-value punctuation" style="color:#89b482">"</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><span class="token plain">4</span><span class="token tag punctuation" style="color:#e78a4e">&lt;/</span><span class="token tag" style="color:#e78a4e">a</span><span class="token tag punctuation" style="color:#e78a4e">&gt;</span><br></span></code></pre></div></div>
<p>Next we'll need to construct selector expressions for these links.
There are certain types of searches that seem like a better fit for either css or xpath selectors and so I generally tend to mix and chain them somewhat freely.
I highly recommend <a href="http://edutechwiki.unige.ch/en/XPath_tutorial_-_basics" target="_blank" rel="noopener noreferrer" class="">learning xpath</a> if you don't know it, but it's unfortunately a bit beyond the scope of this tutorial.
I personally find it to be pretty indispensible for scraping, web UI testing, and even just web development in general.
I'll stick with css selectors here though because they're probably more familiar to most people.</p>
<p>To select these page links we can look for <code>&lt;a&gt;</code> tags with "page" in the title using <code>a[title ~= page]</code> as a css selector.
If you press <code>ctrl-f</code> in the DOM inspector then you'll find that you can use this css expression as a search query (this works for xpath too!).
Doing so lets you cycle through and see all of the matches.
This is a good way to check that an expression works but also isn't so vague that it matches other things unintentionally.
Our page link selector satisfies both of those criteria.</p>
<p>To tell our spider how to find these other pages, we'll add a <code>parse(response)</code> method to <code>ZipruSpider</code> like so</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">parse</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> response</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># proceed to other pages of the listings</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> page_url </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'a[title ~= page]::attr(href)'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            page_url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">urljoin</span><span class="token punctuation">(</span><span class="token plain">page_url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">yield</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">Request</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">page_url</span><span class="token punctuation">,</span><span class="token plain"> callback</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">parse</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>When we start scraping, the URL that we added to <code>start_urls</code> will automatically be fetched and the response fed into this <code>parse(response)</code> method.
Our code then finds all of the links to other listing pages and yields new requests which are attached to the same <code>parse(response)</code> callback.
These requests will be turned into response objects and then fed back into <code>parse(response)</code> so long as the URLs haven't already been processed (thanks to the dupe filter).</p>
<p>Our scraper can already find and request all of the different listing pages but we still need to extract some actual data to make this useful.
The torrent listings sit in a <code>&lt;table&gt;</code> with <code>class="list2at"</code> and then each individual listing is within a <code>&lt;tr&gt;</code> with <code>class="lista2"</code>.
Each of these rows in turn contains 8 <code>&lt;td&gt;</code> tags that correspond to "Category", "File", "Added", "Size", "Seeders", "Leechers", "Comments", and "Uploaders".
It's probably easiest to just see the other details in code, so here's our updated <code>parse(response)</code> method.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">parse</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> response</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># proceed to other pages of the listings</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> page_url </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">xpath</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'//a[contains(@title, "page ")]/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            page_url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">urljoin</span><span class="token punctuation">(</span><span class="token plain">page_url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">yield</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">Request</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">page_url</span><span class="token punctuation">,</span><span class="token plain"> callback</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">parse</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># extract the torrent items</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> tr </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'table.lista2t tr.lista2'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            tds </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> tr</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'td'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            link </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'a'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">yield</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'title'</span><span class="token plain"> </span><span class="token punctuation">:</span><span class="token plain"> link</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::attr(title)'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'url'</span><span class="token plain"> </span><span class="token punctuation">:</span><span class="token plain"> response</span><span class="token punctuation">.</span><span class="token plain">urljoin</span><span class="token punctuation">(</span><span class="token plain">link</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::attr(href)'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'date'</span><span class="token plain"> </span><span class="token punctuation">:</span><span class="token plain"> tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">2</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'size'</span><span class="token plain"> </span><span class="token punctuation">:</span><span class="token plain"> tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">3</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'seeders'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">4</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'leechers'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">5</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token string" style="color:#89b482">'uploader'</span><span class="token punctuation">:</span><span class="token plain"> tds</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">7</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">extract_first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>Our <code>parse(response)</code> method now also yields dictionaries which will automatically be differentiated from the requests based on their type.
Each dictionary will be interpreted as an item and included as part of our scraper's data output.</p>
<p>We would be done right now if we were just scraping most websites.
We could just run</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">scrapy crawl zipru </span><span class="token parameter variable" style="color:#ea6962">-o</span><span class="token plain"> torrents.jl</span><br></span></code></pre></div></div>
<p>and a few minutes later we would have a nice <a href="http://jsonlines.org/" target="_blank" rel="noopener noreferrer" class="">JSON Lines</a> formatted <code>torrents.jl</code> file with all of our torrent data.
Instead we get this (along with a lot of other stuff)</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] DEBUG: Crawled (403) &lt;GET http://zipru.to/robots.txt&gt; (referer: None) ['partial']</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] DEBUG: Crawled (403) &lt;GET http://zipru.to/torrents.php?category=TV&gt; (referer: None) ['partial']</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.spidermiddlewares.httperror] INFO: Ignoring response &lt;403 http://zipru.to/torrents.php?category=TV&gt;: HTTP status code is not handled or not allowed</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] INFO: Closing spider (finished)</span><br></span></code></pre></div></div>
<p>Drats!
We're going to have to be a little more clever to get our data that we could totally just get from the public API and would never actually scrape.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-easy-problem">The Easy Problem<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#the-easy-problem" class="hash-link" aria-label="Direct link to The Easy Problem" title="Direct link to The Easy Problem" translate="no">​</a></h2>
<p>Our first request gets a <code>403</code> response that's ignored and then everything shuts down because we only seeded the crawl with one URL.
The same request works fine in a web browser, even in incognito mode with no session history, so this has to be caused by some difference in the request headers.
We could use <a href="https://en.wikipedia.org/wiki/Tcpdump" target="_blank" rel="noopener noreferrer" class="">tcpdump</a> to compare the headers of the two requests but there's a common culprit here that we should check first: the user agent.</p>
<p>Scrapy identifies as "Scrapy/1.3.3 (+<a href="http://scrapy.org/" target="_blank" rel="noopener noreferrer" class="">http://scrapy.org</a>)" by default and some servers might block this or even whitelist a limited number of user agents.
You can find lists of <a href="https://techblog.willshouse.com/2012/01/03/most-common-user-agents/" target="_blank" rel="noopener noreferrer" class="">the most common user agents</a> online and using one of these is often enough to get around basic anti-scraping measures.
Pick your favorite and then open up <code>zipru_scraper/settings.py</code> and replace</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># Crawl responsibly by identifying yourself (and your website) on the user-agent</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984">#USER_AGENT = 'zipru_scraper (+http://www.yourdomain.com)'</span><br></span></code></pre></div></div>
<p>with</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">USER_AGENT </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'</span><br></span></code></pre></div></div>
<p>You might notice that the default scrapy settings did a little bit of scrape-shaming there.
Opinions differ on the matter but I personally think it's OK to identify as a common web browser if your scraper acts like somebody using a common web browser.
So let's slow down the response rate a little bit by also adding</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">CONCURRENT_REQUESTS </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">DOWNLOAD_DELAY </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">5</span><br></span></code></pre></div></div>
<p>which will create a somewhat realistic browsing pattern thanks to the <a href="https://doc.scrapy.org/en/latest/topics/autothrottle.html" target="_blank" rel="noopener noreferrer" class="">AutoThrottle extension</a>.
Our scraper will also respect <code>robots.txt</code> by default so we're really on our best behavior.</p>
<p>Now running the scraper again with <code>scrapy crawl zipru -o torrents.jl</code> should produce</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] DEBUG: Crawled (200) &lt;GET http://zipru.to/robots.txt&gt; (referer: None)</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to &lt;GET http://zipru.to/threat_defense.php?defense=1&amp;r=78213556&gt; from &lt;GET http://zipru.to/torrents.php?category=TV&gt;</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] DEBUG: Crawled (200) &lt;GET http://zipru.to/threat_defense.php?defense=1&amp;r=78213556&gt; (referer: None) ['partial']</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] INFO: Closing spider (finished)</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>That's real progress!
We got two <code>200</code> statuses and a <code>302</code> that the downloader middleware knew how to handle.
Unfortunately, that <code>302</code> pointed us towards a somewhat ominous sounding <code>threat_defense.php</code>.
Unsurprisingly, the spider found nothing good there and the crawl terminated.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="downloader-middleware">Downloader Middleware<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#downloader-middleware" class="hash-link" aria-label="Direct link to Downloader Middleware" title="Direct link to Downloader Middleware" translate="no">​</a></h2>
<p>It will be helpful to learn a bit about how requests and responses are handled in scrapy before we dig into the bigger problems that we're facing.
When we created our basic spider, we produced <code>scrapy.Request</code> objects and then these were somehow turned into <code>scrapy.Response</code> objects corresponding to responses from the server.
A big part of that "somehow" is downloader middleware.</p>
<p>Downloader middlewares inherit from <code>scrapy.downloadermiddlewares.DownloaderMiddleware</code> and implement both <code>process_request(request, spider)</code> and <code>process_response(request, response, spider)</code> methods.
You can probably guess what those do from their names.
There are actually a whole bunch of these middlewares enabled by default.
Here's what the standard configuration looks like (you can of course disable things, add things, or rearrange things).</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">DOWNLOADER_MIDDLEWARES_BASE </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">100</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">300</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">350</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">400</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">500</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.retry.RetryMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">550</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">560</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">580</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">590</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">600</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.cookies.CookiesMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">700</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">750</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.stats.DownloaderStats'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">850</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">900</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>As a request makes its way out to a server, it bubbles through the <code>process_request(request, spider)</code> method of each of these middlewares.
This happens in sequential numerical order such that the <code>RobotsTxtMiddleware</code> processes the request first and the <code>HttpCacheMiddleware</code> processes it last.
Then once a response has been generated it bubbles back through the <code>process_response(request, response, spider)</code> methods of any enabled middlewares.
This happens in reverse order this time so the higher numbers are always closer to the server and the lower numbers are always closer to the spider.</p>
<p>One particularly simple middleware is the <code>CookiesMiddleware</code>.
It basically checks the <code>Set-Cookie</code> header on incoming responses and persists the cookies.
Then when a response is on its way out it sets the <code>Cookie</code> header appropriately so they're included on outgoing requests.
It's <a href="https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/downloadermiddlewares/cookies.py" target="_blank" rel="noopener noreferrer" class="">a little more complicated</a> than that because of expirations and stuff but you get the idea.</p>
<p>Another fairly basic one is the <code>RedirectMiddleware</code> which handles, wait for it... <code>3XX</code> redirects.
This one lets any non-<code>3XX</code> status code responses happily bubble through but what if there is a redirect?
The only way that it can figure out how the server responds to the redirect URL is to create a new request, so that's exactly what it does.
When the <code>process_response(request, response, spider)</code> method returns a request object instead of a response then the current response is dropped and everything starts over with the new request.
That's how the <code>RedirectMiddleware</code> handles the redirects and it's a feature that we'll be using shortly.</p>
<p>If it was surprising at all to you that there are so many downloader middlewares enabled by default then you might be interested in checking out the <a href="https://doc.scrapy.org/en/latest/topics/architecture.html" target="_blank" rel="noopener noreferrer" class="">Architecture Overview</a>.
There's actually kind of a lot of other stuff going on but, again, one of the great things about scrapy is that you don't <em>have to</em> know anything about most of it.
Just like you didn't even need to know that downloader middlewares existed to write a functional spider, you don't need to know about these other parts to write a functional downloader middleware.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-hard-problems">The Hard Problem(s)<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#the-hard-problems" class="hash-link" aria-label="Direct link to The Hard Problem(s)" title="Direct link to The Hard Problem(s)" translate="no">​</a></h2>
<p>Getting back to our scraper, we found that we were being redirected to some <code>threat_defense.php?defense=1&amp;...</code> URL instead of receiving the page that we were looking for.
When we visit this page in the browser, we see something like this for a few seconds</p>
<p><img decoding="async" loading="lazy" alt="A javascript redirect" src="https://sangaline.com/assets/images/redirecting-05d19cc34b5d6db4b5e343516830bdb5.png" width="694" height="195" class="img_ev3q"></p>
<p>before getting redirected to a <code>threat_defense.php?defense=2&amp;...</code> page that looks more like this</p>
<p><img decoding="async" loading="lazy" alt="A captcha box" src="https://sangaline.com/assets/images/captcha-screenshot-074380d8d0fafc7997e6082ed44bd8fd.png" width="434" height="201" class="img_ev3q"></p>
<p>A look at the source of the first page shows that there is some javascript code responsible for constructing a special redirect URL and also for manually constructing browser cookies.
If we're going to get through this then we'll have to handle both of these tasks.</p>
<p>Then, of course, we also have to solve the captcha and submit the answer.
If we happen to get it wrong then we sometimes redirect to another captcha page and other times we end up on a page that looks like this</p>
<p><img decoding="async" loading="lazy" alt="Retry the captcha" src="https://sangaline.com/assets/images/captcha-retry-48b28caecdbfd2951f418643e9e7f5c7.png" width="624" height="117" class="img_ev3q"></p>
<p>where we need to click on the "Click here" link to start the whole redirect cycle over.
Piece of cake, right?</p>
<p>All of our problems sort of stem from that initial <code>302</code> redirect and so a natural place to handle them is within a customized version of the <a href="https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect" target="_blank" rel="noopener noreferrer" class="">redirect middleware</a>.
We want our middleware to act like the normal redirect middleware in all cases except for when there's a <code>302</code> to the <code>threat_defense.php</code> page.
When it does encounter that special <code>302</code>, we want it to bypass all of this threat defense stuff, attach the access cookies to the session, and finally re-request the original page.
If we can pull that off then our spider doesn't have to know about any of this business and requests will "just work."</p>
<p>So open up <code>zipru_scraper/middlewares.py</code> and replace the contents with</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> os</span><span class="token punctuation">,</span><span class="token plain"> tempfile</span><span class="token punctuation">,</span><span class="token plain"> time</span><span class="token punctuation">,</span><span class="token plain"> sys</span><span class="token punctuation">,</span><span class="token plain"> logging</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">logger </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> logging</span><span class="token punctuation">.</span><span class="token plain">getLogger</span><span class="token punctuation">(</span><span class="token plain">__name__</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> dryscrape</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> pytesseract</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> PIL </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> Image</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> scrapy</span><span class="token punctuation">.</span><span class="token plain">downloadermiddlewares</span><span class="token punctuation">.</span><span class="token plain">redirect </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> RedirectMiddleware</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">class</span><span class="token plain"> </span><span class="token class-name" style="color:#d8a657">ThreatDefenceRedirectMiddleware</span><span class="token punctuation">(</span><span class="token plain">RedirectMiddleware</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">_redirect</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> redirected</span><span class="token punctuation">,</span><span class="token plain"> request</span><span class="token punctuation">,</span><span class="token plain"> spider</span><span class="token punctuation">,</span><span class="token plain"> reason</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># act normally if this isn't a threat defense redirect</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">not</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">is_threat_defense_url</span><span class="token punctuation">(</span><span class="token plain">redirected</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">_redirect</span><span class="token punctuation">(</span><span class="token plain">redirected</span><span class="token punctuation">,</span><span class="token plain"> request</span><span class="token punctuation">,</span><span class="token plain"> spider</span><span class="token punctuation">,</span><span class="token plain"> reason</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        logger</span><span class="token punctuation">.</span><span class="token plain">debug</span><span class="token punctuation">(</span><span class="token string-interpolation string" style="color:#89b482">f'Zipru threat defense triggered for </span><span class="token string-interpolation interpolation punctuation">{</span><span class="token string-interpolation interpolation">request</span><span class="token string-interpolation interpolation punctuation">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation">}</span><span class="token string-interpolation string" style="color:#89b482">'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        request</span><span class="token punctuation">.</span><span class="token plain">cookies </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">bypass_threat_defense</span><span class="token punctuation">(</span><span class="token plain">redirected</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        request</span><span class="token punctuation">.</span><span class="token plain">dont_filter </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token boolean" style="color:#ea6962">True</span><span class="token plain"> </span><span class="token comment" style="color:#a89984"># prevents the original link being marked a dupe</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> request</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">is_threat_defense_url</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> url</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'://zipru.to/threat_defense.php'</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> url</span><br></span></code></pre></div></div>
<p>You'll notice that we're subclassing <code>RedirectMiddleware</code> instead of <code>DownloaderMiddleware</code> directly.
This allows us to reuse most of the built in redirect handling and insert our code into <code>_redirect(redirected, request, spider, reason)</code> which is only called from <code>process_response(request, response, spider)</code> once a redirect request has been constructed.
We just defer to the super-class implementation here for standard redirects but the special threat defense redirects get handled differently.
We haven't implemented <code>bypass_threat_defense(url)</code> yet but we can see that it should return the access cookies which will be attached to the original request and that the original request will then be reprocessed.</p>
<p>To enable our new middleware we'll need to add the following to <code>zipru_scraper/settings.py</code>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">DOWNLOADER_MIDDLEWARES </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token boolean" style="color:#ea6962">None</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'zipru_scraper.middlewares.ThreatDefenceRedirectMiddleware'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">600</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">}</span><br></span></code></pre></div></div>
<p>This disables the default redirect middleware and plugs ours in at the exact same position in the middleware stack.
We'll also have to install a few additional packages that we're importing but not actually using yet.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">pip </span><span class="token function" style="color:#d8a657">install</span><span class="token plain"> dryscrape </span><span class="token comment" style="color:#a89984"># headless webkit</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">pip </span><span class="token function" style="color:#d8a657">install</span><span class="token plain"> Pillow </span><span class="token comment" style="color:#a89984"># image processing</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">pip </span><span class="token function" style="color:#d8a657">install</span><span class="token plain"> pytesseract </span><span class="token comment" style="color:#a89984"># OCR</span><br></span></code></pre></div></div>
<p>Note that all three of these are packages with external dependencies that pip can't handle.
If you run into errors then you may need to visit the <a href="http://dryscrape.readthedocs.io/en/latest/installation.html" target="_blank" rel="noopener noreferrer" class="">dryscrape</a>, <a href="http://pillow.readthedocs.io/en/3.4.x/installation.html" target="_blank" rel="noopener noreferrer" class="">Pillow</a>, and <a href="https://github.com/madmaze/pytesseract" target="_blank" rel="noopener noreferrer" class="">pytesseract</a> installation guides to follow platform specific instructions.</p>
<p>Our middleware should be functioning in place of the standard redirect middleware behavior now; we just need to implement <code>bypass_thread_defense(url)</code>.
We could parse the javascript to get the variables that we need and recreate the logic in python but that seems pretty fragile and is a lot of work.
Let's take the easier, though perhaps clunkier, approach of using a headless webkit instance.
There are a few different options but I personally like <a href="https://dryscrape.readthedocs.io/en/latest/index.html" target="_blank" rel="noopener noreferrer" class="">dryscrape</a> (which we already installed).</p>
<p>First off, let's initialize a dryscrape session in our middleware constructor.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">__init__</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> settings</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token builtin" style="color:#d8a657">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">__init__</span><span class="token punctuation">(</span><span class="token plain">settings</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># start xvfb to support headless scraping</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'linux'</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> sys</span><span class="token punctuation">.</span><span class="token plain">platform</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            dryscrape</span><span class="token punctuation">.</span><span class="token plain">start_xvfb</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> dryscrape</span><span class="token punctuation">.</span><span class="token plain">Session</span><span class="token punctuation">(</span><span class="token plain">base_url</span><span class="token operator" style="color:#a89984">=</span><span class="token string" style="color:#89b482">'http://zipru.to'</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>You can think of this session as a single browser tab that does all of the stuff that a browser would typically do (<em>e.g.</em> fetch external resources, execute scripts).
We can navigate to new URLs in the tab, click on things, enter text into inputs, and all sorts of other things.
Scrapy supports concurrent requests and item processing but the response processing is single threaded.
This means that we can use this single dryscrape session without having to worry about being thread safe.</p>
<p>So now let's sketch out the basic logic of bypassing the threat defense.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">bypass_threat_defense</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> url</span><span class="token operator" style="color:#a89984">=</span><span class="token boolean" style="color:#ea6962">None</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># only navigate if any explicit url is provided</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> url</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">visit</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># solve the captcha if there is one</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        captcha_images </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'img[src *= captcha]'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">captcha_images</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">solve_captcha</span><span class="token punctuation">(</span><span class="token plain">captcha_images</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># click on any explicit retry links</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        retry_links </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">css</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'a[href *= threat_defense]'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">len</span><span class="token punctuation">(</span><span class="token plain">retry_links</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">bypass_threat_defense</span><span class="token punctuation">(</span><span class="token plain">retry_links</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">get_attr</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'href'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># otherwise, we're on a redirect page so wait for the redirect and try again</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">wait_for_redirect</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">bypass_threat_defense</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">wait_for_redirect</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token boolean" style="color:#ea6962">None</span><span class="token punctuation">,</span><span class="token plain"> wait </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">0.1</span><span class="token punctuation">,</span><span class="token plain"> timeout</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> url </span><span class="token keyword" style="color:#ea6962">or</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">range</span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">timeout</span><span class="token operator" style="color:#a89984">//</span><span class="token plain">wait</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            time</span><span class="token punctuation">.</span><span class="token plain">sleep</span><span class="token punctuation">(</span><span class="token plain">wait</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">!=</span><span class="token plain"> url</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        logger</span><span class="token punctuation">.</span><span class="token plain">error</span><span class="token punctuation">(</span><span class="token string-interpolation string" style="color:#89b482">f'Maybe </span><span class="token string-interpolation interpolation punctuation">{</span><span class="token string-interpolation interpolation">self</span><span class="token string-interpolation interpolation punctuation">.</span><span class="token string-interpolation interpolation">dryscrape_session</span><span class="token string-interpolation interpolation punctuation">.</span><span class="token string-interpolation interpolation">url</span><span class="token string-interpolation interpolation punctuation">(</span><span class="token string-interpolation interpolation punctuation">)</span><span class="token string-interpolation interpolation punctuation">}</span><span class="token string-interpolation string" style="color:#89b482"> isn\'t a redirect URL?'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">raise</span><span class="token plain"> Exception</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'Timed out on the zipru redirect page.'</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>This handles all of the different cases that we encountered in the browser and does exactly what a human would do in each of them.
The action taken at any given point only depends on the current page so this approach handles the variations in sequences somewhat gracefully.</p>
<p>The one last piece of the puzzle is to actually solve the captcha.
There are <a href="https://anti-captcha.com/" target="_blank" rel="noopener noreferrer" class="">captcha solving services</a> out there with APIs that you can use in a pinch, but this captcha is simple enough that we can just solve it using OCR.
Using pytesseract for the OCR, we can finally add our <code>solve_captcha(img)</code> method and complete the <code>bypass_threat_defense()</code> functionality.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">solve_captcha</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> img</span><span class="token punctuation">,</span><span class="token plain"> width</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">1280</span><span class="token punctuation">,</span><span class="token plain"> height</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">800</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># take a screenshot of the page</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">set_viewport_size</span><span class="token punctuation">(</span><span class="token plain">width</span><span class="token punctuation">,</span><span class="token plain"> height</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        filename </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> tempfile</span><span class="token punctuation">.</span><span class="token plain">mktemp</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'.png'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">render</span><span class="token punctuation">(</span><span class="token plain">filename</span><span class="token punctuation">,</span><span class="token plain"> width</span><span class="token punctuation">,</span><span class="token plain"> height</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># inject javascript to find the bounds of the captcha</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        js </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'document.querySelector("img[src *= captcha]").getBoundingClientRect()'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        rect </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">eval_script</span><span class="token punctuation">(</span><span class="token plain">js</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        box </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">rect</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'left'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">rect</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'top'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">rect</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'right'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">int</span><span class="token punctuation">(</span><span class="token plain">rect</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'bottom'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># solve the captcha in the screenshot</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        image </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> Image</span><span class="token punctuation">.</span><span class="token builtin" style="color:#d8a657">open</span><span class="token punctuation">(</span><span class="token plain">filename</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        os</span><span class="token punctuation">.</span><span class="token plain">unlink</span><span class="token punctuation">(</span><span class="token plain">filename</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        captcha_image </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> image</span><span class="token punctuation">.</span><span class="token plain">crop</span><span class="token punctuation">(</span><span class="token plain">box</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        captcha </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> pytesseract</span><span class="token punctuation">.</span><span class="token plain">image_to_string</span><span class="token punctuation">(</span><span class="token plain">captcha_image</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        logger</span><span class="token punctuation">.</span><span class="token plain">debug</span><span class="token punctuation">(</span><span class="token string-interpolation string" style="color:#89b482">f'Solved the Zipru captcha: "</span><span class="token string-interpolation interpolation punctuation">{</span><span class="token string-interpolation interpolation">captcha</span><span class="token string-interpolation interpolation punctuation">}</span><span class="token string-interpolation string" style="color:#89b482">"'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># submit the captcha</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token builtin" style="color:#d8a657">input</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">xpath</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'//input[@id = "solve_string"]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token builtin" style="color:#d8a657">input</span><span class="token punctuation">.</span><span class="token builtin" style="color:#d8a657">set</span><span class="token punctuation">(</span><span class="token plain">captcha</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        button </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">xpath</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'//button[@id = "button_submit"]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        url </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">url</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        button</span><span class="token punctuation">.</span><span class="token plain">click</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># try again if it we redirect to a threat defense URL</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">is_threat_defense_url</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">.</span><span class="token plain">wait_for_redirect</span><span class="token punctuation">(</span><span class="token plain">url</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">bypass_threat_defense</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># otherwise return the cookies as a dict</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        cookies </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token punctuation">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> cookie_string </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">cookies</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'domain=zipru.to'</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> cookie_string</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                key</span><span class="token punctuation">,</span><span class="token plain"> value </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> cookie_string</span><span class="token punctuation">.</span><span class="token plain">split</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">';'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">split</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'='</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                cookies</span><span class="token punctuation">[</span><span class="token plain">key</span><span class="token punctuation">]</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> value</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> cookies</span><br></span></code></pre></div></div>
<p>You can see that if the captcha solving fails for some reason that this delegates back to the <code>bypass_threat_defense()</code> method.
This grants us multiple captcha attempts where necessary because we can always keep bouncing around through the verification process until we get one right.</p>
<p>This <em>should</em> be enough to get our scraper working but instead it gets caught in an infinite loop.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">[scrapy.core.engine] DEBUG: Crawled (200) &lt;GET http://zipru.to/robots.txt&gt; (referer: None)</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Zipru threat defense triggered for http://zipru.to/torrents.php?category=TV</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Solved the Zipru captcha: "UJM39"</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Zipru threat defense triggered for http://zipru.to/torrents.php?category=TV</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Solved the Zipru captcha: "TQ9OG"</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Zipru threat defense triggered for http://zipru.to/torrents.php?category=TV</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">[zipru_scraper.middlewares] DEBUG: Solved the Zipru captcha: "KH9A8"</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">...</span><br></span></code></pre></div></div>
<p>It at least looks like our middleware is successfully solving the captcha and then reissuing the request.
The problem is that the new request is <em>triggering the threat defense again</em>.
My first thought was that I had some bug in how I was parsing or attaching the cookies but I triple checked this and the code is fine.
This is another of those "the only things that could possibly be different are the headers" situation.</p>
<p>The headers for scrapy and dryscrape are obviously both bypassing the initial filter that triggers <code>403</code> responses because we're not getting any <code>403</code> responses.
This must somehow be caused by the fact that their headers <em>are different</em>.
My guess is that one of the encrypted access cookies includes a hash of the complete headers and that a request will trigger the threat defense if it doesn't match.
The intention here might be to help prevent somebody from just copying the cookies from their browser into a scraper but it also just adds one more thing that you need to get around.</p>
<p>So let's specify our headers explicitly in <code>zipru_scraper/settings.py</code> like so.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">DEFAULT_REQUEST_HEADERS </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'Accept'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'User-Agent'</span><span class="token punctuation">:</span><span class="token plain"> USER_AGENT</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'Connection'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'Keep-Alive'</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'Accept-Encoding'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'gzip, deflate'</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token string" style="color:#89b482">'Accept-Language'</span><span class="token punctuation">:</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'en-US,*'</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>Note that we're explicitly adding the <code>User-Agent</code> header here to <code>USER_AGENT</code> which we defined earlier.
This was already being added automatically by the user agent middleware but having all of these in one place makes it easier to duplicate the headers in dryscrape.
We can do that by modifying our <code>ThreatDefenceRedirectMiddleware</code> initializer like so.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">__init__</span><span class="token punctuation">(</span><span class="token plain">self</span><span class="token punctuation">,</span><span class="token plain"> settings</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token builtin" style="color:#d8a657">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">__init__</span><span class="token punctuation">(</span><span class="token plain">settings</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token comment" style="color:#a89984"># start xvfb to support headless scraping</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'linux'</span><span class="token plain"> </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> sys</span><span class="token punctuation">.</span><span class="token plain">platform</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        dryscrape</span><span class="token punctuation">.</span><span class="token plain">start_xvfb</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> dryscrape</span><span class="token punctuation">.</span><span class="token plain">Session</span><span class="token punctuation">(</span><span class="token plain">base_url</span><span class="token operator" style="color:#a89984">=</span><span class="token string" style="color:#89b482">'http://zipru.to'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> key</span><span class="token punctuation">,</span><span class="token plain"> value </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> settings</span><span class="token punctuation">[</span><span class="token string" style="color:#89b482">'DEFAULT_REQUEST_HEADERS'</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token plain">items</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token comment" style="color:#a89984"># seems to be a bug with how webkit-server handles accept-encoding</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> key</span><span class="token punctuation">.</span><span class="token plain">lower</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">!=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">'accept-encoding'</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">            self</span><span class="token punctuation">.</span><span class="token plain">dryscrape_session</span><span class="token punctuation">.</span><span class="token plain">set_header</span><span class="token punctuation">(</span><span class="token plain">key</span><span class="token punctuation">,</span><span class="token plain"> value</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>Now, when we run our scraper again with <code>scrapy crawl zipru -o torrents.jl</code> we see a steady stream of scraped items and our <code>torrents.jl</code> file records it all.
We've successfully gotten around all of the threat defense mechanisms!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrap-it-up">Wrap it Up<a href="https://sangaline.com/blog/advanced-web-scraping-tutorial#wrap-it-up" class="hash-link" aria-label="Direct link to Wrap it Up" title="Direct link to Wrap it Up" translate="no">​</a></h2>
<p>We've walked through the process of writing a scraper that can overcome four distinct threat defense mechanisms:</p>
<ol>
<li class="">User agent filtering.</li>
<li class="">Obfuscated javascript redirects.</li>
<li class="">Captchas.</li>
<li class="">Header consistency checks.</li>
</ol>
<p>Our target website Zipru may have been fictional but these are all real anti-scraping techniques that you'll encounter on real sites.
Hopefully you'll find the approach we took useful in your own scraping adventures.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The stories that Hacker News removes from the front page]]></title>
            <link>https://sangaline.com/blog/the-stories-that-hacker-news-removes-from-the-front-page</link>
            <guid>https://sangaline.com/blog/the-stories-that-hacker-news-removes-from-the-front-page</guid>
            <pubDate>Mon, 13 Mar 2017 11:49:21 GMT</pubDate>
            <description><![CDATA[An analysis of which stories are removed from the front page of Hacker News due to moderator intervention.]]></description>
            <content:encoded><![CDATA[<p><strong>UPDATE:</strong> <em>I've spoken to @dang over at Hacker News and he's been extremely understanding and helpful in both explaining and handling the situation.
A new post has been created and it can be found at <a href="https://news.ycombinator.com/item?id=13867739" target="_blank" rel="noopener noreferrer" class="">https://news.ycombinator.com/item?id=13867739</a>.</em></p>
<p><em>My previous post accidentally had "(2010)" added to the title by a moderator and then users flagged the story because of this and it was automatically penalized after hitting a hidden flag threshold.
It sounds like the other submissions were penalized due to excessive flagging and there has been some <a href="https://news.ycombinator.com/item?id=13857086" target="_blank" rel="noopener noreferrer" class="">interesting discussion</a> as to whether some users abuse this and possible solutions.</em></p>
<!-- -->
<br>
<p>I published an article on Friday titled <a class="" href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm">Reverse Engineering the Hacker News Ranking Algorithm</a> and posted it on <a href="https://news.ycombinator.com/item?id=13838399" target="_blank" rel="noopener noreferrer" class="">Hacker News</a>.
Now, I know that it's always a crapshoot whether or not something makes it to the front page, but I had a hunch that this article had what it takes.
It was directly relevant to the Hacker News audience in multiple ways, told a fairly interesting analysis story, and provided source code and data to make it easy for readers to try things out themselves.
At the very least, I figured that if I could get enough attention to initially get on to the front page then the article would be fairly well-received.</p>
<p>You can imagine then that I was pretty happy when the story popped onto the front page after 45 minutes or so.
Traffic started flowing in and I was looking forward to having some substantial discussion in the comments section.
Then all of a sudden <em>it was just gone.</em>
It didn't just fall off the front page, it was like it <em>completely disappeared</em> from the rankings.
A few minutes later <em>@awsoutage</em> left a comment noticing the same thing that I did</p>
<p><img decoding="async" loading="lazy" alt="A comment from awsoutage noting that the post was apparently censored, as no realistic human interaction would result in a popular post disappearing from the first 6 pages within moments." src="https://sangaline.com/assets/images/awsoutage-comment-bbb72473a2106168404066ca9cd6184b.png" width="916" height="246" class="img_ev3q"></p>
<p>and, later on, so did <em>@mos_basik</em></p>
<p><img decoding="async" loading="lazy" alt="A comment from mos_basik acknowledging the suppression, saying they understand why the powers that be might not want the article widely spread." src="https://sangaline.com/assets/images/mos_basik-comment-dde4dd33b1dd296ccda025c034f7985d.png" width="924" height="377" class="img_ev3q"></p>
<p>I have to admit that I found it a bit comforting that I wasn't the only one who thought this all seemed a bit fishy.</p>
<p>I was basically rolling in Hacker News data after working on my last article so it seemed pretty natural to take a quick look and see how the ranking of my post changed over time.
The data backed up what <em>@awsoutage</em>, <em>@mos_bsik</em>, and I had all noticed.</p>
<p><img decoding="async" loading="lazy" alt="The story front paged briefly and then disappeared" src="https://sangaline.com/assets/images/single-story-trajectory-and-votes-8c91e3a4d96b9784f35f88d6190df3d5.png" width="1200" height="400" class="img_ev3q"></p>
<p>The story made it to the front page, started getting upvoted at a fairly high rate, and then disappeared within minutes.
For comparison, take a look at the trajectories of some more typical stories that made it to the front page.</p>
<p><img decoding="async" loading="lazy" alt="The position trajectories of a few more typical stories" src="https://sangaline.com/assets/images/typical-story-trajectories-5f65b21a518700f3b88971547ab18c0f.png" width="600" height="400" class="img_ev3q"></p>
<p>These seem to make a lot more sense; stories rise up to some peak position and then slowly drift down through the back pages over the course of days until they're gone.
Flagging and other factors affect these trajectories but they still fall off somewhat continuously.
There's also clearly some sort of auto-penalty that kicks in once stories are 15 hours old and other minor things but nothing nearly as extreme as disappearing from the rankings entirely.</p>
<p>My guess here is that moderator intervention was responsible for my post's disappearance.
It was on the front page for a grand total of 9 minutes, received 9 upvotes in that time, and then received an additional 11 upvotes after disappearing as people finished reading the article.
Maybe "Reverse Engineering" in the title made a moderator worry that it was going to discuss how the spam prevention mechanisms worked?
That's the only thing I can think of even though that would be an unfounded fear.
I had purposely avoided analyzing the spam prevention mechanisms and limited the analysis to the components of the algorithm that were already publicly available from posts by <em>@pg</em> and the Arc source code releases.</p>
<p>In any case, this occurrence seemed like a natural opportunity to extend my previous analysis.
I now strongly suspect that there's a mechanism for moderators to remove stories from the top stories section or, at the very least, attach a factor similar to the <code>lightweight-factor*</code> discussed in my last post that is so extreme that it effectively does the same thing.</p>
<p>The data signature of moderator intervention is relatively simple.
Basically, if a story has a normal trajectory on the front page for a while and then instantaneously makes a huge jump downwards or disappears completely then it suggests that a moderator may have manually adjusted the story.
To make this a little more concrete let's limit it to stories that got at least 10 votes, dropped from the front page by over 100 spots in less than 30 seconds, and weren't marked "dead", "dupe", or "flagged" (because those might organically result in the same behavior).
Applying this filter to the data lets us pick out the trajectories of stories that were doing well before being severely penalized.</p>
<p><img decoding="async" loading="lazy" alt="Position trajectories for other stories with discontinuous behavior" src="https://sangaline.com/assets/images/suppressed-story-trajectories-3d413a32605622ceb8537ece02f5b944.png" width="600" height="400" class="img_ev3q"></p>
<p>As you can see, my post has some company here.
This has happened to 26 out of 1213 front page stories over the last couple of weeks or so.
That's about 2.1% so it's not a particularly common occurrence but it is happening on a daily basis.</p>
<p>Now let's take a closer look at the stories that have moderator fingerprints on them.
Are they spam, are they critical of Y Combinator companies, or what?</p>
<div style="overflow-x:auto"><table><thead><tr><th>Story</th><th>Hours on Front Page</th><th>Top Position</th><th>Votes</th></tr></thead><tbody><tr><td><a href="https://news.ycombinator.com/item?id=13741276" target="_blank" rel="noopener noreferrer" class="">The Fantasyland Code of Professionalism is an abuser's fantasy</a></td><td>0.11</td><td>3</td><td>52</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13742033" target="_blank" rel="noopener noreferrer" class="">Travel Press Is Reporting a Drop in Tourism to the United States</a></td><td>0.15</td><td>5</td><td>22</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13838399" target="_blank" rel="noopener noreferrer" class="">Reverse Engineering the Hacker News Ranking Algorithm</a></td><td>0.15</td><td>7</td><td>26</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13793549" target="_blank" rel="noopener noreferrer" class="">Python Fire: a library for automatically generating command line interfaces</a></td><td>0.25</td><td>8</td><td>19</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13737695" target="_blank" rel="noopener noreferrer" class="">Australian children's author Mem Fox detained by US border control: 'I sobbed..'</a></td><td>0.29</td><td>21</td><td>22</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13839134" target="_blank" rel="noopener noreferrer" class="">Books that Aaron Swartz read, loved and hated</a></td><td>0.48</td><td>1</td><td>65</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13801914" target="_blank" rel="noopener noreferrer" class="">YouTube TV will be huge. Apple must respond</a></td><td>0.73</td><td>6</td><td>20</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13830176" target="_blank" rel="noopener noreferrer" class="">Show HN: App that makes time travel possible</a></td><td>0.76</td><td>13</td><td>33</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13729678" target="_blank" rel="noopener noreferrer" class="">Collection of Computer Science papers along with their summaries</a></td><td>0.98</td><td>7</td><td>30</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13731916" target="_blank" rel="noopener noreferrer" class="">Trump's FCC Launches Attack on Net Neutrality Transparency Rules</a></td><td>1.06</td><td>8</td><td>20</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13831596" target="_blank" rel="noopener noreferrer" class="">Investor to Airbnb CEO: you want liquidity, make it available to everyone (2011)</a></td><td>1.19</td><td>10</td><td>60</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13803738" target="_blank" rel="noopener noreferrer" class="">Revised executive order bans travelers from six countries from getting new visas</a></td><td>1.25</td><td>3</td><td>66</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13743883" target="_blank" rel="noopener noreferrer" class="">UK government considering a "cut-off date" for EU citizens' rights to residency</a></td><td>1.38</td><td>4</td><td>126</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13835356" target="_blank" rel="noopener noreferrer" class="">PEP 308 and why I still hate Python</a></td><td>1.38</td><td>9</td><td>30</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13777993" target="_blank" rel="noopener noreferrer" class="">Words matter in a sensitive field like security</a></td><td>2</td><td>8</td><td>18</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13795580" target="_blank" rel="noopener noreferrer" class="">Monzo: All payments are failing temporarily</a></td><td>2.62</td><td>11</td><td>42</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13817770" target="_blank" rel="noopener noreferrer" class="">A Wall Street advertising stunt spotlights a push to get more women on boards</a></td><td>2.94</td><td>13</td><td>23</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13809959" target="_blank" rel="noopener noreferrer" class="">Famous Russian hacker Kris Kaspersky passed away</a></td><td>3.12</td><td>9</td><td>105</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13799725" target="_blank" rel="noopener noreferrer" class="">Why It's So Hard to Build the Next Silicon Valley</a></td><td>3.19</td><td>3</td><td>26</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13745294" target="_blank" rel="noopener noreferrer" class="">Tesla Tanks After Goldman Downgrades to Sell</a></td><td>4.31</td><td>2</td><td>97</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13710549" target="_blank" rel="noopener noreferrer" class="">David Bowie's list of books he loved in his life</a></td><td>4.86</td><td>6</td><td>46</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13740781" target="_blank" rel="noopener noreferrer" class="">Ask HN: What should I tell my cousin who wants to go to a "coding boot camp"?</a></td><td>4.97</td><td>20</td><td>73</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13763066" target="_blank" rel="noopener noreferrer" class="">Ask HN: Why isn't VoIP better?</a></td><td>5.88</td><td>9</td><td>143</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13786220" target="_blank" rel="noopener noreferrer" class="">You May Want to Marry My Husband</a></td><td>14.98</td><td>1</td><td>581</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13786997" target="_blank" rel="noopener noreferrer" class="">Show HN: Online Clojure REPL</a></td><td>15.25</td><td>11</td><td>45</td></tr><tr><td><a href="https://news.ycombinator.com/item?id=13831370" target="_blank" rel="noopener noreferrer" class="">Introducing Cloud Functions for Firebase</a></td><td>22.21</td><td>3</td><td>342</td></tr></tbody></table></div>
<p>A few other posts seem a bit off-topic,  some may be a bit light on content, but for the most part I wouldn't think twice if I saw any of these stories on the front page.
I mean... they were all actually upvoted to the front page after all.
Some of them are the types of stories that tend to invite flagging and might have been removed because of that (either manually or automatically).</p>
<p>This is most likely the case for the political posts.
They're usually fairly contentious and the moderators have expressed distaste for an overabundance of them (<em>e.g.</em> the <a href="https://news.ycombinator.com/item?id=13108404" target="_blank" rel="noopener noreferrer" class="">Political Detox Week</a> attempted a few months ago).
I wouldn't be surprised to see those removed, especially when the comment sections turn toxic (as they often do).</p>
<p>There are also a few articles that are thinly veiled affiliate spam.
For example, the ShelfJoy links about books that Aaron Swartz and David Bowie loved fall into this category.
They get called out in the comments for being very low-effort lists of Amazon affiliate links.</p>
<p>I have a hard time imagining others triggering any sort of hidden flag threshold.
I obviously think that <a class="" href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm">Reverse Engineering the Hacker News Ranking Algorithm</a> falls into that category, but what about "Python Fire"?
A new, useful open source project that was met with glowing praise in the comments?
Take a look at the first sentences of some of the comments: "Convenient.", "Some libraries just seem obvious in retrospect.", "Looks great!", "Looks pretty good.", and "Looks fantastic.".
That just really doesn't seem like a recipe for heavy flagging.</p>
<p>The "Investor to Airbnb CEO" post is the only one on the list that pertains to a Y Combinator company.
I read through the comments to get a feel for the tone and how much controversy there was.
As a whole, they're fairly critical of the CEO.
Take a look at the top comment.</p>
<p><img decoding="async" loading="lazy" alt="The top comment is fairly critical of Y Combinator" src="https://sangaline.com/assets/images/throwaway6497-comment-f84cc104321a7ac32f6ccf368b4c7a81.png" width="925" height="859" class="img_ev3q"></p>
<p>I find this to be a very respectful comment that highlights the positive role that Y Combinator plays in the startup community while also explaining why that translates into it having some responsibility for taking a stance against unethical behavior.
That said, it's clearly critical of how Y Combinator has responded to controversies regarding the ethics of their companies.</p>
<p>There was another part of the comment that I found a bit chilling though: "I am looking forward to see[ing] constructive debate around this topic in the comments."
Did a moderator really read that and then disappear the story?
This might just be an unfortunate coincidence caused by some flagging threshold but the pattern looks strikingly similar to that of my own post.
It was on the front page for about 20 minutes, was getting rapidly upvoted, and then suddenly dropped hundreds of positions in the rankings.</p>
<p><img decoding="async" loading="lazy" alt="The Airbnb trajectory seems similar to mine" src="https://sangaline.com/assets/images/airbnb-trajectory-and-votes-29d6edf129ebf9ad138d7f635c517fa7.png" width="1200" height="400" class="img_ev3q"></p>
<p>The penalty was then reversed shortly after but it was off the front page at that point.
The submission never got a fraction of the attention that it would have received otherwise (though it did briefly rise back on to the front page after this).</p>
<p>I'm very curious to hear some other people's opinions on this.
Were these stories removed automatically or was there moderator intervention?
If you think it was the moderators then do their intentions make sense to you?
I don't want to jinx things but... <em>I am looking forward to seeing constructive debate around this topic in the comments.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How many people will actually die this week because of Daylight Savings Time?]]></title>
            <link>https://sangaline.com/blog/deaths-caused-by-daylight-savings-time</link>
            <guid>https://sangaline.com/blog/deaths-caused-by-daylight-savings-time</guid>
            <pubDate>Sun, 12 Mar 2017 13:25:28 GMT</pubDate>
            <description><![CDATA[A data analysis of how many deaths the DST transition causes due to tired driving.]]></description>
            <content:encoded><![CDATA[<p>The idea that people die as a direct result of setting the clocks back for Daylight Savings Time is an annual favorite.
I'm no fan of losing an hour of sleep and I have to admit that I find it somewhat vindicating to disdain something that's not only a minor inconvenience but also <em>a cold blooded killer.</em>
A healthy mixture of confirmation bias and macabre fascination has allowed me to go on for most of my life without ever questioning that people die in sleepy droves on the Monday following the switch to DST.
It is after all eminently reasonable; people are tired when they drive into work after the change and driving while tired reduces reaction times and increases the likelihood of accidents.</p>
<!-- -->
<p>Today happened to be the first day in my life that I have ever had even a shadow of a doubt about the truth of this statement.
I woke up this morning and immediately started thinking about how much better life would be if we could all just agree to switch to Daylight Savings Time year round and let Standard Time go the way of Bramble Cay melomys.
We would get an extra hour of daylight in the winter, I wouldn't have had to wake up an hour early, and, of course, so many lives would be saved.*</p>
<p>Then a thought occurred to me: this is a factoid that I picked up on the playground in elementary school.
Not that that necessarily means it's wrong but the playground is also where I learned that gum stays in your stomach for seven years, the Great Wall of China is the only man-made structure visible from space, and that this:</p>
<p><img decoding="async" loading="lazy" alt="Tootsie Pop Wrapper" src="https://sangaline.com/assets/images/tootsie-6c02304c242392bea8105c45e35665b6.jpg" width="240" height="180" class="img_ev3q"></p>
<p>is redeemable for a free Tootsie Pop (<em>photo credit: <a href="https://www.flickr.com/photos/j_regan/3411324367" target="_blank" rel="noopener noreferrer" class="">regan76</a></em>).
I still haven't gotten over my disappointment after visiting the Great Wall.</p>
<p>In light of this, I decided to seek out a more trustworthy source and a quick googling found many articles supporting the claim.
Their references painted a more blurry picture however.
<a href="http://www.sciencedirect.com/science/article/pii/S0001457503000150" target="_blank" rel="noopener noreferrer" class=""><em>The effects of daylight and daylight saving time on US pedestrian fatalities and motor vehicle occupant fatalities</em></a> is one of the most cited of papers on the topics.
It concludes that 171 pedestrian fatalities and 195 vehicle occupant fatalities are caused each year by the time switch.
These deaths are not caused by sleepy drivers in the spring though, they're caused by <em>reduced sunlight in the winter.</em></p>
<p>This is apparently a popular topic in Finland as well where <a href="https://www.hindawi.com/journals/jeph/2010/657167/" target="_blank" rel="noopener noreferrer" class=""><em>Daylight Saving Time Transitions and Road Traffic Accidents</em></a> found no evidence of increased traffic accidents and <a href="https://www.ncbi.nlm.nih.gov/pubmed/21078830" target="_blank" rel="noopener noreferrer" class="">another paper</a> found no increase in work-related accidents.
There are some papers that claim an effect, such as <a href="https://www.ncbi.nlm.nih.gov/pubmed/11152980" target="_blank" rel="noopener noreferrer" class=""><em>Fatal accidents following changes in daylight savings time: the American experience.</em></a>, but as that paper itself points out: many of these analyses have conflicting results.</p>
<p>At this point, I decided that I should do a little analysis of my own.
I started looking around for US mortality datasets and quickly found that the data situation is complicated.
There's an understandable concern about protecting individuals' privacy and so recent data tends to be either not publicly available or obfuscated in various ways so that identification is impossible.
Unfortunately, removing the date of death is one of those ways.</p>
<p>The most recent usable data that I could get my hands on were the <a href="https://www.cdc.gov/nchs/nvss/mortality/gmwk304.htm" target="_blank" rel="noopener noreferrer" class="">FMWK304</a> and <a href="https://www.cdc.gov/nchs/nvss/mortality/gmwk305.htm" target="_blank" rel="noopener noreferrer" class="">GMWK305</a> mortality tables from the National Vital Statistics System.
These include the total deaths per day from 1999-2007 segmented by cause of death.
I focused on death by motor vehicle accidents for both occupants and pedestrians.</p>
<p>I aggregated the data across the available years to see the average number of deaths per day for the week before, the week of, and the week after the DST transition.</p>
<p><img decoding="async" loading="lazy" alt="Vehicle occupant deaths" src="https://sangaline.com/assets/images/vehicle-occupant-deaths-eb1c141bcc92321a5eb7efdb9fd2307d.png" width="600" height="400" class="img_ev3q"></p>
<p>Then I considered the average of the week before and the week after as a baseline and subtracted this off from the week of the transition to find the excess deaths.</p>
<p><img decoding="async" loading="lazy" alt="Excess vehicle occupant deaths" src="https://sangaline.com/assets/images/excess-vehicle-occupant-deaths-7898cd5639324630d682ba62dab2167d.png" width="600" height="400" class="img_ev3q"></p>
<p>We would expect a large excess on Monday in particular if the original premise of tired commuters causing accidents were true.
We do see an excess of about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>6</mn></mrow><annotation encoding="application/x-tex">6</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">6</span></span></span></span> on Monday but the errors are so large that it's indistinguishable from zero.</p>
<p>The statistical uncertainties are even larger for the pedestrian deaths because they're less frequent</p>
<p><img decoding="async" loading="lazy" alt="Pedestrian deaths" src="https://sangaline.com/assets/images/pedestrian-deaths-8b7a11600ad461c5ec1fd002b6ff1dda.png" width="600" height="400" class="img_ev3q"></p>
<p>and the excess deaths</p>
<p><img decoding="async" loading="lazy" alt="Excess pedestrian deaths" src="https://sangaline.com/assets/images/excess-pedestrian-deaths-ea85f269859af109c31577e3e3bb9ccc.png" width="600" height="400" class="img_ev3q"></p>
<p>are again fairly indistinguishable from zero.
There's actually a deficit observed on Monday but that's meaningless in the face of the large statistical error.</p>
<p>We can also aggregate the deaths over the entire week instead of dividing it up by day.
This could possibly improve the signal to noise ratio if there are continued effects throughout the week.
The excesses for the week of the transition are <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>33</mn><mo>±</mo><mn>51</mn></mrow><annotation encoding="application/x-tex">33 \pm 51</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">33</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">51</span></span></span></span> vehicle occupant deaths and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>5.6</mn><mo>±</mo><mn>11</mn></mrow><annotation encoding="application/x-tex">5.6 \pm 11</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">5.6</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">11</span></span></span></span> pedestrian deaths.
All that we could really do with these would be to put upper limits on the excesses; we can't exclude zero in either case.</p>
<p>So there you have it: there isn't really much evidence that the Daylight Savings Time transition results in any deaths due to tired driving.
Is it weird that I'm a little disappointed?
It's certainly possible, maybe even likely, that it does but the nine years of data that we looked at simply don't have the statistical resolving power to prove it.
You could decrease the errors a little bit by coming up with a more sophisticated model for the baseline subtraction but that would only change the statistical errors by a factor of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><mfrac><mn>2</mn><mn>3</mn></mfrac></msqrt><mo>≈</mo><mn>0.82</mn></mrow><annotation encoding="application/x-tex">\sqrt{\frac 2 3} \approx 0.82</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.84em;vertical-align:-0.6049em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.2351em"><span class="svg-align" style="top:-3.8em"><span class="pstrut" style="height:3.8em"></span><span class="mord" style="padding-left:1em"><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">3</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span><span style="top:-3.1951em"><span class="pstrut" style="height:3.8em"></span><span class="hide-tail" style="min-width:1.02em;height:1.88em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.88em" viewBox="0 0 400000 1944" preserveAspectRatio="xMinYMin slice"><path d="M983 90
l0 -0
c4,-6.7,10,-10,18,-10 H400000v40
H1013.1s-83.4,268,-264.1,840c-180.7,572,-277,876.3,-289,913c-4.7,4.7,-12.7,7,-24,7
s-12,0,-12,0c-1.3,-3.3,-3.7,-11.7,-7,-25c-35.3,-125.3,-106.7,-373.3,-214,-744
c-10,12,-21,25,-33,39s-32,39,-32,39c-6,-5.3,-15,-14,-27,-26s25,-30,25,-30
c26.7,-32.7,52,-63,76,-91s52,-60,52,-60s208,722,208,722
c56,-175.3,126.3,-397.3,211,-666c84.7,-268.7,153.8,-488.2,207.5,-658.5
c53.7,-170.3,84.5,-266.8,92.5,-289.5z
M1001 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.6049em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.82</span></span></span></span>.
Another factor of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><mfrac><mn>9</mn><mn>37</mn></mfrac></msqrt><mo>≈</mo><mn>0.49</mn></mrow><annotation encoding="application/x-tex">\sqrt{\frac{9}{37}} \approx 0.49</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.84em;vertical-align:-0.6049em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.2351em"><span class="svg-align" style="top:-3.8em"><span class="pstrut" style="height:3.8em"></span><span class="mord" style="padding-left:1em"><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">37</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">9</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span><span style="top:-3.1951em"><span class="pstrut" style="height:3.8em"></span><span class="hide-tail" style="min-width:1.02em;height:1.88em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.88em" viewBox="0 0 400000 1944" preserveAspectRatio="xMinYMin slice"><path d="M983 90
l0 -0
c4,-6.7,10,-10,18,-10 H400000v40
H1013.1s-83.4,268,-264.1,840c-180.7,572,-277,876.3,-289,913c-4.7,4.7,-12.7,7,-24,7
s-12,0,-12,0c-1.3,-3.3,-3.7,-11.7,-7,-25c-35.3,-125.3,-106.7,-373.3,-214,-744
c-10,12,-21,25,-33,39s-32,39,-32,39c-6,-5.3,-15,-14,-27,-26s25,-30,25,-30
c26.7,-32.7,52,-63,76,-91s52,-60,52,-60s208,722,208,722
c56,-175.3,126.3,-397.3,211,-666c84.7,-268.7,153.8,-488.2,207.5,-658.5
c53.7,-170.3,84.5,-266.8,92.5,-289.5z
M1001 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.6049em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.49</span></span></span></span> could be achieved by getting access to the non-public National Death Index data stretching back to 1980, but I would be hesitant to use data any older than that because there have been tremendous gains in vehicle safety since then.
That would knock the statistical errors on the excess vehicle occupant deaths for the week down to about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>21</mn></mrow><annotation encoding="application/x-tex">21</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">21</span></span></span></span> which may or may not be enough to resolve a signal.</p>
<p>Overall, I would judge this factoid as much less apparently true than I expected it to be.
It might be possible to discern a definitive signal with access to the non-public data but I have a feeling that it would still be fairly ambiguous.
The statistical errors are such that the problem is highly susceptible to p-hacking and that's probably why you'll find papers using different model assumptions and coming to conflicting conclusions.</p>
<p>It might not be a killer but can we at least all agree that Standard Time sucks?</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reverse Engineering the Hacker News Ranking Algorithm]]></title>
            <link>https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm</link>
            <guid>https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm</guid>
            <pubDate>Fri, 10 Mar 2017 12:36:07 GMT</pubDate>
            <description><![CDATA[A data-driven exploration of how the Hacker News ranking algorithm works.]]></description>
            <content:encoded><![CDATA[<p><em>All data and code used in this article can be found on <a href="https://github.com/sangaline/reverse-engineering-the-hacker-news-ranking-algorithm" target="_blank" rel="noopener noreferrer" class="">GitHub</a>.</em></p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="introduction">Introduction<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>Articles occasionally pop up on Hacker News that analyze historical data relating to posts and comments on the site.
Some of the analyses have been quite interesting but they almost universally focus on either basic metrics, content analysis, or how to get on the front page.
Every time I read one of those articles, it always gets me wondering about what the same data could tell us about how Hacker News actually works.</p>
<p>There are a lot of questions one could ask but one of the most obvious is: what determines the position of stories on the front page?
I find this to be a particularly interesting question, not because I actually care about the answer, but because it feels like the data should be able to tell us the answer.
We could of course pick some different models and use the historical data to fit and validate them... but this isn't what I mean.
I just have this feeling that the data can actually <em>tell us the answer</em> in a more direct way than global optimization.</p>
<p>What follows is an exploration of how we can use the data to learn about how the algorithm works.
It shouldn't be confused with an attempt to find the best predictor for the front page rank, there are better ways to do that.
My main goal was to tease out the ranking algorithm from the data in a simple and elegant fashion.
This made it a little more interesting as an endeavor and hopefully makes it a more interesting read as well.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-you-make-assumptions">When You Make Assumptions<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#when-you-make-assumptions" class="hash-link" aria-label="Direct link to When You Make Assumptions" title="Direct link to When You Make Assumptions" translate="no">​</a></h2>
<p>Anybody who reads Hacker News with any regularity probably has some rough but intuitive understanding of how the ranking algorithm works.
If you've ever seen a story and thought "jeez, this must have been flagged a lot to be this far down the page" then you were basically predicting where it should appear and then noticing a discrepancy from that.
Here are a few examples of my guess-timations of where a story might appear on the front page.</p>
<table><thead><tr><th>votes</th><th>age</th><th>position #</th></tr></thead><tbody><tr><td>15</td><td>41 minutes</td><td>8</td></tr><tr><td>140</td><td>21 minutes</td><td>2</td></tr><tr><td>3000</td><td>4 days</td><td>&gt; 30</td></tr><tr><td>500</td><td>12 hours</td><td>15</td></tr><tr><td>40</td><td>8 hours</td><td>28</td></tr></tbody></table>
<p>I doubt that anybody is going to start calling my psychic hotline after seeing me make these guesses but it's actually pretty interesting that we can make even rough predictions.
I think that this implies that we have some underlying beliefs about how the ranking algorithm works.
Some of these might be things that we just feel must be true while others are more heavily supported by observations.
When I approach a new problem, I like to start by clarifying any implicit assumptions and then seeing if I can use them to build a more structured approach.</p>
<p>One very general assumption that I made when coming up with those predictions was that the rank is primarily determined by a story's vote total and its age which we'll denote with <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> respectively.
I know that there are contributions from flagging, <em>@dang</em> boosting stories, spam detection, <em>etc.</em> but I expect these to be somewhat secondary in most cases and we don't have access to their corresponding variables anyway.
Another broad assumption is that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> are used to calculate some numerical score and that the stories are ranked by sorting their scores.
We'll represent this score function as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal{F}(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.</p>
<p>Having a general framework for how the ranking works, we can move to thinking about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal{F}(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> itself.
Let's start with how the score changes with age.
Here's how I classify some guesses regarding the age dependence of the score.</p>
<p><img decoding="async" loading="lazy" alt="Some functional forms are more intuitive than others" src="https://sangaline.com/assets/images/guesses-abdc92ebb323f1d1de488dda870b41fd.png" width="1200" height="400" class="img_ev3q"></p>
<p>There are two main things I consider when judging these different curves.
First off, it doesn't really make sense for the score to get higher as a story gets older; you don't see stories from a few months ago hopping back onto the front page for no reason.
I'm assuming that the score must decrease as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> decreases and for a fixed <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span>.
Mathematically, this is equivalent to saying that the <a href="https://en.wikipedia.org/wiki/Partial_derivative" target="_blank" rel="noopener noreferrer" class="">partial derivative</a> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo>&lt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial \mathcal F}{\partial t} &lt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.2251em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>.</p>
<p>The second assumption is more subtle: small differences in age make less of a difference for large <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.
The difference between ages of one hour and eleven hours will have a huge impact on the score while the difference between one year and one year plus ten hours should be pretty negligible.
This means that the score curves upwards as a function of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> and therefore <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>t</mi><mn>2</mn></msup></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial t^2} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>.</p>
<p>We can apply similar reasoning to the dependence of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> on the number of votes <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span>.
I expect more votes to always mean a higher score so we can conclude that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><mrow><mi mathvariant="normal">∂</mi><mi>v</mi></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial \mathcal F(t, v)}{\partial v} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.355em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.01em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.485em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mpunct mtight">,</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>.
My first guess would be that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> is proportional to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> but I could imagine somebody wanting to slightly suppress the growth for very large <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> because stories that are on the front page for longer will get seen by more people and will therefore get even more votes.
There's a multiplicative effect here and you could counteract this by introducing downwards curvature with something like <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>∝</mo><msup><mi>v</mi><mfrac><mn>1</mn><mn>2</mn></mfrac></msup></mrow><annotation encoding="application/x-tex">\mathcal F(t, v) \propto v^{\frac 1 2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">∝</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.954em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.954em"><span style="top:-3.363em;margin-right:0.05em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size3 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8443em"><span style="top:-2.656em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.384em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.344em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size3 size6"></span></span></span></span></span></span></span></span></span></span></span></span></span>.
Any functions that suppress this effect will have <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>v</mi><mn>2</mn></msup></mrow></mfrac><mo>&lt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial v^2} &lt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> while functions with <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>v</mi><mn>2</mn></msup></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial v^2} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> would actually exacerbate it.
We'll assume that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>v</mi><mn>2</mn></msup></mrow></mfrac><mo>≤</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial v^2} \le 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> to allow for suppression but not enhancement of the effect.</p>
<p>I should probably mention now that these second derivative limits are <em>a little bit</em> over-simplified.
It's easy to see that multiplying the score function by a positive constant won't affect the relative rank of stories but there are actually a much broader set of transformations that also won't affect it.
Given a score function <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>, you can apply any monotonically increasing transformation and get out another score function that has the same lexicographical sorting order.
So if <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>∝</mo><mi>v</mi></mrow><annotation encoding="application/x-tex">\mathcal F(t, v) \propto v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">∝</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> then <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)^2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0641em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span> will have upward curvature as a function of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> and violate our assumption that it shouldn't.
You can similarly apply transformations that would cause the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>t</mi><mn>2</mn></msup></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial t^2} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> assumption to be violated.</p>
<p>What's really going on here is that those limits need to be constructed with appropriate scaling factors in order to be invariant under all allowable transformations.
I'm tempted to explain this more but it would add a lot of complexity to the discussion without really changing anything in the end.
I think that the best balance to strike here is to just be aware that there's a little more to the story but that there should exist a score function that still satisfies our constraints.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="working-in-observations">Working in Observations<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#working-in-observations" class="hash-link" aria-label="Direct link to Working in Observations" title="Direct link to Working in Observations" translate="no">​</a></h2>
<p>So to summarize the last section, these are what we're going to accept as absolute truth:</p>
<ol>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal{F}(t, v) </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> - The score function used to rank stories.</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo>&lt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial \mathcal F}{\partial t} &lt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.2251em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> - It decreases over time.</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>t</mi><mn>2</mn></msup></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial t^2} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> - It curves upwards over time.</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>v</mi></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial \mathcal F}{\partial v} &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.2251em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> - It increases with votes.</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>v</mi><mn>2</mn></msup></mrow></mfrac><mo>≤</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial v^2} \le 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> - It curves downwards with votes.</li>
</ol>
<p>Now let's see what they can buy us.</p>
<p>Anything further is going to need to be tied to actual observations on how stories are ranked.
Fortunately, this is something that we can easily observe by visiting the website.
Say that we observe two stories on the front page and Story 1 appears in a better position than Story 2.
This tells us that Story 1 has a higher score than Story 2 or, more formally, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal{F}(t\_1, v\_1) &gt; \mathcal{F}(t\_2, v\_2)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span></span></span></span>.</p>
<p>We now need to work in our assumptions somehow in order to make this statement more useful.
If you're a little rusty on your calculus then this chart might help to understand what we're about to do.</p>
<p><img decoding="async" loading="lazy" alt="The tangent line is always below the curve" src="https://sangaline.com/assets/images/calculus-101-682023af7c0449ed2f945d4a8e433e6c.png" width="600" height="400" class="img_ev3q"></p>
<p>What's plotted here is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> for some value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span>, essentially reducing it to a one-dimensional function.
The red line shows the tangent curve which just barely kisses <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> at <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">t=t\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span></span></span></span> and has a slope of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mrow><annotation encoding="application/x-tex">\frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_{1}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord">1</span></span></span></span></span></span>.
The arrow points to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo separator="true">,</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(t\_2, \mathcal F(t\_1, v) + (t\_2 - t\_1) \frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_{1}})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord">1</span></span></span><span class="mclose">)</span></span></span></span> and it's noted that this point is always below <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo separator="true">,</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(t\_2, \mathcal F(t\_2, v))</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">))</span></span></span></span>.
This would not always be true if the score could curve downwards with time but this is forbidden by Assumption 2 and so we can safely conclude that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><mo>&lt;</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t\_1, v) + (t\_2 - t\_1) \frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_1} &lt; \mathcal F(t\_2, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
Plugging this back into our relationship between the scores of the two observed stories results in</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow></mrow><annotation encoding="application/x-tex">\mathcal F(t\_1, v\_1) &gt; \mathcal F(t\_1, v\_2) + (t\_2 - t\_1) \frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_1,v=v\_2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span></span></p>
<p>A similar inequality for the extrapolation as a function of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> follows from Assumption 5: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>v</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><mo>≥</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v\_2) + (v\_1 - v\_2) \frac{\partial \mathcal F}{\partial v} \biggr\rvert\_{v=v\_2} \ge \mathcal F(t, v\_1)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mclose">)</span></span></span></span>.
The direction of the inequality is opposite here because <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> curves downwards instead of upwards but otherwise it's the exact same concept.
We can now plug this into the left hand side to get</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>v</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><mo>&gt;</mo><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow></mrow><annotation encoding="application/x-tex">\mathcal F(t\_1, v\_2) + (v\_1 - v\_2) \frac{\partial \mathcal F}{\partial v} \biggr\rvert\_{t=t\_1,v=v\_2} &gt; \mathcal F(t\_1, v\_2) + (t\_2 - t\_1) \frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_1,v=v\_2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span></span></p>
<p>which simplifies to</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>v</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><mo>&gt;</mo><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mfrac><mrow><mi mathvariant="normal">∂</mi><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><mi>t</mi></mrow></mfrac><mo fence="true" stretchy="true" minsize="2.4em" maxsize="2.4em">∣</mo><mi mathvariant="normal">_</mi><mrow><mi>t</mi><mo>=</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo separator="true">,</mo><mi>v</mi><mo>=</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow></mrow><annotation encoding="application/x-tex">(v\_1 - v\_2) \frac{\partial \mathcal F}{\partial v} \biggr\rvert\_{t=t\_1,v=v\_2} &gt; (t\_2 - t\_1) \frac{\partial \mathcal F}{\partial t} \biggr\rvert\_{t=t\_1,v=v\_2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:2.4em;vertical-align:-0.95em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8801em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathnormal mtight">t</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em"><span style="top:-3.45em"><span class="pstrut" style="height:4.4em"></span><span style="width:0.333em;height:2.400em"><svg xmlns="http://www.w3.org/2000/svg" width="0.333em" height="2.400em" viewBox="0 0 333 2400"><path d="M145 15 v585 v1200 v585 c2.667,10,9.667,15,21,15
c10,0,16.667,-5,20,-15 v-585 v-1200 v-585 c-2.667,-10,-9.667,-15,-21,-15
c-10,0,-16.667,5,-20,15z M188 15 H145 v585 v1200 v585 h43z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.95em"><span></span></span></span></span></span></span><span class="mord" style="margin-right:0.02778em">_</span><span class="mord"><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span></span></p>
<p>OK, that's basically it!
We now have an equation that translates our assumptions into a simple relationship that can be used to determine <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
Each observation further constrains a differential equation that can then be solved to find <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
This could be done directly but things will be much easier if we introduce an additional assumption here that the relative rate of score decay as a function of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> is independent of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span>.
That makes the differential equation separable and so there exists a solution of the form <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v) = f(t)g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
This will just make it easier to visualize and discuss, it's otherwise not a necessary assumption.</p>
<p>Substituting in this separable form of the score function gives us</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>−</mo><mfrac><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><msup><mi>f</mi><mo mathvariant="normal">′</mo></msup><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac><mo>&gt;</mo><mfrac><mrow><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac><mfrac><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo></mrow><mrow><msup><mi>g</mi><mo mathvariant="normal">′</mo></msup><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo></mrow></mfrac></mrow><annotation encoding="application/x-tex">-\frac{f(t\_1)}{f^\prime(t\_1)} &gt; \frac{(t\_2 - t\_1)}{(v\_2 - v\_1)} \frac{g(v\_2)}{g^\prime(v\_2)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.614em;vertical-align:-0.562em"></span><span class="mord">−</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6828em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.614em;vertical-align:-0.562em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">g</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6828em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">g</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></p>
<p>when <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1 &gt; v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> or with a <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>&lt;</mo></mrow><annotation encoding="application/x-tex">&lt;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em"></span><span class="mrel">&lt;</span></span></span></span> sign instead of the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>&gt;</mo></mrow><annotation encoding="application/x-tex">&gt;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em"></span><span class="mrel">&gt;</span></span></span></span> sign when <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">v\_2 &gt; v\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span></span></span></span>.
Note that we also implicitly used Assumptions 2 and 4 here to swap the equality sign appropriately when dividing or multiplying by negative quantities.</p>
<p>The general approach that we'll now take is to choose an ansatz, or reasonable guess, for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> and then use that to determine an estimate for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
We can then turn around and use this estimate of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> to figure out a better guess for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
If we repeat this process then things should hopefully settle down to satisfactory functions for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
This lets us deal with differential equations of a single variable for simplicity.</p>
<p>Let's use <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mi>v</mi></mrow><annotation encoding="application/x-tex">g(v)=v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> as our initial ansatz.
It's simple, satisfies our assumptions, and is probably what I would have picked if I had programmed Hacker News.
Given this ansatz we find that our inequality simplifies to</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>−</mo><mfrac><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><msup><mi>f</mi><mo mathvariant="normal">′</mo></msup><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac><mo>&gt;</mo><msub><mi>v</mi><mn>2</mn></msub><mfrac><mrow><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac></mrow><annotation encoding="application/x-tex">-\frac{f(t\_1)}{f^\prime(t\_1)} &gt; v_2\frac{(t\_2 - t\_1)}{(v\_2 - v\_1)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.614em;vertical-align:-0.562em"></span><span class="mord">−</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6828em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.614em;vertical-align:-0.562em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></p>
<p>where the direction of the inequality again depends on whether or not <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1 &gt; v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span>.
To avoid writing that left hand side over and over let's define <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>≡</mo><mfrac><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo mathvariant="normal">′</mo><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow></mfrac></mrow><annotation encoding="application/x-tex">\tau(t) \equiv \frac{f(t)}{f\prime(t)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≡</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.53em;vertical-align:-0.52em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.01em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mord mtight">′</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.485em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.52em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span> which then reduces our bound equation to</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mfrac><mrow><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><mo stretchy="false">(</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac></mrow><annotation encoding="application/x-tex">\tau(t\_1) &gt; v\_2\frac{(t\_2 - t\_1)}{(v\_2 - v\_1)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.614em;vertical-align:-0.562em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.052em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></p>
<p>Let's take a minute to step back and see what this equation is really telling us.
To simplify things a bit, we'll briefly consider the case of exponential decay <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><msup><mi>e</mi><mrow><mo>−</mo><mi>t</mi><mi mathvariant="normal">/</mi><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow></msup></mrow><annotation encoding="application/x-tex">f(t)=e^{-t/\tau\_0}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.888em"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.888em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">/</span><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_0</span></span></span></span></span></span></span></span></span></span></span></span> where larger values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> mean that stories fall off the front page more slowly.
This isn't an assumption that we'll use in the analysis; we'll just use it to develop some intuition about what the bounds mean.
We can then see that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo mathvariant="normal">′</mo><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow></mfrac><mo>=</mo><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau(t) = \frac{f(t)}{f\prime(t)} = \tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.53em;vertical-align:-0.52em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.01em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mord mtight">′</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.485em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.10764em">f</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.52em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> so our inequalities are going to be directly constraining <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> regardless of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.
The exact nature of the constraint is going to depend on whether <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1&gt;v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> and whether or not <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_1&gt;t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span>.</p>
<p>If <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1 &gt; v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> then we'll be putting a lower bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span>.
If <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">t\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span></span></span></span> is less than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> then this bound will be negative which doesn't tell us much because <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> has to be positive for exponential decay.
This makes a lot of sense because a newer story with more votes <em>has to have a higher score</em> given Assumptions 2 and 4.
Seeing that just confirms our assumptions, it doesn't tell us anything further about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
It gets more interesting when the top story is older than the lower story.
That puts a positive lower bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> which means that if old stories were to fall off more quickly than the bound allows then Story 1 would have already fallen below Story 2.</p>
<p>On the other hand, if <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&lt;</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1 &lt; v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> then we'll be putting an upper bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span>.
If <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">t\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span></span></span></span> is less than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> then this bound will be negative which would mean that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>f</mi><mo mathvariant="normal">′</mo></msup><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">f^\prime(t) &gt; 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0019em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>, violating Assumption 2.
Indeed, if Assumptions 2 and 4 hold then we would expect to never see the higher scored story be both older and have less votes so this issue should never come up.
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>&gt;</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_1&gt;t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> is the more interesting case because it means that the top story is newer but has less votes.
This means that Story 2 must have started falling off with a small enough value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> in order to come in second even though it has more votes.</p>
<p>Hopefully those examples make the bounds that we're calculating a little more intuitive.
The lower bounds eliminate faster decays while the higher bounds eliminate slower ones.
The decay doesn't need to be a constant value though, it can of course depend on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.
If it does, then we'll need to compute this bounds for different values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> to get a more complete picture.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-good">The Good<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#the-good" class="hash-link" aria-label="Direct link to The Good" title="Direct link to The Good" translate="no">​</a></h2>
<p>Now it's just a matter of actually crunching some data to see what we find.
I put together a dataset of front page snapshots from 2007 to 2017 that includes the vote total, age, and relative position of each story to use for this analysis.
The stories are limited to those that link to external URLs in case self-posts are treated differently.
You can grab the dataset, as well as the analysis code, from this <a href="https://github.com/sangaline/reverse-engineering-the-hacker-news-ranking-algorithm" target="_blank" rel="noopener noreferrer" class="">GitHub repository</a> if you would like to play along at home.</p>
<p>Let's start out by just histogramming the upper and lower bounds that we observe.
Each entry here is a bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> obtained by evaluating the inequality that we derived earlier for two observed stories.</p>
<p><img decoding="async" loading="lazy" alt="Histograms of the upper and lower bounds of tau(t)" src="https://sangaline.com/assets/images/tau-limits-2007-2009-a8e707ef4f4ecd5059abfc1f4221f9b6.png" width="1200" height="400" class="img_ev3q"></p>
<p>We can already see how these two histograms can work together to constrain <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
Let's combine them and plot the number of bounds that a given value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> would violate as a function of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.</p>
<p><img decoding="async" loading="lazy" alt="The number of bounds violated as a function of -f(t)/f&amp;#39;(t) and t" src="https://sangaline.com/assets/images/tau-fit-2007-2009-b1c23fe2a9f190979350f516348bab91.png" width="600" height="400" class="img_ev3q"></p>
<p>The points on this plot show the value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> for each <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> bin that violates the fewest bounds.
The line represents the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>≈</mo><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>+</mo><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mi>t</mi></mrow><annotation encoding="application/x-tex">\tau(t) \approx \tau\_0 + \tau\_1 t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mord mathnormal">t</span></span></span></span> fit that minimizes the total number of bounds that are violated.
To my eye, this seems like a pretty reasonable fit and I don't think that we could accurately constrain any higher order terms given the available data.</p>
<p>This differential equation described by this parameterization of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> has a very simple solution</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mo stretchy="false">(</mo><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>+</mo><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mi>t</mi><msup><mo stretchy="false">)</mo><mrow><mo>−</mo><mfrac><mn>1</mn><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mfrac></mrow></msup></mrow><annotation encoding="application/x-tex">f(t)=(\tau\_0 + \tau\_1 t)^{-\frac{1}{\tau\_1}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.4046em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mord mathnormal">t</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:1.0946em"><span style="top:-3.5036em;margin-right:0.05em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size3 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8443em"><span style="top:-2.656em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_1</span></span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.384em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.5654em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size3 size6"></span></span></span></span></span></span></span></span></span></span></span></span></span></p>
<p>Interestingly, you can see that the limit of this quantity as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>→</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_1 \to 0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">→</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> is proportional to, and therefore equivalent to in terms of ranking, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>e</mi><mrow><mo>−</mo><mi>t</mi><mi mathvariant="normal">/</mi><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow></msup></mrow><annotation encoding="application/x-tex">e^{-t / \tau\_0}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.888em"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.888em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">/</span><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_0</span></span></span></span></span></span></span></span></span></span></span></span>.
This is the function we played around with when trying to develop some intuition about what the bounds mean and we found that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> would be constrained to a constant value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span>.
We can think of the solution we found here as a generalization of exponential decay that adds another parameter to make the score fall off more slowly for larger values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.</p>
<p>This plot of the bounds violations encompasses what I was talking about when I said that the data could tell us how the algorithm works.
From just glancing at the figure, it's obvious that the score decays as a power law with age and you can even pull out rough estimates for the parameters by eye: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> is about 2 hours and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> is about 0.75.
There's just something about that I find extremely satisfying.</p>
<p>Perhaps I'm getting ahead of myself a bit though... we said before that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mi>v</mi></mrow><annotation encoding="application/x-tex">g(v)=v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> was just a guess and that we would need to revisit it after we used it to constrain <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
Let's flip it around now and use our newly determined function <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> to constrain <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><mrow><msup><mi>g</mi><mo mathvariant="normal">′</mo></msup><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow></mfrac></mrow><annotation encoding="application/x-tex">\frac{g(v)}{g^\prime(v)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.53em;vertical-align:-0.52em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.01em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">g</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6828em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.485em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">g</span><span class="mopen mtight">(</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.52em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span> which we'll denote <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.</p>
<p><img decoding="async" loading="lazy" alt="The number of bounds violated as a function of -g(v)/f&amp;#39;(v) and v" src="https://sangaline.com/assets/images/nu-fit-2007-2009-640b59e1dadeb971e5c7c86ade3bdb10.png" width="600" height="400" class="img_ev3q"></p>
<p>The line again represents a minimization of the total bound violations, this time parameterized as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mi>ν</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>+</mo><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn><mi>v</mi></mrow><annotation encoding="application/x-tex">\nu(v)=\nu\_0 + \nu\_1 v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span>.
The fit parameters are <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mo>−</mo><mn>0.41</mn><mo>±</mo><mn>0.31</mn></mrow><annotation encoding="application/x-tex">\nu\_0=-0.41 \pm 0.31</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">−</span><span class="mord">0.41</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.31</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mn>1.091</mn><mo>±</mo><mn>0.069</mn></mrow><annotation encoding="application/x-tex">\nu\_1=1.091 \pm 0.069</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">1.091</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.069</span></span></span></span> where the errors are only statistical and were determined by bootstrapping the set of bounds on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
These are both consistent with our ansatz of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\nu\_0=0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_1=1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span> which confirms that our ansatz was fairly accurate.</p>
<p>We could now try to iterate further using this solution but things are complicated by the fact that an arbitrary scaling factor can be applied to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> without effecting the ranking properties of the score function.
Unfortunately, the main thing that this would accomplish would be to let these parameters drift indiscriminately.
If we had found non-zero curvature for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> then we could have fixed the linear term to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn></mrow><annotation encoding="application/x-tex">1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span> and then used the iterative process to determine the higher order terms.
The only thing that we would be constraining in this linear case would be <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\nu\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_0</span></span></span></span> and it doesn't seem like we have the resolving power to differentiate between <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0</mn></mrow><annotation encoding="application/x-tex">0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">-1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">−</span><span class="mord">1</span></span></span></span> so there isn't much point in doing that.</p>
<p>Ultimately we conclude that the score function, or at least <em>an equivalent score function</em> is</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mi>v</mi><mrow><mo stretchy="false">(</mo><mfrac><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mfrac><mo>+</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow><mfrac><mn>1</mn><mi>τ</mi></mfrac><mi mathvariant="normal">_</mi><mn>1</mn></mrow></msup></mrow></mfrac></mrow><annotation encoding="application/x-tex">\mathcal F(t, v) = \frac{v}{(\frac{\tau\_0}{\tau\_1} + t)^{\frac 1 \tau\_1}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.7583em;vertical-align:-1.0629em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.6954em"><span style="top:-2.3329em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size3 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0052em"><span style="top:-2.656em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_1</span></span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.5449em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_0</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.5654em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size3 size6"></span></span><span class="mbin mtight">+</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight"><span class="mclose mtight">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:1.2245em"><span style="top:-3.4878em;margin-right:0.0714em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size1 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0314em"><span style="top:-2.468em"><span class="pstrut" style="height:3em"></span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.387em"><span class="pstrut" style="height:3em"></span><span class="mord mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.532em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size1 size6"></span></span><span class="mord mtight">_1</span></span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.0629em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></p>
<p>with values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mn>1.95</mn><mo>±</mo><mn>0.16</mn></mrow><annotation encoding="application/x-tex">\tau\_0 = 1.95 \pm 0.16</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">1.95</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.16</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mn>0.688</mn><mo>±</mo><mn>0.033</mn></mrow><annotation encoding="application/x-tex">\tau\_1 = 0.688 \pm 0.033</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.688</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.033</span></span></span></span> extracted from the fit.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-bad">The Bad<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#the-bad" class="hash-link" aria-label="Direct link to The Bad" title="Direct link to The Bad" translate="no">​</a></h2>
<p>I mentioned in the previous section that I put together a dataset of stories from 2007-2017.
What I <em>didn't mention</em> is that I was only showing you an analysis of data from 2007 and 2008.
That's partially because I wanted you to see this approach succeed before seeing it fail but it's also because it's a little easier to understand this next figure after walking through the analysis for a single time period.</p>
<p><img decoding="async" loading="lazy" alt="The time dependence of the f(t)/f&amp;#39;(t) constraint" src="https://sangaline.com/assets/images/bound-violations-over-time-ec2683c741a2958d8dc99076aa22a09d.png" width="600" height="400" class="img_ev3q"></p>
<p>This figure shows the number of bound violations for different values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> within a window of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>6</mn><mo>&lt;</mo><mi>t</mi><mo>&lt;</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">6 &lt; t &lt; 8</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6835em;vertical-align:-0.0391em"></span><span class="mord">6</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6542em;vertical-align:-0.0391em"></span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">8</span></span></span></span> for each three month period from mid 2007 until present.
The columns are also normalized to give a consistent picture even though the amount of data varies over time.</p>
<p>It's clear that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> dropped significantly around the first quarter of 2009 which tells us that the score function was adjusted so that stories would drop off more quickly.
It seems like maybe there were changes around mid-2010 and mid-2011 as well but it's hard to tell exactly because the amount of data is also changing and this could just be shifts in the quantity of noise.
What's unmistakable, however, is that our valley of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> values with relatively few bound violations gets demolished in mid-2014.
This sort of change is what we would expect if our assumptions get seriously broken (<em>e.g.</em> by heavy-duty vote fuzzing).</p>
<p>I just want to take a second to emphasize that it's really cool that the general approach we've taken allows us to visually see when these changes in the algorithm occur as well as their nature.
That said, these algorithm changes unfortunately do mark a breakdown in some of our assumptions and that means that any results we pull out will become less accurate.
Our approach is somewhat robust against factors like flagging and vote fuzzing: both of these will basically add noise to the upper and lower bound values which will cancel to leading order when determining the best values for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> or <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.
If additional factors become large enough that the lower and upper bounds overlap a lot though then the extracted fit parameters may be seriously biased.
Just by eye, it's clear that this will be the case for mid-2014 through present and there's a good chance that it's also the case for 2009 through mid-2014.</p>
<p>Let's look briefly at what happens in the data after mid-2014.
The <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> constraints actually aren't <em>that bad</em>...</p>
<p><img decoding="async" loading="lazy" alt="The number of bounds violated as a function of -f(t)/f&amp;#39;(t) and t for 2014-2017" src="https://sangaline.com/assets/images/tau-fit-2014-2017-cce3e1c213db9b44603616e6a89363ff.png" width="600" height="400" class="img_ev3q"></p>
<p>You could almost imagine how deconvolving that image would give back our nice clean valley but it's not really possible to put too much faith in the extracted parameters.
You can definitely rule out exponential behavior though and you can probably reasonably constrain <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> to within a factor of 50% or so.</p>
<p>Where things get really bad is when we use the extracted <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> function to constrain <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>.</p>
<p><img decoding="async" loading="lazy" alt="The number of bounds violated as a function of g(v)/g&amp;#39;(v) and v for 2014-2017" src="https://sangaline.com/assets/images/nu-fit-2014-2017-01b7573b28900f3148d273579ca0cfb4.png" width="600" height="400" class="img_ev3q"></p>
<p>The extracted value for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span></span></span></span> is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.52</mn><mo>±</mo><mn>0.01</mn></mrow><annotation encoding="application/x-tex">0.52 \pm 0.01</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.52</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.01</span></span></span></span> which isn't consistent with our ansatz at all (and the first order terms agreement can't be improved via iteration).
What's most likely happening here is that there are just way more upper bounds than lower bounds and the blurring effect introduced by vote fuzzing, story promotion, <em>etc.</em> are causing these to overwhelm the lower bounds and push the preferred <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> down significantly.
If <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">g(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> is still equal to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> then maybe we can kinda sorta use our <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> constraints to determine <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathcal F(t, v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span>, but we really can't do that with the same level of confidence that we did for the data from 2007-2008.</p>
<p>The situation for the 2009-2014 data has the same problem to a much, much lesser extent.
This article already has more than enough figures so I won't show the plots here but they're included in <a href="https://github.com/sangaline/reverse-engineering-the-hacker-news-ranking-algorithm" target="_blank" rel="noopener noreferrer" class="">the GitHub repository</a> if you would like to see them.
The extracted value for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span></span></span></span> here is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.81</mn><mo>±</mo><mn>0.02</mn></mrow><annotation encoding="application/x-tex">0.81 \pm 0.02</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.81</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.02</span></span></span></span> which is much better than in the 2014-2017 but still demonstrates some violation of our assumptions.
Despite that, the valley is clearly defined and you can much more clearly see how it was shifted down.
I would actually be willing to somewhat trust the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> parameterization in this case while also recognizing that there are definitely additional factors at play.</p>
<p>An interesting thing to note here is that both of these later time periods suggest <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">v\_0=-1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">−</span><span class="mord">1</span></span></span></span> rather than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0</mn></mrow><annotation encoding="application/x-tex">0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>.
It's a little hard to trust this coming from data where our assumptions clearly break down but with the 2007-2009 data being inconclusive and the 2009-2014 data looking not-actually-that-bad then it at least suggests the possibility of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>g</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mi>v</mi><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">g(v)=v-1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span> where the nitial automatic self-vote doesn't count.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-ugly">(the ugly)<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#the-ugly" class="hash-link" aria-label="Direct link to (the ugly)" title="Direct link to (the ugly)" translate="no">​</a></h2>
<p>So you've seen the good and you've seen the bad.
Now it's time to see... code written in Arc.
I kid 'cause I love.</p>
<p>Anyway, there exists a sort of <a href="http://arclanguage.org/forum" target="_blank" rel="noopener noreferrer" class="">Bizarro Hacker News</a> out there where everyone <em>really loves lisp</em>.
OK, I guess that that applies to the real Hacker News too... but everybody there <em>specifically loves Arc</em>.
For those of you who aren't familiar: Arc is a dialect of Lisp developed by Paul Graham.
The Arc Language Forum is a place for people to post and discuss stuff relating to Arc and it uses- or at least used at some point- basically the same code as Hacker News.
More importantly, this code was included as part of the Arc distribution.</p>
<p>The first major Arc release was that of Arc 2 in <a href="http://arclanguage.org/item?id=3426" target="_blank" rel="noopener noreferrer" class="">February 2008</a> and it included the source for Hacker News in <a href="https://github.com/arclanguage/anarki/blob/arc2.master/news.arc" target="_blank" rel="noopener noreferrer" class="">news.arc</a>.
The file is dated September 2006 though and presumably the file had been in use for the previous year before being released.
The portion of the code relevant to the ranking algorithm is</p>
<div class="language-scheme codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-scheme codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984">; Ranking</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984">; Votes divided by the age in hours to the gravityth power.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984">; Would be interesting to scale gravity in a slider.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> gravity* </span><span class="token number" style="color:#d3869b">1.4</span><span class="token plain"> timebase* </span><span class="token number" style="color:#d3869b">120</span><span class="token plain"> front-threshold* </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> frontpage-rank </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">s</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">o</span><span class="token plain"> gravity gravity*</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">realscore</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">     </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">expt</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">+</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">item-age</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain"> timebase*</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">60</span><span class="token punctuation">)</span><span class="token plain"> gravity</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> realscore </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">i</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> i!score i!sockvotes</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> item-age </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">i</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">minutes-since</span><span class="token plain"> i!time</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>which, if you aren't familiar with lisp, might be more comprehensible rewritten in python as</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">def</span><span class="token plain"> </span><span class="token function" style="color:#d8a657">frontpage_rank</span><span class="token punctuation">(</span><span class="token plain">story</span><span class="token punctuation">,</span><span class="token plain"> gravity</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">1.4</span><span class="token punctuation">,</span><span class="token plain"> timebase</span><span class="token operator" style="color:#a89984">=</span><span class="token number" style="color:#d3869b">120</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    effective_score </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> story</span><span class="token punctuation">.</span><span class="token plain">score </span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> story</span><span class="token punctuation">.</span><span class="token plain">sockvotes </span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token keyword" style="color:#ea6962">return</span><span class="token plain"> effective_score </span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token plain">timebase </span><span class="token operator" style="color:#a89984">+</span><span class="token plain"> story</span><span class="token punctuation">.</span><span class="token plain">age</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">60</span><span class="token punctuation">)</span><span class="token operator" style="color:#a89984">**</span><span class="token plain">gravity</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>This is also equivalent to a score function</p>
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">F</mi><mo stretchy="false">(</mo><mi>t</mi><mo separator="true">,</mo><mi>v</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>v</mi><mo>−</mo><mn>1</mn></mrow><mrow><mo stretchy="false">(</mo><mfrac><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mfrac><mo>+</mo><mi>t</mi><msup><mo stretchy="false">)</mo><mrow><mfrac><mn>1</mn><mi>τ</mi></mfrac><mi mathvariant="normal">_</mi><mn>1</mn></mrow></msup></mrow></mfrac></mrow><annotation encoding="application/x-tex">\mathcal F(t, v) = \frac{v - 1}{(\frac{\tau\_0}{\tau\_1} + t)^{\frac 1 \tau\_1}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathcal" style="margin-right:0.09931em">F</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.9081em;vertical-align:-1.0629em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.3329em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size3 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0052em"><span style="top:-2.656em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_1</span></span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.5449em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span><span class="mord mtight">_0</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.5654em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size3 size6"></span></span><span class="mbin mtight">+</span><span class="mord mathnormal mtight">t</span><span class="mclose mtight"><span class="mclose mtight">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:1.2245em"><span style="top:-3.4878em;margin-right:0.0714em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size1 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0314em"><span style="top:-2.468em"><span class="pstrut" style="height:3em"></span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1132em">τ</span></span></span><span style="top:-3.2255em"><span class="pstrut" style="height:3em"></span><span class="frac-line mtight" style="border-bottom-width:0.049em"></span></span><span style="top:-3.387em"><span class="pstrut" style="height:3em"></span><span class="mord mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.532em"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size1 size6"></span></span><span class="mord mtight">_1</span></span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.0629em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></p>
<p>where <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mfrac><mn>1</mn><mn>1.4</mn></mfrac><mo>≈</mo><mn>0.714</mn></mrow><annotation encoding="application/x-tex">\tau\_1 = \frac{1}{1.4} \approx 0.714</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.1901em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1.4</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.714</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mfrac><mn>120</mn><mn>60</mn></mfrac><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mn>1.43</mn></mrow><annotation encoding="application/x-tex">\tau\_0 = \frac{120}{60}\tau\_1 = 1.43</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.1901em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">60</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">120</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1.43</span></span></span></span>.
Note that the initial self-vote is subtracted off so the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_0=-1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">−</span><span class="mord">1</span></span></span></span> values that we got from the later time periods were likely correct (the early data was inconclusive).
The vote total also subtracts off "sockvotes" which are presumably votes connected to spam accounts or vote rings.
Other than that, our initial assumptions hold quite well and it's unsurprising that we had such success with the early data.</p>
<p>Now that we know what the actual score function was, we can compare it to what we extracted using our differential equation approach.
To make things more interesting, let's also include the parameters extracted using global optimization over the data for two different cost functions.
One is the Euclidean distance such that if the front page ordering is observed as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">[</mo><mn>1</mn><mo separator="true">,</mo><mn>2</mn><mo separator="true">,</mo><mn>3</mn><mo separator="true">,</mo><mn>4</mn><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">[1, 2, 3, 4]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mopen">[</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">3</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">4</span><span class="mclose">]</span></span></span></span> but predicted as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">[</mo><mn>1</mn><mo separator="true">,</mo><mn>4</mn><mo separator="true">,</mo><mn>3</mn><mo separator="true">,</mo><mn>2</mn><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">[1, 4, 3, 2]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mopen">[</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">4</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">3</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">2</span><span class="mclose">]</span></span></span></span> then the cost is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><mrow><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mn>1</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo>+</mo><mo stretchy="false">(</mo><mn>2</mn><mo>−</mo><mn>4</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo>+</mo><mo stretchy="false">(</mo><mn>3</mn><mo>−</mo><mn>3</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo>+</mo><mo stretchy="false">(</mo><mn>4</mn><mo>−</mo><mn>2</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow></msqrt><mo>=</mo><mn>2</mn></mrow><annotation encoding="application/x-tex">\sqrt{(1-1)^2 + (2-4)^2 + (3-3)^2 + (4-2)^2}=2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.24em;vertical-align:-0.305em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.935em"><span class="svg-align" style="top:-3.2em"><span class="pstrut" style="height:3.2em"></span><span class="mord" style="padding-left:1em"><span class="mopen">(</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mord">1</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7401em"><span style="top:-2.989em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mopen">(</span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mord">4</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7401em"><span style="top:-2.989em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mopen">(</span><span class="mord">3</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mord">3</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7401em"><span style="top:-2.989em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mopen">(</span><span class="mord">4</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mord">2</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7401em"><span style="top:-2.989em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span><span style="top:-2.895em"><span class="pstrut" style="height:3.2em"></span><span class="hide-tail" style="min-width:1.02em;height:1.28em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"><path d="M263,681c0.7,0,18,39.7,52,119
c34,79.3,68.167,158.7,102.5,238c34.3,79.3,51.8,119.3,52.5,120
c340,-704.7,510.7,-1060.3,512,-1067
l0 -0
c4.7,-7.3,11,-11,19,-11
H40000v40H1012.3
s-271.3,567,-271.3,567c-38.7,80.7,-84,175,-136,283c-52,108,-89.167,185.3,-111.5,232
c-22.3,46.7,-33.8,70.3,-34.5,71c-4.7,4.7,-12.3,7,-23,7s-12,-1,-12,-1
s-109,-253,-109,-253c-72.7,-168,-109.3,-252,-110,-252c-10.7,8,-22,16.7,-34,26
c-22,17.3,-33.3,26,-34,26s-26,-26,-26,-26s76,-59,76,-59s76,-60,76,-60z
M1001 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.305em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2</span></span></span></span>.
The second is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank" rel="noopener noreferrer" class="">Levenshtein distance</a> applied to the same vectors.
I won't go into the details here but the Levenshtein distance will basically punish a story the same amount for being a little or a lot out of order (<em>e.g.</em> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">[</mo><mn>4</mn><mo separator="true">,</mo><mn>1</mn><mo separator="true">,</mo><mn>2</mn><mo separator="true">,</mo><mn>3</mn><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">[4, 1, 2, 3]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mopen">[</span><span class="mord">4</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">3</span><span class="mclose">]</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">[</mo><mn>1</mn><mo separator="true">,</mo><mn>2</mn><mo separator="true">,</mo><mn>4</mn><mo separator="true">,</mo><mn>3</mn><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">[1, 2, 4, 3]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mopen">[</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">4</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">3</span><span class="mclose">]</span></span></span></span> are equivalently bad predictions).</p>
<div style="overflow-x:auto"><table><thead><tr><th></th><th>Actual</th><th>Diff-EQ</th><th>Euclidean</th><th>Levenshtein</th></tr></thead><tbody><tr><td><strong>2007/01/01 - 2008/12/31</strong></td><td></td><td></td><td></td><td></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">\tau_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1.43</mn></mrow><annotation encoding="application/x-tex">1.43</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1.43</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1.95</mn><mo>±</mo><mn>0.16</mn></mrow><annotation encoding="application/x-tex">1.95 \pm 0.16</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">1.95</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.16</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3.15</mn></mrow><annotation encoding="application/x-tex">3.15</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">3.15</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3.34</mn></mrow><annotation encoding="application/x-tex">3.34</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">3.34</span></span></span></span></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\tau_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.714</mn></mrow><annotation encoding="application/x-tex">0.714</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.714</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.69</mn><mo>±</mo><mn>0.03</mn></mrow><annotation encoding="application/x-tex">0.69 \pm 0.03</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.69</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.03</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.669</mn></mrow><annotation encoding="application/x-tex">0.669</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.669</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.644</mn></mrow><annotation encoding="application/x-tex">0.644</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.644</span></span></span></span></td></tr></tbody></table></div>
<p>We can see that the differential equation approach, or Diff-EQ for short, outperforms both global optimizations in this case.
All three methods over-estimate <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> but Diff-EQ least significantly and barely outside of what would be reasonably consistent given the statistical error.
The methods all slightly underestimate <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> as well but Diff-EQ gives the closest estimate and is totally consistent given the statistical error.
It's also worth emphasizing again that the Diff-EQ approach told us what functional form to use while the other methods tell us very little about whether different functional forms would perform better.</p>
<p>So the early days were all sunshine and rainbows but you'll remember that things started to go downhill a bit in early 2009.
It turns out that Arc 3 was released in <a href="http://arclanguage.org/item?id=9383" target="_blank" rel="noopener noreferrer" class="">May 2009</a> lining up perfectly with where we predicted that the algorithm changes.
The updated version of <a href="https://github.com/arclanguage/anarki/blob/master/lib/news.arc" target="_blank" rel="noopener noreferrer" class="">news.arc</a> does indeed update the algorithm as follows.</p>
<div class="language-scheme codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-scheme codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> gravity* </span><span class="token number" style="color:#d3869b">1.8</span><span class="token plain"> timebase* </span><span class="token number" style="color:#d3869b">120</span><span class="token plain"> front-threshold* </span><span class="token number" style="color:#d3869b">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">   nourl-factor* </span><span class="token number" style="color:#d3869b">.4</span><span class="token plain"> lightweight-factor* </span><span class="token number" style="color:#d3869b">.3</span><span class="token plain"> </span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> frontpage-rank </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">s</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">o</span><span class="token plain"> scorefn realscore</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">o</span><span class="token plain"> gravity gravity*</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">*</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token keyword" style="color:#ea6962">let</span><span class="token plain"> base </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">scorefn</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">          </span><span class="token punctuation">(</span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> base </span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">expt</span><span class="token plain"> base </span><span class="token number" style="color:#d3869b">.8</span><span class="token punctuation">)</span><span class="token plain"> base</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">expt</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">+</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">item-age</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain"> timebase*</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">60</span><span class="token punctuation">)</span><span class="token plain"> gravity</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">     </span><span class="token punctuation">(</span><span class="token keyword" style="color:#ea6962">if</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">no</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">in</span><span class="token plain"> s!type </span><span class="token symbol" style="color:#d3869b">'story</span><span class="token plain"> </span><span class="token symbol" style="color:#d3869b">'poll</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain">  </span><span class="token number" style="color:#d3869b">.5</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">         </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">blank</span><span class="token plain"> s!url</span><span class="token punctuation">)</span><span class="token plain">                  nourl-factor*</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">         </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">lightweight</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain">                </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">min</span><span class="token plain"> lightweight-factor*</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                                             </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">contro-factor</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                                        </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">contro-factor</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> contro-factor </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">s</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">aif</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">check</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">visible-family</span><span class="token plain"> nil s</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token operator" style="color:#a89984">&gt;</span><span class="token plain"> _ </span><span class="token number" style="color:#d3869b">20</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">min</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">expt</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">/</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">realscore</span><span class="token plain"> s</span><span class="token punctuation">)</span><span class="token plain"> it</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">2</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    </span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> realscore </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">i</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">-</span><span class="token plain"> i!score i!sockvotes</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">disktable</span><span class="token plain"> lightweights* </span><span class="token punctuation">(</span><span class="token operator" style="color:#a89984">+</span><span class="token plain"> newsdir* </span><span class="token string" style="color:#89b482">"lightweights"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> lightweight </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">s</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">(</span><span class="token builtin" style="color:#d8a657">or</span><span class="token plain"> s!dead</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">mem</span><span class="token plain"> </span><span class="token symbol" style="color:#d3869b">'rally</span><span class="token plain"> s!keys</span><span class="token punctuation">)</span><span class="token plain">  </span><span class="token comment" style="color:#a89984">; title is a rallying cry</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">mem</span><span class="token plain"> </span><span class="token symbol" style="color:#d3869b">'image</span><span class="token plain"> s!keys</span><span class="token punctuation">)</span><span class="token plain">  </span><span class="token comment" style="color:#a89984">; post is mainly image(s)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">lightweights*</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">sitename</span><span class="token plain"> s!url</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">      </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">lightweight-url</span><span class="token plain"> s!url</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">defmemo</span><span class="token plain"> lightweight-url </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">url</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">  </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">in</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">downcase</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">last</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">tokens</span><span class="token plain"> url </span><span class="token char" style="color:#a9b665">#\.</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"png"</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"jpg"</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"jpeg"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">def</span><span class="token plain"> item-age </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">i</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token function" style="color:#d8a657">minutes-since</span><span class="token plain"> i!time</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p>Well there's your problem right there!
These various weighting factors are being applied to different categories of stories.
Any "Ask HN" post is going to incur a 60% score penalty for not including an URL, a direct link to an image will get a 70% penalty, and for controversial stories the factor actually depends on the number of votes!
I had already filtered out items that either weren't stories or didn't have URLs so the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.5</mn></mrow><annotation encoding="application/x-tex">0.5</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.5</span></span></span></span> and the <code>nourl-factor*</code> won't affect the data that we looked at, but the <code>lightweight-factor*</code> and <code>contro-factor</code> definitely violate our assumptions pretty significantly.</p>
<p>Ignoring the different factors for a second, the score function parameters have also been updated so that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mfrac><mn>0.8</mn><mn>1.8</mn></mfrac><mo>≈</mo><mn>0.444</mn></mrow><annotation encoding="application/x-tex">\tau\_1=\frac{0.8}{1.8} \approx 0.444</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.1901em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1.8</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">0.8</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.444</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn><mo>=</mo><mfrac><mn>120</mn><mn>60</mn></mfrac><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>=</mo><mn>0.888</mn></mrow><annotation encoding="application/x-tex">\tau\_0=\frac{120}{60}\tau\_1 = 0.888</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.1901em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">60</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">120</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.888</span></span></span></span>.
Note that I'm raising the listed score function here to a power of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mn>1</mn><mn>0.8</mn></mfrac></mrow><annotation encoding="application/x-tex">\frac{1}{0.8}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1901em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8451em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">0.8</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span> to counteract the fact that an exponent of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.8</mn></mrow><annotation encoding="application/x-tex">0.8</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.8</span></span></span></span> was added to the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">v-1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span> term.
This preserves the sorting order while making <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\nu\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mord">_1</span></span></span></span> match the value we used in our ansatz.
The score functions are equivalent within our framework so this is just to facilitate comparison.</p>
<p>Now let's compare the various parameter extractions to these real values.</p>
<div style="overflow-x:auto"><table><thead><tr><th></th><th>Actual</th><th>Diff-EQ</th><th>Euclidean</th><th>Levenshtein</th></tr></thead><tbody><tr><td><strong>2009/04/01 - 2014/06/31</strong></td><td></td><td></td><td></td><td></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">\tau_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.888</mn></mrow><annotation encoding="application/x-tex">0.888</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.888</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.61</mn><mo>±</mo><mn>0.02</mn></mrow><annotation encoding="application/x-tex">0.61 \pm 0.02</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.61</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.02</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2.22</mn></mrow><annotation encoding="application/x-tex">2.22</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2.22</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2.32</mn></mrow><annotation encoding="application/x-tex">2.32</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2.32</span></span></span></span></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\tau_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.444</mn></mrow><annotation encoding="application/x-tex">0.444</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.444</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.52</mn><mo>±</mo><mn>0.01</mn></mrow><annotation encoding="application/x-tex">0.52 \pm 0.01</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.52</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.01</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.443</mn></mrow><annotation encoding="application/x-tex">0.443</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.443</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.435</mn></mrow><annotation encoding="application/x-tex">0.435</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.435</span></span></span></span></td></tr></tbody></table></div>
<p>Here, the Diff-EQ approach underestimates <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> and overestimates <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> slightly while the global optimizations are fairly consistent with each other and significantly overestimate <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> while getting <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> pretty much spot on.
Roughly speaking, the Diff-EQ parameterization is more accurate for small <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> while the other parameterizations become more accurate as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> grows large.</p>
<p>Finally, let's take a look at the data since mid-2014.
The code relating to these changes wasn't released but the preferred values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> look pretty similar so my best guess would be that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> didn't change (just the various other parts of the algorithm).</p>
<div style="overflow-x:auto"><table><thead><tr><th></th><th>Actual???</th><th>Diff-EQ</th><th>Euclidean</th><th>Levenshtein</th></tr></thead><tbody><tr><td><strong>2014/07/01 - 2017/03/09</strong></td><td></td><td></td><td></td><td></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">\tau_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.888</mn></mrow><annotation encoding="application/x-tex">0.888</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.888</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.40</mn><mo>±</mo><mn>0.01</mn></mrow><annotation encoding="application/x-tex">0.40 \pm 0.01</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.40</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.01</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>4.630</mn></mrow><annotation encoding="application/x-tex">4.630</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">4.630</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2.172</mn></mrow><annotation encoding="application/x-tex">2.172</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2.172</span></span></span></span></td></tr><tr><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\tau_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.444</mn></mrow><annotation encoding="application/x-tex">0.444</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.444</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.456</mn><mo>±</mo><mn>0.004</mn></mrow><annotation encoding="application/x-tex">0.456 \pm 0.004</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">0.456</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">±</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.004</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.230</mn></mrow><annotation encoding="application/x-tex">0.230</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.230</span></span></span></span></td><td><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.417</mn></mrow><annotation encoding="application/x-tex">0.417</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.417</span></span></span></span></td></tr></tbody></table></div>
<p>The Levenshtein and Euclidean parameters diverge significantly here for the first time with Levenshtein giving much more accurate values for both <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span>.
This is as we might expect if some stories are appearing much higher or much lower than their parameterized score would suggest.
Those heavily adjusted stories will have a huge influence on the Euclidean distance while essentially saturating with the Levenshtein distance.
The fact that these cost functions result in such different parameter values is itself evidence of unaccounted for components in the scoring model.</p>
<p>The Diff-EQ approach significantly outperforms both other methods here, underestimating <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">\tau_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span> by roughly a factor of two and finding <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>τ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\tau_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1132em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span> almost exactly.
We can compare these visually by plotting the various <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> functions.</p>
<p><img decoding="async" loading="lazy" alt="A comparison of the various f(t) functions." src="https://sangaline.com/assets/images/f-parameterizations-2014-2017-6de35f1215ebf1c52f330bed8b6d10c4.png" width="600" height="400" class="img_ev3q"></p>
<p>There's clearly some significant error when <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> is small but if you look back at the fit for this newer data then you'll see that there was some upward curvature in <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> for low <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span> and that the preferred y-intercept is actually much closer to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.888</mn></mrow><annotation encoding="application/x-tex">0.888</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.888</span></span></span></span> than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.41</mn></mrow><annotation encoding="application/x-tex">0.41</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.41</span></span></span></span>.
If we had allowed for this curvature then the agreement here would be much, much better... doing so just seemed pointless given the poor quality of the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> constraint.
In either case, the parameterization is surprisingly accurate given that the Arc 2 code already significantly broke our assumptions and the more recent data obviously corresponds to much more extreme violations.
I suspected that fuzzing-like effects would cancel to leading order when the bounds overlapped but there's actually something more interesting going on here that explains why the agreement is so good despite the assumptions breaking down.</p>
<p>The value of a bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t\_1)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mclose">)</span></span></span></span> when Story 1 appears first is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mfrac><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mfrac></mrow><annotation encoding="application/x-tex">v\_2 \frac{t\_2-t\_1}{v\_2-v\_1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.5401em;vertical-align:-0.562em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.9781em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_1</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span>.
Adjusting the score of Story 1 upwards by some factor wouldn't change this expression at all.
Adjusting the score of Story 2 upwards such that it now appears first results in a bound on <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t\_2)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mord">_2</span><span class="mclose">)</span></span></span></span> that is equal to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mfrac><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow></mfrac></mrow><annotation encoding="application/x-tex">v\_1 \frac{t\_2-t\_1}{v\_2-v\_1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.5401em;vertical-align:-0.562em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.9781em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em">v</span><span class="mord mtight">_1</span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.527em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="mord mtight">_2</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">t</span><span class="mord mtight">_1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.562em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span>.
If <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>≈</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_1 \approx v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>≈</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_1 \approx t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> then we find that the value of the computed bound doesn't change due to the score adjustment.
Instead, only the regions over which the bound would be upper or lower change.
If we're randomly adjusting the final scores of stories by some factor then the additional contributions from upper and lower bounds <em>exactly cancel.</em>
You don't even need the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn><mo>≈</mo><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_1 \approx t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> constraint when you consider the distribution over the whole ensemble of data.</p>
<p>If you look at story pairs where <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">v\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> are very different then this is no longer true; the value of the bound jumps discontinuously when the story order changes.
This means that the density of upper and lower bounds will no longer cancel, they'll grow with <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∣</mi><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">|v\_2-v\_1|</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1∣</span></span></span></span>, the magnitude of the adjustment factors, and how quickly the story density changes with respect to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi></mrow><annotation encoding="application/x-tex">t</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6151em"></span><span class="mord mathnormal">t</span></span></span></span>.
The thing is, as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∣</mi><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn><mo>−</mo><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">|v\_2-v\_1|</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1∣</span></span></span></span> grows, any measured bounds (without adjustment) get further from the real value of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span>.
The same discontinuity that causes the contributions to not cancel simultaneously reduces the overlap (or eliminates it completely depending on the size of the adjustment factor).
This is a direct result of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><msup><mi mathvariant="normal">∂</mi><mn>2</mn></msup><mi mathvariant="script">F</mi></mrow><mrow><mi mathvariant="normal">∂</mi><msup><mi>t</mi><mn>2</mn></msup></mrow></mfrac><mo>&gt;</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\frac{\partial^2 \mathcal F}{\partial t^2}&gt;0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.3629em;vertical-align:-0.345em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0179em"><span style="top:-2.655em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="mord mtight"><span class="mord mathnormal mtight">t</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7463em"><span style="top:-2.786em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mtight" style="margin-right:0.05556em">∂</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8913em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mord mathcal mtight" style="margin-right:0.09931em">F</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span> and our extrapolated tangent line falling under the curve.
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">t\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_1</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>t</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">t\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal">t</span><span class="mord">_2</span></span></span></span> have to have similar values to get a very tight constraint and that means <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">v\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_1</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi mathvariant="normal">_</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">v\_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord">_2</span></span></span></span> also have to have similar values.</p>
<p>This goes far beyond my somewhat hand-wavey original assertion that the overlap cancels to leading order.
The quality of the extracted <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>0</mn></mrow><annotation encoding="application/x-tex">\tau\_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_0</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mi mathvariant="normal">_</mi><mn>1</mn></mrow><annotation encoding="application/x-tex">\tau\_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9544em;vertical-align:-0.31em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mord">_1</span></span></span></span> parameters make a lot more sense in light of this.
It also makes more sense why the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>τ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\tau(t)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1132em">τ</span><span class="mopen">(</span><span class="mord mathnormal">t</span><span class="mclose">)</span></span></span></span> fits looked OK in the newer data but the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>ν</mi><mo stretchy="false">(</mo><mi>v</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\nu(v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.06366em">ν</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mclose">)</span></span></span></span> fits looked really bad.
The constant factors weren't the issue, it was really the different <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi></mrow><annotation encoding="application/x-tex">v</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal" style="margin-right:0.03588em">v</span></span></span></span> dependence introduced by the <code>contro-factor</code>.
The distribution of bounds violations was sort of an admixture of one preferring a linear function and many preferring cubic functions.
Then <em>on top of that</em> you also have the contributions from the various adjustment factors.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="in-other-words">In Other Words<a href="https://sangaline.com/blog/reverse-engineering-the-hacker-news-ranking-algorithm#in-other-words" class="hash-link" aria-label="Direct link to In Other Words" title="Direct link to In Other Words" translate="no">​</a></h2>
<p>We took a data driven approach to figuring out how the Hacker News algorithm works and were able to make some fairly accurate inferences about how the score function has depended on a story's age and vote total at various points in the site's history.
Our methodology proved to robust against violations of our initial assumptions and generally outperformed global optimization approaches at reconstructing the parameters used in the Arc code, particularly when adjustment factors became more prevalent.
It even guided us to the correct functional form for the score function while it would have been difficult to do anything beyond guess-and-check with the global optimization approach.</p>
<p>Overall, it was a fun problem space to explore and I hope that you enjoyed following along.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Greedy Image Unshredder]]></title>
            <link>https://sangaline.com/blog/a-greedy-image-unshredder</link>
            <guid>https://sangaline.com/blog/a-greedy-image-unshredder</guid>
            <pubDate>Sun, 09 Oct 2016 17:06:35 GMT</pubDate>
            <description><![CDATA[A brief response to Nayuki's post about the use of simulated annealing to solve an image unshredding problem. An interactive demo is used to show that a simple greedy algorithm outperforms the SA, both in terms of results and computation time.
]]></description>
            <content:encoded><![CDATA[<p>A <a href="https://www.nayuki.io/page/image-unshredder-by-annealing" target="_blank" rel="noopener noreferrer" class="">blogpost by Nayuki</a> about unshredding images using simulated annealing was just posted on <a href="https://news.ycombinator.com/item?id=12667611" target="_blank" rel="noopener noreferrer" class="">Hacker News</a>.
This is just a simple demo to support <a href="https://news.ycombinator.com/item?id=12668728" target="_blank" rel="noopener noreferrer" class="">my comment on the article</a> and illustrate that a much simpler greedy algorithm is both faster and more effective for this specific problem.</p>
<!-- -->
<div class="container_RKCj"><canvas width="400" height="300" class="canvas_n8BK"></canvas><table class="controlsTable_SyT5"><tbody><tr><td><label>Select image:</label></td><td><select><option value="0">Abstract Light Painting</option><option value="1">Alaska Railroad</option><option value="2">Blue Hour in Paris</option><option value="3">Lower Kananaskis Lake</option><option value="4">Marlet 2 Radio Board</option><option value="5">Nikos’s Cat</option><option value="6" selected="">Pizza food wallpaper</option><option value="7">The Enchanted Garden</option><option value="8">Tokyo Skytree Aerial</option></select> <a href="https://www.flickr.com/photos/68711844@N07/15204301893/" target="_blank" rel="noopener noreferrer">Michael Stern</a>, <abbr title="Creative Commons">CC</abbr> license</td></tr><tr><td></td><td><button disabled="">Shuffle</button> <button disabled="">Unshred</button> <button disabled="">Stop</button></td></tr><tr><td>Iterations:</td><td>‒</td></tr></tbody></table></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="description">Description<a href="https://sangaline.com/blog/a-greedy-image-unshredder#description" class="hash-link" aria-label="Direct link to Description" title="Direct link to Description" translate="no">​</a></h2>
<p>The algorithm consists of simply to find the closest matching pair of columns and then build up the image column by column from that by adding the best matching column out of all those remaining.
It takes less time to run and more accurately reconstructs the images (though there are a couple of remaining seams).
Both algorithms currently use the sum of the absolute values of the differences between RGB value in the adjacent pixels.
The performance could be improved by finding a better comparison function for the columns of pixels.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Finding an Optimal Keyboard Layout for Swype]]></title>
            <link>https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype</link>
            <guid>https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype</guid>
            <pubDate>Thu, 09 Apr 2015 20:20:48 GMT</pubDate>
            <description><![CDATA[An overview of my work on optimizing phone keyboard layouts for Swype and T9. There's some interesting history here as well as a novel simulation-based approach to keyboard optimization.
]]></description>
            <content:encoded><![CDATA[<p><em>What follows is a set of somewhat meandering musings. If cold hard facts are more your thing then you may prefer reading our recent paper <a href="http://arxiv.org/abs/1503.06300" target="_blank" rel="noopener noreferrer" class="">Optimizing Touchscreen Keyboard Layouts to Minimize Swipe Input Errors</a>. You might also want to check out the open source libraries <a href="https://github.com/sangaline/dodona" target="_blank" rel="noopener noreferrer" class="">dodona</a> and <a href="https://github.com/sangaline/svganimator" target="_blank" rel="noopener noreferrer" class="">svganimator</a> that were used for the analysis and to generate the graphics in this post.</em></p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="introduction">Introduction<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#introduction" class="hash-link" aria-label="Direct link to Introduction" title="Direct link to Introduction" translate="no">​</a></h2>
<p>How do you enter text on your phone? If it has a touch screen then chances are you use a tiny little keyboard on the screen. If not, you probably still know how that might work. You just press the tiny little keys on the little keyboard for each letter that you want to type and they show up on the screen. It's a pretty intuitive mode of interaction for anybody who has spent some time on a computer before.</p>
<p>There's a very serious limitation to tapping away at the little keys however: it's really slow. Despite the backroom deals that have taken place between the fannypack industry and mobile phone manufacturers, you're still not going to be able to make Mavis Beacon proud by fitting eight fingers on the home row. Most people end up tapping away with their two thumbs or a single finger. Even if you get in a groove it's just frustratingly slow compared to a real keyboard.</p>
<p>At this point some of you who prefer swyping to typing might be thinking something along the lines of, "tap typing is for suckers!" Now some of you are probably wondering, "what's swyping?" Well... swyping, a portmanteau of swiping and typing, is an alternative way of interacting with those same little keyboards. Instead of laboriously lifting your finger and plunging it down against rigid glass for every single letter, you just slide your finger from key to key. Let's take a quick look at how we could type or swype the word "head."</p>
<p><img decoding="async" loading="lazy" alt="typing vs swyping head" src="https://sangaline.com/assets/images/head.swype-cdd126358a1b4fa8a68d6c21b2809a9f.svg" width="1152" height="384" class="img_ev3q"></p>
<p>On the left you see how we would type it. The orange circle indicates a touch event on the device and we just highlight each key to make it more clear what's being entered. On the right we see how you would swype the word. We added a blue line to show you what the whole swype pattern looks like but the orange circle again represents touch events as time progresses. It's not really any more complicated than typing. Figuring out which word was intended from a swype pattern can, on the other hand, be a bit tricky... we'll get to that later.</p>
<p>Swyping might not seem like that big of a deal, but after you get the hang of it it's amazing. Your finger dances effortlessly across the screen as you enter entire words in single fluid motions. If you ever have to use a touch keyboard that doesn't support swyping you feel like you're using two ball-peen hammers on a typewriter. And, most importantly, it's fast.</p>
<p>Back in 2010, Franklin Page set the (now defeated) world record for fastest text message on a touchscreen device by swyping the phrase, "The razor-toothed piranhas of the genera Serrasalmus and Pygocentrus are the most ferocious freshwater fish in the world. In reality they seldom attack a human," in just 35.54 seconds. That works out to 43.9 words per minute! Granted, you better have a great smile if you're trying to get a job as a secretary typing at that speed, but it's still pretty damn impressive on a touch screen.</p>
<p>It's impressive but maybe slightly less so than it looks at first glance. The truth is that a Bostonian with one Irish great-great-grandparent could probably swipe "pygocentrus" correctly on St. Patrick's Day. On the other hand, Norman Shumway would've likely struggled in his prime to swipe "it" without it being interpreted as "out" some of the time. Take a look at what the swype pattern for "pygocentrus" actually looks like.</p>
<p><img decoding="async" loading="lazy" alt="pygocentrus swype pattern" src="https://sangaline.com/assets/images/pygocentrus.swype-84e31bd6efcd70e87aff47d8e54873ed.svg" width="576" height="384" class="img_ev3q"></p>
<p>The swype pattern looks complex at first but it's not that bad if you trace it out letter by letter. The complexity of the swype pattern is actually what makes it so easy for a computer to figure out what word it corresponds to. Even if you miss most of the letters by a bit there's still no other word with a remotely similar pattern. When you swype the word "it" then it looks almost the same as "out" if you start just a little too far to the right. If you're swyping "pot" then it looks exactly like "pout" and "pit" even if you do it perfectly.</p>
<p>When you try to swype one word but it's interpreted as another it's called a swypo. It can be pretty frustrating when these happen because you need to go back, delete the error, and then often tap type the correct word. This really breaks your flow and slows you down. We've also noticed that the incidence of swypos seems to go up after you've been swyping for a while. Once you get too comfortable it becomes really easy to cut a corner a little too sharply or to shift a whole pattern a bit to the side.</p>
<p>It usually starts small; with little mistakes that aren't that big of a deal. You know that your friends will be able to tell you mean "stories" when you say "sorties" so you don't stress. Then you start convincing yourself that "airtight" is just cool new slang you invented that means "alright." But before long you'll start texting "it'd a nice say toast" instead of "it's a nice day today" and you'll get concerned calls asking if you meant to say that you <em>smell</em> toast. Once it gets this bad you might as well be speaking with a British accent for how well people can understand you.</p>
<p>So what can you do? Other than the obviously unacceptable answers of looking at the screen while you're swyping or trying to be less sloppy... nothing really. Nothing at all. Unless... is it possible that a different keyboard layout could alleviate the problem?</p>
<p>A major cause of swypos is that there are inconvenient clusters of letters on a QWERTY keyboard, like "uio," which lead to a lot of ambiguity. Also, some of the least frequently used letters, like "q" and "z," are off in the corners while they would probably be more useful in the center to help separate the most commonly used keys. Just how much better suited for swyping could a keyboard be if you arranged the keys more optimally? If that's a question that you want to know the answer to then you're reading the right blog post.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="back-in-the-day">Back in the Day...<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#back-in-the-day" class="hash-link" aria-label="Direct link to Back in the Day..." title="Direct link to Back in the Day..." translate="no">​</a></h2>
<blockquote>
<p>"Turn your pagers to 1993." -Christopher Wallace</p>
</blockquote>
<p>Before we really dig into swype we're going to take a little detour to explore its spiritual predecessor, T9. This will give me a chance to explain our general methodology without getting too caught up on some of the subtle points that only apply to swype. There's also some interesting history here that's hard to pass up on and relevant to swype. If you already know what T9 is and you're not in the mood for a history lesson then feel free to skip ahead to the next section.</p>
<p>The first incarnation of phone to phone SMS as we know it today was introduced in 1993 by Radiolinja in Finland. The idea of piggybacking 128 byte messages on the signaling paths used to control telephone traffic on GSM networks has been around since 1984, but it wasn't until 1993 that Nokia made the first phones which supported sending and receiving them.</p>
<p>By 1995, the technology had become ubiquitous but the average GSM user was still only sending on average 0.4 messages per month. It would be good for the narrative to say that this was because the text input methods were so inefficient but in reality it's probably more due to the complexity of the billing system in use at the time. The text input was really inefficient though, that part's true.</p>
<p>Remember how telephone keypads have one number and a few letters on each button? For example, the 2 button also has "ABC" on it and the 8 button has "TUV." To send a text message back then you had to rotate through the letters by hitting each button repeatedly. To type the word "cat" you would have to tap 2-2-2-pause-2-8. It was slow and a pain to use.</p>
<p>So anyway, in 1995, when SMS was really still in its infancy, Cliff Kushler and Martin King had a brilliant idea. That idea was to form a company called Tegic Communications and immediately file a patent for T9, short for Text on 9 Keys. With T9 you just press each letter on the keypad once while you're typing instead of cycling through the letters on each key. Going back to the "cat" example it would be shortened to 2-2-8 (although on some phones you would have to pause before the subsequent 2s).</p>
<p>An interesting historical footnote is that the idea for T9 had already been around for a full decade by the time Tegic formed and the patent was filed. I've investigated this thoroughly and believe that the seminal paper was "A Simplified Touch Tone Telecommunication Aid for Deaf and Hearing Impaired Individuals" by Scott Minneman from Tufts Medical Center. This paper describes the text entry method as well as two disambiguation algorithms, and it was published in the proceedings from The 7th Annual Rehabilitation Engineering and Assistive Technology Society of North America (RESNA) Conference in 1985.</p>
<p>There's no way that Kushler and King had any idea about some obscure proceedings from a conference that you've probably never heard of though, right? Well... they were both doing research on computer input mechanisms for disabled people before they started Tegic. They've also both presented their own research at RESNA conferences. It's not like there was only one obscure presentation on the topic either.</p>
<p>Minneman and Stephen Levine, a frequent collaborator, wrote a series of papers in the 1980s exploring different disambiguation algorithms and optimizing keyboard layouts to reduce errors and increase the input rate. There was even a 198 page dissertation written on the topic by Chandravalee Iyengar, who I assume was Levine's graduate student, in 1988 called "Development of a Multicharacter Key Text Entry System Using Computer Disambiguation." This collective work spawned a small subfield and by the time the patent was filed in 1995 there had been dozens of papers exploring every detail of these types of keyboards.</p>
<p>Ah, but you have to read all the claims of the patent and not just the abstract. Yeah, I did. And I've read dozens of research papers on disambiguation keyboards on microfiches that I had to dig out of dusty boxes in library basements. So, yeah... I have a lot of respect for Kushler and King for realizing that the real money in T9 was in facilitating flirting instead of helping deaf people. That was actually a pretty big insight in 1995, but probably not one that should be patentable.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="optimizing-the-t9-layout">Optimizing the T9 Layout<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#optimizing-the-t9-layout" class="hash-link" aria-label="Direct link to Optimizing the T9 Layout" title="Direct link to Optimizing the T9 Layout" translate="no">​</a></h2>
<p>Getting back to the fun stuff: T9 text entry is a lot like swype. It sacrifices a bit of accuracy by allowing ambiguous inputs and in doing so gains a lot of speed. So we can take our initial question of how much we can improve the text input rate by rearranging keys and apply it to T9 instead of swype. It's a very similar problem but it's a lot easier to solve so we can use it as a stepping stone while explaining the general optimization approach.</p>
<p>So how do we optimize a keyboard layout? We first need to quantify what we're trying to optimize. You can get fancy and start talking about the time it takes to move your finger around but the biggest bottleneck with T9 is generally having to go back and fix things that weren't interpreted correctly. We quantify this with the error rate which is just the percentage of words you enter that are misinterpreted. We'll also refer to the efficiency of the keyboard which is the percentage of the time that a word is correctly interpreted (one minus the error rate).</p>
<p>In order to calculate the error rate we're going to need a way to simulate realistic user input on a keyboard and then to interpret it. We built an open source library called <a href="https://github.com/sangaline/dodona" target="_blank" rel="noopener noreferrer" class="">dodona</a> which we'll be using to do these things. The core of the library is written in c++ but we like to interact with it using python and Jupyter. We'll include python code as we go to help explain some of the concepts.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># Import the necessary modules</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> dodona </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> core</span><span class="token punctuation">,</span><span class="token plain"> keyboards</span><span class="token punctuation">,</span><span class="token plain"> wordlists</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># These are a couple of ipython specific things for plotting and numerics</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token operator" style="color:#a89984">%</span><span class="token plain">pylab</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token operator" style="color:#a89984">%</span><span class="token plain">matplotlib inline</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Create a keyboard and an interaction model</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">t9_keyboard </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> keyboards</span><span class="token punctuation">.</span><span class="token plain">MakeT9Keyboard</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">t9_model </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> core</span><span class="token punctuation">.</span><span class="token plain">SimpleGaussianModel</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Set how random the model is relative to the key sizes</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">t9_model</span><span class="token punctuation">.</span><span class="token plain">SetScale</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">0.2</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Use the model to generate an input vector for the word "cat"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">vector </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> t9_model</span><span class="token punctuation">.</span><span class="token plain">RandomVector</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"cat"</span><span class="token punctuation">,</span><span class="token plain"> t9_keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">keyboards</span><span class="token punctuation">.</span><span class="token plain">DrawKeyboard</span><span class="token punctuation">(</span><span class="token plain">t9_keyboard</span><span class="token punctuation">,</span><span class="token plain"> inputvector</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">vector</span><span class="token punctuation">,</span><span class="token plain"> t9</span><span class="token operator" style="color:#a89984">=</span><span class="token boolean" style="color:#ea6962">True</span><span class="token punctuation">,</span><span class="token plain"> figsize</span><span class="token operator" style="color:#a89984">=</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">5</span><span class="token punctuation">,</span><span class="token number" style="color:#d3869b">5</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="cat T9 input" src="https://sangaline.com/assets/images/cat.t9-bb56e134433587d81a99aa5431a0afb2.svg" width="480" height="480" class="img_ev3q"></p>
<p>The circles in this plot again indicate touch events but now the color indicates time with yellow corresponding to the first touch and red the last. We call sets of touch events an input vector which we represent as a sequence of (x, y, t) values. Input vectors are generated by input models which simulate how a user would enter words on a keyboard. The model we're using here chooses a random location for each key press according to a Gaussian distribution centered around a given key.</p>
<p>Now that we have a way to simulate user input we still have to figure out how to interpret it. One way to disambiguate inputs is to use a dictionary of valid words. The simplest approach is to then just cycle through every word in the dictionary and pick the word with the highest likelihood of generating the observed input vector. This is the general approach that SimpleGaussianModel uses but it also limits the search to words of the same length.</p>
<p>To build our dictionary we used the Google Web Trillion Word Corpus as compiled by Peter Norvig. This corpus contains approximately 300,000 of the most commonly used words in the English language and their frequencies. Unfortunately, more than half of these are misspelled words and abbreviations. To get rid of these we cross checked against The Official Scrabble Dictionary, the Most Common Boys/Girls Names, and WinEdt's US Dictionary, including only words that appeared in at least one of them. This left us with 95,881 words or about five times the vocabulary of an average adult.</p>
<p>Now, using this dictionary, we can disambiguate input.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># Load in the wordlist from disk</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">wordlist </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> core</span><span class="token punctuation">.</span><span class="token plain">WordList</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">wordlist</span><span class="token punctuation">.</span><span class="token plain">LoadFromFile</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'wordlist.dat'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Truncate it to the 5000 most common words</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">wordlist </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> wordlists</span><span class="token punctuation">.</span><span class="token plain">MostCommon</span><span class="token punctuation">(</span><span class="token plain">wordlist</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">5000</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Make a new cat vector and find the best match</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">vector </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> t9_model</span><span class="token punctuation">.</span><span class="token plain">RandomVector</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"cat"</span><span class="token punctuation">,</span><span class="token plain"> t9_keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">print</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"Reconstructed as:"</span><span class="token punctuation">,</span><span class="token plain"> t9_model</span><span class="token punctuation">.</span><span class="token plain">BestMatch</span><span class="token punctuation">(</span><span class="token plain">vector</span><span class="token punctuation">,</span><span class="token plain"> t9_keyboard</span><span class="token punctuation">,</span><span class="token plain"> wordlist</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">Reconstructed as: act</span><br></span></code></pre></div></div>
<p>Oops, we interpreted it as "act" instead of "cat." If "c" and "a" were on different keys then we wouldn't have had this problem.</p>
<p>In order to quantify the overall frequency of errors we generate words randomly based on their frequency of usage and then compute the fraction of the time that a word is misinterpreted. If our model is random then this is only a statistical estimate of the error rate but, because the errors in T9 are almost entirely due to the keyboard layout itself, we can cheat a little and compute it directly.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># Turn off the randomness</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">t9_model</span><span class="token punctuation">.</span><span class="token plain">SetScale</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">0</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Find the efficiency</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">result </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> core</span><span class="token punctuation">.</span><span class="token plain">ExactEfficiency</span><span class="token punctuation">(</span><span class="token plain">t9_keyboard</span><span class="token punctuation">,</span><span class="token plain"> t9_model</span><span class="token punctuation">,</span><span class="token plain"> wordlist</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">print</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'Error rate:'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1</span><span class="token operator" style="color:#a89984">-</span><span class="token plain">result</span><span class="token punctuation">.</span><span class="token plain">Fitness</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">Error rate: 0.04062902967123838</span><br></span></code></pre></div></div>
<p>So the error rate is about 4% for the normal touchtone layout with a dictionary consisting of the 5,000 most commonly used words. Now let's see how that compares to the average error rate for randomly selected keyboards.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token keyword" style="color:#ea6962">from</span><span class="token plain"> copy </span><span class="token keyword" style="color:#ea6962">import</span><span class="token plain"> deepcopy</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">error_rates </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">random_keyboard </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> deepcopy</span><span class="token punctuation">(</span><span class="token plain">t9_keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">range</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">1</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token number" style="color:#d3869b">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    random_keyboard</span><span class="token punctuation">.</span><span class="token plain">Randomize</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    fitness </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> core</span><span class="token punctuation">.</span><span class="token plain">ExactEfficiency</span><span class="token punctuation">(</span><span class="token plain">random_keyboard</span><span class="token punctuation">,</span><span class="token plain"> t9_model</span><span class="token punctuation">,</span><span class="token plain"> wordlist</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token plain">Fitness</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    error_rates</span><span class="token punctuation">.</span><span class="token plain">append</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">1</span><span class="token operator" style="color:#a89984">-</span><span class="token plain">fitness</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">print</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"Average error rate:"</span><span class="token punctuation">,</span><span class="token plain"> np</span><span class="token punctuation">.</span><span class="token plain">mean</span><span class="token punctuation">(</span><span class="token plain">error_rates</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">print</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"Standard deviation:"</span><span class="token punctuation">,</span><span class="token plain"> np</span><span class="token punctuation">.</span><span class="token plain">std</span><span class="token punctuation">(</span><span class="token plain">error_rates</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">Average error rate: 0.0618525265595</span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">Standard deviation: 0.0167464493527</span><br></span></code></pre></div></div>
<p>So the efficiency of the standard touchtone layout is actually 1.27 standard deviations above average.</p>
<p>How could we go about finding a more optimal arrangement of the keys? One simple approach is to swap a few random pairs of letters to get a similar but random keyboard and then reevaluate the efficiency. If it's greater than it was before the swaps then we keep the new keyboard and otherwise we keep the old. If we do that repeatedly then we'll end up with a keyboard with a lower error rate.</p>
<p>We repeated this exact process a few times to illustrate how the optimization progresses. For fun we also decided to see how bad of a layout we could come up with. In the image below you can see how the error rates progressed with iterations. The blue and the green represent optimizations for low and high error rates, respectively. The light lines are individual optimization progressions while the dark lines are the average over the ensemble.</p>
<p><img decoding="async" loading="lazy" alt="T9 optimization" src="https://sangaline.com/assets/images/t9.optimization-0b7da24ba0e83b5a9e00a5b5b4700a3d.svg" width="576" height="384" class="img_ev3q"></p>
<p>The best out of these had an error rate of 1.72% which is 58% lower than the standard "abc" layout. The worst had an error rate of 21.7% which is over 5 times higher than the "abc" layout. Clearly the performance of the keyboard is hugely dependent on the arrangement of the letters. Taking a look at these two layouts we can gain some insight to why the performance is so different.</p>
<p><img decoding="async" loading="lazy" alt="best and worst T9 layouts" src="https://sangaline.com/assets/images/best.worst.t9-d71c141d22ecdcf8cd588d6d8cdef246.svg" width="960" height="480" class="img_ev3q"></p>
<p>One of the first things we notice is that the really bad keyboard has the four most common vowels all on one key. Some of the other characteristics are more subtle but become apparent when we look at some of the most common errors for each layout.</p>
<div style="overflow-x:auto"><table><thead><tr><th style="text-align:center">Worst Layout</th><th style="text-align:center"></th><th style="text-align:center"></th><th style="text-align:center">Best Layout</th><th style="text-align:center"></th><th style="text-align:center"></th></tr></thead><tbody><tr><td style="text-align:center">intended</td><td style="text-align:center">interpreted</td><td style="text-align:center">frequency</td><td style="text-align:center">intended</td><td style="text-align:center">interpreted</td><td style="text-align:center">frequency</td></tr><tr><td style="text-align:center">in</td><td style="text-align:center">of</td><td style="text-align:center">1.82%</td><td style="text-align:center">them</td><td style="text-align:center">they</td><td style="text-align:center">0.09%</td></tr><tr><td style="text-align:center">on</td><td style="text-align:center">of</td><td style="text-align:center">0.80%</td><td style="text-align:center">non</td><td style="text-align:center">now</td><td style="text-align:center">0.04%</td></tr><tr><td style="text-align:center">this</td><td style="text-align:center">that</td><td style="text-align:center">0.69%</td><td style="text-align:center">gay</td><td style="text-align:center">may</td><td style="text-align:center">0.04%</td></tr><tr><td style="text-align:center">I</td><td style="text-align:center">a</td><td style="text-align:center">0.66%</td><td style="text-align:center">la</td><td style="text-align:center">of</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">it</td><td style="text-align:center">is</td><td style="text-align:center">0.60%</td><td style="text-align:center">pc</td><td style="text-align:center">re</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">not</td><td style="text-align:center">for</td><td style="text-align:center">0.56%</td><td style="text-align:center">nov</td><td style="text-align:center">not</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">or</td><td style="text-align:center">is</td><td style="text-align:center">0.56%</td><td style="text-align:center">al</td><td style="text-align:center">do</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">at</td><td style="text-align:center">is</td><td style="text-align:center">0.49%</td><td style="text-align:center">yet</td><td style="text-align:center">get</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">as</td><td style="text-align:center">is</td><td style="text-align:center">0.48%</td><td style="text-align:center">id</td><td style="text-align:center">if</td><td style="text-align:center">0.03%</td></tr><tr><td style="text-align:center">an</td><td style="text-align:center">of</td><td style="text-align:center">0.33%</td><td style="text-align:center">term</td><td style="text-align:center">very</td><td style="text-align:center">0.02%</td></tr></tbody></table></div>
<p>More than half of the time that an error occurs on the worst keyboard it's the result of one of these ten most common errors. From these we can see that having "tsr" together and "nf" are also problematic. On the best keyboard these groups of letters are all separated which eliminates any ambiguity for these extremely common words. The error rate does depend on all of the words in the dictionary, so the full picture is more complicated than this, but these common words are a dominant factor.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="optimizing-a-keyboard-layout-for-swype">Optimizing a Keyboard Layout for Swype<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#optimizing-a-keyboard-layout-for-swype" class="hash-link" aria-label="Direct link to Optimizing a Keyboard Layout for Swype" title="Direct link to Optimizing a Keyboard Layout for Swype" translate="no">​</a></h2>
<p>T9 will always have a special place in my heart but it's probably a pretty safe bet to assume that the future of text entry doesn't involve a touchtone keypad. It seems like every kid I see these days is glued to their iThis or their Razzr Thatt HD+ and there's no sign that this is about to change. People aren't just writing 160 character text messages anymore either; now they're writing whole Tweets, emails, and god knows what else. I even typed up this entire blogpost on my phone just because my laptop was upstairs. A lot of time could be saved and frustration avoided if touchscreen keyboards were more optimal.</p>
<p>"More optimal" could of course mean a lot of different things. We're going to use the same criteria for being optimal that we did for T9: having the lowest rate of interpretation errors in typical usage. This probably wouldn't be the best quantity to optimize if we were dealing with tap typing, but it's a pretty good way to find faster and less frustrating keyboards with inherently ambiguous input methods like T9 and swype.</p>
<p>We'll also take the same general approach to computing the rate of errors that we did in the T9 analysis. We'll build a model of typical user input, try to interpret it, and then see how frequently we reconstruct it to be the wrong word. Then we can repeat this for different keyboard layouts and see how they compare.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="modeling-swype-input">Modeling Swype Input<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#modeling-swype-input" class="hash-link" aria-label="Direct link to Modeling Swype Input" title="Direct link to Modeling Swype Input" translate="no">​</a></h3>
<p>Modeling how users input words on T9 was pretty easy. We basically just had them tap once in the center of each key. With swype it's a bit more complicated because the personal style and sloppiness of each user plays a major role in the ambiguity. Although there are a number of inherently indistinguishable swype patterns on a QWERTY keyboard, the majority of errors are the result of there being words with similar patterns that become hard to distinguish in realistic usage.</p>
<p>To model the random elements of individual swypes we first choose control points for each letter in a given word. These points are chosen according to correlated Gaussian distributions around each key's center. We then interpolate between these control points in various ways and sample along the path so that every swype consists of the same number of discrete touch events. This allows for a wide variety of realistic swype input, capturing both differences in personal style and sloppiness. Below you can see how a handful of random swypes look for the word "cream".</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">word </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token string" style="color:#89b482">"cream"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Make a standard QWERTY keyboard</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">keyboard </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> keyboards</span><span class="token punctuation">.</span><span class="token plain">MakeStandardKeyboard</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">keyboard</span><span class="token punctuation">.</span><span class="token plain">RemoveKey</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">'.'</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Create an interpolation model and a perfect vector</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">model </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> core</span><span class="token punctuation">.</span><span class="token plain">SimpleInterpolationModel</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">perfect </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> model</span><span class="token punctuation">.</span><span class="token plain">PerfectVector</span><span class="token punctuation">(</span><span class="token plain">word</span><span class="token punctuation">,</span><span class="token plain"> keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># A few possible interpolations</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">interpolations </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token plain">core</span><span class="token punctuation">.</span><span class="token plain">SpatialInterpolation</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                  core</span><span class="token punctuation">.</span><span class="token plain">CubicSplineInterpolation</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                  core</span><span class="token punctuation">.</span><span class="token plain">MonotonicCubicSplineInterpolation</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">                  core</span><span class="token punctuation">.</span><span class="token plain">HermiteCubicSplineInterpolation</span><span class="token punctuation">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token comment" style="color:#a89984"># Plot ten random vectors for the word</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">for</span><span class="token plain"> i </span><span class="token keyword" style="color:#ea6962">in</span><span class="token plain"> </span><span class="token builtin" style="color:#d8a657">range</span><span class="token punctuation">(</span><span class="token number" style="color:#d3869b">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    model</span><span class="token punctuation">.</span><span class="token plain">Interpolation </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> np</span><span class="token punctuation">.</span><span class="token plain">random</span><span class="token punctuation">.</span><span class="token plain">choice</span><span class="token punctuation">(</span><span class="token plain">interpolations</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    vector </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> model</span><span class="token punctuation">.</span><span class="token plain">RandomVector</span><span class="token punctuation">(</span><span class="token plain">word</span><span class="token punctuation">,</span><span class="token plain"> keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">    keyboards</span><span class="token punctuation">.</span><span class="token plain">DrawKeyboard</span><span class="token punctuation">(</span><span class="token plain">keyboard</span><span class="token punctuation">,</span><span class="token plain"> perfectvector</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">perfect</span><span class="token punctuation">,</span><span class="token plain"> inputvector </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> vector</span><span class="token punctuation">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">        frequencymap</span><span class="token operator" style="color:#a89984">=</span><span class="token plain">word</span><span class="token punctuation">,</span><span class="token plain"> colormap </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> mpl</span><span class="token punctuation">.</span><span class="token plain">cm</span><span class="token punctuation">.</span><span class="token plain">Reds</span><span class="token punctuation">,</span><span class="token plain"> nopalette</span><span class="token operator" style="color:#a89984">=</span><span class="token boolean" style="color:#ea6962">True</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="cream swype animation" src="https://sangaline.com/assets/images/cream.animation-3b8f611cb0e748f68d73c50ff4431c38.svg" width="576" height="384" class="img_ev3q"></p>
<p>The blue line signifies what we call the perfect vector. This is a swype pattern that represents the ideal input vector for a word in the absence of any randomness or user bias. For our swype models, this consists of a linear interpolation between the centers of the successive keys spelling out the word.</p>
<p>You might be wondering how much the parameters of the models affect the error rates that we ultimately determine. The answer is quite a bit, but this isn't as big of an issue as it seems at first. Clearly a user who is more sloppy with their inputs will encounter more errors than somebody who is very careful with their swypes. The frequency of errors will be different but, despite this, the words that are misinterpreted will tend to be the same. We spent a lot of time studying the systematics of this and have found that the relative error rate between keyboards tends to be mostly independent of the model parameters. This means that statements like, "this keyboard resulted in a 51% reduction in the error rate relative to QWERTY," are broadly applicable even if the error rates themselves aren't.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="interpreting-swypes">Interpreting Swypes<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#interpreting-swypes" class="hash-link" aria-label="Direct link to Interpreting Swypes" title="Direct link to Interpreting Swypes" translate="no">​</a></h3>
<p>It's necessary to have an algorithm for interpreting swypes in order to quantify the error rate. With T9 we could tell directly if a word matched an input vector as long as we were willing to turn off randomness in the model. The randomness in swype inputs is fundamentally important so we can't simply turn that off anymore. We instead need a way to quantitatively estimate how similar an input pattern is to any given word. If we can do that then we can just cycle through our dictionary and pick the word that's the best match.</p>
<p>For our tap typing models we compute the posterior probability for each word directly but that's not really feasible with the swype models due to the complexity of the interpolations. We instead measure the similarity between a swype pattern and the perfect vector for each word. One simple approach to this is to take the Euclidean distance between two input vectors. This is actually the default implementation for interpolation models in <a href="https://github.com/sangaline/dodona" target="_blank" rel="noopener noreferrer" class="">dodona</a> and it does a pretty decent job of reconstructing words.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token comment" style="color:#a89984"># Make a new cream vector and find the best match</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain">vector </span><span class="token operator" style="color:#a89984">=</span><span class="token plain"> model</span><span class="token punctuation">.</span><span class="token plain">RandomVector</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"cream"</span><span class="token punctuation">,</span><span class="token plain"> keyboard</span><span class="token punctuation">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#ebdbb2"><span class="token plain"></span><span class="token keyword" style="color:#ea6962">print</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"Reconstructed as:"</span><span class="token punctuation">,</span><span class="token plain"> t9_model</span><span class="token punctuation">.</span><span class="token plain">BestMatch</span><span class="token punctuation">(</span><span class="token plain">vector</span><span class="token punctuation">,</span><span class="token plain"> keyboard</span><span class="token punctuation">,</span><span class="token plain"> wordlist</span><span class="token punctuation">)</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">Reconstructed as: cream</span><br></span></code></pre></div></div>
<p>It worked in this case and works well overall, but there are some cases where it gives quantitatively large distances between swipes when they're quantitatively very similar. In particular, it tends to give large distances between patterns that are slightly offset but still pass over all of the same keys. Unrealistic quirks in the matching algorithm are something that we really want to avoid because optimization tends to find and exploit things like that.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="neural-network-identification">Neural Network Identification<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#neural-network-identification" class="hash-link" aria-label="Direct link to Neural Network Identification" title="Direct link to Neural Network Identification" translate="no">​</a></h4>
<p>To improve the performance of the algorithm we trained a neural network to take in eleven comparison measures between swype patterns and then identify whether or not they correspond to a pair of random and perfect vectors for the same word. The eleven comparison measures relate to the differences in length, x/y positions, x/y derivatives, x/y second derivatives, and the first/last x/y positions of the two swipe patterns. We go into a lot more detail about them, and the network in general, in the <a href="http://arxiv.org/abs/1503.06300" target="_blank" rel="noopener noreferrer" class="">paper</a>.</p>
<p>Below we can see an example of how the network responds to a match and a non-match. The value for each node is represented in grayscale with black corresponding to the largest value and white to zero. The magnitude of the signal being passed through each connection is represented by it's transparency with solid connections having the largest signal strength. Finally, red corresponds to positive weight sand blue to negative weights. The swipe patterns being compared in the top were both generated from the word "phish" using two different interpolations and the network successfully identifies them as being a match. The swipe patterns compared in the bottom correspond to the words "alright" and "airtight." These words have very similar swype patterns but the network is still able to determine that there is not a match.</p>
<p><img decoding="async" loading="lazy" alt="neural network match" src="https://sangaline.com/assets/images/nn.match-eca7a580312247f7361f4af29b9c8a0b.svg" width="1361" height="519" class="img_ev3q">
<img decoding="async" loading="lazy" alt="neural network non-match" src="https://sangaline.com/assets/images/nn.nonmatch-3f25b9a0ff52daa266877c1648b6ba86.svg" width="1361" height="519" class="img_ev3q"></p>
<p>It's clear in these cases that the first and second derivatives play a significant role in differentiating and matching similar swype patterns. This makes a lot of sense because they're invariant to translation and the Euclidean distance tends to punish small translations very harshly. The neural network reduced our overall error rate by about 20% and helped eliminate nearly all unrealistic mismatches.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="limiting-candidate-words">Limiting Candidate Words<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#limiting-candidate-words" class="hash-link" aria-label="Direct link to Limiting Candidate Words" title="Direct link to Limiting Candidate Words" translate="no">​</a></h4>
<p>Generating and comparing swype vectors is much more expensive than doing the same with T9. With T9 we were also able to reduce the number of possible words by only looking at words of the same length and stopping the search once we found the most probable direct match. We can't get any reliable information about the number of characters in a swype pattern so we can't apply the same tricks. So what can we do to speed things up?</p>
<p>A general approach is to only consider words that have a reasonably high probability of producing the pattern in question. One way to accomplish this is to work directly with the series of letters that are passed over by a swipe pattern. For example, a perfect swype of the word "pot" would correspond to "poiuyt" on a QWERTY keyboard. We call this discrete representation of a swype pattern the string form. We could then simulate a large number of random swypes for each word in the dictionary and build a hash table which maps string forms to a list of words that could result in each string form. Using this table we could quickly produce a short list of probable candidate words for any swype pattern and then evaluate each of these more carefully.</p>
<p>This would work well in a practical implementation of a swype keyboard, but it has a major issue in the context of optimization. Producing the initial hash table is very expensive and it has to be redone for every new keyboard layout. If we want to quantify the error rate for many different keyboards during optimization then this is going to end up being slower than just cycling through every word in the dictionary.</p>
<p>We get around this by cheating a little. In a simulation, we know what the correct word is for every swype pattern that we're trying to match. We use this to produce a bunch of random swypes for the correct word and then compute the string form for each of them. The string forms are then searched for words that are contained within them and also match the first and last letters. It's easiest to explain what we mean by "contained within" with an example.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">wordlist</span><span class="token punctuation">.</span><span class="token plain">RadixTreeMatches</span><span class="token punctuation">(</span><span class="token string" style="color:#89b482">"poiuyt"</span><span class="token punctuation">)</span><br></span></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#ebdbb2;--prism-background-color:#292828"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#ebdbb2;background-color:#292828"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#ebdbb2"><span class="token plain">['poot', 'pout', 'pott', 'pot', 'pit', 'putt', 'put']</span><br></span></code></pre></div></div>
<p>The candidate words are guaranteed to include any word with a perfect vector having the same string form. They can also be found very efficiently by recursively searching a radix tree representation of our dictionary. Below you can see an example of a radix tree containing the words that would match the string form for "poiuyt".</p>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="300" height="137pt" viewBox="0 0 225 137" class="radixTree_y2DS"><defs><symbol id="a" overflow="visible"><path d="M2.922 1.672c-.766 0-.75.156-.75-.375V-.781l-.266.094c.203.312.813.812 1.578.812 1.375 0 2.75-1.312 2.75-2.844s-1.28-2.843-2.593-2.843c-1.047 0-1.735.796-1.766.843l.281.11v-.97l-1.984.157v.625c1 0 .906-.062.906.469v5.625c0 .531.031.375-.906.375v.672c.469-.047 1.125-.047 1.453-.047.344 0 1 0 1.453.047v-.672Zm-.75-5.625c0-.235-.047-.14.094-.328.359-.532.828-.735 1.28-.735.907 0 1.454.954 1.454 2.297C5-1.297 4.344-.422 3.438-.422c-.376 0-.61-.11-.86-.328-.265-.281-.406-.406-.406-.75Zm0 0" style="stroke:none"></path></symbol><symbol id="b" overflow="visible"><path d="M2.234-7.5c0-.312-.406-.75-.734-.75-.312 0-.75.406-.75.734 0 .36.453.72.75.72.36 0 .734-.454.734-.704M.266-5.281v.484c.922 0 .875-.062.875.516v3.25c0 .547.03.39-.907.39v.657c.5-.047 1.063-.047 1.422-.047.125 0 .813 0 1.375.047v-.657c-.922 0-.812.094-.812-.375v-4.562l-1.953.156Zm0 0" style="stroke:none"></path></symbol><symbol id="c" overflow="visible"><path d="M2.016-4.812h1.828v-.641H2.156v-2.203h-.578C1.562-6.375 1.297-5.375.047-5.328v.516h1.016v3.187c0 1.328 1.046 1.75 1.687 1.75.75 0 1.297-.89 1.297-1.75v-.844h-.578v.813c0 .875-.188 1.219-.64 1.219-.813 0-.673-.97-.673-1.172v-3.203Zm0 0" style="stroke:none"></path></symbol><symbol id="d" overflow="visible"><path d="M5.64-2.703c0-1.547-1.327-2.922-2.718-2.922C1.5-5.625.188-4.203.188-2.703.188-1.172 1.547.125 2.922.125c1.406 0 2.719-1.312 2.719-2.828M2.923-.437c-.438 0-.86-.141-1.219-.735-.312-.531-.281-1.14-.281-1.64 0-.454-.047-1.079.328-1.61.328-.516.75-.672 1.172-.672.453 0 .844.188 1.172.64.36.563.312 1.204.312 1.641 0 .422.047 1.063-.265 1.626-.328.562-.766.75-1.22.75m0 0" style="stroke:none"></path></symbol><symbol id="e" overflow="visible"><path d="M3.469-5.281v.484c.984 0 .906-.062.906.531v2.141C4.375-1.109 4-.422 3.109-.422c-.984 0-.89-.406-.89-1.031v-4.125l-2 .156v.625c1.078 0 .906-.11.906.953v1.797c0 .75.047 1.281.406 1.672.282.313.89.5 1.5.5.203 0 .703-.047 1.11-.39.343-.266.53-.72.53-.72l-.265-.109V.125L6.391 0v-.64c-.97 0-.922.062-.922-.516v-4.422l-2 .156Zm0 0" style="stroke:none"></path></symbol></defs><path d="M2404.648 7041.367c0 46.055-37.343 83.399-83.398 83.399s-83.398-37.344-83.398-83.399 37.343-83.398 83.398-83.398 83.398 37.343 83.398 83.398Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#a" x="100.87" y="10.27" style="fill:#000;fill-opacity:1"></use><path d="M1442.656 6732.54c0 42.382-34.375 76.718-76.718 76.718-42.383 0-76.758-34.336-76.758-76.719s34.375-76.758 76.757-76.758c42.344 0 76.72 34.375 76.72 76.758Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#b" x="6.97" y="43.71" style="fill:#000;fill-opacity:1"></use><path d="m2238.086 7014.492-795.352-257.148" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1443.047 6423.672c0 42.617-34.531 77.11-77.11 77.11s-77.109-34.493-77.109-77.11c0-42.578 34.531-77.11 77.11-77.11s77.109 34.532 77.109 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="6.32" y="74.31" style="fill:#000;fill-opacity:1"></use><path d="M1365.938 6651.797v-147.031" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1399.805 6114.844c0 18.71-15.157 33.867-33.868 33.867-18.71 0-33.906-15.156-33.906-33.867s15.196-33.867 33.907-33.867 33.867 15.156 33.867 33.867Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1365.938 6342.578v-189.883" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2330.703 6732.54c0 40.194-32.617 72.812-72.851 72.812S2185 6772.734 2185 6732.539c0-40.234 32.617-72.851 72.852-72.851 40.234 0 72.851 32.617 72.851 72.851Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#d" x="94.86" y="42.32" style="fill:#000;fill-opacity:1"></use><path d="m2303.672 6955.781-30.39-148.008" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1810.352 6423.672c0 40.234-32.618 72.851-72.813 72.851-40.234 0-72.851-32.617-72.851-72.851s32.617-72.813 72.851-72.813c40.195 0 72.813 32.578 72.813 72.813Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#d" x="42.83" y="73.21" style="fill:#000;fill-opacity:1"></use><path d="m2191.797 6693.32-388.203-230.43" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1814.648 6114.844c0 42.578-34.53 77.11-77.109 77.11-42.617 0-77.148-34.532-77.148-77.11s34.53-77.11 77.148-77.11c42.578 0 77.11 34.532 77.11 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="43.48" y="105.19" style="fill:#000;fill-opacity:1"></use><path d="M1737.54 6346.836v-150.899" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1771.406 5806.016c0 18.672-15.195 33.867-33.867 33.867-18.71 0-33.906-15.195-33.906-33.867 0-18.711 15.195-33.868 33.906-33.868 18.672 0 33.867 15.157 33.867 33.868Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M1737.54 6033.75v-189.883" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2313.008 6423.672c0 42.617-34.531 77.11-77.11 77.11-42.617 0-77.109-34.493-77.109-77.11 0-42.578 34.492-77.11 77.11-77.11 42.578 0 77.109 34.532 77.109 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="93.31" y="74.31" style="fill:#000;fill-opacity:1"></use><path d="m2252.422 6655.898-10.781-151.328" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2103.633 6114.844c0 18.71-15.156 33.867-33.867 33.867s-33.868-15.156-33.868-33.867 15.157-33.867 33.868-33.867c18.71 0 33.867 15.156 33.867 33.867Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="m2197.46 6352.266-109.765-204.102" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2479.14 6114.844c0 42.578-34.53 77.11-77.109 77.11-42.617 0-77.148-34.532-77.148-77.11s34.531-77.11 77.148-77.11c42.578 0 77.11 34.532 77.11 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="109.93" y="105.19" style="fill:#000;fill-opacity:1"></use><path d="m2274.297 6352.266 89.297-165.977" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2435.898 5806.016c0 18.672-15.195 33.867-33.867 33.867-18.71 0-33.906-15.195-33.906-33.867 0-18.711 15.195-33.868 33.906-33.868 18.672 0 33.867 15.157 33.867 33.868Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2402.031 6033.75v-189.883" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2853.555 6423.672c0 41.601-33.75 75.351-75.352 75.351-41.64 0-75.351-33.75-75.351-75.351s33.71-75.313 75.351-75.313c41.602 0 75.352 33.711 75.352 75.313Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#e" x="146.57" y="73.21" style="fill:#000;fill-opacity:1"></use><path d="m2323.945 6693.32 386.016-229.14" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2855.313 6114.844c0 42.578-34.532 77.11-77.11 77.11s-77.148-34.532-77.148-77.11 34.57-77.11 77.148-77.11 77.11 34.532 77.11 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="147.54" y="105.19" style="fill:#000;fill-opacity:1"></use><path d="M2778.203 6344.375v-148.437" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2812.07 5806.016c0 18.672-15.156 33.867-33.867 33.867s-33.867-15.195-33.867-33.867c0-18.711 15.156-33.868 33.867-33.868s33.867 15.157 33.867 33.868Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M2778.203 6033.75v-189.883" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3351.914 6732.54c0 41.6-33.75 75.312-75.352 75.312-41.601 0-75.351-33.711-75.351-75.313 0-41.64 33.75-75.351 75.352-75.351 41.601 0 75.351 33.71 75.351 75.351Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#e" x="196.4" y="42.32" style="fill:#000;fill-opacity:1"></use><path d="m2404.375 7014.492 796.719-257.578" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3353.672 6423.672c0 42.617-34.531 77.11-77.11 77.11s-77.109-34.493-77.109-77.11c0-42.578 34.531-77.11 77.11-77.11s77.109 34.532 77.109 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="197.38" y="74.31" style="fill:#000;fill-opacity:1"></use><path d="M3276.563 6653.203v-148.437" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3144.297 6114.844c0 18.71-15.156 33.867-33.867 33.867s-33.867-15.156-33.867-33.867 15.156-33.867 33.867-33.867 33.867 15.156 33.867 33.867Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="m3238.125 6352.266-109.766-204.102" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3519.805 6114.844c0 42.578-34.532 77.11-77.11 77.11s-77.11-34.532-77.11-77.11 34.532-77.11 77.11-77.11 77.11 34.532 77.11 77.11Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><use xlink:href="#c" x="213.99" y="105.19" style="fill:#000;fill-opacity:1"></use><path d="m3314.96 6352.266 89.298-165.977" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3476.563 5806.016c0 18.672-15.157 33.867-33.868 33.867-18.71 0-33.867-15.195-33.867-33.867 0-18.711 15.156-33.868 33.867-33.868s33.867 15.157 33.867 33.868Zm0 0" style="fill:none;stroke-width:7.97011;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path><path d="M3442.695 6033.75v-189.883" style="fill:none;stroke-width:8;stroke-linecap:butt;stroke-linejoin:miter;stroke:#000;stroke-opacity:1;stroke-miterlimit:10" transform="matrix(.1 0 0 -.1 -128 713)"></path></svg>
<p>This approach might seem like we're abusing our knowledge of the correct word, but it typically results in a superset of the candidate words that the hash table method would produce. The main assumption is that if a random vector from one word can look very similar to the perfect vector of a second word then a random vector from the second word can look very similar to the perfect vector from the first word. If that holds, and it almost always does, then candidate words from the hash table method will also be produced by this method. It produces realistic lists of candidate words for arbitrary keyboards while speeding up the error rate calculation by over two orders of magnitude when using our full dictionary. This difference is hugely important when performing the optimization.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="optimization">Optimization<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#optimization" class="hash-link" aria-label="Direct link to Optimization" title="Direct link to Optimization" translate="no">​</a></h3>
<p>There are 26! ≈ 4×10<sup>26</sup> possible keyboard layouts to consider, even when we limit ourselves to permutations of the letters. If we could calculate the error rate of one layout per nanosecond then it would still take longer than the current age of the universe to explore the full state space. Needless to say, it's simply not possible to test every keyboard. We can still do our best to find significantly improved keyboards though.</p>
<p>In the T9 section we employed a random walk optimization. For the swipe optimization we use a similar approach but gradually reduce the number of random swaps over time so that the keyboard settles into a local minimum. The number of swaps in this procedure is analogous to the temperature in a physical or simulated annealing process. An illustration of how it evolves a keyboard layout to have a lower error rate is shown below.</p>
<div>Loading keyboard slider...</div>
<p>The blue shading of the keys here indicates how frequently each letter is used in our word list. When you drag the slider you can see how the more frequently used keys tend to move towards the outside as the optimization progresses. We ran 256 optimizations like the one above, each starting with a unique random keyboard. Below you can see the average error rate at each iteration as well as the minimum and maximum.</p>
<p><img decoding="async" loading="lazy" alt="swype optimization" src="https://sangaline.com/assets/images/swipe.optimization-d674faa210fde1caeb811696ead310ec.svg" width="960" height="576" class="img_ev3q"></p>
<p>You can see from this plot that the average error rate of the random starting points is about 13%. QWERTY is close to average but slightly worse with an error rate of 15.3%. The best and worst keyboards that we encountered had error rates of 7.7% and 27.2%. These correspond to a 51% reduction in error rates and a 78% increase in error rates relative to QWERTY. As we mentioned before: the absolute error rates don't mean a lot by themselves but their ratios are fairly robust. Things could be a lot worse than QWERTY but they could also be a lot better.</p>
<p><img decoding="async" loading="lazy" alt="best and worst keyboards" src="https://sangaline.com/assets/images/best.worst.keyboards-fbc1035470677448277f948a469d1b57.svg" width="1152" height="384" class="img_ev3q"></p>
<p>Looking at the best and worst keyboards we find that they have a bit in common with the optimized T9 layouts that we looked at. The worst T9 layout had "eiao" all on one key and the worst keyboard here has those clustered together near the center of the keyboard while they're very spread out on the best keyboard. The clustering of "ts" and "nf" had also been an issue with T9 and we find that they're both separated by a full row here in the best keyboard. This shouldn't really come as much of a surprise because these have to do with common words that only differ by a letter.</p>
<p>We can also see that the center keys are occupied by less frequently used letters. These keys are swyped over very frequently and we can reduce the ambiguity in long swypes by populating them with letters that don't often occur in the middle of words. There are a few other trends to pick out but it really comes down to a complicated interplay between the structure of the keyboard and the English language, especially as you push the error rate lower and lower.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://sangaline.com/blog/finding-an-optimal-keyboard-layout-for-swype#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h3>
<p>So does any of this matter at all? Apple and Google probably aren't about to ship new keyboard layouts that are optimized for swype. Even if they did, most people probably wouldn't want to use them. There are lots of third party keyboards available though, so somebody could make one that lets adventurous users choose their own layouts (if you do this, let us know!). We are talking about <em>smart</em> phones here afterall.</p>
<p>Really though, we just thought it was a fun problem to explore. As touchscreens become increasingly pervasive it feels like people are more and more ready for a paradigm shift in terms of entering text. Rearranging the keyboard isn't going to be that shift but we think it's important that people think about what might be. Hopefully <a href="https://github.com/sangaline/dodona" target="_blank" rel="noopener noreferrer" class="">dodona</a> can help some other people do that, it certainly helped us.</p>
<p>If you made it this far: thanks for reading!</p>]]></content:encoded>
        </item>
    </channel>
</rss>