You’ll probably get more out of this post if you read this one first Continuing from where I left off, I now want to use a normal distribution (i.e. gaussian) to calculate my cookie box distributions. Unlike the poisson distribution I used earlier, with just the mean as a parameter, the normal distribution is described by both a mean and a standard deviation. Chocolate Chip Likelihood The probability that a box was generated by a normal distribution with mean ($\mu$) and standard deviation ($\sigma$) is given by $$ \prod_{i=1}^{10} \textrm{normal}(\mu, \sigma)\textrm{.pdf}(\textrm{cookie_box}(i))$$ where pdf is the probability density function and cookie_box(i) is the number of chocolate chips on cookie i. 1 2 3 4 5 6 7 # Likelihood for the heavenly_bites box at mu = 8 and mu = 16 with sigma = 1 np.product(stats.norm(8, 1).pdf(heavenly_bites)) # => 5.35e-26 np.product(stats.norm(16, 1).pdf(heavenly_bites)) # => 1.29e-213 Since my parameter space is 2 dimensional ($\mu$, $\sigma$), I get to use these neat (but not as clear) 3D graphs to show the likelihood plots for the different cookie boxes. Cookie Posteriors Each step in this process is almost exactly the same as in the 1 parameter poisson case, except now with an additional dimension. The cookie prior is now the plane, pdf($\mu$, $\sigma$) = 1. Calculating the posterior is way more computationally expensive in this case. To get the same point density as the poisson model, I’d need to update the prior at $ 20,000 \times 20,000 $, i.e. $400,000,000$ points. What’s worse, most of those posterior values in this space are ~0. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Thanks to numpy arrays, this part of the code doesn't change at all. # heavenly_bites_likelihood and cookie_prior are both numpy arrays with dimensions (20000, 10000). # They contain y values for each function, for mu from [0, 20] and sigma [0, 10] # in increments of 0.001 # The unnormalized posterior, the numerator in Bayes' Rule heavenly_bites_un_posterior = np.multiply(heavenly_bites_likelihood, cookie_prior) # Approximating the integral in the denominator normalising_factor = np.sum(heavenly_bites_un_posterior) # Applying Bayes' Rule heavenly_bites_posterior = np.divide(heavenly_bites_un_posterior, normalising_factor) Calculating the credible interval is a little trickier here but the core idea is the same (gist) Cookie Box (0.95) $\mu$ $\sigma$ Heavenly Bites (3.72, 9.56) (1.72, 6.06) Hobbes’ Nobbs (9.95, 17.19) (2.12, 7.38) Kant Touch This (10.85, 18.69) (2.42, 8.18) Results A poisson distribution with mean $\lambda$ can be approximated by a normal distribution with mean and variance both $\lambda$, its not a great approximation at $\lambda < 1000$ but it does help me test my results. Cookie Box (0.95) actual $\lambda$ expected $\mu$ expected $\sigma$ Heavenly Bites 8 8 2.83 Hobbes’ Nobbs 12 12 3.46 Kant Touch This 16 16 4.00 Excellent! The actual values are within the predicted ranges, the model works. Issues The main problem is the cost to calculate, every additional dimension or an increase in the range can make the model prohibitively expensive. Fortunately for us, Monte Carlo methods like the Metropolis algorithm generate useful results far more efficiently. More about these in a later post. In the meantime, take a look at Chris’ post on the subject

I love chocolate chip cookies, the more chocolate chips the better I’ve managed to narrow down my favourites to 5 brands1, Heavenly Bites, Subjectively Scrumptious, Hobbes’ Nobs, The Happiness Mill and Kant Touch This I’d like to consistently buy the brand with the most chocolate chips per cookie and so, for science, I pick up a box of each with 10 cookies per box. 1 2 3 4 5 6 7 8 9 10 11 12 import numpy as np from scipy import stats np.random.seed(1) # for repeatability # Each box of cookies will be generated by drawing from a poisson distribution with # heavenly_bites having the lowest mean (8) and kant_touch_this, the highest (16) heavenly_bites = stats.poisson(8).rvs(10) subjectively_scrupmtious = stats.poisson(10).rvs(10) hobbes_nobs = stats.poisson(12).rvs(10) the_happiness_mill = stats.poisson(14).rvs(10) kant_touch_this = stats.poisson(16).rvs(10) 1 2 3 4 # each number in the array represents the number of chocolate chips on a cookie heavenly_bites # => [ 2 4 6 11 4 10 7 6 6 10] Chocolate Chip Likelihood Just like earlier I need a way to measure the likelihood of seeing a particular set of cookies. For this problem I’m going to assume that each bakery produces cookies following a poisson distribution around some mean. The probability that a box was generated by a poisson distribution with mean, $\lambda$, is given by $$ \prod_{i=1}^{10} \textrm{poisson}(\lambda)\textrm{.pmf}(\textrm{cookie_box}(i))$$ where pmf is the probability mass function and cookie_box(i) is the number of chocolate chips on cookie i. 1 2 3 4 5 6 7 # Likelihood for the heavenly_bites box at lambda = 8 and lambda = 16 np.product(stats.poisson(8).pmf(heavenly_bites)) # => 6.36e-12 np.product(stats.poisson(16).pmf(heavenly_bites)) # => 8.47e-27 These are the likelihood plots for Heavenly Bites, Hobbes’ Nobs and Kant Touch This (see how the peaks of the functions are near the true lambda values). Cookie Posteriors I’m going to assume that it’s impossible to have a cookie with more than 20 chips (or less than 0), I’m also choosing an initial prior that assigns equal probability to all possible values of $\lambda$ in this space. Unfortunately, there’s no clever hack I can use to easily calculate the posterior, I’ll have to resort to a numerical approximation of Bayes Rule. 1 2 3 4 5 6 7 8 9 10 11 12 # heavenly_bites_likelihood and cookie_prior are both long numpy arrays. # with dimensions (20000, 1) containing y values for x in [0, 20] but # in incrememnts of 0.001 # The unnormalized posterior, the numerator in Bayes' Rule heavenly_bites_un_posterior = np.multiply(heavenly_bites_likelihood, cookie_prior) # Approximating the integral in the denominator normalising_factor = np.sum(heavenly_bites_un_posterior) # Applying Bayes' Rule heavenly_bites_posterior = np.divide(heavenly_bites_un_posterior, normalising_factor) Heavenly Bites’ posterior is a lot more concentrated than the others but thats the way the cookie crumbles, it all depends on the amount of information each cookie box provides, a different set of cookies could have led to a different result. Here’s a gist of the function I used to figure out the intervals. Sleight of Hand The real $\lambda$ values are 8, 12 and 16; all within a 0.95 credible interval… so I have my answer, case closed? Well not really, I cheated, picking a poisson model over any other guaranteed good results, after all, thats what I used to generate the cookie boxes in the first place. If I was really approaching this problem without this information, I might have chosen a gaussian model or something else entirely. I’m going to try that next. Update: A Clever Hack As Chris pointed out, a Gamma distribution forms a conjugate prior for a Poisson likelihood. A Gamma distribution has the form $\textrm{gamma}(a, b)$ where $a$ is the shape parameter and $b$ is the inverted scale parameter. If I have a chocolate chip distribution prior of $\textrm{gamma}(a, b)$ and then examine $n$ cookies, my updated posterior would be $$\textrm{gamma}(a + \sum\limits_{i=1}^n \textrm{cookie_box}(i), b + n)$$ An uninformative gamma prior looks like $\textrm{gamma}(a = \textrm{~0}, b = \textrm{~0})$ where ~0 is very small but still positive. 1 2 3 4 5 6 7 from scipy.stats import gamma # Updating the gamma(~0, ~0) prior shape = 0 + sum(heavenly_bites) inverted_scale = 0 + len(heavenly_bites) heavenly_bites_posterior = gamma(shape, scale=1.0/inverted_scale) The credible intervals are almost the same in this case, the biggest difference being the values for $\textrm{pdf}(\lambda)$ which is an effect of the numerical approximation. The relative shapes of the distributions seem identical. An efficient way of solving the same problem and also verifies the numerical approach. Named for the philosophers found in the excellent Socrates Jones flash game.↩

While Chris has done an excellent job explaining this concept, I’m having too much fun with my coin example to stop now. A Maybe Biased Coin Suppose I try my hand at casting my own coin. If my work is even half as good, my coin is going to end up being rather uneven and most likely biased. Unlike the double headed coin problem, I know very little about the coin’s bias. One (not particularly good) choice for a prior is to assign equal probability to all possibilities. In my last post, $\theta$ could only be one of two values and the prior reflected that. The height of a bar represented the probability mass for each value of $\theta$ and the sum of all heights was 1. In this case $\theta$ can take any value between 0 and 1. It doesn’t make sense to talk about the probability for a specific value of $\theta$, the probability at any one point is infinitesimal. Here the prior is plotted as a probability density function, I can calculate the probability mass between any two points $(a, b)$ by integrating $\mathrm{pdf}(\theta)$ over $a$ and $b$ Since this is a probability density function, $\int_{0}^{1} \mathrm{pdf}(\theta)\cdot d\theta = 1$ Continuous Likelihood The binomial distribution is a function which plots the probability of any independent binary event (like a coin flip) succeeding for a number of attempts. If a binary event has a probability $\theta$ of succeeding and succeeds k times out of N $$\mathrm{P}(\textrm{k successes, N attempts} | \theta) = \binom{N}{k}\cdot\theta^{k}\cdot(1-\theta)^{N-k}$$ Imagine I flip my coin 10 times and end up with only 2 heads. For an unbiased coin, this is fairly unlikely $$\mathrm{P}(\textrm{2 successes, 10 attempts} | \theta=0.5) $$ $$ = \binom{10}{2}\cdot(0.5)^{2}\cdot(1 - 0.5)^{10 - 2} $$ $$ = 0.001 $$ Just like last time with a larger number of flips, the likelihood for most values of $\theta$ approaches 0. With additional flips, the likelihood spread gets smaller with the highest mass at $\frac{k}{N}$. Continuous Posteriors I can update the posterior using Bayes’ rule $$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\int_{0}^{1} \mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)\cdot d\theta} $$ Usually I’d have to approximate a solution using numerical integration but there’s a simpler solution for this particular type of problem. Conjugate Priors If the prior and the posterior belong to the same function family, it can make computing the posterior much simpler. For example, if my prior is Gaussian and my likelihood is Gaussian then my posterior will also be Gaussian, i.e. Gaussian functions are conjugate to themselves. In this case, since my likelihood function is a binomial distribution, its conjugate prior is a beta distribution. Specifically, if my prior is of the form $\mathrm{beta}(a, b)$, where a and b are the parameters of the distribution, and the likelihood function is a binomial distribution with N attempts and k successes, then the posterior would be a beta distribution with the parameters a + k and b + N - k. Updating the posterior reduces to simple addition. Updating the Posterior The uniform prior is the same as $\mathrm{beta}(1, 1)$ For 2 heads out of 10 flips and the prior $\mathrm{beta}(1, 1)$ $$ \mathrm{posterior}(\theta) $$ $$ = \mathrm{beta}(a + k, b + N - k) $$ $$ = \mathrm{beta}(1 + 2, 1 + 10 - 2) $$ $$ = \mathrm{beta}(3, 9) $$ Since pdf is a function for probability density and not probability mass, $\mathrm{pdf}(\theta)$ can be greater than 1 as long as the integral over (0, 1) is 1. Credibility It doesn’t make sense to ask for the probability of a particular $\theta$ but its useful to know the probability of a range. Knowing that $P(0.1\lt\theta\lt0.3) = 0.95$ is far better than where I started, especially as the range gets smaller. This is a credible interval and there are several ways to pick one. I like to pick the highest density interval, the shortest range (or set of ranges) that contains a certain amount of probability mass (0.95 for example). For a unimodal distribution, like a beta distribution, there is only 1 highest density interval. Making a Decision There will always be uncertainty and not making a decision is often not an option. In this situation, I have a few choices. I could pick the mean, which is the expected value of $\theta$, or the mode, which is the most likely value of the distribution (the peak). a b mode mean variance 1 2 0 0.33 0.05 2 5 0.2 0.29 0.03 3 9 0.2 0.25 0.01 Either way, with additional flips the variance drops and both values of central tendency more accurately predict $\theta$ Complications But what about if my $\theta$ changes over time or I have events with multiple possible values and can’t use beta distributions or even numerical integration. More on that next time.

While Chris has done an excellent job explaining this concept, I’m having too much fun with my coin example to stop now. A Maybe Biased Coin Suppose I try my hand at casting my own coin. If my work is even half as good, my coin is going to end up being rather uneven and most likely biased. Unlike the double headed coin problem, I know very little about the coin’s bias. One (not particularly good) choice for a prior is to assign equal probability to all possibilities. In my last post, $\theta$ could only be one of two values and the prior reflected that. The height of a bar represented the probability mass for each value of $\theta$ and the sum of all heights was 1. In this case $\theta$ can take any value between 0 and 1. It doesn’t make sense to talk about the probability for a specific value of $\theta$, the probability at any one point is infinitesimal. Here the prior is plotted as a probability density function, I can calculate the probability mass between any two points $(a, b)$ by integrating $\mathrm{pdf}(\theta)$ over $a$ and $b$ Since this is a probability density function, $\int_{0}^{1} \mathrm{pdf}(\theta)\cdot d\theta = 1$ Continuous Likelihood The binomial distribution is a function which plots the probability of any independent binary event (like a coin flip) succeeding for a number of attempts. If a binary event has a probability $\theta$ of succeeding and succeeds k times out of N $$\mathrm{P}(\textrm{k successes, N attempts} | \theta) = \binom{N}{k}\cdot\theta^{k}\cdot(1-\theta)^{N-k}$$ Imagine I flip my coin 10 times and end up with only 2 heads. For an unbiased coin, this is fairly unlikely $$\mathrm{P}(\textrm{2 successes, 10 attempts} | \theta=0.5) $$ $$ = \binom{10}{2}\cdot(0.5)^{2}\cdot(1 - 0.5)^{10 - 2} $$ $$ = 0.001 $$ Just like last time with a larger number of flips, the likelihood for most values of $\theta$ approaches 0. With additional flips, the likelihood spread gets smaller with the highest mass at $\frac{k}{N}$. Continuous Posteriors I can update the posterior using Bayes’ rule $$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\int_{0}^{1} \mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)\cdot d\theta} $$ Usually I’d have to approximate a solution using numerical integration but there’s a simpler solution for this particular type of problem. Conjugate Priors If the prior and the posterior belong to the same function family, it can make computing the posterior much simpler. For example, if my prior is Gaussian and my likelihood is Gaussian then my posterior will also be Gaussian, i.e. Gaussian functions are conjugate to themselves. In this case, since my likelihood function is a binomial distribution, its conjugate prior is a beta distribution. Specifically, if my prior is of the form $\mathrm{beta}(a, b)$, where a and b are the parameters of the distribution, and the likelihood function is a binomial distribution with N attempts and k successes, then the posterior would be a beta distribution with the parameters a + k and b + N - k. Updating the posterior reduces to simple addition. Updating the Posterior The uniform prior is the same as $\mathrm{beta}(1, 1)$ For 2 heads out of 10 flips and the prior $\mathrm{beta}(1, 1)$ $$ \mathrm{posterior}(\theta) $$ $$ = \mathrm{beta}(a + k, b + N - k) $$ $$ = \mathrm{beta}(1 + 2, 1 + 10 - 2) $$ $$ = \mathrm{beta}(3, 9) $$ Since pdf is a function for probability density and not probability mass, $\mathrm{pdf}(\theta)$ can be greater than 1 as long as the integral over (0, 1) is 1. Credibility It doesn’t make sense to ask for the probability of a particular $\theta$ but its useful to know the probability of a range. Knowing that $P(0.1\lt\theta\lt0.3) = 0.95$ is far better than where I started, especially as the range gets smaller. This is a credible interval and there are several ways to pick one. I like to pick the highest density interval, the shortest range (or set of ranges) that contains a certain amount of probability mass (0.95 for example). For a unimodal distribution, like a beta distribution, there is only 1 highest density interval. Making a Decision There will always be uncertainty and not making a decision is often not an option. In this situation, I have a few choices. I could pick the mean, which is the expected value of $\theta$, or the mode, which is the most likely value of the distribution (the peak). a b mode mean variance 1 2 0 0.33 0.05 2 5 0.2 0.29 0.03 3 9 0.2 0.25 0.01 Either way, with additional flips the variance drops and both values of central tendency more accurately predict $\theta$ Complications But what about if my $\theta$ changes over time or I have events with multiple possible values and can’t use beta distributions or even numerical integration. More on that next time.

Plotting Priors I often find it useful to plot probability functions, it gives me a better idea on how my probabilities are being updated. Take the double headed coin problem. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head? This plot represents my prior probability for $\theta$, the probability that the coin I pull from the jar will land heads. Since most of the coins are unbiased (999 out of a 1000), most of the mass is concentrated over 0.5. Theres an invisible sliver above 1.0 for the double headed coin. To calculate P(heads), I take the sum over $\theta$ weighted by $P(\theta)$ $$ = \sum\limits_{i=1}^N \theta_i \cdot P(\theta_i) $$ $$ = 0.5 \cdot \frac{999}{1000} + 1.0 \cdot \frac{1}{1000} $$ $$ P(\textrm{ heads }) = 0.5005 $$ Coin Flip Likelihood After seeing new evidence, to compute a posterior I need both, the prior and the likelihood of the evidence. The posterior is the normalised product of the prior and the likelihood. For an unbiased coin, the probability of z heads in a row is $(0.5)^{z}$, for the double headed coin its $(1)^{z}$ which is $1$. In general, the probability of landing z heads in a row for a coin is $(\theta)^{z}$ when $P(\textrm{ heads }) = \theta$ (Plotting the likelihood for several flips) See how quickly the likelihood falls for lower values of $\theta$, 10 heads in a row is very unlikely and the plot after 10 flips reflects that. While the likelihood for individual $\theta$s is a probability, the likelihood plot isn’t a probability distribution and doesn’t need to sum to 1. Plotting Posteriors I can update the posterior at every $\theta$ using Bayes’ rule $$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\sum\limits_{i=1}^N \mathrm{likelihood}(\theta_i) \cdot \mathrm{prior}(\theta_i)} $$ The key take away here is that the posterior always lies somewhere between the prior and the likelihood. The complicated looking denominator is only there to ensure that the posterior sums up to 1 making it a valid probability distribution. The likelihood function is exponential in z. The first few flips barely change the distribution but by z=10 either kind of coin is equally likely. Calculating $ P(\textrm{ heads }) $ for each $ z $ $ P(\textrm{ heads } | z=\space\space1) = 0.5 \cdot 0.998 + 1.0 \cdot 0.002 = 0.501 $ $ P(\textrm{ heads } | z=\space\space5) = 0.5 \cdot 0.969 + 1.0 \cdot 0.031 = 0.516 $ $ P(\textrm{ heads } | z=10) = 0.5 \cdot 0.494 + 1.0 \cdot 0.506 = 0.753 $ Further Reading Plotting posteriors might give you insights you otherwise would have missed and it hints at even more. Bayes Rule works just as well for continuous functions, sums become integrals but the core idea remains the same. There are even neat hacks to make some of those calculations trivial.

Plotting Priors I often find it useful to plot probability functions, it gives me a better idea on how my probabilities are being updated. Take the double headed coin problem. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head? This plot represents my prior probability for $\theta$, the probability that the coin I pull from the jar will land heads. Since most of the coins are unbiased (999 out of a 1000), most of the mass is concentrated over 0.5. Theres an invisible sliver above 1.0 for the double headed coin. To calculate P(heads), I take the sum over $\theta$ weighted by $P(\theta)$ $$ = \sum\limits_{i=1}^N \theta_i \cdot P(\theta_i) $$ $$ = 0.5 \cdot \frac{999}{1000} + 1.0 \cdot \frac{1}{1000} $$ $$ P(\textrm{ heads }) = 0.5005 $$ Coin Flip Likelihood After seeing new evidence, to compute a posterior I need both, the prior and the likelihood of the evidence. The posterior is the normalised product of the prior and the likelihood. For an unbiased coin, the probability of z heads in a row is $(0.5)^{z}$, for the double headed coin its $(1)^{z}$ which is $1$. In general, the probability of landing z heads in a row for a coin is $(\theta)^{z}$ when $P(\textrm{ heads }) = \theta$ (Plotting the likelihood for several flips) See how quickly the likelihood falls for lower values of $\theta$, 10 heads in a row is very unlikely and the plot after 10 flips reflects that. While the likelihood for individual $\theta$s is a probability, the likelihood plot isn’t a probability distribution and doesn’t need to sum to 1. Plotting Posteriors I can update the posterior at every $\theta$ using Bayes’ rule $$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\sum\limits_{i=1}^N \mathrm{likelihood}(\theta_i) \cdot \mathrm{prior}(\theta_i)} $$ The key take away here is that the posterior is simply the prior scaled by the likelihood and then renormalised. The complicated looking denominator is only there to ensure that the posterior sums up to 1 making it a valid probability distribution. The likelihood function is exponential in z. The first few flips barely change the distribution but by z=10 either kind of coin is equally likely. Calculating $ P(\textrm{ heads }) $ for each $ z $ $ P(\textrm{ heads } | z=\space\space1) = 0.5 \cdot 0.998 + 1.0 \cdot 0.002 = 0.501 $ $ P(\textrm{ heads } | z=\space\space5) = 0.5 \cdot 0.969 + 1.0 \cdot 0.031 = 0.516 $ $ P(\textrm{ heads } | z=10) = 0.5 \cdot 0.494 + 1.0 \cdot 0.506 = 0.753 $ Further Reading Plotting posteriors might give you insights you otherwise would have missed and it hints at even more. Bayes Rule works just as well for continuous functions, sums become integrals but the core idea remains the same. There are even neat hacks to make some of those calculations trivial.

In my last post I had started building a fanfiction recommender using collaborative filtering. In this post I’ll work through my solution which you can find here Author Similarity The core of a recommender is the similarity metric, how similar are two authors and how similar are you to them. Since I have sets of stories for each author (and a set of my own favourites), I can use a set distance metric as a measure of similarity. The Jaccard Index is just the the thing. For any two finite sets, the Jaccard Index is the size of the intersection of the sets divided by the size of their union $$ J(A, B) = \frac{| \space A \cap B \space |}{| \space A \cup B \space |} $$ The Jaccard Index lies between 0 and 1, inclusive. 1 when both sets are identical and 0 if they have no elements in common. $$ A = \{ 1, 2, 3 \}, \space B = \{4, 5, 6\}, \space J(A, B) = \frac{0}{6} = 0 $$ $$ A = \{ 1, 2, 3 \}, \space B = \{2, 3, 4\}, \space J(A, B) = \frac{2}{4} = 0.5 $$ $$ A = \{ 1, 2, 3 \}, \space B = \{1, 2, 3\}, \space J(A, B) = \frac{3}{3} = 1 $$ For each author I calculated the Jaccard Index of a set containing their own stories and favourite stories with a set of my favourite stories to get something like the table below. Top 5 authors author name similarity 4976703 alexanderwales 0.384615 5118664 daystar721 0.333333 3989854 Sir Poley 0.222222 4767519 Scientist’s Thesis 0.187500 3344060 Velorien 0.166667 Ranking Stories Now that I’ve got a similarity score for actors, I still need to convert this into a story recommendation. A possible strategy is to calculate the average similarity across all authors who have favourited or written a story and use that for the story preference. For example, if alexanderwales (0.38) wrote a story and Sir Poley (0.22) is the only person who liked it, the preference score would be $ \frac{0.38 + 0.22}{2} = 0.3 $ This could work because Similar authors to me like similar stories which are scored higher, while disimilar authors would have lower scores. Outliers, stories I wouldn’t have liked but are liked by a very similar author would be lowered by the dissimilar authors who liked it. This may not work because If a story hasn’t been liked by enough authors, the average wont be very representative. Very popular stories, liked by a large group of people would be ranked lower even if I’d have liked them. Top 5 stories story title sim_total sim_count sim_score 10327510 A Bluer Shade of White 0.384615 1 0.384615 10023949 Harry Potter and the Philosopher’s Zombie 0.717949 2 0.358974 9676374 Daystar’s Remix of Rationality 0.333333 1 0.333333 9794740 Pokemon: The Origin of Species 0.967949 4 0.241987 9658524 Branches on the Tree of Time 0.469361 2 0.234681 Conclusions A Bluer Shade of White and Daystar's Remix of Rationality both have only 1 data point so are probably misscored, on the other hand a story I do like, Methods of rationality has a score of 0.08 because of the popularity problem. Both of these problems are a result of limited data (only 100 authors) and should go away with more. I still see a bunch of stories I haven’t read which look promising and I hadn’t heard of before so tentatively successful, I’ll need to read them to make sure. I encourage you to try this yourself, I’ll be interested to see if it helps. [update] Recommending stories you aren’t interested in? Try out filtered.ipynb to use only a subset of the data filtered by genre and category.

In my last post I had started building a fanfiction recommender using collaborative filtering. In this post I’ll work through my solution which you can find here Author Similarity The core of a recommender is the similarity metric, how similar are two authors and how similar are you to them. Since I have sets of stories for each author (and a set of my own favourites), I can use a set distance metric as a measure of similarity. The Jaccard Index is just the the thing. For any two finite sets, the Jaccard Index is the size of the intersection of the sets divided by the size of their union $$ J(A, B) = \frac{| \space A \cap B \space |}{| \space A \cup B \space |} $$ The Jaccard Index lies between 0 and 1, inclusive. 1 when both sets are identical and 0 if they have no elements in common. $$ A = \{ 1, 2, 3 \}, \space B = \{4, 5, 6\}, \space J(A, B) = \frac{0}{6} = 0 $$ $$ A = \{ 1, 2, 3 \}, \space B = \{2, 3, 4\}, \space J(A, B) = \frac{2}{4} = 0.5 $$ $$ A = \{ 1, 2, 3 \}, \space B = \{1, 2, 3\}, \space J(A, B) = \frac{3}{3} = 1 $$ For each author I calculated the Jaccard Index of a set containing their own stories and favourite stories with a set of my favourite stories to get something like the table below. Top 5 authors author name similarity 4976703 alexanderwales 0.384615 5118664 daystar721 0.333333 3989854 Sir Poley 0.222222 4767519 Scientist’s Thesis 0.187500 3344060 Velorien 0.166667 Ranking Stories Now that I’ve got a similarity score for actors, I still need to convert this into a story recommendation. A possible strategy is to calculate the average similarity across all authors who have favourited or written a story and use that for the story preference. For example, if alexanderwales (0.38) wrote a story and Sir Poley (0.22) is the only person who liked it, the preference score would be $ \frac{0.38 + 0.22}{2} = 0.3 $ This could work because Similar authors to me like similar stories which are scored higher, while disimilar authors would have lower scores. Outliers, stories I wouldn’t have liked but are liked by a very similar author would be lowered by the dissimilar authors who liked it. This may not work because If a story hasn’t been liked by enough authors, the average wont be very representative. Very popular stories, liked by a large group of people would be ranked lower even if I’d have liked them. Top 5 stories story title sim_total sim_count sim_score 10327510 A Bluer Shade of White 0.384615 1 0.384615 10023949 Harry Potter and the Philosopher’s Zombie 0.717949 2 0.358974 9676374 Daystar’s Remix of Rationality 0.333333 1 0.333333 9794740 Pokemon: The Origin of Species 0.967949 4 0.241987 9658524 Branches on the Tree of Time 0.469361 2 0.234681 Conclusions A Bluer Shade of White and Daystar's Remix of Rationality both have only 1 data point so are probably misscored, on the other hand a story I do like, Methods of rationality has a score of 0.08 because of the popularity problem. Both of these problems are a result of limited data (only 100 authors) and should go away with more. I still see a bunch of stories I haven’t read which look promising and I hadn’t heard of before so tentatively successful, I’ll need to read them to make sure. I encourage you to try this yourself, I’ll be interested to see if it helps. [update] Recommending stories you aren’t interested in? Try out filtered.ipynb to use only a subset of the data filtered by genre and category.

I’m currently working through Python for Data Analysis (you can follow along here) using the useful IPython Notebook and Pandas. While the book has been enlightening, I think its time to dive into a meaty problem to get some experience. Some of the stories on fanfiction.net are brilliant but they’re very difficult to find, there are just too many to look through. So far I’ve relied on r/rational as a filter but I think I can do better. My goal is to create a simple collaborative filtering recommender using the preferences of the authors on fanfiction.net. The Data Authors on fanfiction.net have lists of stories they’ve written, favourite stories and favourite authors. Each story has several attributes like chapters and categories I’ve written a web scraper which collects author records using a breadth first search starting with a seed list of author ids. In case you don’t feel like picking up Clojure or writing your own scraper I’ve uploaded a 100 author records here. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // an author record {"id":"2269863", "name":"Less Wrong", "author-stories":[{"id":"5782108", "title":"Harry Potter and the Methods of Rationality", "author":"2269863", "date-submitted":"2010-02-28T08:12:39Z", "date-updated":"2014-07-26T02:01:00Z", "categories":["Harry Potter"], "genres":["Drama", "Humor"], "chapters":102, "word-count":565501, "follows":13290, "favourites":14432, "reviews":24766, "language":"English", "complete?":false, "rating":"T"}, ...], "favourite-stories":[{"id":"5193644", "title":"Time Braid", ... }, ...], "favourite-authors":["4976703", ... ]} Wrangling A step by step guide on how to parse these records can be found in wrangling.ipynb I’ve decided on 6 DataFrames authors - author information (only name for now) indexed by author id stories - story information indexed by story id favourite_authors - mapping of author id to favourite author ids favourite_stories - mapping of author id to favourite story ids genres - a relation DataFrame mapping story ids to genres categories - a relation DataFrame mapping story ids to categories For the full schema in nicely formatted markdown check out the metafiction readme My first plot Matplotlib plays well with Pandas and a one liner reveals the most popular genres in the records I’ve got. 1 genres.sum().order().plot(kind="barh") Stories can be in multiple genres and most stories have at least one romantic relationship so this isn’t as damning as you might believe. Although ‘angst’ and ‘hurt’ being quite high up does lead some credence to the image of the Tortured artist Next steps I’ve been pointed to Programming Collective Intelligence to help me build the recommender, any other suggestions welcome. Stay tuned for part II

I’m currently working through Python for Data Analysis (you can follow along here) using the useful IPython Notebook and Pandas. While the book has been enlightening, I think its time to dive into a meaty problem to get some experience. Some of the stories on fanfiction.net are brilliant but they’re very difficult to find, there are just too many to look through. So far I’ve relied on r/rational as a filter but I think I can do better. My goal is to create a simple collaborative filtering recommender using the preferences of the authors on fanfiction.net. The Data Authors on fanfiction.net have lists of stories they’ve written, favourite stories and favourite authors. Each story has several attributes like chapters and categories I’ve written a web scraper which collects author records using a breadth first search starting with a seed list of author ids. In case you don’t feel like picking up Clojure or writing your own scraper I’ve uploaded a 100 author records here. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // an author record {"id":"2269863", "name":"Less Wrong", "author-stories":[{"id":"5782108", "title":"Harry Potter and the Methods of Rationality", "author":"2269863", "date-submitted":"2010-02-28T08:12:39Z", "date-updated":"2014-07-26T02:01:00Z", "categories":["Harry Potter"], "genres":["Drama", "Humor"], "chapters":102, "word-count":565501, "follows":13290, "favourites":14432, "reviews":24766, "language":"English", "complete?":false, "rating":"T"}, ...], "favourite-stories":[{"id":"5193644", "title":"Time Braid", ... }, ...], "favourite-authors":["4976703", ... ]} Wrangling A step by step guide on how to parse these records can be found in wrangling.ipynb I’ve decided on 6 DataFrames authors - author information (only name for now) indexed by author id stories - story information indexed by story id favourite_authors - mapping of author id to favourite author ids favourite_stories - mapping of author id to favourite story ids genres - a relation DataFrame mapping story ids to genres categories - a relation DataFrame mapping story ids to categories For the full schema in nicely formatted markdown check out the metafiction readme My first plot Matplotlib plays well with Pandas and a one liner reveals the most popular genres in the records I’ve got. 1 genres.sum().order().plot(kind="barh") Stories can be in multiple genres and most stories have at least one romantic relationship so this isn’t as damning as you might believe. Although ‘angst’ and ‘hurt’ being quite high up does lead some credence to the image of the Tortured artist Next steps I’ve been pointed to Programming Collective Intelligence to help me build the recommender, any other suggestions welcome. Stay tuned for part II

The infinite chocolate trick is a nifty optical illusion, here’s how it works. This chocolate bar has $ m $ rows and $ n $ columns <!-- rows --> <!-- cols --> <!-- text --> m rows n cols We want to remove the top left piece, with length $ x $ and breadth $ y $. <!-- rows --> <!-- cols --> <!-- piece --> <!-- text --> x y m.x n.y We cut the bar into 4 components the piece we want to remove the column the piece is from a diagonal cut across the bar from the base and the large piece thats left over <!-- figure 1--> <!-- text --> 1 2 3 4 x y a b c d f <!-- figure 2 --> <!-- text --> 2 3 4 c b f d a When swapped, the tops of 4 and 2 align perfectly For this to happen, the left side of 2 (a) has to equal the right side of 4 (d) Since $ a = d $ and $ a + x = m \cdot x $ $\Rightarrow d + x = m \cdot x $ Since $ d + c = m \cdot x $ $\Rightarrow c = x $ We now know how to cut our chocolate bar to make infinite chocolate The Illusion The reason this illusion works is because the final chocolate bar isn’t that much smaller than the original. <!-- figure 1--> <!-- text --> x y (n-1)y (m-1)x x (m-1)x f g y x <!-- figure 2 --> <!-- text --> x n.y f y g (m-1)x The final length of the chocolate is $ x + f $ We also know that $ x + f + g = m \cdot x $ $ \Rightarrow g = m \cdot x - (x + f) $ Which means the new chocolate bar is only $ g $ units shorter than the original $ g = y \cdot tan(\theta) $ [$ \theta $ being the angle made by the lower wedge] $ g = y \cdot \frac{x}{n \cdot y} $ [trigonometry] $ g = \frac{x}{n} $ The area for the missing piece is taken as a sliver from the top row. The length of the sliver is equal to the length of the piece you want to remove divided by the number of columns An Alternate Solution (w/o trig) The total area of the chocolate bar is $ m \cdot x \times n \cdot y $ The area of the piece we remove is $ x \times y $ So the area thats left $ = m \cdot x \times n \cdot y - ( x \times y ) $ Dividing by the breadth, which doesn’t change, we get the new length $ \textrm{new length} = \frac{ m \cdot x \times n \cdot y - ( x \times y ) }{n \cdot y} $ $ \textrm{new length} = m \cdot x - \frac{x}{n} $ $ \textrm{new length} = \textrm{old length} - \frac{x}{n} $ Where $ \frac{x}{n} $ is the same as $ g $ in the first solution, the difference in lengths

The infinite chocolate trick is a nifty optical illusion, here’s how it works. This chocolate bar has $ m $ rows and $ n $ columns <!-- rows --> <!-- cols --> <!-- text --> m rows n cols We want to remove the top left piece, with length $ x $ and breadth $ y $. <!-- rows --> <!-- cols --> <!-- piece --> <!-- text --> x y m.x n.y We cut the bar into 4 components the piece we want to remove the column the piece is from a diagonal cut across the bar from the base and the large piece thats left over <!-- figure 1--> <!-- text --> 1 2 3 4 x y a b c d f <!-- figure 2 --> <!-- text --> 2 3 4 c b f d a When swapped, the tops of 4 and 2 align perfectly For this to happen, the left side of 2 (a) has to equal the right side of 4 (d) Since $ a = d $ and $ a + x = m \cdot x $ $\Rightarrow d + x = m \cdot x $ Since $ d + c = m \cdot x $ $\Rightarrow c = x $ We now know how to cut our chocolate bar to make infinite chocolate The Illusion The reason this illusion works is because the final chocolate bar isn’t that much smaller than the original. <!-- figure 1--> <!-- text --> x y (n-1)y (m-1)x x (m-1)x f g y x <!-- figure 2 --> <!-- text --> x n.y f y g (m-1)x The final length of the chocolate is $ x + f $ We also know that $ x + f + g = m \cdot x $ $ \Rightarrow g = m \cdot x - (x + f) $ Which means the new chocolate bar is only $ g $ units shorter than the original $ g = y \cdot tan(\theta) $ [$ \theta $ being the angle made by the lower wedge] $ g = y \cdot \frac{x}{n \cdot y} $ [trigonometry] $ g = \frac{x}{n} $ The area for the missing piece is taken as a sliver from the top row. The length of the sliver is equal to the length of the piece you want to remove divided by the number of columns An Alternate Solution (w/o trig) The total area of the chocolate bar is $ m \cdot x \times n \cdot y $ The area of the piece we remove is $ x \times y $ So the area thats left $ = m \cdot x \times n \cdot y - ( x \times y ) $ Dividing by the breadth, which doesn’t change, we get the new length $ \textrm{new length} = \frac{ m \cdot x \times n \cdot y - ( x \times y ) }{n \cdot y} $ $ \textrm{new length} = m \cdot x - \frac{x}{n} $ $ \textrm{new length} = \textrm{old length} - \frac{x}{n} $ Where $ \frac{x}{n} $ is the same as $ g $ in the first solution, the difference in lengths

crntaylor not so recently posted his mathematician’s fizzbuzz question on HN A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head? The Setup Pick a random coin out of the jar, whats the probability its the double headed coin? Well, there are a 1000 coins in the jar, theres only 1 double headed coin. So the probability of picking it is $ \frac{1}{1000} $ $$ \textrm { the probability of picking a double headed coin is } \frac{1}{1000} $$ $$ P( \textrm{ double headed coin } ) = \frac{1}{1000} $$ I don’t have any more information about how the coins are arranged so I’ll assume that I’m equally likely to pick any of them. This assumption can be wrong but theres not enough information to make a better guess. Adding Information Flip the coin, it lands tails, aha! more information. I now know that the coin I picked can’t be the double headed coin. A double headed coin can never land tails. $$ \textrm { the probability of tails given a double headed coin is } 0 $$ $$ P( \textrm{ tails } | \textrm{ double headed coin } ) = 0 $$ What about if it landed heads? Now I can’t tell for sure whether I have the double headed coin but I still have more information. Since I haven’t gotten a tails, the probability of having picked the double headed coin goes up. To find out by just how much I use Bayes’ rule Bayes’ rule $$ P(\textrm{ event }|\textrm{ evidence }) = \frac{P(\textrm{ evidence }|\textrm{ event }) \cdot P(\textrm{ event })}{P(\textrm{ evidence })} $$ $$ \textrm { or } $$ $$ P(\textrm{ double headed coin }|\textrm{ heads }) = \frac{P(\textrm{ heads }|\textrm{ double headed coin })P(\textrm{ double headed coin })}{P(\textrm{ heads })} $$ This might be completely unintuitive, check out the link above, it really helped me understand the concept, for now take it as fact and move on. In this case, the event I care about is picking the double headed coin, the evidence is the flip landing heads. The Evidence $ P(\textrm{ heads } ) $ is the probability of seeing a heads at all, whichever coin I picked. Thats equal to the probability of picking the double headed coin and landing a heads plus picking a fair coin and landing a heads. $$ P(\textrm{ heads } ) $$ $$ = P(\textrm{ double headed coin & heads }) + P(\textrm{ fair coin & heads }) $$ $$ = P(\textrm{ heads } | \textrm{ double headed coin }) \cdot P(\textrm{ double headed coin }) + P(\textrm{ heads } | \textrm{ fair coin }) \cdot P(\textrm{ fair coin })$$ $$ = 1 \cdot \frac{1}{1000} + \frac{1}{2} \cdot \frac{999}{1000}$$ $$ = 0.5005 $$ Which is slightly higher than $P(\textrm{ heads } | \textrm{ fair coin })$, after all, 999 out of 1000 coins in the jar are fair. (For an explanation of what just happened check out these Khan Academy videos) Picking the Double Headed Coin $$ P(\textrm{ double headed coin }| \textrm{ heads }) $$ $$ = \frac{P(\textrm{ heads }|\textrm{ double headed coin }) \cdot P(\textrm{ double headed coin })}{P(\textrm{ heads })} $$ $$ = \frac{1 \cdot \frac{1}{1000}}{1 \cdot \frac{1}{1000} + \frac{1}{2} \cdot \frac{999}{1000}} $$ $$ = \frac{2}{1001} $$ $$ = 0.0019 $$ Compared to the original probability of picking the double headed coin of $ 0.001 $ this isn’t a lot higher but then I’ve only flipped it once. Plugging in the values for 10 heads in a row. $$ P(\textrm{ double headed coin }| \textrm{ 10 heads }) $$ $$ = \frac{P(\textrm{ 10 heads }|\textrm{ double headed coin }) \cdot P(\textrm{ double headed coin })}{P(\textrm{ 10 heads })} $$ $$ = \frac{1 \cdot \frac{1}{1000}}{1 \cdot \frac{1}{1000} + (\frac{1}{2})^{10} \cdot \frac{999}{1000}} $$ $$ = \frac{1024}{2023} $$ $$ = 0.506 \textrm{ (or a little over 50%) } $$ Updating Posteriors Before I flipped anything, $P(\textrm{ heads })$ was 0.5005, only slightly higher than flipping a fair coin. Now after seeing 10 Heads in a row I must update my beliefs about the coin in my hand. My posterior becomes my new prior. $$ P(\textrm{ double headed coin }| \textrm{ 10 heads }) \Rightarrow P(\textrm{ double headed coin }) $$ $$ P(\textrm{ fair coin }| \textrm{ 10 heads }) \Rightarrow P(\textrm{ fair coin })$$ The 11th Flip Plugging the updated probabilities into the formula for $ P(\textrm{ heads }) $ $$ = P(\textrm{ double headed coin & heads }) + P(\textrm{ fair coin & heads }) $$ $$ = P(\textrm{ heads } | \textrm{ double headed coin }) \cdot P(\textrm{ double headed coin }) + P(\textrm{ heads } | \textrm{ fair coin }) \cdot P(\textrm{ fair coin })$$ $$ = 1 \cdot \frac{1024}{2023} + \frac{1}{2} \cdot \frac{999}{2023}$$ $$ = 0.753 $$ TLDR Initially, there was a small chance I was holding the double headed coin but with every flip landing heads I could no longer hold onto that belief. Bayes’ Rule quantifies my shift in belief with every additional flip, away from believing that I hold a fair coin closer to the belief that I hold the double headed coin.

crntaylor not so recently posted his mathematician’s fizzbuzz question on HN A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head? The Setup Pick a random coin out of the jar, whats the probability its the double headed coin? Well, there are a 1000 coins in the jar, theres only 1 double headed coin. So the probability of picking it is $ \frac{1}{1000} $ $$ \textrm { the probability of picking a double headed coin is } \frac{1}{1000} $$ $$ P( \textrm{ double headed coin } ) = \frac{1}{1000} $$ I don’t have any more information about how the coins are arranged so I’ll assume that I’m equally likely to pick any of them. This assumption can be wrong but theres not enough information to make a better guess. Adding Information Flip the coin, it lands tails, aha! more information. I now know that the coin I picked can’t be the double headed coin. A double headed coin can never land tails. $$ \textrm { the probability of tails given a double headed coin is } 0 $$ $$ P( \textrm{ tails } | \textrm{ double headed coin } ) = 0 $$ What about if it landed heads? Now I can’t tell for sure whether I have the double headed coin but I still have more information. Since I haven’t gotten a tails, the probability of having picked the double headed coin goes up. To find out by just how much I use Bayes’ rule Bayes’ rule $$ P(\textrm{ event }|\textrm{ evidence }) = \frac{P(\textrm{ evidence }|\textrm{ event }) \cdot P(\textrm{ event })}{P(\textrm{ evidence })} $$ $$ \textrm { or } $$ $$ P(\textrm{ double headed coin }|\textrm{ heads }) = \frac{P(\textrm{ heads }|\textrm{ double headed coin })P(\textrm{ double headed coin })}{P(\textrm{ heads })} $$ This might be completely unintuitive, check out the link above, it really helped me understand the concept, for now take it as fact and move on. In this case, the event I care about is picking the double headed coin, the evidence is the flip landing heads. The Evidence $ P(\textrm{ heads } ) $ is the probability of seeing a heads at all, whichever coin I picked. Thats equal to the probability of picking the double headed coin and landing a heads plus picking a fair coin and landing a heads. $$ P(\textrm{ heads } ) $$ $$ = P(\textrm{ double headed coin & heads }) + P(\textrm{ fair coin & heads }) $$ $$ = P(\textrm{ heads } | \textrm{ double headed coin }) \cdot P(\textrm{ double headed coin }) + P(\textrm{ heads } | \textrm{ fair coin }) \cdot P(\textrm{ fair coin })$$ $$ = 1 \cdot \frac{1}{1000} + \frac{1}{2} \cdot \frac{999}{1000}$$ $$ = 0.5005 $$ Which is slightly higher than $P(\textrm{ heads } | \textrm{ fair coin })$, after all, 999 out of 1000 coins in the jar are fair. (For an explanation of what just happened check out these Khan Academy videos) Picking the Double Headed Coin $$ P(\textrm{ double headed coin }| \textrm{ heads }) $$ $$ = \frac{P(\textrm{ heads }|\textrm{ double headed coin }) \cdot P(\textrm{ double headed coin })}{P(\textrm{ heads })} $$ $$ = \frac{1 \cdot \frac{1}{1000}}{1 \cdot \frac{1}{1000} + \frac{1}{2} \cdot \frac{999}{1000}} $$ $$ = \frac{2}{1001} $$ $$ = 0.0019 $$ Compared to the original probability of picking the double headed coin of $ 0.001 $ this isn’t a lot higher but then I’ve only flipped it once. Plugging in the values for 10 heads in a row. $$ P(\textrm{ double headed coin }| \textrm{ 10 heads }) $$ $$ = \frac{P(\textrm{ 10 heads }|\textrm{ double headed coin }) \cdot P(\textrm{ double headed coin })}{P(\textrm{ 10 heads })} $$ $$ = \frac{1 \cdot \frac{1}{1000}}{1 \cdot \frac{1}{1000} + (\frac{1}{2})^{10} \cdot \frac{999}{1000}} $$ $$ = \frac{1024}{2023} $$ $$ = 0.506 \textrm{ (or a little over 50%) } $$ Updating Posteriors Before I flipped anything, $P(\textrm{ heads })$ was 0.5005, only slightly higher than flipping a fair coin. Now after seeing 10 Heads in a row I must update my beliefs about the coin in my hand. My posterior becomes my new prior. $$ P(\textrm{ double headed coin }| \textrm{ 10 heads }) \Rightarrow P(\textrm{ double headed coin }) $$ $$ P(\textrm{ fair coin }| \textrm{ 10 heads }) \Rightarrow P(\textrm{ fair coin })$$ The 11th Flip Plugging the updated probabilities into the formula for $ P(\textrm{ heads }) $ $$ = P(\textrm{ double headed coin & heads }) + P(\textrm{ fair coin & heads }) $$ $$ = P(\textrm{ heads } | \textrm{ double headed coin }) \cdot P(\textrm{ double headed coin }) + P(\textrm{ heads } | \textrm{ fair coin }) \cdot P(\textrm{ fair coin })$$ $$ = 1 \cdot \frac{1024}{2023} + \frac{1}{2} \cdot \frac{999}{2023}$$ $$ = 0.753 $$ TLDR Initially, there was a small chance I was holding the double headed coin but with every flip landing heads I could no longer hold onto that belief. Bayes’ Rule quantifies my shift in belief with every additional flip, away from believing that I hold a fair coin closer to the belief that I hold the double headed coin.