{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "![DSA log](dsalogo-Abuja.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<center>\n",
    "<h2>Natural Language Processing Workshop</h2>\n",
    "<h3>Ex1: Language modeling basics</h3>\n",
    "<p style=\"text-align:center\">\n",
    "Adapted with permission from <a href=\"http://andreasvlachos.github.io\">Andreas Vlacho's </a> COM4513/6513:NLP Module<br>\n",
    "<small>Department of Computer Science<br>\n",
    "University of Sheffield\n",
    "</small>\n",
    "</p>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task: Fill the blank space with the correct option!\n",
    "\n",
    " - I don't know ____ to go out or not . : weather/whether\n",
    "\n",
    " - We went ____ the door to get inside . : through/threw\n",
    "\n",
    " - They all had a ____ of the cake . : piece/peace\n",
    "\n",
    " - She had to go to ____ to prove she was innocent . : caught/court\n",
    "\n",
    " - We were only ____ to visit at certain times . : aloud/allowed\n",
    "\n",
    " - She went back to ____ she had locked the door . : cheque/check\n",
    "\n",
    " - Can you ____ me ? : hear/here\n",
    "\n",
    " - Do you usually eat ____ for breakfast ? : serial/cereal\n",
    "\n",
    " - She normally ____ with her mouth closed . : chews/choose\n",
    "\n",
    " - I'm going to ____ it on the internet . : cell/sell\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Let's learn n-gram language models\n",
    "### In this exercise...,\n",
    "... we'll look at a basic language model. Language models attempt to estimate:\n",
    "<center>\n",
    "<p style=\"border:3px; width: 64%; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 1em;\">\n",
    "<i>the likelihood of a particular sequence of words coming from a particular language (say English)!\n",
    "</i>\n",
    "</p>\n",
    "</center>\n",
    "<b>Language modelling has very interesting application areas which includes:</b>\n",
    "- speech recognition\n",
    "- machine translation\n",
    "- grammatical error detection\n",
    "- language identification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the <code>unigram</code>, <code>bigram</code> and <code>trigram</code> counts\n",
    "We need the <code>brown corpus</code> in <code>nltk</code> (see this <a href=\"http://www.nltk.org/\">link</a> for more info and documentation on <code>nltk</code>).\n",
    "The <code>word()</code> function converts the free flowing string to _words_, while the <code>Counter</code> class in the package <code>collections</code> yields the <code>unigram_counts</code>.\n",
    "\n",
    "#### First, the <code>unigram</code> counts..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.corpus import brown\n",
    "from collections import Counter\n",
    "brown_words = brown.words()\n",
    "unigram_counts = Counter(brown_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##UNCOMMENT AND EXECUTE BELOW: Counter({'The': 7258,'Fulton': 17,'County': 85...})\n",
    "# unigram_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Then, the <code>bigram</code> and <code>trigram</code> counts...\n",
    "By the way, _\"padding\"_ ensures we have a sequence of n-grams at the beginnings and ends of sentences. If you are curious you could set the parameters to <code>False</code> i.e.  <code>pad_left=True</code> and <code>pad_right=True</code> and run the cell again to see what they do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "bigrams = []\n",
    "for sent in brown.sents():\n",
    "    bigrams.extend(nltk.bigrams(sent, pad_left=True, pad_right=True))\n",
    "bigram_counts = Counter(bigrams)\n",
    "\n",
    "trigrams = []\n",
    "for sent in brown.sents():\n",
    "    trigrams.extend(nltk.trigrams(sent, pad_left=True, pad_right=True))\n",
    "trigram_counts = Counter(trigrams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##UNCOMMENT AND EXECUTE BELOW: Counter({(None, 'The'): 6544,('The', 'Fulton'): 1, ('Fulton', 'County'): 6,...\n",
    "# bigram_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##UNCOMMENT AND EXECUTE BELOW: Counter({(None, None, 'The'): 6544, (None, 'The', 'Fulton'): 1, ('The', 'Fulton', 'County'): 1,\n",
    "# trigram_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's then define the bigram language model, <code>bigram_LM</code> as a function\n",
    "By the way, _\"padding\"_ ensures we have a sequence of n-grams at the beginnings and ends of sentences. If you are curious you could set the parameters to <code>False</code> i.e. <code>pad_left=True</code> and <code>pad_right=True</code> and run the cell again to see what they do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bigram_LM(sentence_x, smoothing=0.0):\n",
    "    unique_words = len(unigram_counts.keys()) + 2 # For the None paddings\n",
    "    x_bigrams = nltk.bigrams(sentence_x, pad_left=True, pad_right=True)\n",
    "    prob_x = 1.0\n",
    "    for bg in x_bigrams:\n",
    "        if bg[0] == None:\n",
    "            prob_bg = (bigram_counts[bg]+smoothing)/(len(brown.sents())+smoothing*unique_words)\n",
    "        else:\n",
    "            prob_bg = (bigram_counts[bg]+smoothing)/(unigram_counts[bg[0]]+smoothing*unique_words)\n",
    "        prob_x = prob_x *prob_bg\n",
    "        print(str(bg)+\":\"+str(prob_bg))\n",
    "    return prob_x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Problem setup\n",
    "\n",
    "Training data is a (large) set of sentences $\\mathbf{x}^m$ with words $x_n$:\n",
    "\n",
    "<p>\n",
    "\\begin{align}\n",
    "D_{train} & = \\{\\mathbf{x}^1,...,\\mathbf{x}^M\\} \\\\\n",
    "\\mathbf{x}& = [x_1,... x_N]\\\\\n",
    "\\end{align}\n",
    "</p>\n",
    "\n",
    "<p class=\"fragment\">\n",
    "for example:\n",
    "\\begin{align}\n",
    "\\mathbf{x}=&[\\text{None}, \\text{The}, \\text{water}, \\text{is}, \\text{clear}, \\text{.}, \\text{None}]\n",
    "\\end{align}\n",
    "</p>\n",
    "\n",
    "We want to learn a model that returns:\n",
    "\\begin{align}\n",
    "\\text{probability}\\; P(\\mathbf{x}), \\mathbf{for} \\; \\forall \\mathbf{x}\\in V^{maxN}\n",
    "\\end{align}\n",
    "$V$ is the vocabulary and $V^{maxN}$ all possible sentences\n",
    "\n",
    "#### Word counts\n",
    "Let's take look at our corpus:\n",
    " * List of all sentences in our corpus \n",
    " \n",
    " -- <code>brown.sents()</code> returns a list of all sentences in our corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## UNCOMMENT AND EXECUTE BELOW:\n",
    "# brown.sents()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    " * How many \"words\" (or _tokens_) are in our corpus?\n",
    " \n",
    " -- <code>len(brown_words)</code> returns the total number of tokens in our corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## UNCOMMENT AND EXECUTE BELOW:\n",
    "# len(brown_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " * What is the count of the word _'the'_ in our corpus?\n",
    " \n",
    " -- <code>unigram_counts[\"the\"]</code> returns the count of the word *the* in the corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## UNCOMMENT AND EXECUTE BELOW:\n",
    "# unigram_counts[\"the\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Let's move from word counts to probabilities\n",
    "\n",
    "probability $P(\\text{the}) = \\frac{counts(the)}{\\sum_{x \\in Vocabulary} counts(x)}$\n",
    "\n",
    " * What is the probability of finding _'the'_ in our corpus?\n",
    " \n",
    " -- <code>unigram_counts[\"the\"]/len(brown_words)</code> returns the required probability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## UNCOMMENT AND EXECUTE BELOW:\n",
    "# unigram_counts[\"the\"]/len(brown_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "To have a probability distributions and not just scores, the sum of the probabilities for all words must be 1. Let's check:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## UNCOMMENT AND EXECUTE BELOW:\n",
    "# sum = 0\n",
    "# for word in unigram_counts:\n",
    "#     sum += unigram_counts[word]\n",
    "# print(sum/len(brown_words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Unigram language model\n",
    "\n",
    "Multiply the probability of each word in the sentence $\\mathbf{x}$:\n",
    "\n",
    "$$P(\\mathbf{x}) = \\prod_{n=1}^N  P(x_n) = \\prod_{n=1}^N \\frac{counts(x_n)}{\\sum_{x \\in V} counts(x)}$$\n",
    "\n",
    "### Now let's attempt to solve the exercise above using the unigram language model\n",
    "    [Hint: Compare the unigram probabilities of the options and the higher option]\n",
    " - All the questions and options are stored as one long string in the variable <code>questions_options</code>. Remember to extract them into a sensible format.\n",
    " - You can create a tuple of <code>question</code> and <code>options</code> for each where the <code>question</code> is a list of tokens in the question e.g. <code>['I', 'don't', 'know', '____', 'to', 'go', 'out', 'or', 'not', '.']</code> and <code>options</code> is a list of the corresponding choices e.g. <code>['weather', 'whether']</code>.\n",
    " - If it helps keep a list the correct options as well in order to compare your model's answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions_options = \"\"\"I don't know ____ to go out or not . : weather/whether\n",
    "We went ____ the door to get inside . : through/threw\n",
    "They all had a ____ of the cake . : piece/peace\n",
    "She had to go to ____ to prove she was innocent . : caught/court\n",
    "We were only ____ to visit at certain times . : aloud/allowed\n",
    "She went back to ____ she had locked the door . : cheque/check\n",
    "Can you ____ me ? : hear/here\n",
    "Do you usually eat ____ for breakfast ? : serial/cereal\n",
    "She normally ____ with her mouth closed . : chews/choose\n",
    "I'm going to ____ it on the internet . : cell/sell\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUT YOUR CODE FOR UNIGRAM MODEL SOLUTION IN HERE!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### What could go wrong with the unigram probabilities?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "the most probable word is \"the\"\n",
    "- the most probable single-word sentence is \"the\"\n",
    "- the most probable two-word sentence is \"the the\"\n",
    "- the most probable $N$-word sentence is $N\\times $\"the\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<h3>On the way to a better language model</h3>\n",
    "\n",
    "Instead of:\n",
    "\n",
    "$$P(\\mathbf{x}) = \\prod_{n=1}^N  P(x_n)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We assume that each word is dependent on all previous ones:\n",
    "\n",
    "\\begin{align}\n",
    "P(\\mathbf{x}) &= P(x_1,...,x_N) \\\\\n",
    "&= P(x_1)P(x_2...x_N|x_1)\\\\\n",
    "&= P(x_1)P(x_2|x_1) ... P(x_N|x_1,...,x_{N-1})\\\\\n",
    "&= \\prod_{n=1}^N P(x_n|x_1,... x_{n-1}) \\quad \\text{(chain rule)}\\\\\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "What could go wrong?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<h3>On the way to a better language model</h3>\n",
    "\n",
    "Let's analyze this:\n",
    "\n",
    "$$P(\\mathbf{x}) = P(x_1)P(x_2|x_1)P(x_3|x_2,x_1) ... P(x_N|x_1,...,x_{N-1}) $$\n",
    "\n",
    "$$P(x_n| x_{n-1...x_1})=\\frac{counts(x_1...x_{n-1}, x_n)}{counts(x_1...x_{n-1})}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "x = [\"I\", \"spoke\", \"the\", \"truth\", \".\"]\n",
    "print(bigram_counts[(None, x[0])]/len(brown.sents()))\n",
    "print(trigram_counts[(None, x[0], x[1])]/bigram_counts[(None, x[0])])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "As we condition on more words, the counts become **sparser**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Bigram language model\n",
    "\n",
    "Assume that the choice of a word depends **only** on the one before it:\n",
    "\n",
    "$$P(\\mathbf{x}) = \\prod_{n=1}^N P(x_n| x_{n-1})$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<p style=\"float: left;\">k-th order Markov <b>assumption</b>:\n",
    "\n",
    "$$P(x_n|x_{n-1},..., x_1) \\approx P(x_n|x_{n-1},..., x_{n-k})$$\n",
    "\n",
    "with k=1</p> \n",
    "\n",
    "<a style=\"float: right;\" href=\"https://en.wikipedia.org/wiki/Andrey_Markov\"><img height=\"200\" src=\"images/220px-AAMarkov.jpg\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Bigram language model in action\n",
    "\n",
    "$$\\mathbf{x}=[\\text{I}, \\text{spoke}, \\text{the}, \\text{truth}, \\text{.}]$$\n",
    "\n",
    "\\begin{align}\n",
    "P(\\mathbf{x}) =& P(\\text{I}|\\text{None})\\times P(\\text{spoke}|\\text{I})\\times P\\text{the}|\\text{spoke})\\times\\\\\n",
    "&P(\\text{truth}|\\text{the})\\times P(\\text{.}|\\text{truth})\\times P(\\text{.}|\\text{None})\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print(bigram_LM(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Longer contexts\n",
    "\n",
    "$$P(x|context) = \\frac{P(context,x)}{P(context)} = \\frac{counts(context,x)}{counts(context)}  $$\n",
    "\n",
    "In bigram LM $context$ is $x_{n-1}$, trigram $x_{n-2}, x_{n-1}$, etc.\n",
    "\n",
    "The longer the context:\n",
    "- the more likely to capture long-range dependencies: \"<i>I saw a tiger that was really very...</i>\" <b>fierce</b> or <b>talkative</b>?\n",
    "- the sparser the counts (zero probabilities)\n",
    "\n",
    "5-grams\n",
    "and <a href=\"http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html\">\n",
    "training sets with billions of tokens</a>  are common."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Can we avoid zero probabilities? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<img style:\"float:left\" src=\"images/Fairbanks_Robin_Hood_standing_by_wall_w_sword.jpg\" />\n",
    "    \n",
    "    \n",
    "<p>... stealing from the rich and giving to the poor!</p>\n",
    "\n",
    "<p style=\"font-size:50%\"><a href=\"https://en.wikipedia.org/wiki/Robin_Hood\"></a>\n",
    "By United Artists, cinematographers Arthur Edeson &amp; Charles Richardson - <a rel=\"nofollow\" class=\"external text\" href=\"http://douglasfairbanks.org/robin1.jpg\">Here</a>, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=819264\">Link</a>\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<h3>Smoothing intuition</h3>\n",
    "\n",
    "<a href=\"http://www.cs.berkeley.edu/~klein/cs288/sp10/slides/SP10%20cs288%20lecture%202%20--%20language%20models%20(2PP).pdf\"><img width=\"80%\" src=\"images/smoothing.png\"></a>\n",
    "<p>Taking from the frequent and giving to the rare (discounting)</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Smoothing\n",
    "\n",
    "Pretend we have seen all bigrams at least once:\n",
    "\n",
    "$$P_{add-1}(x_n|x_{n-1}) = \\frac{counts(x_{n-1},x_n)+1}{counts(x_{n-1}) + |V|}  $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Add-1 puts too much probability mass to unseen bigrams, better to add-$k, k<1$:\n",
    "$$P_{add-k}(x_n|x_{n-1}) = \\frac{counts(x_{n-1},x_{n})+k}{counts(x_{n-1}) + k|V|}  $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Smoothing in action"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "y = [\"The\", \"water\", \"is\",\"crystal\", \"clear\", \".\"]\n",
    "print(bigram_LM(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "print(bigram_LM(y, smoothing=1.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Having learnt about bigram models and smoothing, apply the techniques to the problem and see if the solution improves..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUT YOUR CODE FOR BIGRAM MODEL SOLUTION IN HERE!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### More smoothing\n",
    "\n",
    "If a word was never encountered in training,\n",
    "any sentence containing will have probability 0\n",
    "\n",
    "It happens:\n",
    "- all corpora are finite\n",
    "- new words emerging\n",
    "\n",
    "\n",
    "Common solutions: \n",
    "- Generate unknown words in the training data by\n",
    "replacing low-frequency words with a special UNKNOWN word to represent them\n",
    "- Use classes of unknown words, e.g. names and numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Evaluation \n",
    "\n",
    "**Intrinsic** evaluation for language modeling\n",
    "\n",
    "How well does our model predict the next word?\n",
    "\n",
    "*I always order pizza with cheese and...*\n",
    "- mushrooms?\n",
    "- bread?\n",
    "- and?\n",
    "\n",
    "Measure accuracy: how often the LM predicts the correct word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Implementation\n",
    "\n",
    "Dealing with large datasets requires efficiency:\n",
    "- use logprobs to avoid underflows (small numbers)\n",
    "- efficient data structures for sparse counts\n",
    "- lossy data structures (e.g. Bloom filters) are used\n",
    "\n",
    "Often more data matters more than a cleverer model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Bibliography\n",
    "- Jurafsky & Martin [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf)\n",
    "- Michael Collins's [notes](http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### More topics on Language modeling:\n",
    "- Interpolation methods\n",
    "- Applications and more methods for evaluation"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "livereveal": {
   "start_slideshow_at": "selected",
   "theme": "solarized",
   "transition": "slide"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}