{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "![DSA log](dsalogo-Abuja.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n",
"Adapted with permission from Andreas Vlacho's COM4513/6513:NLP Module
\n",
"Department of Computer Science
\n",
"University of Sheffield\n",
"\n",
"
\n", "the likelihood of a particular sequence of words coming from a particular language (say English)!\n", "\n", "
\n", "unigram
, bigram
and trigram
counts\n",
"We need the brown corpus
in nltk
(see this link for more info and documentation on nltk
).\n",
"The word()
function converts the free flowing string to _words_, while the Counter
class in the package collections
yields the unigram_counts
.\n",
"\n",
"#### First, the unigram
counts..."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"from nltk.corpus import brown\n",
"from collections import Counter\n",
"brown_words = brown.words()\n",
"unigram_counts = Counter(brown_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##UNCOMMENT AND EXECUTE BELOW: Counter({'The': 7258,'Fulton': 17,'County': 85...})\n",
"# unigram_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Then, the bigram
and trigram
counts...\n",
"By the way, _\"padding\"_ ensures we have a sequence of n-grams at the beginnings and ends of sentences. If you are curious you could set the parameters to False
i.e. pad_left=True
and pad_right=True
and run the cell again to see what they do."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"bigrams = []\n",
"for sent in brown.sents():\n",
" bigrams.extend(nltk.bigrams(sent, pad_left=True, pad_right=True))\n",
"bigram_counts = Counter(bigrams)\n",
"\n",
"trigrams = []\n",
"for sent in brown.sents():\n",
" trigrams.extend(nltk.trigrams(sent, pad_left=True, pad_right=True))\n",
"trigram_counts = Counter(trigrams)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##UNCOMMENT AND EXECUTE BELOW: Counter({(None, 'The'): 6544,('The', 'Fulton'): 1, ('Fulton', 'County'): 6,...\n",
"# bigram_counts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##UNCOMMENT AND EXECUTE BELOW: Counter({(None, None, 'The'): 6544, (None, 'The', 'Fulton'): 1, ('The', 'Fulton', 'County'): 1,\n",
"# trigram_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's then define the bigram language model, bigram_LM
as a function\n",
"By the way, _\"padding\"_ ensures we have a sequence of n-grams at the beginnings and ends of sentences. If you are curious you could set the parameters to False
i.e. pad_left=True
and pad_right=True
and run the cell again to see what they do."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def bigram_LM(sentence_x, smoothing=0.0):\n",
" unique_words = len(unigram_counts.keys()) + 2 # For the None paddings\n",
" x_bigrams = nltk.bigrams(sentence_x, pad_left=True, pad_right=True)\n",
" prob_x = 1.0\n",
" for bg in x_bigrams:\n",
" if bg[0] == None:\n",
" prob_bg = (bigram_counts[bg]+smoothing)/(len(brown.sents())+smoothing*unique_words)\n",
" else:\n",
" prob_bg = (bigram_counts[bg]+smoothing)/(unigram_counts[bg[0]]+smoothing*unique_words)\n",
" prob_x = prob_x *prob_bg\n",
" print(str(bg)+\":\"+str(prob_bg))\n",
" return prob_x"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Problem setup\n",
"\n",
"Training data is a (large) set of sentences $\\mathbf{x}^m$ with words $x_n$:\n",
"\n",
"\n", "\\begin{align}\n", "D_{train} & = \\{\\mathbf{x}^1,...,\\mathbf{x}^M\\} \\\\\n", "\\mathbf{x}& = [x_1,... x_N]\\\\\n", "\\end{align}\n", "
\n", "\n", "\n", "for example:\n", "\\begin{align}\n", "\\mathbf{x}=&[\\text{None}, \\text{The}, \\text{water}, \\text{is}, \\text{clear}, \\text{.}, \\text{None}]\n", "\\end{align}\n", "
\n", "\n", "We want to learn a model that returns:\n", "\\begin{align}\n", "\\text{probability}\\; P(\\mathbf{x}), \\mathbf{for} \\; \\forall \\mathbf{x}\\in V^{maxN}\n", "\\end{align}\n", "$V$ is the vocabulary and $V^{maxN}$ all possible sentences\n", "\n", "#### Word counts\n", "Let's take look at our corpus:\n", " * List of all sentences in our corpus \n", " \n", " --brown.sents()
returns a list of all sentences in our corpus"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"## UNCOMMENT AND EXECUTE BELOW:\n",
"# brown.sents()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" * How many \"words\" (or _tokens_) are in our corpus?\n",
" \n",
" -- len(brown_words)
returns the total number of tokens in our corpus"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"## UNCOMMENT AND EXECUTE BELOW:\n",
"# len(brown_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" * What is the count of the word _'the'_ in our corpus?\n",
" \n",
" -- unigram_counts[\"the\"]
returns the count of the word *the* in the corpus"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## UNCOMMENT AND EXECUTE BELOW:\n",
"# unigram_counts[\"the\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Let's move from word counts to probabilities\n",
"\n",
"probability $P(\\text{the}) = \\frac{counts(the)}{\\sum_{x \\in Vocabulary} counts(x)}$\n",
"\n",
" * What is the probability of finding _'the'_ in our corpus?\n",
" \n",
" -- unigram_counts[\"the\"]/len(brown_words)
returns the required probability"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"## UNCOMMENT AND EXECUTE BELOW:\n",
"# unigram_counts[\"the\"]/len(brown_words)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"To have a probability distributions and not just scores, the sum of the probabilities for all words must be 1. Let's check:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"## UNCOMMENT AND EXECUTE BELOW:\n",
"# sum = 0\n",
"# for word in unigram_counts:\n",
"# sum += unigram_counts[word]\n",
"# print(sum/len(brown_words))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Unigram language model\n",
"\n",
"Multiply the probability of each word in the sentence $\\mathbf{x}$:\n",
"\n",
"$$P(\\mathbf{x}) = \\prod_{n=1}^N P(x_n) = \\prod_{n=1}^N \\frac{counts(x_n)}{\\sum_{x \\in V} counts(x)}$$\n",
"\n",
"### Now let's attempt to solve the exercise above using the unigram language model\n",
" [Hint: Compare the unigram probabilities of the options and the higher option]\n",
" - All the questions and options are stored as one long string in the variable questions_options
. Remember to extract them into a sensible format.\n",
" - You can create a tuple of question
and options
for each where the question
is a list of tokens in the question e.g. ['I', 'don't', 'know', '____', 'to', 'go', 'out', 'or', 'not', '.']
and options
is a list of the corresponding choices e.g. ['weather', 'whether']
.\n",
" - If it helps keep a list the correct options as well in order to compare your model's answer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions_options = \"\"\"I don't know ____ to go out or not . : weather/whether\n",
"We went ____ the door to get inside . : through/threw\n",
"They all had a ____ of the cake . : piece/peace\n",
"She had to go to ____ to prove she was innocent . : caught/court\n",
"We were only ____ to visit at certain times . : aloud/allowed\n",
"She went back to ____ she had locked the door . : cheque/check\n",
"Can you ____ me ? : hear/here\n",
"Do you usually eat ____ for breakfast ? : serial/cereal\n",
"She normally ____ with her mouth closed . : chews/choose\n",
"I'm going to ____ it on the internet . : cell/sell\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##PUT YOUR CODE FOR UNIGRAM MODEL SOLUTION IN HERE!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### What could go wrong with the unigram probabilities?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"the most probable word is \"the\"\n",
"- the most probable single-word sentence is \"the\"\n",
"- the most probable two-word sentence is \"the the\"\n",
"- the most probable $N$-word sentence is $N\\times $\"the\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"k-th order Markov assumption:\n", "\n", "$$P(x_n|x_{n-1},..., x_1) \\approx P(x_n|x_{n-1},..., x_{n-k})$$\n", "\n", "with k=1
\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Bigram language model in action\n", "\n", "$$\\mathbf{x}=[\\text{I}, \\text{spoke}, \\text{the}, \\text{truth}, \\text{.}]$$\n", "\n", "\\begin{align}\n", "P(\\mathbf{x}) =& P(\\text{I}|\\text{None})\\times P(\\text{spoke}|\\text{I})\\times P\\text{the}|\\text{spoke})\\times\\\\\n", "&P(\\text{truth}|\\text{the})\\times P(\\text{.}|\\text{truth})\\times P(\\text{.}|\\text{None})\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "print(bigram_LM(x))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Longer contexts\n", "\n", "$$P(x|context) = \\frac{P(context,x)}{P(context)} = \\frac{counts(context,x)}{counts(context)} $$\n", "\n", "In bigram LM $context$ is $x_{n-1}$, trigram $x_{n-2}, x_{n-1}$, etc.\n", "\n", "The longer the context:\n", "- the more likely to capture long-range dependencies: \"I saw a tiger that was really very...\" fierce or talkative?\n", "- the sparser the counts (zero probabilities)\n", "\n", "5-grams\n", "and \n", "training sets with billions of tokens are common." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Can we avoid zero probabilities? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", " \n", " \n", "... stealing from the rich and giving to the poor!
\n", "\n", "\n", "By United Artists, cinematographers Arthur Edeson & Charles Richardson - Here, Public Domain, Link\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Taking from the frequent and giving to the rare (discounting)
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Smoothing\n", "\n", "Pretend we have seen all bigrams at least once:\n", "\n", "$$P_{add-1}(x_n|x_{n-1}) = \\frac{counts(x_{n-1},x_n)+1}{counts(x_{n-1}) + |V|} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Add-1 puts too much probability mass to unseen bigrams, better to add-$k, k<1$:\n", "$$P_{add-k}(x_n|x_{n-1}) = \\frac{counts(x_{n-1},x_{n})+k}{counts(x_{n-1}) + k|V|} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Smoothing in action" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "y = [\"The\", \"water\", \"is\",\"crystal\", \"clear\", \".\"]\n", "print(bigram_LM(y))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "print(bigram_LM(y, smoothing=1.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Having learnt about bigram models and smoothing, apply the techniques to the problem and see if the solution improves..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##PUT YOUR CODE FOR BIGRAM MODEL SOLUTION IN HERE!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### More smoothing\n", "\n", "If a word was never encountered in training,\n", "any sentence containing will have probability 0\n", "\n", "It happens:\n", "- all corpora are finite\n", "- new words emerging\n", "\n", "\n", "Common solutions: \n", "- Generate unknown words in the training data by\n", "replacing low-frequency words with a special UNKNOWN word to represent them\n", "- Use classes of unknown words, e.g. names and numbers" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Evaluation \n", "\n", "**Intrinsic** evaluation for language modeling\n", "\n", "How well does our model predict the next word?\n", "\n", "*I always order pizza with cheese and...*\n", "- mushrooms?\n", "- bread?\n", "- and?\n", "\n", "Measure accuracy: how often the LM predicts the correct word)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Implementation\n", "\n", "Dealing with large datasets requires efficiency:\n", "- use logprobs to avoid underflows (small numbers)\n", "- efficient data structures for sparse counts\n", "- lossy data structures (e.g. Bloom filters) are used\n", "\n", "Often more data matters more than a cleverer model." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Bibliography\n", "- Jurafsky & Martin [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf)\n", "- Michael Collins's [notes](http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### More topics on Language modeling:\n", "- Interpolation methods\n", "- Applications and more methods for evaluation" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "livereveal": { "start_slideshow_at": "selected", "theme": "solarized", "transition": "slide" } }, "nbformat": 4, "nbformat_minor": 2 }