"Well that was hard!" "That's what she said" Turning seemingly normal comments into sexual innuendo by adding the words "That's what she said" is a cultural phenomenon, appearing everywhere from TV sitcoms, to internet discussions, to movies. From having its own page on Wikipedia to sites dedicated to the joke, you don't have to look far on the internet before running into it. This has led some to wonder whether it is possible to determine when it is appropriate to add those magic four words to a sentence.
As it turns out, identifying humour through software is hard. For decades now, artificial intelligence (AI) researchers have been trying to solve the NLP (or Natural Language Processing) problem. This field of computer science and linguistics is concerned with the building of systems that can understand normal language as spoken by humans. This is normally considered a hard task, as the meaning of a sentence will often vary based on the context in which it is presented, and this is something that is difficult to implement in software. When you add humour and puns — when words can have multiple meanings — this can get substantially harder.
Two researchers at the University of Washington, however, were willing to give it their best shot. In a recently released paper entitled "That's What She Said: Double Entendre Identification", Kiddon and Brun describe what they've found and introduce their new approach to the problem: "Double Entendre via Noun Transfer" or DEviaNT for short.
Their approach consists of creating three functions that were used to score words based on a number of sample sentences sourced from either an erotic corpus or from the Brown corpus, the standard used in this field. The authors used the Standford Tagger to identify which parts of sentences in the corpus were nouns, adjectives, verbs and so on. Using these two sources, Kiddon and Brun were able to create three sets of functions they used to classify words based on their frequency and position related to other words. The "noun sexiness"
These three functions were used to score sentences for noun euphemisms (ie, does a test sentence include a word likely to be used in an erotic sentence). Other elements sentences were scored on included the presence of adjectives and verbs combinations more likely to be used in erotic literature. Finally, they used some information such as the number of punctuation and non-punctuation items in sentences.
Kiddon and Brun sourced a number of sentences from sites based on user-submitted content such as twssstories.com, fmylife.com and textsfromlastnight.com, which were scored using their system. These scores were used to train the WEKA machine learning package, an open source machine learning tool. Using their test set they were able to show a high level of identification of sentences which were suitable for "That's what she said"-style jokes, while keeping false negatives to a minimum — the authors flagging that making the joke when the sentence is not appropriate is much worse than not making the joke when it is appropriate.
For those interested in the topic, the authors will be presenting on it at the 49th Annual Meeting of the Association for Computation Linguistics: Human Language Technologies in Portland next June.
Thanks to the folks at reddit for spotting the paper.