Frequency of Words in English

Similar documents
Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Linking Task: Identifying authors and book titles in verbose queries

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Context Free Grammars. Many slides from Michael Collins

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The stages of event extraction

Probabilistic Latent Semantic Analysis

AQUA: An Ontology-Driven Question Answering System

Writing a composition

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

BULATS A2 WORDLIST 2

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

indexing many slides courtesy James

Using dialogue context to improve parsing performance in dialogue systems

Distant Supervised Relation Extraction with Wikipedia and Freebase

CS 598 Natural Language Processing

Advanced Grammar in Use

National Literacy and Numeracy Framework for years 3/4

Universiteit Leiden ICT in Business

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

The Smart/Empire TIPSTER IR System

On document relevance and lexical cohesion between query terms

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Memory-based grammatical error correction

Constructing Parallel Corpus from Movie Subtitles

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Experts Retrieval with Multiword-Enhanced Author Topic Model

ScienceDirect. Malayalam question answering system

California Department of Education English Language Development Standards for Grade 8

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Controlled vocabulary

A Bayesian Learning Approach to Concept-Based Document Classification

Grammars & Parsing, Part 1:

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

An Evaluation of POS Taggers for the CHILDES Corpus

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Introduction to Text Mining

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

The taming of the data:

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

A Case Study: News Classification Based on Term Frequency

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Parsing of part-of-speech tagged Assamese Texts

Prediction of Maximal Projection for Semantic Role Labeling

The Role of the Head in the Interpretation of English Deverbal Compounds

Indian Institute of Technology, Kanpur

Multilingual Sentiment and Subjectivity Analysis

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

1. Introduction. 2. The OMBI database editor

Formulaic Language and Fluency: ESL Teaching Applications

Training and evaluation of POS taggers on the French MULTITAG corpus

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Developing a TT-MCTAG for German with an RCG-based Parser

Modeling full form lexica for Arabic

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Language Independent Passage Retrieval for Question Answering

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Development of the First LRs for Macedonian: Current Projects

A Graph Based Authorship Identification Approach

Compositional Semantics

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Cross Language Information Retrieval

LTAG-spinal and the Treebank

Developing Grammar in Context

Finding Translations in Scanned Book Collections

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Ch VI- SENTENCE PATTERNS.

Sample Goals and Benchmarks

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Words come in categories

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Emmaus Lutheran School English Language Arts Curriculum

Cross-Lingual Text Categorization

Disambiguation of Thai Personal Name from Online News Articles

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Switchboard Language Model Improvement with Conversational Data from Gigaword

Ensemble Technique Utilization for Indonesian Dependency Parser

THE VERB ARGUMENT BROWSER

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

What the National Curriculum requires in reading at Y5 and Y6

Transcription:

Frequency of Words in English One of the most obvious features of text from a statistical point of view is that the distribution of word frequencies is very skewed. In fact, the two most frequent words in English ( the, of ) account for about 10% of all word occurrences. The most frequent 6 words account for 20% of occurrences, and the most frequent 50 words are about 40% of all text. On the other hand, about one-half of all words only occur once. This distribution is described by Zipf s law, which states that the frequency of the rth most common word is inversely proportional to r, or alternatively, the rank of a word times its frequency (f) is approximately a constant (k): r f = k

We often talk about probability of occurrence of a word, which is just the frequency of the word divided by the total number of word occurrences in the text. In this case, Zipf s law is: r Pr = c where Pr is the probability of occurrence for the rth ranked word, and c is a constant. For English, c 0.1.

Another useful prediction related to word occurrence is vocabulary growth. As the size of the corpus grows, new words occur. The relationship between the size of the corpus and the size of the vocabulary was found empirically by Heaps (1978) to be v = k n β where v is the vocabulary size for a corpus of size n words, k and β are parameters that vary for each collection. This is sometimes referred to as Heaps Law. Typical values for k and β are often stated to be 10 k 100 and β 0.5.

Document Parsing Document parsing involves the recognition of the content and structure of text documents. Recognizing each word occurrence in the sequence of characters in a document is called tokenizing or lexical analysis. Apart from these words, there can be many other types of content in a document, such as metadata, images, graphics, code, and tables. Metadata content includes document attributes such as date and author, and, most importantly, the tags that are used by markup languages to identify document components. The parser uses the tags and other metadata recognized in the document to interpret the document s structure based on the syntax of the markup language (syntactic analysis) and produce a representation of the document that includes both the structure and content.

Tokenizing Tokenizing is the process of forming words from the sequence of characters in a document. Given that nearly everything in the text of a document can be important for some query, the tokenizing rules have to convert most of the content to searchable tokens. Instead of trying to do everything in the tokenizer, some of the more difficult issues such as identifying word variants, or recognizing that a string is a name or a date, can be handled by separate processes, including stemming, information extraction, and query transformation.

Stopping Human language is filled with function words; words which have little meaning apart from other words. The most popular, like the, a, an, that, or those are determiners. In information retrieval, these function words have a second name: stopwords. They are called stopwords because text processing stops when one is seen, and they are thrown out. Throwing out these words decreases index size, increases retrieval efficiency, and generally improves retrieval effectiveness. "to be or not to be?

Stemming Stemming, also called conflation, is a component of text processing that captures the relationships between different variations of a word. More precisely, stemming reduces the different forms of a word that occur because of inflection (e.g., plurals, tenses) or derivation (e.g., making a verb to a noun by adding the suffix -ation) to a common stem. There are two basic types of stemmers: algorithmic and dictionary-based. An algorithmic stemmer uses a small program to decide whether two words are related, usually based on knowledge of word suffixes for a particular language. By contrast, a dictionarybased stemmer has no logic of its own, but instead relies on pre-created dictionaries of related terms to store term relationships.

Original text: This document will describe marketing strategies carried out by U.S. companies for their agricultural chemicals, report predictions for market share of such chemicals, or report market statistics for agrochemicals, pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales. Porter stemmer: document describ market strategi carri compani agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale market share stimul demand price cut volum sale Krovetz stemmer: document describe marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer predict sale stimulate demand price cut volume sale

Phrases and N-grams Noun phrases Noun phrases (nouns, adjectives followed by nouns) can be identified using a part-of-speech (POS) tagger. A POS tagger marks the words in a text with labels corresponding to the part-of-speech of the word in that context. Taggers are based on statistical or rule-based approaches and are trained using large corpora that have been manually labeled. Typical tags that are used to label the words include NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., and, or ), PRP (pronoun), and MD (modal auxiliary, e.g., can, will ).

N-grams In the N-gram approach, a phrase is defined as any sequence of n words. Sequences of two words are called bigrams and sequences of three words are called trigrams. Single words are called unigrams. N-grams have been used in many text applications. N-grams, both character and word, are generated by choosing a particular value for n and then moving that window forward one unit (character or word) at a time.

Many web search engines use n-gram indexing because they provide a fast method of incorporating phrase features in the ranking. https://books.google.com/ngrams <iframe name="ngram_chart" src="https://books.google.com/ngrams/interactive_chart? content=science+versus+religion&year_start=1800&year_end=2000&corpus=15&smoothi ng=3&share=&direct_url=t1%3b%2cscience%20versus%20religion%3b%2cc0" width=900 height=500 marginwidth=0 marginheight=0 hspace=0 vspace=0 frameborder=0 scrolling=no></iframe>

Document Structure and Markup In the case of Web search, queries do not usually refer to document structure or fields, but that does not mean that this structure is not important. Some parts of the structure of Web pages, indicated by HTML markup, are very significant features used by the ranking algorithm. e document parser must recognize this structure and make it available for indexing. > tropical fish < <html> <head> <meta name="keywords" content="tropical fish, Airstone, Albinism, Algae eater, Aquarium, Aquarium fish feeder, Aquarium furniture, Aquascaping, Bath treatment (fishkeeping),berlin Method, Biotope" /> <title>tropical fish - Wikipedia, the free encyclopedia</title> </head> <body> <h1 class="firstheading">tropical fish</h1>

Information Extraction Information extraction is a language technology that focuses on extracting structure from text. Information extraction is used in a number of applications, and particularly for text data mining. Most of the recent research in information extraction has been concerned with features that have specific semantic content, such as named entities, relationships, and events. Although all of these features contain important information, named entity recognition been used most often in search applications. Given the more specific nature of these features, the process of recognizing them and tagging them in text is sometimes called semantic annotation

Entities and Relations Fred Smith, who lives at 10 Water Street, Springfield, MA, is a long-time collector of tropical fish. <p ><PersonName><GivenName>Fred</GivenName> <Sn>Smith</Sn> </PersonName>, who lives at <address><street >10 Water Street</Street>, <City>Springfield</City>, <State>MA</State></address>, is a long-time collector of <b>tropical fish.</b></p>

Hidden Markov Models A statistical entity recognizer uses a probabilistic model of the words in and around an entity that is trained using a text corpus that has been manually annotated. A Markov Model describes a process as a collection of states with transitions between them. Each of the transitions has an associated probability. The next state in the process depends solely on the current state and the transition probabilities. In a Hidden Markov Model, each state also has a probability distribution over a set of possible outputs. The model represents a process which generates output as it transitions between states, all determined by probabilities. Only the outputs generated by state transitions are visible (i.e., can be observed), the underlying states are hidden.