ACL RD-TEC 1.0 Summarization of J01-1001
Paper Title:
USING SUFFIX ARRAYS TO COMPUTE TERM FREQUENCY AND DOCUMENT FREQUENCY FOR ALL SUBSTRINGS IN A CORPUS
USING SUFFIX ARRAYS TO COMPUTE TERM FREQUENCY AND DOCUMENT FREQUENCY FOR ALL SUBSTRINGS IN A CORPUS
Authors: Mikio Yamamoto and Kenneth W. Church
Primarily assigned technology terms:
- algorithm
- binary search
- binding
- cd-rom
- computational linguistics
- computing
- corpus-based natural language processing
- cutoff
- database
- database indexing
- dictionary maintenance
- estimator
- good-turing method
- grouping
- idf df
- illustration
- indexing
- information retrieval
- japanese word extraction
- language processing
- likelihood estimator
- matching
- maximum likelihood
- maximum likelihood estimator
- natural language processing
- nlp
- processing
- recognition
- recognizer
- search
- searching
- segmentation
- speech recognition
- statistical natural language processing
- term weighting
- terminology
- weighting
- word extraction
- word segmentation
Other assigned terms:
- alphabet
- approach
- array
- case
- characters
- chinese characters
- cluster
- compact representation
- compositionality
- concordance
- corpora
- correlation
- data structure
- dictionaries
- dictionary
- discourse
- discourse structure
- distribution
- document
- document frequency
- empty string
- english corpus
- english dictionary
- english text
- english translation
- events
- expository convenience
- fact
- foreign words
- general vocabulary
- grammar
- heuristics
- implementation
- independence assumption
- interpolation
- interpretation
- inverse document frequency
- japanese corpus
- japanese text
- japanese word
- kanji
- katakana
- keyword
- large corpora
- large corpus
- lexicographer
- lexicography
- likelihood
- linguist
- linguistics
- method
- mutual information
- n-gram
- n-grams
- names
- natural language
- nlp applications
- noise
- noisy channel
- occurrence probability
- perplexity
- phrase
- prepositions
- probabilities
- probability
- procedure
- process
- processing time
- recursion
- ridf value
- search space
- statistical natural language
- statistics
- substring
- suffix
- suffixes
- symbol
- technical terminology
- technique
- term
- term frequency
- terms
- text
- tokens
- trees
- vocabulary
- wall street journal corpus
- word
- word sequences
- words
- wsj corpus