http://www.speech.sri.com/projects/srilm/
Google N-grams data
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Sunday, June 10, 2012
Monday, May 14, 2012
Precision and Recall in plain English
In Information Retrieval,
Precision is the fraction of retrieved documents that are relevent to the user's information need. This measures whether the mix of retrieved results has a lot of bad things in it.
Recall is the faction of relevent documents in collection that are retrieved. This measures how much of the good information in the document collection is the system succeed in finding.
Precision is the fraction of retrieved documents that are relevent to the user's information need. This measures whether the mix of retrieved results has a lot of bad things in it.
Recall is the faction of relevent documents in collection that are retrieved. This measures how much of the good information in the document collection is the system succeed in finding.
Thursday, May 10, 2012
Entropy in Information Retrival
Information Theory
•Value or content of a message is based on how much the receiver’s uncertainty (entropy) is reduced
•Predictability of the message (impact of content)
–Very predictable – low uncertainty – low entropy
•Hello, good day, how are you? Fine.
–Unpredictable – high uncertainty – high entropy
•Move your car. Leave the building.
Information Content
Function H defines the Information Content: H(p) = -log pp is the a priori probability that a message could be predicted
So, if a receiver can predict a message
With p=1 then H(1) = 0
If cannot predict message
Then p=0 and H(0) is undefined
so the smaller p is, the larger H(p) is
in other words, the less predictable of a message, the more information the message contains
Calculation of Entropy
Example:Receive one letter of the alphabet
H = log 1/26 or 4.7 bits if all equally likely
4.14 bits given known distribution
Given n messages, the average information content (bits) of any one of those messages is
H =
Average Entropy is maximized when all messages are equally likely
When would this occur?
Using Entropy
Information Content is additiveH(p1, p2) = H(p1) + H( p2)
So what??
Google Queries
some terms have more information value
some retrieval messages have more information value
SO??
Subscribe to:
Posts (Atom)