Authors
Grant Gehrke, CRAIG MARTELL, ANDREW SCHEIN, Pranav Anand
Publication date
2009/9
Journal
International Journal of Semantic Computing
Volume
3
Issue
03
Pages
365-382
Publisher
World Scientific Publishing Company
Description
Author identification algorithms attempt to ascribe document to author, with an eye towards diverse application areas including: forensic evidence, authenticating communications, and intelligence gathering. We view author identification as a single label classification problem, where 2000 authors would imply 2000 possible categories to assign to a post. Experiments with a naive Bayes classifier on a blog author identification task demonstrate a remarkable tendency to over-predict the most prolific authors. Literature search confirms that the class imbalance phenomenon is a challenge for author identification as well as other machine learning tasks. We develop a vector projection method to remove this hazard, and achieve a 63% improvement in accuracy over the baseline on the same task. Our method adds no additional asymptotic computational complexity to naive Bayes, and has no free parameters to set. The …
Total citations
20092010201120122013201420152016201720182019202021
Scholar articles
G Gehrke, C MARTELL, A SCHEIN, P Anand - International Journal of Semantic Computing, 2009