Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (Data-Centric Systems and Applications)

Category: Computer Science
Author: Peter Christen
This Month Hacker News 1


by olooney   2018-10-04
Some of the best textbooks in statistics and machine learning:

Applied -------

Hosmer et al., Applied Logistic Regression. An exhaustive guide to the perils and pitfalls of logistic regression. Logistic regression is the power tool of interpretable statistical models, but if you don't understand it, it will take your foot off (concretely, your inferences will be wrong and your peers will laugh at you.) This book is essential. Graduate level, or perhaps advanced undergraduate, intended for STEM and social science grad students.

Peter Christen's Data Matching. Record Linkage is a relatively niche concept, so Christen's book has no right to be as good as it is. But it covers every relevant topic in a clear, even-handed way. If you are working on a record linkage system, then there's nothing in this book you can afford not to know. Undergraduate level, but intended for industry practitioners.

Max Kuhn's Applied Predictive Modeling. Even if you don't use R, this is an incredibly good introduction to how predictive modeling is done in practice. Early undergraduate level.

Theoretical -----------

The Elements of Statistical Learning. Probably the single most respected book in machine learning. Exhaustive and essential. Advanced undergraduate level.

Kevin Murphy's Machine Learning: A Probabilistic Perspective. Covers lots of the same ground as Elements but is a little easier. Undergraduate level.

Taboga's Lectures on Probability Theory and Mathematical Statistics. Has the distinction of being available for free in web-friendly format at