Wonder

Author: R. J. Palacio
4.9

Comments

by anonymous   2018-01-07

I would be happy if you could help me out. I want to examine the changes of the preferences of the amazon best selling books. In the link "https://www.amazon.com/gp/bestsellers/2017/books/", you see that there is a new composition of the best-sellers for every year. In each of the links/books, you can check into which category the book falls. The first book e.g. is a “Children’s Book”. The second best-seller “Literature & Fiction”, and so forth. At the end, I count and visualize the changes of each category and derive a hypothesis. But now, regarding the webmining code: I don’t know how to go to each link. (you can’t simply change the composition of the URL(?))

https://www.amazon.com/Wonder-R-J-Palacio/dp/0375869026/ref=zg_bsar_books_1?_encoding=UTF8&psc=1&refRID=WHP2CV9Z86NK5VYK3W27

-> there’s no link showing it is ranked one or whatsoever. So what makes more sense is to extract the XPath.

//*[@id="zg_centerListWrapper"]/div[2]/div[2]/div/a/div[2]

-> this needs to be done for the next 80 items and for the years 2013-2017. (It only shows 20 best-sellers at once of a certain year) How to implement it with a while loop?

So, the code should go to each of the links and extract basically everything:

  • name of the book
  • name of the category
  • rank in the given year
  • number of pages
  • numbers of customer reviews
  • customer rating

Here's what I have started:

library(xml2)
library(httr)
library(rvest)
library(selectr)

amazon_link <- read_html("https://www.amazon.com/gp/bestsellers/2017/books/")
amazon_title <- html_nodes(xpath = "//*[@id="zg_centerListWrapper"]/div[2]/div[2]/div/a/div[2]")