Trying to parse IMDb but the links are different each time I open site

I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:


http://ift.tt/1IWIa8Z

http://ift.tt/1FVQ1NW

Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.

Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example


http://ift.tt/1IWI814

So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.

The code I use to parse sites and it works on first 2000 links:


makeListOfUrls  <- function() {
  howManyPages <- round(318485/50)
  urlStart <- "http://ift.tt/1IWIa8Z"
  linksList <- list()
  for (i in 1:howManyPages){
    j <- 50  * (i - 1) + 1
    print(j)
    startNew <- paste("start=", j, sep="")
    urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
    titleLinks <- getLinks(urlNew)  

    ## I get empty character for sites 2001 and next !!!

    linksList[[i]] <- makeLongPath(titleLinks)
  }
  vector <- combineList(linksList)
  return(vector)
}

getLinks <- function(url) {
   allLinks <- getHTMLLinks(url, xpQuery = "//@href") 
   titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]

   #there are no links for movies for the pages after 2000 (titleLinks is empty)

   titleLinks <- titleLinks[!duplicated(titleLinks)]
   return(titleLinks)
}

makeLongPath <- function(links) {
  longPaths <- paste("http://www.imdb.com", links, sep="")
  return(longPaths)
}

combineList <- function(UrlList){
   n <- length(UrlList)
   if (n==1){
      return(UrlList)
   } else {
      tmpV <- UrlList[[1]]
      for (i in 2:n){
         cV <- c(tmpV, UrlList[[i]])
         tmpV <- cV
      }
      return(tmpV)
   }
}

Trying to parse IMDb but the links are different each time I open site

No comments:

Post a Comment