Trying to parse IMDb but the links are different each time I open site



I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:



http://ift.tt/1IWIa8Z

http://ift.tt/1FVQ1NW


Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.


Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example



http://ift.tt/1IWI814


So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.


The code I use to parse sites and it works on first 2000 links:



makeListOfUrls <- function() {
howManyPages <- round(318485/50)
urlStart <- "http://ift.tt/1IWIa8Z"
linksList <- list()
for (i in 1:howManyPages){
j <- 50 * (i - 1) + 1
print(j)
startNew <- paste("start=", j, sep="")
urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
titleLinks <- getLinks(urlNew)

## I get empty character for sites 2001 and next !!!

linksList[[i]] <- makeLongPath(titleLinks)
}
vector <- combineList(linksList)
return(vector)
}

getLinks <- function(url) {
allLinks <- getHTMLLinks(url, xpQuery = "//@href")
titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]

#there are no links for movies for the pages after 2000 (titleLinks is empty)

titleLinks <- titleLinks[!duplicated(titleLinks)]
return(titleLinks)
}

makeLongPath <- function(links) {
longPaths <- paste("http://www.imdb.com", links, sep="")
return(longPaths)
}

combineList <- function(UrlList){
n <- length(UrlList)
if (n==1){
return(UrlList)
} else {
tmpV <- UrlList[[1]]
for (i in 2:n){
cV <- c(tmpV, UrlList[[i]])
tmpV <- cV
}
return(tmpV)
}
}

No comments:

Post a Comment