I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:
http://ift.tt/1IWIa8Z
http://ift.tt/1FVQ1NW
Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.
Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example
http://ift.tt/1IWI814
So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.
The code I use to parse sites and it works on first 2000 links:
makeListOfUrls <- function() {
howManyPages <- round(318485/50)
urlStart <- "http://ift.tt/1IWIa8Z"
linksList <- list()
for (i in 1:howManyPages){
j <- 50 * (i - 1) + 1
print(j)
startNew <- paste("start=", j, sep="")
urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
titleLinks <- getLinks(urlNew)
## I get empty character for sites 2001 and next !!!
linksList[[i]] <- makeLongPath(titleLinks)
}
vector <- combineList(linksList)
return(vector)
}
getLinks <- function(url) {
allLinks <- getHTMLLinks(url, xpQuery = "//@href")
titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]
#there are no links for movies for the pages after 2000 (titleLinks is empty)
titleLinks <- titleLinks[!duplicated(titleLinks)]
return(titleLinks)
}
makeLongPath <- function(links) {
longPaths <- paste("http://www.imdb.com", links, sep="")
return(longPaths)
}
combineList <- function(UrlList){
n <- length(UrlList)
if (n==1){
return(UrlList)
} else {
tmpV <- UrlList[[1]]
for (i in 2:n){
cV <- c(tmpV, UrlList[[i]])
tmpV <- cV
}
return(tmpV)
}
}
No comments:
Post a Comment