In R Regex that ignores some punctuation at the end of a URL string -
is possible use regex function ignores punctuation (not "/
's") @ end of url strings (i.e. punctuation @ end of url string followed space) when extracted? when extracting urls, i'm getting periods, parenthesizes, question marks , exclamation points @ end of strings extract example:
findurl <- function(x){ m <- gregexpr("http[^[:space:]]+", x, perl=true) w <- unlist(regmatches(x,m)) op <- paste(w,collapse=" ") return(op) } x <- "find out more @ http://bit.ly/ss/vuer). check out here http://bit.ly/14pwinr)? http://bit.ly/108vjom! now!" findurl(x) [1] http://bit.ly/ss/vuer).http://bit.ly/14pwinr)? http://bit.ly/108vjom!
and
findurl2 <- function(x){ m <- gregexpr("www[^[:space:]]+", x, perl=true) w <- unlist(regmatches(x,m)) op <- paste(w,collapse=" ") return(op) } y <- "this www.example.com/store/locator. of type of www.example.com/google/voice. data i'd extract www.example.com/network. it?" findurl2(y) [1] www.example.com/store/locator. www.example.com/google/voice. www.example.com/network.
is there way modify these functions if . ) ?
!
or ,
or (if possible) ). )? )!
or ),
is found @ end of string followed space (i.e. if punctuation: periods, parenthesizes, question marks, exclamation points, or comma's @ end of url string followed space) not extract them?
use positive lookahead , may combine both...
findurl <- function(x){ m <- gregexpr("\\b(?:www|http)[^[:space:]]+?(?=[^\\s\\w]*(?:\\s|$))", x, perl=true) w <- unlist(regmatches(x,m)) op <- paste(w,collapse=" ") return(op) } x <- "find out more @ http://bit.ly/ss/vuer). check out here http://bit.ly/14pwinr)? http://bit.ly/108vjom! now!" y <- "this www.example.com/store/locator. of type of www.example.com/google/voice. data i'd extract www.example.com/network. it?" findurl(x) findurl(y) # [1] "http://bit.ly/ss/vuer http://bit.ly/14pwinr http://bit.ly/108vjom" # [1] "www.example.com/store/locator www.example.com/google/voice www.example.com/network"
Comments
Post a Comment