In R Regex that ignores some punctuation at the end of a URL string -


is possible use regex function ignores punctuation (not "/'s") @ end of url strings (i.e. punctuation @ end of url string followed space) when extracted? when extracting urls, i'm getting periods, parenthesizes, question marks , exclamation points @ end of strings extract example:

findurl <- function(x){   m <- gregexpr("http[^[:space:]]+", x, perl=true)   w <- unlist(regmatches(x,m))   op <- paste(w,collapse=" ")   return(op) }  x <- "find out more @ http://bit.ly/ss/vuer). check out here http://bit.ly/14pwinr)? http://bit.ly/108vjom! now!"   findurl(x)  [1] http://bit.ly/ss/vuer).http://bit.ly/14pwinr)? http://bit.ly/108vjom! 

and

findurl2 <- function(x){   m <- gregexpr("www[^[:space:]]+", x, perl=true)   w <- unlist(regmatches(x,m))   op <- paste(w,collapse=" ")   return(op) }   y <-  "this www.example.com/store/locator. of type of www.example.com/google/voice. data i'd extract www.example.com/network. it?"    findurl2(y)       [1] www.example.com/store/locator. www.example.com/google/voice. www.example.com/network.   

is there way modify these functions if . ) ? ! or , or (if possible) ). )? )! or ),is found @ end of string followed space (i.e. if punctuation: periods, parenthesizes, question marks, exclamation points, or comma's @ end of url string followed space) not extract them?

use positive lookahead , may combine both...

findurl <- function(x){   m <- gregexpr("\\b(?:www|http)[^[:space:]]+?(?=[^\\s\\w]*(?:\\s|$))", x, perl=true)   w <- unlist(regmatches(x,m))   op <- paste(w,collapse=" ")   return(op) }  x <- "find out more @ http://bit.ly/ss/vuer). check out here http://bit.ly/14pwinr)? http://bit.ly/108vjom! now!"  y <-  "this www.example.com/store/locator. of type of www.example.com/google/voice. data i'd extract www.example.com/network. it?"  findurl(x) findurl(y)  # [1] "http://bit.ly/ss/vuer http://bit.ly/14pwinr http://bit.ly/108vjom"  # [1] "www.example.com/store/locator www.example.com/google/voice www.example.com/network" 

Comments

Popular posts from this blog

java - Static nested class instance -

c# - Bluetooth LE CanUpdate Characteristic property -

JavaScript - Replace variable from string in all occurrences -