Splitting strings around 2 parameters
-
Hi, I really have not mastered strings, so this is probably a pretty beginner question. But I want to parse a string from an html. I want to find everything between the <span> and </span> tags in the string. How would I do that? Here is an example string that I could be parsing:
<dl class="apireference"> <dt id="copyright"><span class="myclass">I want all this text. All of it as a single string.</span><span class="version">SketchUp 6.0+</span></dt>
Any thoughts? Thanks folks,
Chris
-
txt1="<dl class=\"apireference\"> <dt id=\"copyright\"><span class=\"myclass\">I want all this text. All of it as a single string.</span><span class=\"version\">SketchUp 6.0+</span></dt>" puts txts1=txt1.split("<span class=\"myclass\">") puts txt2=txts1[1]###*** puts txts2=txt2.split("</span>") puts txt3=txts2[0]
Split the string at the designated text into pieces in an array.
***If you are not sure which array item it'll be in 'txts1' then you can add a test for each until you find one that doesn't start with a '<' etc - i.e. it's your 'string'...txt2="" txts1.each{|txt| if not txt=~/^[<]/ txt2=txt break end#if }
-
Or a RegEx: http://ruby-doc.org/core/classes/String.html#M000812
` > match = str.scan(/<span(?:\s+.?)>(.?)</span>/)
[["I want all this text. All of it as a single string."], ["SketchUp 6.0+"]]match[0]
["I want all this text. All of it as a single string."]
match[1]
["SketchUp 6.0+"]str1 = match[0][0]
"I want all this text. All of it as a single string."
str2 = match[1][0]
"SketchUp 6.0+"`And also be accessed by a block:
> str.scan(/<span(?:\s+.*?)>(.*?)<\/span>/) { |match| p match[0] } "I want all this text. All of it as a single string." "SketchUp 6.0+"
-
Your last method to extract all of the strings is very elegant compared to my clumsy hack... however, I do find the construction of the RegEx test somewhat difficult - after many unsuccessful tests my quick 'hack' looked more appealing - but now you've made an example the 'crib' is there...
-
RegEx are a pain to learn IMO. I started meddling with them when I was doing webdesign, since you need to do a lot of string processing. But for a long time I created my regex on a hit an miss basis. But slowly I've managed to get a better grasp of them. But there are still many features of the system I don't know how to use. But I know the basics to sniff out and extract basic data.
A very nice tool to use for testing regex expressions is this: http://www.rubular.com/
Live update as you modify the expression and you have that quick reference at the bottom to jog your memory. -
Thanks for the site - useful...
-
Awesome guys! Thanks so much, I'll be working these into my script later today. Thanks again,
Chris
-
easier to understand:
# htstr would be the html you grab htstr='<dl class="apireference"> <dt id="copyright"><span class="myclass">I want all this text. All of it as a single string.</span><span class="version">SketchUp 6.0+</span></dt>' # # replace first html tag with <***> s1=htstr.sub('<span class="myclass">','<***>') # # replace second html tag with <***> s2=s1.sub('</span>','<***>') # # now split using your custom <***> delimiter # and take the second array element [1] apistr=s2.split('<***>')[1] # # >> I want all this text. All of it as a single string.
it could be condensed into a one-liner method:
def grabAPI( htstr ) return htstr.sub('<span class="myclass">','<***>').sub('</span>','<***>').split('<***>')[1] end #
-
Thats awesome, thanks Dan! I'm going to play with this tonight. String parsing is not my favorite thing currently, but you guys are making it bareable.
Chris
-
Here's another example using substrings specified by range offsets:
(I dup'd the string just in case because I'm slicing off the first unsued part.)def grabAPI( htstr ) temp = htstr.dup temp.slice!(0..temp.index('<span class="myclass">')+21) return temp[0..(temp.index('</span>')-1)] end #
-
ok, this is remarkably painful, but still somehow keeping me amused. I stay up late everynight trying to figure out how to parse this text. Thanks to everyone who is chiming in.
New question. What is this error?
(eval):62: warning: string pattern instead of regexp; metacharacters no longer effective
I am getting it for 2 different lines of code:
temp_info_str_array.sub(" ", "") if temp_info_str_array[0] == 32
and
temp_str = str.split("***")
In the first one I just wanted to remove the first character of the string if it is a space. And the second one seems pretty simple, just split a string at the *** delimeter. But each of these lines seems to to be throwing that error, and I'm not exactly sure what it means. But I'm guessing I'm just doing something wrong. Any ideas what it is?Chris
-
Not an error, just warning that your match pattern is not a regex.
-
temp_info_str_array.gsub!(/^ /,'')
should remove just the first white-space, or try
temp_info_str_array.strip!
to remove all leadings and trailing white-spaces
str.lstrip!
to remove all leading white-spaces
str.rstrip!
to remove all trailing white-spaces
str.slice!()
to remove the specified portion(s) of the string,
e.g.str.slice1(0)
removes the first character, also
str.chomp!
typically to remove the\n
etc
str.chop!
to remove the last character
etc etc there are very many 'string' methods -
@chris fullmer said:
temp_info_str_array.sub(" ", "") if temp_info_str_array[0] == 32
The if condition has an error, should be:
... if temp_info_str_array[0] == **32.chr**
but as TIG said,
temp_info_str_array.lstrip!
is much easier. -
@dan rathbun said:
The if condition has an error, should be:
... if temp_info_str_array[0] == **32.chr**
Nope - not under Ruby 1.8.
"string"[0] 115 "string"[0,1] s
This was changed in 1.9 though.
-
@thomthom said:
@dan rathbun said:
The if condition has an error, should be:
... if temp_info_str_array[0] == **32.chr**
Nope - not under Ruby 1.8.
"string"[0] 115 "string"[0,1] s
I stand corrected. (Confused with Pascal, a min there.)
I always think of Strings as Arrays of Char; and a subscript should return the character at that index.
So for Ruby I'd need probably do:" string"[0..0]==32.chr
It's just kinda weird.@thomthom said:
This was changed in 1.9 though.
What did they change it to?%(#4000BF)[EDIT: n/m I see they changed it to the way I expected it to work.
And added the]String.ordmethod to return the ASCII ordinal. That's the way it should work! like:
" string"[0].ord==32
>> true # in ver1.9.x -
I got caught on this the first time I tried to extract characters at indexes as well, being used to PHP. And it really is counter-intuitive the way Ruby 1.8 works.
-
@thomthom said:
And it really is counter-intuitive the way Ruby 1.8 works.
Agree! .. but at least they revising Ruby to correct things the way they should be.
Advertisement