vendredi 10 juin 2016

Remove HTML tag associated with a class


I am forcing myself to learn how to script solely in AppleScript but I am currently facing an issue with trying to remove a particular tag with a class. I've tried to find solid documentation and examples but at this time it seems to be very limited.

Here is the HTML I have:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

What I am trying to do is remove a particular class, so it would remove <span class="foo">, result:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

I know how to do this with do shell script and through the terminal but I am wanting to learn what is available through AppleScript's dictionary.

In research I was able to find a way to parse all HTML tags with:

on removeMarkupFromText(theText)
    set tagDetected to false
    set theCleanText to ""
    repeat with a from 1 to length of theText
        set theCurrentCharacter to character a of theText
        if theCurrentCharacter is "<" then
            set tagDetected to true
        else if theCurrentCharacter is ">" then
            set tagDetected to false
        else if tagDetected is false then
            set theCleanText to theCleanText & theCurrentCharacter as string
        end if
    end repeat
    return theCleanText
end removeMarkupFromText

but that removes all HTML tags and that is not what I want. Searching SO I was able to find how to extract between tags with Parsing HTML source code using AppleScript but I'm not looking to parse the file.

I am familiar with BBEdit's Balance Tags known as Balance in the drop down but when I run:

tell application "BBEdit"
    activate
    find "<span class="foo">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
    balance tags
end tell

it turns greedy and grabs the entire line between the first tag to the second last closing tag with text in between instead of isolating itself to the first tag with it's text.

Further research in the dictionary under tag I did run across find tag which I could do: set spanTarget to (find tag "span" start_offset counter) then target the tag with the class |class| of attributes of tag of spanTarget and use balance tags but I am still running into the same issue as before.

So in pure AppleScript how can I remove a tag associated with a class without it being greedy?


Aucun commentaire:

Enregistrer un commentaire