Xtract My Links

n: smd_xml | v: 0.3 | d: 1058 | f: /

Plugin documentation follows this short message from our sponsor ;-)

If you like my code and use it a lot, feel free to show your appreciation with something from my UK Amazon wish list (or US) or donate to the Stef Dawson community coding pot by following the Donate button below to PayPal. Thanks!

Plugin list buttonPlugin download buttonCompressed plugin download buttonBeta plugin download button

smd_xml

Yank bits out of any hunk of XML and reformat it to your own needs. Great for pulling feed info into your Textpattern site, for example from delicious.com.

Features

  • Specify your XML data from any URL — internal or external to TXP
  • Selectively extract any items in your record set
  • Use a Form or the plugin container to output data you have extracted
  • XML tag attributes are available as well
  • Supports pagination of results with limit/offset

Author

Stef Dawson. For other software by me, or to make a donation, see the software page.

Installation / Uninstallation

Requires PHP 5+

Download the plugin from either textpattern.org, or the software page above, paste the code into the TXP Admin -> Plugins pane, install and enable the plugin. Visit the forum thread for more info or to report on the success or otherwise of the plugin.

To remove the plugin, simply delete it from the Admin->Plugins tab.

Usage: smd_xml

Place a <txp:smd_xml> tag where you wish to process XML data — this could be from a feed. Since this plugin is best explained by example, assume the following XML document is presented to the plugin:

<employees>
   <employee>
      <name id="wile_e_coyote">Wile E. Coyote</name>
      <job_title>Schemer</job_title>
      <dept>ACME corp</dept>
      <quality>Cunning</quality>
      <quality>Deviousness</quality>
      <quality>Persistence</quality>
   </employee>
   <employee>
      <name id="road_runner">Road Runner</name>
      <job_title>Seed expert</job_title>
      <dept>Evasion</dept>
      <quality>Speed</quality>
      <quality>Meep meep</quality>
   </employee>
</employees>

Attributes: smd_xml

Use the following attributes to configure the smd_xml plugin (shaded attributes are mandatory) :

Data import

  • data : the XML data source. Most of the time this will be a URL, though you could hard-code the XML data ot use another TXP tag here (e.g. <txp:variable />)
  • record : the name of the XML tag that surrounds each record of data in your feed. Thus you would need record="employee" in the above document
  • fields : list of XML fields you want to extract from each record. For example, fields="name, dept". Each field you specify here will create a similarly-named replacement tag that you may use in your form/container to display the relevant piece of data. In this case, {name} and {dept} would be available in your output. You may also extract multiple copies of the same field by separating the name of the field’s copy with param_delim. For example: fields="pubDate, title|url_ttl, id, link" would extract title twice: once as {title} and again as {url_ttl}. See example 6 for a practical application of this
  • skip : list of XML fields you want to skip over in each record. Useful if a field you wish to extract is used in two places in the same record. See example 2 for a practical application
  • defaults : list of default values you wish to set if any fields are not set in your document. Specify defaults in pairs of entries like this: defaults="field|default, field|default, ...". The pipe can be altered by param_delim
  • set_empty : any fields that are not set in your document will normally mean that you’ll see the raw {replacement tag} in your output. Use set_empty="1" to ensure that all empty nodes are set to an empty value. Any defaults you specify will take precedence over empties
  • cache_time : if set, the XML document is cached in the TXP prefs. Subsequent calls to smd_xml (e.g. refreshing the page) will read the cached information instead of hitting the data URL, thus cutting down on network traffic. After cache_time — specified in seconds — has elapsed, the next page refresh will cause the document to be fetched from the data URL again. You may, however, force a refresh from the data URL at any time by adding &force_read=1 to the browser URL (you can use smd_prefalizer and search for ‘smd_xml’ to find the cached documents — each is referenced by its unique ID)

Manipulation

  • format : alter the format of this list of fields. For each field, specify items separated by param_delim; the first is the name of the field you want to alter, the 2nd is the type of alteration required, and the 3rd|4th|5th|.. specify how you want to alter the data. The following data types are supported:
    • date : takes one argument; the format string as detailed in strftime. Example: format="pubDate|date|%d %B %Y %H:%I:%S" would reformat the pubDate field. Can also be used to reformat time strings
    • link : convert the URL in this field to an HTML anchor hyperlink. Example: format="cat_url|link". Replaces the linkify attribute from the v0.2x versions
    • escape : escape the field so it can be used in an SQL statement
    • sanitize : convert the field into one of three ‘dumed down’ formats, as specified by the third parameter. Choose from url for creating simple, valid URL strings; file for creating valid file names, or url_title for making TXP-style URL titles as governed by your prefs settings. Example: format="Title|sanitize|url"
    • case : alter the case of the field. Choose from one of four options as the third, fourth, etc parameters: upper, lower, ucfirst, ucwords. The items may be cumulative, for example to first convert the field to lower case then convert the first letter of each word to upper case, specify formats="Country|case|lower|ucwords"
  • target_enc : character encoding to apply to the parsed XML data. Choose from ISO-8859-1, US-ASCII, or UTF-8. Default: UTF-8
  • uppercase : set to 1 to force all XML tag names to be in upper case, thus you would have to specify fields="NAME, DEPT" in order to successfully extract those fields
  • concat : any duplicate tags in the stream are usually concatenated together. If you wish to turn this feature off so that only the last tag’s content remains, set concat="0". Default: 1
  • convert : if your data stream contains data you don’t want or data that you wish to translate (for example, character entities) you can list them here. Items are specified in pairs separated by param_delim; the first is the item to search for and the second is its replacement. For example: convert="&amp;#039|'" would replace all occurrences of &amp;#039 with an apostrophe character. Note that the replacements are performed on the raw stream before it is parsed and after it is cached. Also take care when decoding double quotes; this is the correct method: convert="&amp;quot;|""" (note the double quote is escaped by putting two double quote characters in)

Forms and paging

  • form : the TXP Form with which to parse each record. You may use the plugin as a container instead if you prefer
  • pageform : optional TXP form used to specify the layout of any paging navigation and statistics such as page number, quantity of records per page, total number of records, etc. See paging replacement tags
  • pagepos : the position of the paging information. Options are below (the default), above, or both of them separated by delim
  • limit : show this many results per page. Setting a limit smaller than the total number of records switches paging on automatically so you can use the <txp:older /> and <txp:newer /> tags inside your pageform to step through each page of results. You may also construct your own paging (see example 3)
  • offset : skip this many rows before outputting the results. If you specify a negative offset you start that many records from the end of the document
  • pagevar : if you are putting smd_xml on the same page as a standard article list, the built-in newer and older tags will clash with those of smd_xml; clicking next/prev will step through both your result set and your article list. Specify a different variable name here so the two lists can be navigated independently, e.g. pagevar="xpage". Note that if you change this, you will have to generate your own custom newer/older links (see example 4) and the conditional tags. There is also a special value SMD_XML_UNIQUE_ID which assigns the tags’ unique ID as the paging variable. See example 5 for more. Default: pg

Tag/class/formatting attributes

  • wraptag : the (X)HTML tag, without brackets, to surround each record you output
  • break : the (X)HTML tag, without brackets, to surround each field you output
  • class : the CSS class name to apply to the wraptag

Plugin customisation

  • delim : the delimiter to use between items in the plugin attributes. Default: comma
  • param_delim : the delimiter to use between items in XML and plugin data attributes. Default: pipe (|)
  • concat_delim : the delimiter to use between identically-named tags in the XML data stream. Default: space
  • transport : (should not be needed) if you would like to force the plugin to use a particular HTTP transport mechanism to fetch your data you can specify it here. Choose from fsock or curl. Default: fsock
  • line_length : if you are using the fsock transport mechanism, the plugin grabs the XML document line by line and uses a maximum line length of 8192 characters by default. This is usually good enough because most feeds contain newlines, but some (e.g. Google Spreadsheet) don’t have any newlines in them. To successfully parse such documents you may need to increase the line length. In these situations, however, it is highly recommended to switch to transport="curl" instead (if you can) because it does not have any line length restrictions
  • hashsize : (should not be needed) when specifying a cache_time the plugin assigns a 32-character, unique reference to the current smd_xml based on your import attributes. hashsize governs the mechanism for making this long reference shorter. It comprises two numbers separated by a colon; the first is the length of the uniqe ID, the second is how many characters to skip past each time a character is chosen. For example, if the unique_reference was 0cf285879bf9d6b812539eb748fbc8f6 then hashsize="6:5" would make a 6-character unique ID using every 5th character; in other words 05f898. If at any time, you “fall off” the end of the long string, the plugin wraps back to the beginning of the string and continues counting. Default: 6:5

Replacement tags

Each XML field you extract from your data stream has an equivalently-named replacement tag available so you may use it anywhere you like in your Form/container. So, if you chose to extract fields="name, job_title, quality" you would have the following replacement tags available during the first record:

  • {name} : Wile E. Coyote
  • {name|id} : wile_e_coyote
  • {job_title} : Schemer
  • {quality} : Cunning Deviousness Persistence

And during the second record, the same replacement tag names would refer to the following items:

  • {name} : Road Runner
  • {name:id} : road_runner
  • {job_title} : Seed expert
  • {quality} : Speed Meep meep

Note that the attribute called id that is part of the <name> XML tag has been extracted and is made available automatically. By default, the names of attributes are defined as {tag|attribute}. The pipe can be altered using param_delim.

The {quality} tag appears more than once in the example document above. You can influence its output using the concat and concat_delim attributes, e.g. using concat_delim="|" would render the following replacement variable on the first record:

  • {quality} : Cunning|Deviousness|Persistence

while concat="0" would render this:

  • {quality} : Persistence

There are also some special statistical tags available in each record:

  • {smd_xml_totalrecs} : the total number of records found in your XML document
  • {smd_xml_pagerecs} : the number of records on this page (if not using paging, this is the same as above)
  • {smd_xml_pages} : the number of available pages
  • {smd_xml_thispage} : the page number of the currently visible page
  • {smd_xml_thisrec} : the record number, starting at 1
  • {smd_xml_thisindex} : the record number, starting at 0
  • {smd_xml_runrec} : the record number, starting at 1 and including any offset
  • {smd_xml_runindex} : the record number, starting at 0 and including any offset

Paging replacement tags

In your pageform you can employ any of the following replacement tags to build up a navigation system for stepping through your XML document:

  • {smd_xml_totalrecs} : the total number of records found in your XML document
  • {smd_xml_pagerecs} : the number of records on this page
  • {smd_xml_pages} : the number of available pages
  • {smd_xml_prevpage} : the page number of the previous page — empty if on first page
  • {smd_xml_thispage} : the page number of the current page
  • {smd_xml_nextpage} : the page number of the next page — empty if on last page
  • {smd_xml_rec_start} : the record number of the first record on this page (counted from the start of the record set)
  • {smd_xml_rec_end} : the record number of the last record on this page (counted from the start of the record set)
  • {smd_xml_recs_prev} : the number of records on the previous page
  • {smd_xml_recs_next} : the number of records on the next page
  • {smd_xml_unique_id} : the unique reference number assigned to this smd_xml tag (see example 5 for usage of this)

Usage: <txp:smd_xml_if_prev> / <txp:smd_xml_if_next>

Use these container tags to determine if there is a next or previous page and take action if so. Can only be used inside pageform, thus all paging replacement variables are available inside these tags.

<txp:smd_xml_if_prev>Previous page</txp:smd_xml_if_prev>
<txp:smd_xml_if_next>Next page</txp:smd_xml_if_next>

The tags supprt <txp:else />See example 5 for more.

Examples

Example 1: delicious links

Swap roadrunner in this code with your delicious username to get your own feed:

<txp:smd_xml data="http://feeds.delicious.com/v2/rss/roadrunner"
     record="item" fields="title, link, pubDate, description"
     wraptag="dl">
   <dt><a href="{link}">{title}</a></dt>
   <dd>Posted: {pubDate}<br />{description}</dd>
</txp:smd_xml>

Example 2: twitter feed

<txp:smd_xml
     data="http://twitter.com/statuses/user_timeline/textpattern.xml"
     record="status" fields="id, text, created_at" skip="user"
     wraptag="ul" linkify="text">
   <li>
      <a href="http://twitter.com/textpattern/statuses/{id}">
         {created_at}
      </a>
      <br />{text}
   </li>
</txp:smd_xml>

Notice that we skip the whole user block in the XML data stream. This is for two reasons:

  1. it is redundant information that appears in every record — we already know to which user the feed belongs because theyr’e all from the same user
  2. created_at is used inside the user block as well as in the outer status block so we get two datestamps, which is not what we want (if we simply used concat="0" to only grab one of the created_at entries, the last one would prevail — the one from the user block)

Example 3: limit and paging

Viewing the I Love TXP feed 3 records at a time. Note that since the site is not updated frequently, the cache_time of 86400 seconds (1 day) is ample to avoid hammering the network:

<txp:smd_xml
     data="http://feeds.feedburner.com/welovetxp"
     record="item" fields="title,description, link, pubDate"
     wraptag="ul" limit="3" pageform="pager"
     cache_time="86400">
   <li>
      <a href="{link}">
         {title}
      </a><span class="published">{pubDate}</span>
      <br />{description}
   </li>
</txp:smd_xml>

And in form pager:

Page {smd_xml_thispage} of {smd_xml_pages}
<txp:newer>Previous page</txp:newer>
<txp:older>Next page</txp:older>

If you wanted to view the last three entries in the feed instead of the first three, you could set offset="-3".

Example 4: using pagevar

Adding pagevar="xmlpg" to example 3 allows paging independently of txp:older and txp:newer tags. You then need to build your own links in your pager form, like this:

Page {smd_xml_thispage} of {smd_xml_pages} |
   Showing records {smd_xml_rec_start} to {smd_xml_rec_end}
   of {smd_xml_totalrecs} |
  <a href="?xmlpg={smd_xml_prevpage}">Previous {smd_xml_recs_prev}</a>
  <a href="?xmlpg={smd_xml_nextpage}">Next {smd_xml_recs_next}</a>

That creates links to next and previous record sets using the assigned pagevar as the URL parameter.

Example 5: conditional navigation and the unique ID

Again using example 3, if you used pagevar="SMD_XML_UNIQUE_ID" the pagevar would be assigned the value f290b8. In this case we could use it like this:

Page {smd_xml_thispage} of {smd_xml_pages} |
   Showing records {smd_xml_rec_start} to {smd_xml_rec_end}
   of {smd_xml_totalrecs} |
<txp:smd_xml_if_prev>
  <a href="?{smd_xml_unique_id}={smd_xml_prevpage}">Previous {smd_xml_recs_prev}</a>
</txp:smd_xml_if_prev>
<txp:smd_xml_if_next>
  <a href="?{smd_xml_unique_id}={smd_xml_nextpage}">Next {smd_xml_recs_next}</a>
</txp:smd_xml_if_next>

Note that we are using the conditional tags to only display the next and previous links if the next/prev page exists and also that the URL link is generated using {smd_xml_unique_id}. You could conceivably use this same pageform on more than one XML feed on the same page and navigate the two feeds indpenedently, though you would have to work out a clever way of amalgamating the URL vars (perhaps using the adi_gps plugin).

Example 6: inserting XML data into TXP

<txp:smd_xml data="http://feeds.delicious.com/v2/rss/roadrunner"
     record="item" fields="title|utitle, link, pubDate, description, category"
     format="pubDate|date|%Y-%m-%d %H:%I:%S,
     description|escape, title|escape, utitle|sanitize|url_title">
   <txp:smd_query query="INSERT INTO textpattern
     SET Posted='{pubDate}', LastMod=NOW()
     url_title='{utitle}',
     Title='{title}', custom_3='{link}',
     Body='{description}', Body_html='{description}',
     Section='links', Category1='delicious',
     keywords='{category}'" />
</txp:smd_xml>

This example takes a delicious feed, reformats the various entries and inserts them into the textpattern table in a dedicated section. Note that the date format is altered and the feed’s title is converted to a sanitized TXP URL suitable for the url_title field.

Changelog

  • 02 Jan 10 | 0.1 | Initial release
  • 03 Jan 10 | 0.2 | Added cache support (thanks variaas) ; added limit, offset and paging features ; added linkify (thanks Jaro)
  • 05 Jan 10 | 0.21 | Supports https:// feeds (thanks photonomad) ; added transport, defaults and set_empty attributes
  • 13 Jan 10 | 0.22 | Added line_length (thanks nardo)
  • 17 Jan 10 | 0.3 | Enabled URL params to be passed in the data attribute ; added format ; deprecated linkify ; param_delim default is now pipe

Source code

If you’d rather dig for buried treasure, you’ll need to step into the view source page.

Legacy software

If, for some inexplicable reason, you need last century's version of a plugin, it can probably be found on the plugin archive page.

Experimental software

If you’re feeling brave, or fancy dipping your toe in shark-infested water, you can test out some of my beta code. It can be found on the plugin beta page.

Stef, ze German Scientist, knows all your nuclear secrets

Stef, ze German Scientist, knows all your nuclear secrets