This page exists for archival purposes. Please refer to the XMLSTATS API for current documentation.

XMLSTATS: Using SportsML

Introduction

As a means to provide a consistent format for sports scores, schedules, and statistics, an open, non-proprietary XML standard was created by the International Press Telecommunications Council (IPTC). The IPTC is an association of the largest news agencies in the world including the Associated Press, Reuters, and the New York Times. The standard is named the Sports Markup Language or SportsML. The SportsML 1.0 specification was formally adopted by the IPTC in May 2003. The first update to that specification, SportsML 1.5, was released spring 2005.

All major sports websites receive their content in an XML format already. Unfortunately, this data is not readily available in the SportsML format for sports enthusiasts who wish to personalize the sports data to follow their favorite teams and players, present the information in a manner preferable to them, and perform various statistical analyses.

Project Goals

This project aims to provide examples of how to properly form SportsML for major team sports including MLB, NBA, and NCAA basketball. Example XSL style sheets to convert SportsML XML source files into XHTML pages will be listed along with all source code.

The previous day's box scores in the SportsML format will also be made available. However, it is not the goal of this project to provide sports data in realtime. There are pay services available who provide realtime, live sports feeds in XML format.

All 2008-2009 NBA box scores in XML format.

All 2010 MLB box scores in XML format.

All 2009-2010 NCAA Basketball box scores in XML format.

Beta 2006 NFL box scores in XML format.

The Basics

A Brief MLB Box Score in XML

This walk-through shows what a brief baseball box score summary looks like in XML using SportsML.

Given a summary box score contains the team names, their score each inning, total runs, hits, and errors. The first step is to give our XML file an identification string which specifies the XML standard it applies and an encoding specification which instructs how the characters should be interpreted. The line below says the XML document uses version 1.0 and uses the character set identified by UTF-8. This line must be on the first line and in the first column of every XML document.

<?xml version="1.0" encoding="UTF-8"?>

The next part of the XML file is optional. A DOCTYPE identifier provides a mechanism to validate the XML. If it is absent a well-formed XML file will still load, however. For SportsML, the main DTD is named sportsml-core.dtd. For XML files that specify a DOCTYPE, the sportsml-core.dtd is the file to use. A DOCTYPE specifier uses the root node of the DTD as its first argument. In SportsML, the root node is named sports-content. Next in the DOCTYPE specification is a path to find the DTD file. The SYSTEM keyword means that the DTD can be found by a relative path on the filesystem or by a URL. The DOCTYPE line below indicates the sportsml-core.dtd file can be found at the specified URL.

<!DOCTYPE sports-content SYSTEM "http://erikberg.com/xmlstats/dtd/sportsml-core.dtd">

With those three headers, work on the heart of the XML document can now begin. As mentioned above the root node for SportsML is named sports-content. All information contained in the XML document will be a child node of sports-content. The first few lines of a simple SportsML document could look like the following.

<sports-content>
 <sports-metadata
     doc-id="20060408_BAL_at_TEX"
     document-class="event-statistics"
     date-coverage-type="2004">
  <sports-title>MLB Box Score: Baltimore at Texas</sports-title>
 </sports-metadata>

There is a single requirement to use the doc-id attribute in sports-metadata. The other attributes are all optional. There are some useful ones that can help identify the scope of the document, however. Above, sports-metadata has two optional attributes defined. The document-class specifies that this document contains statistics for an event as opposed to schedules or league standings. The date-coverage-type indicates that the document covers an event that took place in 2004. Finally, one important child element of sports-metadata is listed, sports-title. As it indicates, the title of the document is contained within this element.

The actual event information is included in an element named sports-event. There can be multiple sports-event elements within one sports-content container. Since multiple sports-event elements could be contained within a single sports-content, sports-event has a child element named event-metadata. It contains more specific event information like start and end times, environmental information, attendance, and whether the information is before, during, or after the event.

 <sports-event>
  <event-metadata
    event-status="post-event"
    start-date-time="2004-06-08"
    end-date-time="2004-06-08"
    site-attendance="43128"
    site-temperature="71"
    site-temperature-units="F"
    />

The attributes in event-metadata above denote that the event is over, it started and ended on June 8, 2004, the number of people in attendance was 43,128, and the temperature was a pleasant 71 degrees Fahrenheit. Dates and times for SportsML follow the ISO-8601 standard. The standard calls for times to be represented as YYYY-MM-DDThh:mm:ss±hhmm if available. An example of this standard is 2005-07-04T14:30:00-0500, which corresponds to July 4th, 2005 at 2:30PM EDT. If smaller units of time, i.e. minutes and seconds are unknown, it is acceptable to omit them.

The next task is to describe the teams. This is done simply enough by the team element. A team element is important and contains the bulk of the data for a box score event. Keeping with SportsML's consistency, the team element has a child element named team-metadata. The metadata information for team contains items about where the team is playing (home, away, or at a neutral field), location information and the name.

  <team>
   <team-metadata alignment="away">
    <name first="Baltimore" last="Orioles"/>
   </team-metadata>

The next building block for this simple example are the score values. The child element of team which contains the basic score information is named team-stats. The score attribute is the total score for the game. The event-outcome attribute is a controlled value: win, loss, tie, or undecided. Each sub-score child element indicates the inning number and the scores recorded for that inning.

   <team-stats score="2" event-outcome="loss">
    <sub-score period-value="1" score="0"/>
    <sub-score period-value="2" score="0"/>
    <sub-score period-value="3" score="1"/>
    <sub-score period-value="4" score="0"/>
    <sub-score period-value="5" score="0"/>
    <sub-score period-value="6" score="0"/>
    <sub-score period-value="7" score="1"/>
    <sub-score period-value="8" score="0"/>
    <sub-score period-value="9" score="0"/>

Sports have a wide variety of formats, rules, and scoring values. Accounting for all of the different possibilities for all sports in one DTD would create a monolithic file which would add unnecessary complexity to maintain it. Instead, SportsML has specific DTDs that are called only when specifically requested. Each sport specific DTD contains elements and nodes that describe situations which only apply to the particular sport. Baseball has many unique statistics and a detailed box score would use many of these specific elements. This simple example only uses a few. The baseball specific element is named team-stats-baseball. It is a child element of team stats. In addition, team-stats-baseball further breaks things down with offensive and defensive statistics. Other SportsML specific DTDs behave the same way. In this short example, the number of hits is recorded with the hits attribute in stats-baseball-offensive. The number of errors are specified by the errors attribute in stats-baseball-defensive.

    <team-stats-baseball>
     <stats-baseball-offensive hits="7"/>
     <stats-baseball-defensive errors="1"/>
    </team-stats-baseball>
   </team-stats>
  </team>

The team element is repeated with the values for the second team. The XML file is completed with the close of the sports-event and sports-content elements.

 </sports-event>
</sports-content>

The completed XML file now looks like the following. Download the completed XML file.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE sports-content SYSTEM "http://erikberg.com/xmlstats/dtd/sportsml-core.dtd">

<sports-content>
 <sports-metadata
     doc-id="20060408_BAL_at_TEX"
     document-class="event-statistics"
     date-coverage-type="2004">
  <sports-title>MLB Box Score: Baltimore at Texas</sports-title>
 </sports-metadata>

 <sports-event>
  <event-metadata
    event-status="post-event"
    start-date-time="2004-06-08"
    end-date-time="2004-06-08"
    site-attendance="43128"
    site-temperature="71"
    site-temperature-units="F"
    />

  <team>
   <team-metadata alignment="away">
    <name first="Baltimore" last="Orioles"/>
   </team-metadata>
   <team-stats score="2" event-outcome="loss">
    <sub-score period-value="1" score="0"/>
    <sub-score period-value="2" score="0"/>
    <sub-score period-value="3" score="1"/>
    <sub-score period-value="4" score="0"/>
    <sub-score period-value="5" score="0"/>
    <sub-score period-value="6" score="0"/>
    <sub-score period-value="7" score="1"/>
    <sub-score period-value="8" score="0"/>
    <sub-score period-value="9" score="0"/>
    <team-stats-baseball>
     <stats-baseball-offensive hits="7"/>
     <stats-baseball-defensive errors="1"/>
    </team-stats-baseball>
   </team-stats>
  </team>
  <team>
   <team-metadata alignment="home">
    <name first="Texas" last="Rangers"/>
   </team-metadata>
   <team-stats score="5" event-outcome="win">
    <sub-score period-value="1" score="2"/>
    <sub-score period-value="2" score="0"/>
    <sub-score period-value="3" score="1"/>
    <sub-score period-value="4" score="0"/>
    <sub-score period-value="5" score="0"/>
    <sub-score period-value="6" score="1"/>
    <sub-score period-value="7" score="1"/>
    <sub-score period-value="8" score="0"/>
    <sub-score period-value="9" score="x"/>
    <team-stats-baseball>
     <stats-baseball-offensive hits="14"/>
     <stats-baseball-defensive errors="0"/>
    </team-stats-baseball>
   </team-stats>
  </team>

 </sports-event>
</sports-content>

Adding an XSL Stylesheet

The structure looks orderly, but viewing plain XML is not the optimal format for human consumption. The way data is displayed can be controlled through an XML stylesheet, formally Extensible Stylesheet Language (XSL). XSL is a powerful, but terse formatting language that allows unlimited possibilities for the presentation and display of XML. This is one of the great aspects of XML. The data is completely separate from the presentation layer and style contents. The stylesheet can be modified, duplicated, and altered without ever touching the data.

The start of an an XSL document is similar to an XML one. The first line and first column of the document must specify the XSL specification and version.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

In this example the data will be displayed on the Web. XSL has powerful methods to convert XML into other XML, HTML, or plain text documents. This is called XSLT for XSL Transformations. It is tightly coupled with XSL. The following line instructs the transformation process to output the results in HTML.

<xsl:output method="html" indent="yes" encoding="UTF-8" />

Next, an instruction tells the XSL file to look for the root element, sports-content. All subsequent XSL commands will be relative to the sports-content root element.

 <xsl:template match="/sports-content">
  <html>
   <head>
    <title><xsl:value-of select="sports-metadata/sports-title" /></title>
   </head>

After the XSL is told to match on sports-content, the title of the document can be extracted with a relative path. The <xsl:value-of select="..."> statement instructs the XSL to use the value of the data contained in sports-metadata/sports-title.

To retrieve the data contained in the sports-event element, an XSL looping method named <xsl:for-each select="..."> is used. In this case, an xsl:for-each match is performed on sports-event. The general format of the output is created with <table>, <tr>, and <td> HTML tags. Another xsl:for-each loop is used to retrieve and display the number of innings specified in the XML document. The xsl:for-each loop is always closed with the </xsl:for-each> tag.

   <body>
    <xsl:for-each select="sports-event">
    <table style="background-color:#f0f0f0; border:solid #999999 1px;">
     <tr>
      <td>
       <table style="border-spacing:1px">
        <tr>
         <td colspan="2" class="heading">Final</td>
         <xsl:for-each select="team[1]/team-stats/sub-score">
          <td class="hdinning"><xsl:value-of select="@period-value" /></td>
         </xsl:for-each>
         <td class="hdinning">R</td>
         <td class="hdinning">H</td>
         <td class="hdinning">E</td>
        </tr>

With the basic structure of the document created, now it is a matter of looping through the data for each team. This is accomplished with another xsl:for-each statement. The following code adds an <xsl:choose> statement. This is similar to an if ... [ [else if ...] else ... ] conditional statement in other programming languages. The xsl:when statement performs the conditional test. In the example, the event-outcome attribute for team-stats is tested to see if it contains the text "win". If it does, a <td> tag with a class value of "win" is inserted. If the xsl:when test fails, xsl:otherwise is used to insert a <td> tag with a "loss" class value. It is important to note that all tags used in conditionals must close, or a XSLT parse error will be thrown. That is, a <table> tag can not be opened in an xsl:when clause unless there is a closing </table> tag to go along with it.

        <xsl:for-each select="team">
         <tr>
          <xsl:choose>
          <xsl:when test="team-stats/@event-outcome = 'win'">
           <td class="win">»</td>
          </xsl:when>
          <xsl:otherwise>
           <td class="loss"></td>
          </xsl:otherwise>
          </xsl:choose>
          <td class="team">
          <xsl:value-of select="team-metadata/name/@first" /></td>
          <xsl:for-each select="team-stats/sub-score">
           <td class="inning"><xsl:value-of select="@score" /></td>
          </xsl:for-each>
          <td class="runs"><xsl:value-of select="team-stats/@score" /></td>
          <td class="totals"><xsl:value-of select="team-stats/team-stats-baseball/stats-baseball-offensive/@hits" /></td>
          <td class="totals"><xsl:value-of select="team-stats/team-stats-baseball/stats-baseball-defensive/@errors" /></td>
         </tr>
        </xsl:for-each>
       </table>

The XSL document is completed by closing the xsl:for-each, xsl:template, and xsl:stylesheet tags that were opened at the top of the stylesheet.

   </xsl:for-each>
   </body>
  </html>
 </xsl:template>
</xsl:stylesheet>

The completed XSL stylesheet now looks like the following. Download the completed XSL file.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" indent="yes" encoding="UTF-8" />

 <xsl:template match="/sports-content">
  <html>
   <head>
    <title><xsl:value-of select="sports-metadata/sports-title" /></title>
   </head>

   <body>
    <xsl:for-each select="sports-event">
    <table style="background-color:#f0f0f0; border:solid #999999 1px;">
     <tr>
      <td>
       <table style="border-spacing:1px">
        <tr>
         <td colspan="2" class="heading">Final</td>
         <xsl:for-each select="team[1]/team-stats/sub-score">
          <td class="hdinning"><xsl:value-of select="@period-value" /></td>
         </xsl:for-each>
         <td class="hdinning">R</td>
         <td class="hdinning">H</td>
         <td class="hdinning">E</td>
        </tr>
        <xsl:for-each select="team">
         <tr>
          <xsl:choose>
          <xsl:when test="team-stats/@event-outcome = 'win'">
           <td class="win">»</td>
          </xsl:when>
          <xsl:otherwise>
           <td class="loss"></td>
          </xsl:otherwise>
          </xsl:choose>
          <td class="team">
          <xsl:value-of select="team-metadata/name/@first" /></td>
          <xsl:for-each select="team-stats/sub-score">
           <td class="inning"><xsl:value-of select="@score" /></td>
          </xsl:for-each>
          <td class="runs"><xsl:value-of select="team-stats/@score" /></td>
          <td class="totals"><xsl:value-of select="team-stats/team-stats-baseball/stats-baseball-offensive/@hits" /></td>
          <td class="totals"><xsl:value-of select="team-stats/team-stats-baseball/stats-baseball-defensive/@errors" /></td>
         </tr>
        </xsl:for-each>
       </table>
      </td>
     </tr>
   </table>
   </xsl:for-each>
   </body>
  </html>
 </xsl:template>
</xsl:stylesheet>

Add the XSL stylesheet

In order for an XML file to perform the transformation of the XSL file, a statement must be added to the XML file. This is a single line with an xml-stylesheet identifier and a pointer to the XSL file. In this case, the following line is added to boxsamp.xml. The stylesheet links to boxsamp.xsl.

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="boxsamp.xsl" ?>
<!DOCTYPE sports-content SYSTEM "http://erikberg.com/xmlstats/dtd/sportsml-core.dtd">

Now viewing the XML file with Firefox or IE 6 should apply the stylesheet, transforming the XML to HTML on the fly. Additional formatting can be accomplished easily by adding a link to a CSS stylesheet. XSL also has a mechanism to achieve detailed styling using XSL Formatting Objects (XSL-FO).

League Standings Table

This example illustrates how to formulate a league standings table in SportsML.

Start with the same XML and encoding declaration as described above in the box score example above. The only required attribute of sports-metadata is doc-id. The language the document uses is expressed by the language attribute (see: ISO 639 and ISO 3166 for acceptable codes). The type of document is also described with the document-class attribute. In this example, the document contains "standings".

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sports-content SYSTEM "http://erikberg.com/xmlstats/dtd/sportsml-core.dtd">

<sports-content>
 <sports-metadata
     doc-id="20060428-mlb-standings"
     language="en-US"
     document-class="standings">
  <sports-title>MLB Standings<sports-title>
 </sports-metadata>

The element used to describe the standings data is called standing obviously enough. Teams can be grouped within a standing element by league, conference, division, or it could be used to hold a single team. The content-label attribute can contain an arbitrary value that is used to describe the information the element contains. The label here is "AL East" since the teams in this standing group represent the American League East division.

  <standing content-label="AL East">
    <standing-metadata
      team-coverage="multi-team"
      date-coverage-type="season-regular"
      stats-coverage="standard"
      competition-scope="league">
    </standing-metadata>

Information covering the dates, team coverage, and the detail level of the standings table is controlled within the standing-metadata child element. For team-coverage the value is "multi-team" which means that within the standing element more than one team is described. The stats-coverage attribute may contain three different values: compact, standard, and expanded. There is no formal guideline for using any of the values. It is up to the discretion of the author of the document to decide which one to use. The guideline I use is when a standings table contains less than five elements such as team, win, loss, and winning percentage, it would be classified as "compact". A standings table with five to eight elements adding games behind, home records, away records would be "standard". And a standings table that included more than eight elements such as streaks, points scored, points against, conference and division records, etc would be "expanded".

The next part in the standings table is listing the teams. The team element is used again to describe the team name and any additional team metadata.

    <team>
      <team-metadata>
        <name first="Boston" last="Red Sox">
      </team-metadata>

Next, two attributes in the team-stats element are used. The number of games played are listed in an attribute named events-played, the number of games a team is out of first place is contained in the games-back attribute. When a team is in first place, use "0" as the value.

      <team-stats events-played="24" games-back="0">

The element that contains most of the information for a standings table is named outcome-totals. This element contains attributes that further specify time periods, where the game occurred, and records against specific teams, divisions, or conferences. It also contains the self-documenting attributes such as wins, losses, and winning-percentage. There can be several outcome-totals elements within team-stats. Each one is used to contain information for specific result sets that are defined by the following elements.

        <outcome-totals
          duration-scope="events-all"
          competition-scope="events-all"
          alignment-scope="events-all"

For a standard standings table, the duration-scope value will be "events-all". This means that every game played is included. Other possible values are "events-overtime" which covers games decided in overtime, "events-shootout" specifically for soccer and hockey where games were decided by a shootout, and "events-most-recent-10" which describes the last ten game results i.e., the "Last 10" column in a standings table.

The attribute used to indicate the record totals against a certain team, division, conference, league, or ranked teams is named competition-scope. The most common value used, the default value, is "events-all" which means to include all games played against all opponents. When creating a standings table that breaks down a team's record by division, it would be indicated using this attribute with a "division" value. [The SportsML specification does not include a method for sub dividing a division competition scope. For MLB, I append the division name. For example, the value for competition-scope when listing totals for the NL West would be indicated by using a value of "division-west".] For college sports, a valid competition-scope value is "top-25" which would only contain records against ranked opponents. [The SportsML specification does not include a method to show which poll is being used (AP, Coaches, etc). Similar to MLB, I append the name of the poll to this value e.g., "top-25-ap" for the Top 25 AP poll, "top-25-coaches" for the Top 25 Coaches poll.]

Just like it is used elsewhere in the SportsML definition, alignment-scope indicates whether the games covered include all games ("events-all"), only home games ("events-home"), or only away games ("events-away"). [I use "events-neutral" to describe games that occurred at a neutral location. This is not in the SportsML specification.]

      wins="14"
      losses="10"
      winning-percentage=".583"
      streak-type="win"
      streak-total="1"
      />
       <outcome-totals
          alignment-scope="events-home"
          wins="6"
          losses="4"
          />
       <outcome-totals
          alignment-scope="events-away"
          wins="8"
          losses="6"
          />
      </team-stats>
    </team>

The rest of the attributes in outcome-totals are basically self-explanatory. Values relevant for the streak-type attribute in a standings table are win, loss, and tie. The streak-total attribute is used to hold the number total value of the streak.

The standings table now has information for the overall win-loss record including the current winning streak, the win-loss records for games that took place at home, and the win-loss record for games that took place on the road.

The team element is repeated with the values for the rest of the teams in the AL East. After the final team in the AL East is listed, the standing element tag is closed. A new standing element is used for each division i.e., AL Central, AL West, NL East, Central, and NL West are all contained within their own standing tags.

Once all of the divisions and teams have been entered the document is closed with the </sports-content> tag. A completed expanded standings table.

Links

Other projects that use XML and/or SportsML:

Creative Commons License
Copyright © 2005-2010 Erik Berg