Better automated computer access to data

Gravatar Requested By: Chris F.A. Johnson
Tagged: ttc, automation, data, format

So far, I have only looked at the TTC data, but I have three recommendations that would make the page more useful:

  1. The files would be more accessible if they were not zipped.

  2. There should be an invariant name for the files so that an automated script can find them. The files can be there with a datestamp, but the latest data should also be linked to a generic name. I know I can scrape the information from http://www.toronto.ca/open/datasets/ttc-routes/, but that is hardly in the spirit of easily accessible data.

  3. There should be metadata for the files in computer-readable (text) format; a PDF doesn't cut it.

There are 7 comments:

  • Gravatar Chris F.A. Johnson
    posted Nov 14, 2009

    And please use ISO standard format for dates (e.g., 2009-10-27) rather than the ridiculous and ambiguous 10-27-2009.

  • Gravatar Adam Thody
    posted Nov 14, 2009

    Agreed. Referring specifically to TTC data as well, why tab delimited .txt files? This is 2009, we have standards for formatting data. These should be XML files with descriptive tags, and proper XSDs.

    This doesn't require any kind of innovation, or radical thinking, just the use of this decade's standard notation for data exchange.

  • Gravatar Chris F.A. Johnson
    posted Nov 15, 2009

    Character-delimited fields are far more accessible than XML.

    They don't require any special software (they can be read with any editor or any command-line tool), and can be interpreted by any programming language without needing an XML parser.

    They take up considerably less space than XML.

    The only thing in favour of XML is that it's better than PDF.

  • Gravatar Adam Thody
    posted Nov 15, 2009

    Sorry, I disagree. The self-describing nature of XML makes it ideal for exchanging this kind of data. I'm not sure what you mean by "special software", XML can be opened in any text editor and be easily understood, without the need for external documentation. Since XML parsing is more or less trivial with nearly all modern programming languages, it's also highly portable.

    The size argument is totally valid, but the pros outweigh the cons.

  • Gravatar Jungho Kim
    posted Nov 17, 2009

    Character delimited fields may be fine for simple data, however, it doesn't scale with data complexity. Futhermore, data often needs to be transformed into the what makes sense for a given domain. With XML there is already standard means to do that via XSLT. No need to reinvent the wheel. Lastly, wide distribution and consumption of data can only be helped by interoperability, and XML is interoperable.

  • Gravatar Geoffrey Wiseman
    posted Nov 17, 2009

    Last I heard, comments on existing data sources were to be made to the email address, rather than datato.org.

  • Gravatar Ernest
    posted Nov 19, 2009

    The amount of data available to end users will continue to increase exponentially, and CSV data while taking up less space, is actually less reliable when dealing with complex datasets and and distribution. It get get really messy when trying to parse it at times, whereas XML and related standards have tonnes of tools out there to receive, process and transmit such data. As we move more and more into the semantic web environment, CSV data will ultimately become confined to the annals of history.

Add a comment

Remain anonymous | Sign in | Register

Please enter the code:


Rate This Request

Vote this request up or down to show your support.

0

Sign in to vote

Votes in favour: 50%
Degree of contention: Low