Better automated computer access to data
So far, I have only looked at the TTC data, but I have three recommendations that would make the page more useful:
The files would be more accessible if they were not zipped.
There should be an invariant name for the files so that an automated script can find them. The files can be there with a datestamp, but the latest data should also be linked to a generic name. I know I can scrape the information from http://www.toronto.ca/open/datasets/ttc-routes/, but that is hardly in the spirit of easily accessible data.
There should be metadata for the files in computer-readable (text) format; a PDF doesn't cut it.
Votes in favour: 50%
Degree of contention:
Low

There are 7 comments:
posted Nov 14, 2009
And please use ISO standard format for dates (e.g., 2009-10-27) rather than the ridiculous and ambiguous 10-27-2009.
posted Nov 14, 2009
Agreed. Referring specifically to TTC data as well, why tab delimited .txt files? This is 2009, we have standards for formatting data. These should be XML files with descriptive tags, and proper XSDs.
This doesn't require any kind of innovation, or radical thinking, just the use of this decade's standard notation for data exchange.
posted Nov 15, 2009
Character-delimited fields are far more accessible than XML.
They don't require any special software (they can be read with any editor or any command-line tool), and can be interpreted by any programming language without needing an XML parser.
They take up considerably less space than XML.
The only thing in favour of XML is that it's better than PDF.
posted Nov 15, 2009
Sorry, I disagree. The self-describing nature of XML makes it ideal for exchanging this kind of data. I'm not sure what you mean by "special software", XML can be opened in any text editor and be easily understood, without the need for external documentation. Since XML parsing is more or less trivial with nearly all modern programming languages, it's also highly portable.
The size argument is totally valid, but the pros outweigh the cons.
posted Nov 17, 2009
Character delimited fields may be fine for simple data, however, it doesn't scale with data complexity. Futhermore, data often needs to be transformed into the what makes sense for a given domain. With XML there is already standard means to do that via XSLT. No need to reinvent the wheel. Lastly, wide distribution and consumption of data can only be helped by interoperability, and XML is interoperable.
posted Nov 17, 2009
Last I heard, comments on existing data sources were to be made to the email address, rather than datato.org.
posted Nov 19, 2009
The amount of data available to end users will continue to increase exponentially, and CSV data while taking up less space, is actually less reliable when dealing with complex datasets and and distribution. It get get really messy when trying to parse it at times, whereas XML and related standards have tonnes of tools out there to receive, process and transmit such data. As we move more and more into the semantic web environment, CSV data will ultimately become confined to the annals of history.
Add a comment
Remain anonymous | Sign in | Register