Wednesday, April 11, 2018

3:45 PM  - 5:15 PM

Bulk Data Exchange Technology and the "Long Tail" of Data

Las Palmas C

Join a focused community discussion on issues related to Ed-Fi bulk data exchange technologies and consideration of new strategies for incorporating the “long tail” of data into Ed-Fi ODS implementations.

These broad session notes attempt to capture the spirit of the discussion and should not be interpreted as a transcript. Although Ed-Fi Alliance staff were involved in capturing these observations, the notes below should not be construed as official, complete, or 100% accurate.


Slide decks:


Chris Moffatt, VP Technology, Ed-Fi Alliance


Jason Hoekstra, Senior Program Manager, Ed-FI Alliance

  • “Long Tail” of Data

    • Many states and SIS now running data into the ODS with the API

    • Continuing to bring more vendors onto using the API

    • What about other data types that we need to support, which are not covered by vendor solutions yet?

    • Trying to gather input and guidance right now

    • What domains are in the “long tail”?

      • Assessment data (other than summary) - formative, professional development

      • HR and finance data, particularly as relate to ESSA

      • Instructional / LMS data that are outside of a SIS

      • Intervention data

      • Special education data

      • Transportation

      • Audience input: teacher and student surveys on effectiveness of intervention work; social and emotional learning data (separate from special education) (tools such as Panorama); nutrition and health; college & career (feedback?) (Wal-Mart is working on career readiness standard)

      • Most of the room relates to 3 or more of these topics

    • Common qualities

      • These data are unlikely to be API native in the short term (next 3 years). Messy and hard to manage. But valuable for analysis.

      • Quarterly or monthly in frequency

      • Mostly in CSV format or exportable to CSV

      • Home grown solutions in Access, FoxPro, Excel

      • Accessible via SFTP or local network

      • Any other reoccurring themes? None noted by the audience.

  • Data Flow Project

    • MSDF-sponsored proof-of-concept project

    • Partners: RTI and Lesson Planning applications

    • Districts wanting more visualizations about what students know, strategies around assessments

    • Toolkit is a lightweight mapping and ETL solution for loading assessment data into the ODS

      • .NET solution paired with Cloud ODS

      • Open source, working to figure out what to do with it next

      • Admin, ETL, and Reporting/Cleanup modules

      • Admin tool configures mapping of CSV files to API calls

    • Quick demo - Jason is available for a longer demo after the Congress. YES REQUEST FOR A LONGER WEBEX.

    • Audience

      • How is this tool different from general vendor products like SSIS, Talend, WebLogic?

        • In creating this, tried to narrow down the ETL problem by mapping only to the ODS API standard. Not a general-purpose ETL tool.

        • Allows implementers to use a simple tool without having the high learning curve around standard industry tools. Enables a BA to map this instead of requiring a programmer.

        • Re-usable code then lowers cost of implementation.

      • Why is Ed-Fi Alliance so technology specific? Seems to be so when looking at Tech Docs.

        • Standard is technology neutral

        • Reference implementation of choice for Ed-Fi team uses Microsoft stack but does not strictly endorse or require that.

      • From multiple people: this is very valuable and exciting work to lower maintenance work

      • Ed-Fi Alliance still asking vendors to natively support the API - this solution is for small and custom applications that don’t have the capacity for building interoperability.

      • Quality concerns?

        • This is not a perfect solution - e.g. concern that pushing a duplicate record becomes an update, and is that what is really wanted? Still ideal to code directly to the API

      • Data shown in the tool using the API? Yes

        • Staged approach to loading data, i.e. one file might need to be loaded before another in order to populate reference information. Roster data for example.

      • This is not an Ed-Fi CSV Format.

      • Aware of required fields & referential integrity?

        • Required fields: yes. Referential: no.

        • BA will need to have an understanding of the domain model.

      • Is there a way to share (and govern) the mappings?

        • How does this relate to MappingEDU?

          • That is from standard-to-standard

          • [Stephen: I didn’t hear a clear answer about the difference between these two, except that they came from different initiatives. Perhaps this is my lack of experience. But points to need for a little more clarity]

        • One-off and re-use are two different use cases

  • Any other tools you’re working with? A few people starting to work on similar solutions.

  • To be continued...


Geoff McElhanon, CTO, Educuity

Bulk Data Exchange Technology

  • Aligning data standards - want to support JSON bulk load, not just XML

  • History

    • XML was chosen as standard in 2010, initially processed with SSIS

    • SSIS provided early value but became maintenance liability

    • 2013 Tennessee explored JSON but contined with XML.

    • Created C# app to read in XML instead of using SSIS

    • Still complex and difficult to maintain

    • 2015 replace NHibernate with SQL Bulk Copy using a staging database, then DB-to-DB push from staging to transactional DB with built-in validations. Essentially merge statements. Higher performance.

    • XML less and less valuable compared to JSON

  • JSON proposal

    • Proposed to use JSON as primary data format

    • XML support continued to lesser extent

    • Aligns bulk load format with transactional API, which helps vendors support both use cases

    • Support potential new bulk features

      • Bulk export of standard resources

      • Bulk export of changed resources (Change Events)

      • Bulk delete operations

  • Architecture

    • XML reads full file into memory, very problematic. JSON bulk loading is more efficient.

    • Simple structure e.g. [ { schema: “ed-fi”, resource: “students”, data: [ ]  }, { … } ]

    • Could support composite resources: { compositePublisher: “ed-fi”, compositeCategory: “enrollment”, resource: “sections’, data: [ ] }

    • Schema for extensions? Should directly align

  • What about client loader?

    • [Stephen did not follow the discussion due to lack of familiarity with the mentioned tool… ]

    • Client loader maps XML to JSON, so even better to have tools write directly to JSON and the API.

  • What about loading directly to JSON data type in database?

    • Still need validations

  • Exporting changes

    • Change Events API provides raw Change Events, not including the actual resource changed

    • Add an operation { .. “operation”: “create” … } to the schema would help both export and import of changed resources

    • What about partial updates? Probably not, many challenges with patching

  • Bulk load support for profiles now

  • What about adding filter queries in the API for bulk exports? Not currently in the works but should be feasible

  • Bulk deletes would include only natural keys

  • Batch JSON data transfers

    • Use same format as bulk to send small batches with data processed synchronously (whereas bulk is queued async)

    • Limit to small size < 1 MB

    • Lower latency alternative to sending large file, though large file might be easier to contain all referential integrity

    • Example use case: fixing attendance - send a batch command to delete a record and replace with new one.

      • Current vision this would require two different batches, really a series of transactions. Not one atomic transaction.

  • Hoping that this tool will support easier debugging of files compared to current XML

  • Bulk API v2 might look like

    • /bulk/v2

    • Operations

      • Create a bulk import

        • POST /bulk/v2/imports

        • Returns 201 and ID of the bulk operation

          • Does not return resource IDs

          • Could it return an array of IDs?

          • Could that be another operation on this API - asking for those resource IDs?

          • How would you map output to the input records?

          • Currently having to run a separate process to discover the resource IDs

          • Could this be handled by positional mapping? If 10000 records in input, exact match of 10000 records in output with error or resource IDs for each record.

      • Add data to a bulk import

        • POST /bulk/v2/imports/{id}/append

        • Max 150 MB per batch, Recommended 4 MB

        • Returns 200

      • Complete a bulk import

        • /bulk/v2/imports/{id}/complete

        • Response contains createdItems, updatedItems, errors arrays

          • { schema, resource, key, id, link, etag }

          • Error { schema, resource, key, statusCode, message }

      • Abort a bulk import (before processing starts)

      • Get information about existing bulk import

        • GET /bulk/v2/imports

        • GET /bulk/v2/imports/{id}/errors

        • Authorization based on ownership (who created)

      • Export

        • POST /bulk/v2/export

        • Could this have an ODATA layer?

        • Creates a URL for retrieving (zipped) file contents

          • 404 if you call before the file is ready

      • Batch - POST /data/v3/batches [ { schema, resource, data: [ ] } ]

    • Bulk import workflow

      • Diagram

      • Using Hangfire