Wednesday, April 11, 2018
3:45 PM - 5:15 PM | Bulk Data Exchange Technology and the "Long Tail" of Data Las Palmas C Join a focused community discussion on issues related to Ed-Fi bulk data exchange technologies and consideration of new strategies for incorporating the “long tail” of data into Ed-Fi ODS implementations. |
These broad session notes attempt to capture the spirit of the discussion and should not be interpreted as a transcript. Although Ed-Fi Alliance staff were involved in capturing these observations, the notes below should not be construed as official, complete, or 100% accurate.
Slide decks:
Chris Moffatt, VP Technology, Ed-Fi Alliance
Topics for this session are around getting data in and out of the ODS outside the API
- PPT’s posted to https://techdocs.ed-fi.org/display/EE/2018+Technical+Congress
- Data Flow Code available at https://github.com/schoolstacks/dataflow
Jason Hoekstra, Senior Program Manager, Ed-FI Alliance
“Long Tail” of Data
Many states and SIS now running data into the ODS with the API
Continuing to bring more vendors onto using the API
What about other data types that we need to support, which are not covered by vendor solutions yet?
Trying to gather input and guidance right now
What domains are in the “long tail”?
Assessment data (other than summary) - formative, professional development
HR and finance data, particularly as relate to ESSA
Instructional / LMS data that are outside of a SIS
Intervention data
Special education data
Transportation
Audience input: teacher and student surveys on effectiveness of intervention work; social and emotional learning data (separate from special education) (tools such as Panorama); nutrition and health; college & career (feedback?) (Wal-Mart is working on career readiness standard)
Most of the room relates to 3 or more of these topics
Common qualities
These data are unlikely to be API native in the short term (next 3 years). Messy and hard to manage. But valuable for analysis.
Quarterly or monthly in frequency
Mostly in CSV format or exportable to CSV
Home grown solutions in Access, FoxPro, Excel
Accessible via SFTP or local network
Any other reoccurring themes? None noted by the audience.
Data Flow Project
MSDF-sponsored proof-of-concept project
Partners: RTI and Lesson Planning applications
Districts wanting more visualizations about what students know, strategies around assessments
Toolkit is a lightweight mapping and ETL solution for loading assessment data into the ODS
.NET solution paired with Cloud ODS
Open source, working to figure out what to do with it next
Admin, ETL, and Reporting/Cleanup modules
Admin tool configures mapping of CSV files to API calls
Quick demo - Jason is available for a longer demo after the Congress. YES REQUEST FOR A LONGER WEBEX.
Audience
How is this tool different from general vendor products like SSIS, Talend, WebLogic?
In creating this, tried to narrow down the ETL problem by mapping only to the ODS API standard. Not a general-purpose ETL tool.
Allows implementers to use a simple tool without having the high learning curve around standard industry tools. Enables a BA to map this instead of requiring a programmer.
Re-usable code then lowers cost of implementation.
Why is Ed-Fi Alliance so technology specific? Seems to be so when looking at Tech Docs.
Standard is technology neutral
Reference implementation of choice for Ed-Fi team uses Microsoft stack but does not strictly endorse or require that.
From multiple people: this is very valuable and exciting work to lower maintenance work
Ed-Fi Alliance still asking vendors to natively support the API - this solution is for small and custom applications that don’t have the capacity for building interoperability.
Quality concerns?
This is not a perfect solution - e.g. concern that pushing a duplicate record becomes an update, and is that what is really wanted? Still ideal to code directly to the API
Data shown in the tool using the API? Yes
Staged approach to loading data, i.e. one file might need to be loaded before another in order to populate reference information. Roster data for example.
This is not an Ed-Fi CSV Format.
Aware of required fields & referential integrity?
Required fields: yes. Referential: no.
BA will need to have an understanding of the domain model.
Is there a way to share (and govern) the mappings?
How does this relate to MappingEDU?
That is from standard-to-standard
[Stephen: I didn’t hear a clear answer about the difference between these two, except that they came from different initiatives. Perhaps this is my lack of experience. But points to need for a little more clarity]
One-off and re-use are two different use cases
Any other tools you’re working with? A few people starting to work on similar solutions.
To be continued...
Geoff McElhanon, CTO, Educuity
Bulk Data Exchange Technology
Aligning data standards - want to support JSON bulk load, not just XML
History
XML was chosen as standard in 2010, initially processed with SSIS
SSIS provided early value but became maintenance liability
2013 Tennessee explored JSON but contined with XML.
Created C# app to read in XML instead of using SSIS
Still complex and difficult to maintain
2015 replace NHibernate with SQL Bulk Copy using a staging database, then DB-to-DB push from staging to transactional DB with built-in validations. Essentially merge statements. Higher performance.
XML less and less valuable compared to JSON
JSON proposal
Proposed to use JSON as primary data format
XML support continued to lesser extent
Aligns bulk load format with transactional API, which helps vendors support both use cases
Support potential new bulk features
Bulk export of standard resources
Bulk export of changed resources (Change Events)
Bulk delete operations
Architecture
XML reads full file into memory, very problematic. JSON bulk loading is more efficient.
Simple structure e.g. [ { schema: “ed-fi”, resource: “students”, data: [ ] }, { … } ]
Could support composite resources: { compositePublisher: “ed-fi”, compositeCategory: “enrollment”, resource: “sections’, data: [ ] }
Schema for extensions? Should directly align
What about client loader?
[Stephen did not follow the discussion due to lack of familiarity with the mentioned tool… ]
Client loader maps XML to JSON, so even better to have tools write directly to JSON and the API.
What about loading directly to JSON data type in database?
Still need validations
Exporting changes
Change Events API provides raw Change Events, not including the actual resource changed
Add an operation { .. “operation”: “create” … } to the schema would help both export and import of changed resources
What about partial updates? Probably not, many challenges with patching
Bulk load support for profiles now
What about adding filter queries in the API for bulk exports? Not currently in the works but should be feasible
Bulk deletes would include only natural keys
Batch JSON data transfers
Use same format as bulk to send small batches with data processed synchronously (whereas bulk is queued async)
Limit to small size < 1 MB
Lower latency alternative to sending large file, though large file might be easier to contain all referential integrity
Example use case: fixing attendance - send a batch command to delete a record and replace with new one.
Current vision this would require two different batches, really a series of transactions. Not one atomic transaction.
Hoping that this tool will support easier debugging of files compared to current XML
Bulk API v2 might look like
/bulk/v2
Operations
Create a bulk import
POST /bulk/v2/imports
Returns 201 and ID of the bulk operation
Does not return resource IDs
Could it return an array of IDs?
Could that be another operation on this API - asking for those resource IDs?
How would you map output to the input records?
Currently having to run a separate process to discover the resource IDs
Could this be handled by positional mapping? If 10000 records in input, exact match of 10000 records in output with error or resource IDs for each record.
Add data to a bulk import
POST /bulk/v2/imports/{id}/append
Max 150 MB per batch, Recommended 4 MB
Returns 200
Complete a bulk import
/bulk/v2/imports/{id}/complete
Response contains createdItems, updatedItems, errors arrays
{ schema, resource, key, id, link, etag }
Error { schema, resource, key, statusCode, message }
Abort a bulk import (before processing starts)
Get information about existing bulk import
GET /bulk/v2/imports
GET /bulk/v2/imports/{id}/errors
Authorization based on ownership (who created)
Export
POST /bulk/v2/export
Could this have an ODATA layer?
Creates a URL for retrieving (zipped) file contents
404 if you call before the file is ready
Batch - POST /data/v3/batches [ { schema, resource, data: [ ] } ]
Bulk import workflow
Diagram
Using Hangfire