The goal is to consolidate all available and required data and prepare it for modeling.
For some projects there will be a need to label data in order to train a model to do a specific task, in most cases, this is done by the client with support from us in terms of tooling, etc. In general, we do not have the required domain knowledge to do the labeling correctly.
Steps:- Identify sources for required data
- Curate and validate data sources
- Perform syntactic quality check
- Data cleanliness assessment
Deliverables:- Consolidated data specification
- ETL procedure for getting data into the model
The data collection, cleaning, and organizing step varies a lot from customer to customer and project to project. In the "simplest" case we basically just provide a data specification and agree on how the data should be transferred with us doing a simple data inspection to see that the data is ok and can go into the model before starting to train a model. In a very "complex" case where the customer may not have in-house staff to do it, we will do all of the work.