Many machine learning problems involve more than a single data source related to the problemof interest. For instance, if one is aimed at developing rich city model, datasets can be obtaineddescribing not only the plan of the streets, avenues, and roads, but also visual appearance fromonline cameras or street views, the population density along the several districts, the informationabout the geography of the region (including the existence of lakes, coast, mountains, etc.), theeconomy of the region among many other important related aspects. Such dataintegration has been considered strategic in recent urban informatics literature as well as otherfields. However, these datasets are often obtained and organized independently and have varyingdegrees of completeness and quality. Equally important is the fact that such datasets are notdirectly useful, demanding transformations in order to be more effectively analyzed given specificapplications. In the case of the previous example, the databases representing the streets and avenuesare often organized in CAD structures representing streets by polylines with control points definedby street intersections or high curvature points, while a more powerful representation would be tohave a graph or network containing only the former type of control points. So, it is necessary toremove the latter points, while performing a consistency checking.The integration of datasets corresponds to a critical task, because it is necessary to integratethe data in each independent database into a coherent whole. In the case of the previous example,it would be necessary to integrate the geographical coordinates into the control points deningthe streets. Frequently, such an integration demands incorporating into the processing a relativelyhigh level of intelligence about the data, objectives, and application. It is important to have commonmathematical/computational intermediate representations to perform meaningful integration.These two critical tasks, integration and transformation, therefore define the kernel of the presentapproach, as represented in the diagram in Figure 2 of the Thematic Project. Two particular topicsare of interest: dataset augmentation and quality control.
News published in Agência FAPESP Newsletter about the scholarship: