Data Enrichment

Exception, Anomaly, and Threat Detection

Eric D. Knapp , Joel Thomas Langill , in Industrial Network Security (Second Edition), 2015

Data Enrichment

Data enrichment refers to the process of appending or otherwise enhancing collected data with relevant context obtained from additional sources. For example, if a username is found within an awarding log, that username tin be referenced confronting a central IAM organization (or ICS application if Application Security is deployed) to obtain the user'southward actual name, departmental roles, privileges, and and then on. This additional information "enriches" the original log with this context. Similarly, an IP address tin exist used to enrich a log file, referencing IP reputation servers for external addresses to see if there is known threat activity associated with that IP accost, or by referencing geolocation services to determine the concrete location of the IP address by land, state, or postal code (run across "Additional Context" in Chapter 12, "Security Monitoring of Industrial Command Systems," for more examples of contextual data).

Caution

Many of the advanced security controls described in this affiliate leverage the use of external threat intelligence data. It is e'er important to remember to follow strict security policies on network connectivity between trusted control zones and less-trusted enterprise and public (i.e. Internet) zones. This can be addressed by proper location of local assets requiring remote information, including the creation of dedicated "security zones" within the semitrusted DMZ framework.

Information enrichment can occur in two primary ways. The offset is by performing a lookup at the time of collection and appending the contextual information into the log. Some other method is to perform a lookup at the time the event is scrutinized by the SIEM or log management arrangement. Although both provide the relevant context, each has advantages and disadvantages. Appending the data at the fourth dimension of collection provides the most authentic representation of context and prevents misrepresentations that may occur as the network environment changes. For example, if IP addresses are provided via the Dynamic Host Configuration Protocol (DHCP), the IP associated with a specific log could exist different at the time of collection than at the time of analysis. Although more authentic, this type of enrichment likewise burdens the analysis platform by increasing the amount of stored data. Information technology is important to ensure that the original log file is maintained for compliance purposes, requiring the system to replicate the original raw log records prior to enrichment.

The alternative, providing the context at the time of assay, removes these boosted requirements at the cost of accuracy. Although at that place is no difficult rule indicating how a detail product enriches the data that information technology collects, traditional Log Management platforms tend toward analytical enrichment, whereas SIEM platforms tend toward enrichment at the time of drove, perchance because most SIEM platforms already replicate log data for parsing and assay, minimizing the additional burden associated with this type of enrichment.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780124201149000113

Data Quality Management

Mark Allen , Dalton Cervo , in Multi-Domain Main Data Management, 2015

Data Enrichment

Data enrichment or augmentation is the process of enhancing existing information by supplementing missing or incomplete data. Typically, data enrichment is achieved by using external data sources, only that is non always the case.

In large companies with multiple disparate systems and fragmented information, it is non unusual to enrich the information provided by one source with data from another. This is particularly common during data migration, where customer information is fragmented among multiple systems and the data from i organisation are used to complement data from the other and form a more complete data tape in the MDM repository.

As with whatever other data-quality endeavour, data enrichment must serve a concern purpose. New requirements come along that may require data to be augmented. Here are some examples:

A new marketing campaign requires nonexisting particular information almost a fix of customers, such as Standard Industry Code (SIC), annual sales information, company family information, etc.

A new tax calculation process requires county information for all U.S. address records, or an extended format for U.Southward. postal lawmaking, which includes Aught   +   4.

A new legal requirement requires province information to be populated for Italian addresses.

Much of this additional information needs to come up from an external reference source, such as Dun & Bradstreet or OneSource for client data enrichment, postal code references for address augmentation, and then on.

Information technology tin can be quite a claiming to enrich data. This procedure all starts with the quality of the existing data. If the existing information is incorrect or too incomplete, it may be incommunicable to match to a reference source to supplement what is missing. It tin exist quite expensive likewise, since the bulk of the reference sources volition either require a subscription fee or accuse by volume or specific regional data sets.

When matching data to another source, there is e'er the risk that the friction match will not exist authentic. Most companies providing customer matching services with their sources volition include an automated score representing their confidence level with the friction match. For instance, a score of ninety means a confidence level of ninety percent that the match is good. Companies volition need to work with their data vendors to decide what is acceptable for their concern. Typically, at that place are three ranges:

College range: For example, eighty percent and above, where matches are automatically accepted

Center range: For example, between threescore and 80 percentage, where matches have to be manually analyzed to determine if they are adept or non

Lower range: For instance, 60 percent and below, where matches are automatically refused

Once a match is deemed correct, the boosted information provided by the reference source can be used to enrich the existing data. Accost enrichment is very common, where the combination of some address elements is used to find what is missing. Examples include using postal code to figure out city and state, or using address line, city, and state to decide postal code.

The claiming comes when there is conflicting information. For example, let'southward say that city, state, and postal code are all populated. However, when trying to enrich county information, the postal code suggests 1 county, while the city and land suggest another. The concluding choice comes down to the confidence level of the original information. If the intent is to automate the matching process, it may be necessary to evaluate what information is usually populated more accurately according to that given system and associated business concern practise. If information technology is not possible to brand that determination, a manual inspection is likely to exist required for alien situations.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9780128008355000099

Security Monitoring of Industrial Control Systems

Eric D. Knapp , Joel Thomas Langill , in Industrial Network Security (Second Edition), 2015

Data management

The side by side step in security monitoring is to utilize the relevant security information that has been collected. Proper analysis of this data can provide the situational sensation necessary to observe incidents that could impact the rubber and reliability of the industrial network.

Ideally, the SIEM or Log Manager will perform many underlying detection functions automatically—including normalization, information enrichment, and correlation (run into Chapter 11, "Exception, Anomaly, and Threat Detection")—providing the security analyst with the following types of data at their disposal:

The raw log and event details obtained by monitoring relevant systems and services, normalized to a common taxonomy.

The larger "incidents" or more than sophisticated threats derived from those raw events that may include correlation with external global threat intelligence sources.

The associated necessary context to what has been observed (raw events) and derived (correlated events).

Typically, an SIEM will stand for a high-level view of the available data on a dashboard or console, equally illustrated in Figure 12.12, which shows the dashboard of the Open Source Security Data Management (OSSIM) platform. With this information in hand, automated and manual interaction with the data tin occur. This information can be queried directly to achieve straight answers to explicit questions. It can as well be formulated into a report to satisfy specific business, policy, or compliance goals, or it tin be used to proactively or reactively notify a security or operations officer of an incident. The information is available to further investigate incidents that have already occurred.

Effigy 12.12. The Open Source Security Information Management project.

Queries

The term "query" refers to a request for information from the centralized information shop. This can sometimes be an actual database query, using structured query language (SQL), or it may be a plain-text request to brand the data more accessible by users without database administration skills (although these requests may use SQL queries internally, hidden from the user). Common examples of initial queries include the following:

Tiptop x talkers (past full network bandwidth used)

Top talkers (by unique connections or flows)

Top events (by frequency)

Top events (by severity)

Top events over time

Top applications in use

Open ports.

These requests can be made against any or all data that are available in the data store (see the section "Data Availability"). Past providing additional weather or filters, queries tin can be focused yielding results more relevant to a specific situation. For example

Meridian ten talkers during not-business organisation hours

Meridian talkers using specific industrial network protocols

All events of a mutual type (e.g. user account changes)

All events targeting a specific asset or assets (east.yard. critical assets within a specific zone)

All ports and services used by a specific asset or avails

Top applications in use inside more than i zone.

Query results can exist returned in a number of ways: via delimited text files, a graphical user interface or dashboard, preformatted executive reports, an alarm that is delivered past SMS or email, and so on. Effigy 12.13 shows user activeness filtered by a specific upshot blazon—in this example, administrative account change activities that correspond with NERC compliance requirements.

Figure 12.13. An SIEM dashboard showing administrative account changes.

A defining function of an SIEM is to correlate events to find larger incidents (come across Affiliate 11, "Exception, Bibelot, and Threat Detection"). This includes the ability to define correlation rules, as well as present the results via a dashboard. Figure 12.14 shows a graphical event correlation editor that allows the logical conditions (such as "if A and B then C"), while Figure 12.xv shows the event of an incident query—in this case the selected incident (an HTTP Command and Control Spambot) beingness derived from four detached events.

Effigy 12.xiv. An example of a graphical interface for creating upshot correlation rules.

Effigy 12.15. An SIEM dashboard a correlated issue and its source events.

Reports

Reports select, organize, and format all relevant data from the enriched logs and events into a unmarried document. Reports provide a useful means to present well-nigh any information set. Reports can summarize high-level incidents for executives, or include precise and comprehensive documentation that provides minute details for internal auditing or for compliance. An case of a written report generated past an SIEM is shown in Figure 12.16 showing a quick summary of the OSIsoft PI Historian hallmark failures and point alter activity.

Effigy 12.16. An SIEM written report showing industrial activities.

Alerts

Alerts are active responses to observed conditions within the SIEM. An warning tin exist a visual notification in a console or dashboard, a direct communications (e-mail, page, SMS, etc.) to a security ambassador, or even the execution of a custom script. Mutual alert mechanisms used by commercial SIEMs include the following:

Visual indicators (eastward.1000. red, orange, yellow, green)

Direct notification to a user or grouping of users

Generation and delivery of a specific report(s) to a user or group of users

Internal logging of alarm activeness for audit control

Execution of a custom script or other external control

Generation of a ticket in a compatible help desk or incident management system.

Several compliance regulations, including NERC CIP, CFATS, and NRC RG 5.71, require that incidents be accordingly communicated to proper authorities inside and/or outside of the organization. The alerting mechanism of an SIEM can facilitate this process by creating a useable variable or data dictionary with appropriate contacts within the SIEM and automatically generating appropriate reports and delivering them to primal personnel.

Incident investigation and response

SIEM and log management systems are useful for incident response, because the structure and normalization of the data allow an incident response team to drill into a specific event to detect additional details (ofttimes downwards to the source log file contents and/or captured network packets), and to pin on specific data fields to find other related activities. For example, if there is an incident that requires investigation and response, it can be examined speedily providing relevant details, such equally the username and IP accost. The SIEM can then be queried to determine what other events are associated with the user, IP, and so on.

In some cases the SIEM may support active response capabilities, including

Allowing direct control over switch or router interfaces via SNMP, to disable network interfaces.

Executing scripts to interact with devices within the network infrastructure, to reroute traffic, isolate users, then on.

Execute scripts to interact with perimeter security devices (e.g. firewalls) to block subsequent traffic that has been discovered to be malicious.

Execute scripts to interact with directory or IAM systems to alter or disable a user account in response to observed malicious behavior.

These responses may be supported manually or automatically, or both.

Circumspection

While automated response capabilities can amend efficiencies, they should be limited to non-critical security zones and/or to zone perimeters. As with any control deployed inside industrial networks, all automated responses should be carefully considered and tested prior to implementation. A false positive could trigger such a response and cause the failure of an industrial performance, with potentially serious consequences.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/commodity/pii/B9780124201149000125

Information Management and Life Wheel for Large Data

Krish Krishnan , in Data Warehousing in the Age of Big Information, 2013

Applied science

Implementing the program from a concept to reality within data governance falls in the applied science layers. There are several different technologies that are used to implement the different aspects of governance. These include tools and technologies used in Data acquisition, Information cleansing, Data transformation and Database code such as stored procedures, programming modules coded as awarding programming interface (API), Semantic technologies and Metadata libraries.

Data quality

Is implemented as a part of the data movement and transformation processes.

Is adult as a combination of business rules adult in ETL/ELT programs and third-party data enrichment processes.

Is measured in percentage of corrections required per execution per table. The lower the percentage of corrections, the higher the quality of data.

Information enrichment

This is not a new subject expanse in the globe of data. We have always enriched data to ameliorate its accurateness and data quality.

In the world of Big Data, data enrichment is achieved by integrating taxonomies, ontologies, and third-political party libraries every bit a function of the data processing architecture.

Enriched information volition provide the user capabilities:

To define and manage hierarchies.

To create new business rules on-the-fly for tagging and classifying the information.

To process text and semi-structured information more efficiently.

Explore and process multilingual and multistructured data assay.

Information transformation

Is implemented as part of ETL/ELT processes.

Is defined as business concern requirements past the user teams.

Uses master data and metadata programme outputs for referential information processing and data standardization.

Is developed by It teams.

Includes auditing and traceability framework components for recording data manipulation linguistic communication (DML) outputs and rejects from information quality and integrity checks.

Information archival and retention

Is implemented as part of the archival and purging procedure.

Is developed every bit a role of the database systems by many vendors.

Is often misquoted as a database feature.

Frequently fails when legacy data is imported back due to lack of correct metadata and underlying structural changes. This can be avoided hands past exporting the metadata and the master data along with the information set.

Master data management

Is implemented as a standalone plan.

Is implemented in multiple cycles for customers and products.

Is implemented for location, organisation, and other smaller data sets every bit an add-on by the implementing system.

Measured as a pct of changes processed every execution from source systems.

Operationalized as business rules for cardinal management across operational, transactional, warehouse, and analytical data

Metadata

Is implemented as a data definition process past concern users,

Has business-oriented definitions for data for each business organisation unit. One cardinal definition is regarded every bit the enterprise metadata view of the information.

Has Information technology definitions for metadata related to data structures, information management programs, and semantic layers inside the database.

Has definitions for semantic layers implemented for concern intelligence and analytical applications.

All the technologies used in the processes described to a higher place take a database, a user interface for managing data, rules and definitions, and reports available on the processing of each component and its associated metrics.

At that place are many books and conferences on the subject of data governance and program governance. We recommend readers peruse the available material for continued reading on implementing governance for a traditional data warehouse. 1–4

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B978012405891000012X

Flickr: if it's good plenty for the Library of Congress information technology's proficient enough for your library

Terry Ballard , in Google This!, 2012

A history of Flickr

Flickr was launched in February 2004 by Ludicorp, a Canadian company. Founders Stewart Butterfield and Caterina Fake based the service on an online game they had created but not released. The original Flickr was built effectually an online conversation room with real-time photo sharing, simply information technology soon evolved into the sort of site it is today, minus much of the metadata and geotagging data enrichment.

Only over a year after Flickr went live, the company was acquired past Yahoo! for, reportedly, US$35 million. Past 2008 the company had lifted any upload limits for paid or Flickr Pro accounts. Past 2007, logins and passwords were synchronized to those of Yahoo! The next year, videos were included, but with a limit of 90 seconds (that rule is supposedly all the same in effect, but I accept successfully uploaded videos three to four minutes in length). Particularly since the formation of the Flickr Commons, this has been well-adopted by libraries for hosting their prototype projects.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9781843346777500048

Vector and Tensor Field Visualization

Gerik Scheuermann , Hans Hagen , in Handbook of Computer Aided Geometric Blueprint, 2002

27.2 VISUALIZATION Process

The transformation of information into images is a process with three steps, usually described as a visualization pipeline. A typical model is given by the three intermediate steps in Figure 27.2. The data generation phase stands outside the visualization process. Itmeans the creation of numerical data by simulation, measurements during experiments or ascertainment of natural phanomena. The data enrichment and enhancement stage modifies the data to reduce its corporeality or improve the information content. Domain transformations, interpolation, sampling, and dissonance filtering are typical operations in this phase. The visualization mapping department is the centre of the whole transformation. The application data is mapped to visual primitives and attributes. This chapter will requite an overview of successful techniques for vector and tensor data. The rendering phase does the usual computer graphics operation of creating an image on the screen from the graphics primitives and attributes combined with a photographic camera model, lighting operations, anti-aliasing filtering, and hidden surface removal. Finally, the display stage shows the image on the screen or prints it on paper.

Figure 27.2. The visualization pipeline.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780444511041500289

31st European Symposium on Reckoner Aided Process Technology

Nancy Prioux , ... Gilles Hétreux , in Figurer Aided Chemical Technology, 2021

2 Full general approach

The approach is based on multidisciplinary field: data direction, sustainability engineering, and life cycle thinking. Information technology supports the assay of environmental impacts in an automated manner in grouping processes. Following the goal of their study, the researcher or the R&D applied science can compare or cull processes based on our arroyo's results. Information technology is defined by five steps: (1) goal and scope, (2) data architecture, (3) life cycle inventory (LCI), (4) sustainability assessment, and (v) visualization and analysis of results.

In the commencement step, the goal and scope of the study must be clearly defined. Life bicycle thinking is recommended. This thinking encourages a "cradle-to-grave" or "cradle-to-gate" approach if the logistics of a value chain are difficult to get. System boundaries and functional unit significantly influence evaluations. For example, it needs to be clarified whether the upstream biomass supply concatenation is considered. Once the goal and scope have been properly defined, the supply chain, technologies, and transformation processes should also be described.

The data compages is directly inspired by the structure of big data compages and consists of five sub-steps: (i) information collection and extraction, (ii) data enrichment and storage, (iii) data processing, (4) (raw) information analysis, and (v) (raw) data visualization. This step can be automatic, semi-automated, or manual and it uses data technics e.g. machine-learning methods for the (raw) data assay. These substeps are detailed in ( Belaud et al., 2019).

The Life Cycle Inventory (LCI) is completed using the process data (also called foreground data) from the previous step. The background data come from free or commercial LCI databases such as EcoInvent Database.

For the fourth step, one or more affect calculation methods must be determined in accordance with the get-go footstep that integrates the nature of the study and the system. Then, the environmental impacts are calculated using these methods. At the end of this stage, the master issue is a structure [processes: biomasses: impacts] which is difficult to analyze.

The visualization and analysis of the results step include methods derived from artificial intelligence and more precisely from "motorcar learning" to assist in the analysis of environmental impacts. Starting from the statistical literature, traditional dimension reduction (DR) and unsupervised clustering techniques are combined to extract information from environmental impacts. More precisely, this hybrid approach is based on the Multi-Dimensional Scaling (MDS) using the Canberra altitude and thou-means. The objective is to search for "hidden" structures in multi-dimensional data and to help interpret the matrix. The advantage of this arroyo is that data-based methods require very lilliputian knowledge of processes to perform. Figure i summarizes the treatment for a [processes: impacts] matrix. Start, DR techniques project the raw procedure data into a lower-dimensional infinite (two or 3). After a projection of the data by a DR technique, the clustering approach is then applied to consider similar impacts and processes inside the lower-dimensional space. Finally, the user (adept) analyzes the points grouped in clusters to link them to pregnant processes/impacts. The visualization of the data clusters will aid the researcher or R&D engineers in the final decision following the goal of their study.

Figure 1

Effigy one. Schematic of data driven processing

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323885065501546

Data Governance for Big Data Analytics

David Loshin , in Big Data Analytics, 2013

5.4 Large Data Oversight: Five Key Concepts

The determination is that the standard approach to data governance in which information policies divers by an internal governance council direct control of the usability of datasets cannot exist universally applied to big information applications. And yet there is definitely a need for some type of oversight that can ensure that the datasets are usable and that the analytic results are trustworthy. One style to address the need for data quality and consistency is to leverage the concept of data policies based on the information quality characteristics that are important to the big data project.

This means considering the intended uses of the results of the analyses and how the inability to exercise any kind of control on the original sources of the data production period tin can exist mitigated by the users on the consumption side. This approach requires a number of key concepts for information practitioners and business process owners to keep in heed:

managing consumer data expectations;

identifying the critical data quality dimensions;

monitoring consistency of metadata and reference data as a basis for entity extraction;

repurposing and reinterpretation of information;

information enrichment and enhancement when possible.

5.4.1 Managing Consumer Data Expectations

There may be a broad diversity of users consuming the results of the spectrum of large data analytics applications. Many of these applications use an intersection of available datasets. Analytics applications are supposed to be designed to provide actionable knowledge to create or improve value. The quality of information must be directly related to the ways the business processes are either expected to be improved by better quality data or how ignoring data problems leads to undesired negative impacts, and there may be varied levels of interest in asserting levels of usability and acceptability for caused datasets past different parties.

This means, for the scope of the different big data analytics projects, you must ascertain these collective user expectations past engaging the different consumers of large data analytics to discuss how quality aspects of the input data that might affect the computed results. Some examples include:

datasets that are out of sync from a time perspective (e.yard., one dataset refers to today's transactions beingness compared to pricing information from yesterday);

not having all the datasets available that are necessary to execute the analysis;

not knowing if the information element values that feed the algorithms taken from different datasets share the same precision (east.g., sales per minute vs sales per 60 minutes);

not knowing if the values assigned to similarly named information attributes truly share the same underlying meaning (east.chiliad., is a "customer" the person who pays for our products or the person who is entitled to client support?).

Engaging the consumers for requirements is a process of discussions with the known terminate users, coupled with some degree of speculation and apprehension of who the puddle of potential end users are, what they might want to do with a dataset, and correspondingly, what their levels of expectation are. Then, it is important to establish how those expectations can exist measured and monitored, as well equally the realistic remedial actions that can exist taken.

v.four.ii Identifying the Disquisitional Dimensions of Information Quality

An important step is to determine the dimensions of data quality that are relevant to the business and then distinguish those that are only measurable from those that are both measurable and controllable. This distinction is important, since you can utilize the measures to assess usability when you cannot exert control and to brand corrections or updates when you practice accept control. In either instance, here are some dimensions for measuring the quality of information used for large data analytics:

Temporal consistency: Measuring the timing characteristics of datasets used in big information analytics to see whether they are aligned from a temporal perspective.

Timeliness: Measuring if the data streams are delivered co-ordinate to stop-consumer expectations.

Currency: Measuring whether the datasets are up to date.

Completeness: Measuring that all the data is available.

Precision consistency: Assessing if the units of measure associated with each information source share the same precision and if those units are properly harmonized if not.

Unique identifiability: Focusing on the power to uniquely identify entities within datasets and data streams and link those entities to known system of record information.

Semantic consistency: This metadata activity may incorporate a glossary of business terms, hierarchies and taxonomies for business concepts, and relationships across concept taxonomies for standardizing means that entities identified in structured and unstructured information are tagged in preparation for information utilize.

five.iv.3 Consistency of Metadata and Reference Data for Entity Extraction

Large data analytics is often closely coupled with the concept of text analytics, which depends on contextual semantic analysis of streaming text and consequent entity concept identification and extraction. Just earlier you tin can aspire to this kind of assay, you lot need to ground your definitions within clear semantics for commonly used reference data and units of measure out, too as identifying aliases used to refer to the same or similar ideas.

Analyzing relationships and connectivity in text data is cardinal to entity identification in unstructured text. But because of the diverseness of types of data that bridge both structured and unstructured sources, one must exist enlightened of the degree to which unstructured text is replete with nuances, variation, and double meanings. In that location are many examples of this ambiguity, such as references to a machine, a minivan, an SUV, a truck, a roadster, too as the manufacturer's company name, make, or model—all referring to an automobile.

These concepts are embedded in the value inside a context, and are manifested as metadata tags, keywords, and categories that are often recognized as the terms that bulldoze how search engine optimization algorithms associate concepts with content. Entity identification and extraction depend on the differentiation between words and phrases that bear loftier levels of "meaning" (such equally person name, business names, locations, or quantities) from those that are used to constitute connections and relationships, more often than not embedded within the language of the text.

As data volumes expand, there must be some procedure for definition (and therefore control) over concept variation in source information streams. Introducing conceptual domains and hierarchies can assistance with semantic consistency, peculiarly when comparing data coming from multiple source data streams.

Exist aware that context carries pregnant; equally at that place are unlike inferences near data concepts and relationship, yous tin make based on the identification of concept entities known within your reference data domains and how close they are constitute in the data source or stream. But since the same terms and phrases may have unlike meanings depending on the participating constituency generating the content, it yet again highlights the need for precision in semantics associated with concepts extracted from data sources and streams.

5.4.4 Repurposing and Reinterpretation

One of the foundational concepts for the use of information for analytics is the possibility of finding interesting patterns that tin can lead to actionable insight, and yous must go on in mind that whatever acquired dataset may be used for any potential purpose at any time in the future. However, this strategy of data reuse can also backfire. Repeated copying and repurposing leads to a greater degree of separation betwixt information producer and data consumer. With each successive reuse, the data consumers nevertheless over again must reinterpret what the data ways. Eventually, any inherent semantics associated with the data when it is created evaporates.

Governance will as well mean establishing some limits around the scheme for repurposing. New policies may be necessary when it comes to determining what data to acquire and what to ignore, which concepts to capture and which ones should be trashed, the volume of data to be retained and for how long, and other qualitative information direction and custodianship policies.

5.four.5 Information Enrichment and Enhancement

It is difficult to consider any need for data governance or quality for large acquired datasets without discussing alternatives for information cleansing and correction. The apparently truth is that in general yous will accept no control over the quality and validity of data that is acquired from exterior the organisation. Validation rules can be used to score the usability of the data based on end-user requirements, but if those scores are below the level of acceptability and yous nonetheless desire to do the assay, yous basically have these choices:

1.

Don't use the data at all.

2.

Use the data in its "unacceptable" land and modulate your users' expectations in relation to the validity score.

3.

Modify the data to a more than acceptable form.

This choice might not exist as drastic equally you might think. If the business awarding requires accuracy and precision in the information, attempting to utilise unacceptable data will introduce a run a risk that the results may not exist trustworthy. On the other hand, if you lot are analyzing extremely large datasets for curious and interesting patterns or to identify relationships among many different entities, there is some elbowroom for executing the process in the presence of a pocket-sized number of errors. A minimal percentage of data flaws will not significantly skew the results.

Equally an example, large online retailers want to drive increased sales through relationship analysis, as well every bit look at sales correlations within sales "market baskets" (the collection of items purchased by an private at 1 fourth dimension). When processing millions of (or orders of magnitude more than) transactions a day, a minimal number of inconsistencies, incomplete records, or errors are probable to exist irrelevant.

All the same, should incorrect values be an impediment to the analysis and making changes does not significantly modify the data from its original form other than in a positive and expected way, data enhancement and enrichment may be a reasonable alternative. A practiced example is address standardization. Address locations may be incomplete or even wrong (due east.yard., the null code may be incorrect). Standardizing an address'southward format and applying corrections is a consequent way merely to better the information.

The same could exist said for linking extracted entities to known identity profiles using algorithms that match identities with loftier probability. Making that link enhances the assay through the sharing of contour data for extracted entities. A similar process can exist used in connection with our divers reference metadata hierarchies and taxonomies: standardizing references to items or concepts in relation to a taxonomic order lets your application treat cars, automobiles, vans, minivans, SUVs, trucks, and RVs as vehicles, at least for certain analytical purposes.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780124173194000053

Variational Modeling Methods for Visualization

HANS HAGEN , INGRID HOTZ , in Visualization Handbook, 2005

xix.three Variational Surface Modeling

The process of creating a 3D CAD model from an existing physical model is called contrary engineering science, and it is different from standard engineering, where a physical model is created from a CAD model. Both approaches accept, from a mathematical point of view, sure principles in common. Since the mid-1990s, a v-step "modeling pipeline" has been the state of the art, and information technology consists of the following steps:

Information generation and information reception: measurements and numerical simulations

Data enrichment and comeback: filtering and clustering

Data assay and data reduction: structure recognition, testing of features, etc.

Modeling: variational design, physically based modeling, etc.

Quality analysis and surface interrogation: reflection lines, isophotes, variable offsetting, etc.

The last pace is the scientific-visualization role of the modeling pipeline. We discuss this topic in item in Section 19.four. Later briefly discussing the topics of data reduction and segmentation, nosotros concentrate in this chapter on the modeling step.

19.3.ane Data Reduction

Physical objects can be digitized using manual devices, CNC-controlled coordinate measuring machines, or laser range-scanning systems. In whatever case, nosotros get big, unstructured datasets of arbitrary distributed points.

Let P : = {pi IR 3 |i = i, …, due north} be a set of n distinct points. To reduce P to a smaller prepare Q : = {qi P|j = 1, …, thousand}, a subdivision into m distinct clusters will be calculated. Clustering means grouping like points past optimizing a certain criterion function. Subsequently, a single point out of each cluster is selected every bit a "representation point" for this cluster; these points build the so-called representation set Q. As a criterion to verify the quality of the subdivision, a office Kh is introduced, which assigns a numerical value to each cluster Ch .

(xix.7) K n : = i , j I h i < j p i p j two

where Ih : = {i|pi Ch } and h = i, …, m. This price of a cluster is a measure of the distribution of the points in the cluster. An optimal subdivision of P is given by minimizing the toll. That is

(19.eight) h = 1 m k h = h = ane grand i , j I n i < j p i p j 2 min

This expression is equivalent to

(nineteen.9) h = ane m i I h p i S h 2 min

Sh is the center of the cluster Ch . To find a global minimum of this expression is known to be NP-complete. Therefore, we have to use a heuristic method to observe an "optimal" solution.

The Schreiber method is based on an interactive refinement strategy. The initial subdivision is the single cluster containing all points of P. In each pace, the cluster with the highest cost is determined and divided into 2 new clusters, then that the cost is optimally reduced locally. For this purpose, a hyperplane is calculated orthogonal to the largest eigenvector of the covariance matrix of the points in a cluster. The optimal representation indicate for each cluster is the center signal Sh of the cluster, but Sh is in general not a bespeak of P. If this is a problem, the signal of P nearest to Southk is used every bit the representation point. For more than details on this algorithm, encounter Schreiber [four].

19.3.ii Segmentation

For a segmentation, all points have to be grouped in such a way that each group represents a surface (patch) of the concluding CAD model. The segmentations criterion is based on a curvature estimation scheme. The curvature at a point p can be estimated past calculating an approximating role for a local prepare of points around p. Hamann [3] uses the osculating paraboloids as approximating functions in his algorithm. Schreiber [five] extended this approach by using a general polynomial office to guess a local set up of points around p. Outset, a set of points neighboring p is adamant past the Delaunay triangulation; this set of points is called the platelet. The platelet consists initially of all points that share a common edge of a triangle with p. This platelet is extended past adding all points that share a common edge with any platelet points. For a better curvature estimation, this extension is repeated several times.

19.3.3 Variational Design

The quaternary footstep in the modeling pipeline is the final surface structure for a group of points. Analytical surfaces like planes, cylinders, and spheres can be created using standard CAD tools. Furthermore, the fillets with a abiding radius in one parameter management, as a connecting surface between two given surfaces, can be generated in a standard style, and so the focus of this section is ready to free-form modeling. This technique offers the possibility for a user to predefine boundary curves and to select neighboring surfaces for tangent or curvature continual transitions. Two approaches have go industry standards over the last couple of years: variational blueprint and physically based modeling.

The variational design procedure of Brunnett et al. [six] combines a weighted least-squares approximation with an automatic smoothing of the surface. The chosen smoothing benchmark minimizes the variation of the curvature along the parameter lines of the designed surface. This fundamental B-spline approach was extended for arbitrary degrees and arbitrary continuity conditions in both parameter directions, including given boundary information.

The following mathematical models can be used as variation principles:

(19.10) ( 1 w south ) { k = 1 n [ w p ( F ( u k , 5 k ) p k ) 2 } + w south { i = 1 n j = 1 due north west u g v j 5 j + 1 u i u i + ane w u i j 3 F ( u , five ) u 3 d u d five + w u thou v j 5 j + 1 u i u i + one w 5 i j three F ( u , v ) u three d u d v } min

where F(u, v) is the representation of the surface, {pgrand |k = one, …, np } is the group of points, due north and m are the numbers of segments in u and v directions, and ws, wug , wvg , wuij and due west fiveij [0, 1] are the smoothing weights.

A successful alternative is to minimize the bending energy

(nineteen.eleven) S κ i 2 + κ 2 ii d South min

Variational design can to some extent be considered a role of physically based modeling. The starting point is always a specific physical demand of mechanic, electronic, aerodynamic, or like origin.

Hamiltonian principle: Permit a mechanical system exist described by the functions qi , i = one, …, n, where n is the number of degrees of liberty. The arrangement between a fixed starting land qi (t 0) at starting time t 0 and a fixed final state q i (t 1) at last time t i moves in such a way that the functions qi (t) make the integral

(xix.12) I = t 0 t 1 L ( q i ( t ) , q ˙ i ( t ) ) d t = t 0 t 1 { T ( q i ( t ) , q ˙ i ( t ) ) U ( q i ( t ) , q ˙ i ( t ) ) } d t

stationary, compared with all functions q ¯ i ( t ) , which fulfill equal purlieus conditions. T is the kinetic free energy, U the potential free energy, and L = TU the Lagrange function.

The functionals in variational design express an energy blazon. One very pop functional for surface generation describes the free energy, which is stored in a thin, homogeneous, clamped plate with small deformations:

(19.13) F ( η ) = h iii 2 n E 1 v 2 { ( 2 η 10 2 + ii η y 2 ) 2 2 ( one v ) ( ii η 2 η 10 2 y ii + 2 η x y ) 2 d x d y

The deduction of this term grounds on the equilibrium of volume and surface forces and uses some linearizations, where h denotes the thickness of the plate, η denotes the deformation, and Eastward and five are material parameters. For more details and applications, see Hagen and Nawotki [2].

Read total chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9780123875822500216

Continued Computing Environment

Zoran Babovic , Veljko Milutinovic , in Advances in Computers, 2013

iv Nomenclature Criteria and the Classification Tree

The major 2 problems in creating a new taxonomy are: the nomenclature criteria and the nomenclature tree. Here, the nomenclature criteria have been chosen to reverberate the essence of the enquiry bones viewpoint. The classification tree has been obtained by successive application of the chosen criteria. The leaves of the classification tree are the examples (research efforts), which are elaborated briefly afterwards, in the Presentation of Existing Solutions department of this newspaper.

In this written report, nosotros have as well included architectures not dealing with the data semantics, but the architectures of which take influenced enquiry in sure direction. In add-on to this, we accept shown how semantic data enrichment improves efficiency of used approach.

Since the part of the sensor networks integration platform is to act every bit an interface between sensor networks and users application, researchers are able to tackle the trouble either at the sensor networks level, i.e., bottom-upward, or at the applications level, i.east., elevation-downwardly approach. Therefore, equally the master classification criteria of the surveyed architectures, we classify architectures according to the selected arroyo, which may include: sensor networks-oriented arroyo and awarding-oriented approach. In the get-go approach, researchers try to solve the sensor networks heterogeneity, sensor networks technical characteristics, constraints, protocols, and produced observations and measurement, by ways of proposing an optimal style for handling, representing, storing, and accumulation the available sensor data sources to upper layers in the organization, and thus to applications. In the 2nd arroyo, researchers tend to enable an as-conform-as-possible interface and an interaction mechanism for users and applications, which enable them to become the information they are interested in, from the integrated sensor networks, by releasing them from the complexities and specifics of those sensors networks.

Inside the beginning class, nosotros can identify three subgroups: database centered architectures, approaches based on query translation, and sensor virtualization-based approaches. All these subgroups can be further divided into the approaches with and without data semantics employment.

The database centered solutions are characterized with a database as a central hub of all the collected sensor data, and consequently all search and manipulation of sensor data are performed over the database. It is a challenge to map heterogeneous sensor data to a unique database scheme. An additional mechanism should be provided for existent-time data support, because this type of data is hardly to exist cached straight due to its large volume. The primary concern with this approach is the scalability, since the database server should handle both insertions of data coming from the sensor nodes, as well as to perform application queries. This approach can do good from the possibility to enable support for data mining and auto learning techniques over the stored puddle of sensor data.

The query translation approach utilizes natural course of sensor information and the associated query languages in order to transform users query to the target query language of a certain source. This approach implies a need to maintain the information of available data sources, primarily the native query language of certain data source, format and the nature of produced data, but it may also include information almost sensors capabilities, network topology, power constrains for better query optimization. The results of native queries should be assembled into target information format. Potentially, a performance drawback lies in the fact that two conversions per each user request must be done in the runtime: when a query is translated to a native query, and again when query results should exist converted into the target format.

In the sensor virtualization approach, sensors and other devices are represented with an abstruse data model and applications are provided with the ability to directly interact with such abstraction using an interface. Whether the implementation of the defined interface is achieved on the sensor nodes sinks or gateways components, the produced data streams must comply with the normally accustomed format that should enable interoperability. Generally, any common data format, that leverages the semantic information model, could be used for representing data representation, or even multiple data formats targeting at different levels of information abstractions, might coexist in parallel depending on the user needs. This approach is a promising one and offers skillful scalability, high functioning, and efficient data fusion over heterogeneous sensor networks, as well equally flexibility in aggregating data streams, etc.

Every bit stated above, application-oriented approaches endeavour to offer the well-nigh efficient mode to user applications to become needed information from the integrated sensor networks. However, focusing on the provision of the high-level interaction between applications and underlying system, with enabling knowledge inferring features, sometimes suffers from the performance aspects, which forestall these solutions' wider acceptance. We take identified four subgroups that share the same basic principle of top-down approach: the service-oriented architecture approaches, service-composition approaches, rule-based data transformation approaches, and agent-based systems.

The service-oriented-compages approaches provide a standard service interface with divers methods and data encodings for obtaining observation and measurements from desired sensors. Furthermore, information technology might offer functions such as getting information of sensors characteristics, power to subscribe on selected sensors information values, submitting queries, optionally the actuation functions, etc. The dominant interaction in these architectures is the asking-reply model, and to a lesser extent the event-based commitment of sensors data. A drawback of this arroyo is that it does non have an ability to fuse stream-based sensor data forth with archived or acquisitioned data types. Although there are no explicit constraints on concrete implementation, this arroyo tends to be vertically oriented and covers simply 1 application domain.

The service-composition-oriented approaches permit users the ability to ascertain capricious services or data streams with specific characteristic of involvement. The system will effort to etch such a data period by applying specific processing over appropriate data sources, which will result in producing a information stream that conforms to the requested specification. Full user request expressiveness could be achieved by enabling a semantic model-based description of desired data streams and processing capabilities: semantics-based reasoning could be utilized when looking for an optimal composition of available components. This approach seems to offer the about flexible solutions from the applications perspective, although the performance may exist degraded due to existent-time discovery of service composition.

The rule-based data transformation seems as the most mutual approach for utilizing semantic data models. Inferring new knowledge or detecting loftier-level events are achieved by the mapping functions relying on the relationships between the concepts captured in the domain model ontological representation and sensor information observations and measurements. There could exist multiple transformations through the architecture according to the different layers in the information model. Data are transformed from lower level formats to semantic-based representations enabling semantic search and reasoning algorithms application.

The agent-based systems consist of several types of agents. Agents are software components capable of performing specific tasks. They collaboratively achieve desired functionalities. For the internal agent communications some of standard agent platforms or a specific implementation tin be used. Typically, agents belong to one of several layers based on the blazon of functionalities they are responsible for. Also there might exist several agent types in one logical layer. Agents from upper layers utilise agents from lower layers. Whether the agents utilise sensor data semantics, or whether semantic models are used for the agent processing capabilities description depends on the concrete implementation.

The nomenclature tree, derived from the same classification criteria, is presented in Fig. 1, and is composed of seven leaves. Each leaf of the classification tree is assigned a name, as described above. The list of existing solutions (examples) is given according to the applied classification for each foliage (class). We have provided only the names of approaches and major references in a divide paragraph in order to enable interested readers to written report further details.. For the sake of simplicity, we give an arbitrary proper noun to a solution that does not have an explicit proper name given by authors. We use either the proper name of institution that authors came from, or the proper name of the principal strategic event characteristic for that solution.

Fig. i. The classification tree of Sensor Web architectures.

The database centered solutions include non-semantic approaches such as the Cougar database system [10], one of the commencement enquiry works toward sensor networks integration, and SenseWeb [eleven], which is an example of maximum utilization of the described approach. The ES3N [13] is an example of semantics-based database centered arroyo.

All solutions pertaining to query translation approaches employ semantic technologies and include: the CSIRO semantic sensor network [14], the SPARQL STREAM -based arroyo [22], and the SemSorGrid4Env [47,48], which is the almost comprehensive solution in this grouping.

The most recent inquiry efforts in this field belong to sensor virtualization approaches. The non-semantic approach is used in the GSN [18], while the solutions proposed in large-scale EU funded projects such as the SENSEI [50] and the Internet of Things (IoT) [51,52] utilize semantics of information.

The service-oriented architectures include elementary and yet efficient not-semantic solutions such every bit TinyREST [53] and the OGC SWE specifications of the reference compages [two] implemented by various parties [54,55]. A semantics-enabled approach is used in the SemSOS [56].

The service-composition approaches tend to offer the almost flexible interaction to users and Hourglass [16] is an example of a not-semantic-based solution. More powerful solutions utilize semantic approaches and include the SONGS [17] and an architecture developed at IBM [59].

The most common architectures that utilise semantic technologies vest to rule-based data transformation approaches and include: a semantics-based sensor data fusion system developed at the University of Toronto [20], pluggable compages designed at the National Technical University of Athens [23], and the SWASN [61], a function of the Ericsson's CommonSense vision [62].

Finally, the amanuensis-based approaches have both the not-semantic and semantic representatives: the first 1 is an Net-scale sensor infrastructure chosen the IrisNet [fifteen,63], while the second one is the Swap [64], a multi-agent system for Sensor Web architectures.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9780124080911000026