Veracity - Cleansing & Transformation

 

Big data grow over volume, variety, and velocity alone. You need to know the 10 characteristics and properties of big data to formulate for both the challenges and returns of big data initiatives. Veracity is one of the unfortunate characteristics of big data. As all the above characteristics increase, veracity (confidence, trust in data) drops. The term veracity is defined as one of the 5vs of big data which indicates trust, quality and the credibility of the data that a company or organization has composed to gain perfect insights for the perfect decision-making system.

Data creation and consumption is becoming a way of life. According to the recent IBM research report, 2.5 quintillion bytes of data are produced globally in a day in 2017. Analysts predict that, over 1.7 megabytes of new information will be produced by every person in the world in every second. Generally speaking, all applications of cleansing, profiling, transformation, discovering etc, should be in terms of data that is captured or extracted from the web. Each website should be treated as a source and you should use language from that stand point rather than the traditional data integration slant on enterprise data management and data from traditional sources. Only after the data source has been analyzed and installed can data processing continue. Data purification relies on complete and continuous data profiles to identify data quality issues to be addressed.

 Mostly in general, data veracity is a degree of accuracy or truthfulness of a set of data. In case of big data its not just the quality of data which are important but it’s about how trustworthy is the source, data type and data processing are.

                                        Sources of data veracity



                                              Figure:  datafloq weekly digest

          

What is data transformation?

Data transformation is the process which change the format structure and value of data. For the projects of data analytics data may transform at two stages of data pipelines. The organizations which have on premises data warehouse normally use ETL (extract, transform, load) procedures and data transformation is one of the steps in this. Now a days, organizations mainly use cloud-based data warehouses which is more advanced. This can scale compute and storage and measure within a short time. This process can skip the preload transformation and low data into a data warehouse and then transform at the requested time.

Benefits of data transformation

·       Transformation of data will make it well organized. Also transformed data may also be simple for both for humans and computers to use.

·       Properly formatted and validated data helps to develop applications from null values, duplicates which unexpectedly happened, incorrect indexing etc.

·       Data transformation enables performance between applications, systems and data types.


Author: Harikrishnan H 

 

 Keywords

#big data

#veracity

# data cleansing

#data transformation

#ETL

                                                         Reference


Datafloq.com. 2021. Data Veracity: a New Key to Big Data. [online] Available at: <https://datafloq.com/read/data-veracity-new-key-big-data/6595> [Accessed 7 March 2021].

Import.io. 2021. What is Data Cleansing and Transformation/ Wrangling? | Import.io. [online] Available at: <https://www.import.io/post/what-is-data-cleansing-and-transformation-wrangling/> [Accessed 7 March 2021].

Stitch. 2021. What is data transformation: definition, benefits, and uses | Stitch resource. [online] Available at: <https://www.stitchdata.com/resources/data-transformation/> [Accessed 7 March 2021].

Transforming Data with Intelligence. 2021. The 10 Vs of Big Data | Transforming Data with Intelligence. [online] Available at: <https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx> [Accessed 7 March 2021].

 

Comments

  1. Good work Hari. Your ideas about big data and data transformation are really helpful

    ReplyDelete
  2. Very helpful and informative as well

    ReplyDelete
  3. You deserve a great appreciation for this Good job.💯

    ReplyDelete
  4. Good job
    Well done

    ReplyDelete
  5. Good narration and more accurate piece of information

    ReplyDelete
  6. Very informative! Looking forward to more contents.

    ReplyDelete
  7. Of course Veracity is important as velocity, volume, variety of big data. Consumers and companies need to know how trustworthy the data is. Since companies only wanted to store and mine the data relevant to solving the problems, data scientists would need to be extremely careful by keeping those biases and noise out of the dataset being analyzed. The generated big data is highly abnormal and inconsistent, making it difficult for enterprises to make sense of it and build trust in their data. Trust is a big factor for any businesses which I completely understand by this blog post. keep it up!

    ReplyDelete
  8. Great work! This process is crucial and emphasized because wrong data can drive a business to wrong decisions, conclusions, and poor analysis, especially if the huge quantities of big data are into the picture.

    ReplyDelete
  9. Well written Harikrishnan… I

    t is very much true that data veracity is the degree to which a data collection is reliable or truthful. Adding to what you have written in the blog, I wish to share a couple of insights more to it. When it comes to big data accuracy, it's not only about the data's quality; it's also about how reliable the data source, type, and processing are. Improving the accuracy of big data requires removing bias, anomalies or discrepancies, replication, and uncertainty, to name a few factors. Volatility is out of our reach at times, unfortunately. The rate of change and lifespan of the data are referred to as volatility, or another "V" in big data. Social networking, where sentiments and trending topics shift rapidly and often, is an example of extremely volatile data. Weather patterns, which shift less often and are easier to forecast and monitor, are an example of less volatile data.

    -- Thomas Devasia

    ReplyDelete
  10. Good work Hari
    Expecting more from you

    ReplyDelete
  11. Im really amused to see such a informative blog

    ReplyDelete

Post a Comment