Saturday, March 28, 2020

IT links (16. - 22.3.2020)

Java 14 has been released on 17th March 2020.

Here is a great overview of newly available functionality in this release. I like better messages for NPEs, which bring an easier investigation of issues related to the most common exception.

InfoQ brings eMag about recent innovations in the Java platform, covering recent Java releases

Article with some useful JVM arguments

How to compare two JSONs with Gson?

Automated testing on production with Selenium

Monday, March 16, 2020

IT links (9. - 15.3.2020)

Two articles about new features released in recent JDK releases:


Another one related to Java - How to get last element of stream?
And nice String format examples in Java

Good to remind - Difference between save vs persist and saveOrUpdate in Hibernate

This was really interesting talk at QCon London 2020 - I recommend it - Running Java on GPUs and FPGAs

And how to not misuse data warehouses?

Sunday, March 8, 2020

Data challanges in modern companies

As our civilization generates vast and continuously increasing amounts of data, opportunities to harness it are also growing. Companies participating in digital transformation integrate digital technology into all possible areas, altering how they operate and deliver value to customers.

More companies are basing their businesses on data or using data to improve sales and operations. Becoming data-driven involves making strategic decisions based on data analysis and extracting value from company data. Data utilization is central to competitiveness and survival across industries. The goal is to find advantages over competitors through better insights from available data and discover ways to enhance operations and overall business.

Some believe that businesses not fully utilizing data could eventually face extinction.

It is crucial to understand that information is not contained in individual data pieces; they must first be combined. It is not only about storing data in data warehouses or data lakes but also about the ability to use all available data. And this brings challenges. Challenges in way of collecting data, storing data, analyzing it, and applying gained insights. Often, generated or received data is used for a single purpose, one time only. All these data-related operations raise issues connected to big data, represented by well-known Vs (VolumeVelocityVarietyVeracity...).

Data integration is vital, as is knowing the data owned by the business. Data should be considered an asset, serving everyone in the company, not just analysts. 

Another important thing is to know how data is used across the company. Data quality of various sources should be part of this knowledge because data of poor quality could negatively affect decisions, based on it. Information about data quality, among other useful information, should be provided by a data source.

Companies should have a data strategy, defining long-term goals that bring business value and lead to higher revenues. Data governance should support these goals by guiding daily data management, indicating how to handle data assets to support the organization immediately. Data governance offers a more granular approach to data management and tactics for company data strategy.

Having an overview of all data assets is crucial, and data cataloging should ideally be part of data governance. Data catalogs, available as services (e.g., via the cloud) or open-source tools, contain metadata about data sources (properties like connections, owners, structure, lineage, and quality).

They facilitate data discoverability, faster access to data, and improved decision-making for analysts and businesses. To evolve better products for our customers, help to deal with GDPR issues, consolidate data from other, acquitted companies, or enhance our own data with data publicly available (e.g. open datasets).

The goal is to help analysts locate all data they need. To improve and speed up analysis like customer segmentation and behavior, lead scoring, etc. This also helps us to bring a more personalized focus on smaller customers.

Companies can also consider external services with APIs as data sources, such as emails, Google Analytics statistics, customer feedback tools, and social networks. Data fusion, based on available data sources, helps drive the data flywheel (more data leads to better analytics, better products, more users, and more useful data).

Data fusion, based on available data sources, helps drive the flywheel of data (more data means better analytics. Better analytics leads to better products. Better products bring more users. More users generate more useful data, etc.).

From a technical perspective, a better overview of owned data enables easier setup of data pipelines for ETL operations and makes the data more accessible to analytics tools like PowerBI. It also brings the possibility to make better architectural decisions and think in a more platform-independent way.

If cloud-based, it allows for more cloud-agnostic, and eventually helps us run and manage hybrid-cloud or polycloud environments. With the data fabric approach, uniform access to data across multiple environments can be provided.

Clear business processes and sufficient data enable automation with RPA or hyperautomation, increasing efficiency. These automated processes, bringing better efficiency, can allow as to handle more customers and help us with scaling up operations. A substantial amount of data, combined with a good understanding of it, opens up possibilities for machine learning applications, which are usually data-hungry. Data is really important here, data dependencies can be unstable, there can be changes in data structure - all this is affecting trained models.

A good overview of company-owned data increases the chances of success with machine learning experiments, so they might not end up in just a kind of proof of concept stage. and helps businesses transition from data-driven to model-driven operations. Also, it is not good to do these experiments as isolated attempts, machine learning steps should be reproducible and repeatable.

A good overview of data owned by a company helps to solve these obstacles and helps businesses transition from data-driven to model-driven operations. This basically means incorporating machine learning models directly into business processes and making data science a core capability of a company and potentially even automating decisions, that are business-critical.

Good data management can assist in advancing AI maturity. Wider usage of machine learning at the production level across a company and the interaction of deployed models can introduce issues specific to complex solutions.

One would run away from all these problems, but anyway, wider usage of external services, that will integrate more and more machine learning on their backend - and that is basically inevitable - could eventually bring similar issues, regardless of whether a company implements its own machine learning solutions or not.

You cannot run away, because, as Nvidia CEO - Jensen Huang - said:


“Software Is Eating the World, but AI Is Going to Eat Software”.